WIP: dynahash replacement for buffer table

Started by Robert Haasabout 11 years ago39 messages

robertmhaas@gmail.com

about 11 years ago

1 attachment(s)

A few years ago I started working on a concurrent hash table for
PostgreSQL. The hash table part of it worked, but I never did
anything with it, really. Amit mentioned to me earlier this week that
he was seeing contention inside the dynahash machinery, which inspired
me to go back and update the patch. I took the basic infrastructure
from before and used it to replace the buffer table. Patch is
attached.

The key idea here is that lookups are done without any locks, only
memory barriers; and inserts and deletes are done using atomic ops.
The algorithm is not strictly lock-free for reasons explained in the
comments in chash.c, but it's a lot less locky than what we have now,
so in theory you might think that would be a good thing.

I haven't had time to do much performance testing yet, but it looks
like this may be slower at low client counts and faster at high client
counts. However, my results weren't real reproducible, and I haven't
done comprehensive testing yet. What was really bizarre is that I
couldn't really pin down the cause of the slowness at low client
counts; a quick perf profile showed overhead concentrated in
CHashBucketScan, basically memory access latency for walking the
bucket chain. But the table is built to have a load factor of 1, so I
can't see why that should amount to much, or why it should be
significantly worse than for dynahash.

This patch contains assorted leftovers and is grotty in various ways,
but I'm sharing it anyway just to get it out there.

git branch also available at:
http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash2014

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

chash-buftable-v1.patchtext/x-patch; charset=US-ASCII; name=chash-buftable-v1.patchDownload

diff --git a/contrib/Makefile b/contrib/Makefile
index b37d0dd..0b91ac1 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -20,6 +20,7 @@ SUBDIRS = \
 		earthdistance	\
 		file_fdw	\
 		fuzzystrmatch	\
+		hashtest	\
 		hstore		\
 		intagg		\
 		intarray	\
diff --git a/contrib/hashtest/Makefile b/contrib/hashtest/Makefile
new file mode 100644
index 0000000..3ee42f8
--- /dev/null
+++ b/contrib/hashtest/Makefile
@@ -0,0 +1,18 @@
+# contrib/hashtest/Makefile
+
+MODULE_big = hashtest
+OBJS = hashtest.o
+
+EXTENSION = hashtest
+DATA = hashtest--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/hashtest
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/hashtest/hashtest--1.0.sql b/contrib/hashtest/hashtest--1.0.sql
new file mode 100644
index 0000000..e271baf
--- /dev/null
+++ b/contrib/hashtest/hashtest--1.0.sql
@@ -0,0 +1,52 @@
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION hashtest" to load this file. \quit
+
+CREATE FUNCTION chash_insert_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_insert_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_search_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_search_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_delete_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_delete_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_concurrent_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_concurrent_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_collision_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_collision_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_insert_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_insert_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_search_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_search_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_delete_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_delete_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_concurrent_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_concurrent_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_collision_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_collision_test'
+LANGUAGE C;
diff --git a/contrib/hashtest/hashtest.c b/contrib/hashtest/hashtest.c
new file mode 100644
index 0000000..172a5bb
--- /dev/null
+++ b/contrib/hashtest/hashtest.c
@@ -0,0 +1,527 @@
+/*-------------------------------------------------------------------------
+ * hashtest.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "libpq/auth.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "portability/instr_time.h"
+#include "storage/ipc.h"
+#include "utils/chash.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+Datum		chash_insert_test(PG_FUNCTION_ARGS);
+Datum		chash_search_test(PG_FUNCTION_ARGS);
+Datum		chash_delete_test(PG_FUNCTION_ARGS);
+Datum		chash_concurrent_test(PG_FUNCTION_ARGS);
+Datum		chash_collision_test(PG_FUNCTION_ARGS);
+Datum		dynahash_insert_test(PG_FUNCTION_ARGS);
+Datum		dynahash_search_test(PG_FUNCTION_ARGS);
+Datum		dynahash_delete_test(PG_FUNCTION_ARGS);
+Datum		dynahash_concurrent_test(PG_FUNCTION_ARGS);
+Datum		dynahash_collision_test(PG_FUNCTION_ARGS);
+static void hashtest_shmem_startup(void);
+
+PG_FUNCTION_INFO_V1(chash_insert_test);
+PG_FUNCTION_INFO_V1(chash_search_test);
+PG_FUNCTION_INFO_V1(chash_delete_test);
+PG_FUNCTION_INFO_V1(chash_concurrent_test);
+PG_FUNCTION_INFO_V1(chash_collision_test);
+PG_FUNCTION_INFO_V1(dynahash_insert_test);
+PG_FUNCTION_INFO_V1(dynahash_search_test);
+PG_FUNCTION_INFO_V1(dynahash_delete_test);
+PG_FUNCTION_INFO_V1(dynahash_concurrent_test);
+PG_FUNCTION_INFO_V1(dynahash_collision_test);
+
+typedef struct
+{
+	uint32	key;
+	uint32	val;
+} hentry;
+
+static CHashDescriptor cdesc = {
+	"hashtest-chash",	/* name */
+	1048576,			/* capacity */
+	sizeof(hentry),		/* element size */
+	sizeof(uint32)		/* key size */
+};
+
+#define DYNAHASH_PARTITIONS		16
+
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+static CHashTable chash;
+static HTAB *dynahash;
+static LWLockId dynahash_lock[DYNAHASH_PARTITIONS];
+static ClientAuthentication_hook_type original_client_auth_hook = NULL;
+
+static void hashtest_client_auth_hook(Port *port, int status);
+static void chash_write_stats_to_log(int code, Datum dummy);
+
+#define dynahash_get_lock(hashcode) \
+	(dynahash_lock[(hashcode) % DYNAHASH_PARTITIONS])
+
+void
+_PG_init(void)
+{
+	Size	cs;
+	Size	ds;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = hashtest_shmem_startup;
+	chash = CHashBootstrap(&cdesc);
+	cs = CHashEstimateSize(chash);
+	RequestAddinShmemSpace(cs);
+	ds = hash_estimate_size(cdesc.capacity, cdesc.element_size);
+	RequestAddinShmemSpace(ds);
+	elog(LOG, "chash: %u bytes; dynahash: %u bytes", (unsigned) cs,
+		 (unsigned) ds);
+	RequestAddinLWLocks(DYNAHASH_PARTITIONS);
+	original_client_auth_hook = ClientAuthentication_hook;
+	ClientAuthentication_hook = hashtest_client_auth_hook;
+	
+}
+
+static void
+hashtest_client_auth_hook(Port *port, int status)
+{
+	if (original_client_auth_hook)
+		original_client_auth_hook(port, status);
+	on_proc_exit(chash_write_stats_to_log, (Datum) 0);
+}
+
+static void
+chash_write_stats_to_log(int code, Datum dummy)
+{
+	uint64	stats[CHS_NumberOfStatistics];
+	CHashStatisticsType i;
+	StringInfoData	buf;
+
+	CHashStatistics(chash, stats);
+	initStringInfo(&buf);
+
+	for (i = 0; i < CHS_NumberOfStatistics; ++i)
+	{
+		if (stats[i] == 0)
+			continue;
+		appendStringInfo(&buf, UINT64_FORMAT " %s; ", stats[i],
+						 CHashStatisticsNames[i]);
+	}
+
+	if (buf.len > 1)
+	{
+		buf.data[buf.len-2] = '\0';
+		elog(LOG, "chash statistics: %s", buf.data);
+	}
+}
+
+static void
+hashtest_shmem_startup(void)
+{
+	HASHCTL		info;
+	uint32		i;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Initialize concurrent hash table. */
+	chash = CHashInitialize(chash, &cdesc);
+
+	/* Initialize shared dynahash table. */
+	info.keysize = cdesc.key_size;
+	info.entrysize = cdesc.element_size;
+	info.hash = tag_hash;
+	info.num_partitions = DYNAHASH_PARTITIONS;
+
+	dynahash = ShmemInitHash("hashtest-dynahash",
+							 cdesc.capacity, cdesc.capacity,
+							 &info,
+							 HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+
+	for (i = 0; i < DYNAHASH_PARTITIONS; ++i)
+		dynahash_lock[i] = LWLockAssign();
+}
+
+Datum
+chash_insert_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+	hentry	e;
+
+	for (i = 0; i < 1000000; ++i)
+	{
+		bool ok;
+
+		e.key = i;
+		e.val = i * 31;
+		ok = CHashInsert(chash, &e);
+		if (!ok)
+			elog(LOG, "insert %u: failed", i);
+		ok = CHashInsert(chash, &e);
+		if (ok)
+			elog(LOG, "insert %u: worked twice", i);
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+chash_search_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+	hentry	e;
+
+	for (i = 0; i < 1000000; ++i)
+	{
+		bool ok;
+
+		e.key = i;
+		ok = CHashSearch(chash, &e);
+		if (!ok)
+			elog(LOG, "search %u: not found", i);
+		else if (e.val != e.key * 31)
+			elog(LOG, "search %u: found %u", i, e.val);
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+chash_delete_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+	hentry	e;
+
+	for (i = 0; i < 1000000; ++i)
+	{
+		bool ok;
+
+		e.key = i;
+		ok = CHashDelete(chash, &e);
+		if (!ok)
+			elog(LOG, "delete %u: not found", i);
+		ok = CHashDelete(chash, &e);
+		if (ok)
+			elog(LOG, "delete %u: found twice", i);
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+chash_concurrent_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+	hentry	e;
+	uint32	seed = MyProcPid << 16;
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		e.key = seed | i;
+		e.val = MyProcPid;
+		ok = CHashInsert(chash, &e);
+		if (!ok)
+			elog(LOG, "insert %u: found", i);
+	}
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		e.key = seed | i;
+		e.val = 0;
+		ok = CHashSearch(chash, &e);
+		if (!ok)
+		{
+			uint64	retry = 1;
+			elog(LOG, "search %u: not found", i);
+			while (!CHashSearch(chash, &e))
+				++retry;
+			elog(LOG, "search %u: eventually found it after "
+				UINT64_FORMAT " retries", i, retry);
+		}
+		if (e.val != MyProcPid)
+			elog(LOG, "search %u: expected %u found %u", i, (unsigned) MyProcPid, e.val);
+	}
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		e.key = seed | i;
+		ok = CHashDelete(chash, &e);
+		if (!ok)
+		{
+			uint64	retry = 1;
+			elog(LOG, "delete %u: not found", i);
+			while (!CHashDelete(chash, &e))
+				++retry;
+			elog(LOG, "delete %u: eventually deleted it after "
+				UINT64_FORMAT " retries", i, retry);
+		}
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+chash_collision_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+	hentry	e;
+
+	/* Don't stack-allocate this. */
+	static bool mine[10000];
+
+	memset(mine, 0, 10000 * sizeof(bool));
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		e.key = i;
+		e.val = MyProcPid;
+		ok = CHashInsert(chash, &e);
+		if (ok)
+			mine[i] = true;
+	}
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		if (!mine[i])
+			continue;
+		e.key = i;
+		ok = CHashSearch(chash, &e);
+		if (!ok)
+			elog(LOG, "search %u: not found", i);
+		else if (e.val != MyProcPid)
+			elog(LOG, "search %u: expected %u found %u",
+				 i, (unsigned) MyProcPid, e.val);
+		ok = CHashDelete(chash, &e);
+		if (!ok)
+			elog(LOG, "delete %u: not found", i);
+	}
+
+	PG_RETURN_VOID();
+}
+
+static bool
+dynahash_insert(uint32 key, uint32 val)
+{
+	bool	found;
+	uint32	hashcode;
+	hentry *e;
+	LWLockId	lockid;
+
+	hashcode = get_hash_value(dynahash, (void *) &key);
+ 	lockid = dynahash_get_lock(hashcode);
+	LWLockAcquire(lockid, LW_EXCLUSIVE);
+	e = hash_search_with_hash_value(dynahash, (void *) &key,
+									hashcode, HASH_ENTER, &found);
+	if (!found)
+		e->val = val;
+	LWLockRelease(lockid);
+
+	return !found;
+}
+
+static bool
+dynahash_search(uint32 key, uint32 *val)
+{
+	uint32	hashcode;
+	hentry *e;
+	LWLockId	lockid;
+
+	hashcode = get_hash_value(dynahash, (void *) &key);
+ 	lockid = dynahash_get_lock(hashcode);
+	LWLockAcquire(lockid, LW_SHARED);
+	e = hash_search_with_hash_value(dynahash, (void *) &key,
+									hashcode, HASH_FIND, NULL);
+	if (e)
+		*val = e->val;
+	LWLockRelease(lockid);
+
+	return e != NULL;
+}
+
+static bool
+dynahash_delete(uint32 key)
+{
+	uint32	hashcode;
+	hentry *e;
+	LWLockId	lockid;
+
+	hashcode = get_hash_value(dynahash, (void *) &key);
+ 	lockid = dynahash_get_lock(hashcode);
+	LWLockAcquire(lockid, LW_EXCLUSIVE);
+	e = hash_search_with_hash_value(dynahash, (void *) &key,
+									hashcode, HASH_REMOVE, NULL);
+	LWLockRelease(lockid);
+
+	return e != NULL;
+}
+
+Datum
+dynahash_insert_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+
+	for (i = 0; i < 1000000; ++i)
+	{
+		bool	ok;
+
+		ok = dynahash_insert(i, i * 31);
+		if (!ok)
+			elog(LOG, "insert %u: failed", i);
+		ok = dynahash_insert(i, i * 31);
+		if (ok)
+			elog(LOG, "insert %u: worked twice", i);
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+dynahash_search_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+
+	for (i = 0; i < 1000000; ++i)
+	{
+		bool	ok;
+		uint32	val;
+
+		ok = dynahash_search(i, &val);
+		if (!ok)
+			elog(LOG, "search %u: not found", i);
+		else if (val != i* 31)
+			elog(LOG, "search %u: found %u", i, val);
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+dynahash_delete_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+
+	for (i = 0; i < 1000000; ++i)
+	{
+		bool	ok;
+
+		ok = dynahash_delete(i);
+		if (!ok)
+			elog(LOG, "delete %u: not found", i);
+		ok = dynahash_delete(i);
+		if (ok)
+			elog(LOG, "delete %u: found twice", i);
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+dynahash_concurrent_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+	uint32	val;
+	uint32	seed = MyProcPid << 16;
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		ok = dynahash_insert(seed | i, MyProcPid);
+		if (!ok)
+			elog(LOG, "insert %u: found", i);
+	}
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		ok = dynahash_search(seed | i, &val);
+		if (!ok)
+		{
+			uint64	retry = 1;
+			elog(LOG, "search %u: not found", i);
+			while (!dynahash_search(seed | i, &val))
+				++retry;
+			elog(LOG, "search %u: eventually found it after "
+				UINT64_FORMAT " retries", i, retry);
+		}
+		if (val != MyProcPid)
+			elog(LOG, "search %u: expected %u found %u",
+				 i, (unsigned) MyProcPid, val);
+	}
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		ok = dynahash_delete(seed | i);
+		if (!ok)
+		{
+			uint64	retry = 1;
+			elog(LOG, "delete %u: not found", i);
+			while (!dynahash_delete(seed | i))
+				++retry;
+			elog(LOG, "delete %u: eventually deleted it after "
+				UINT64_FORMAT " retries", i, retry);
+		}
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+dynahash_collision_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+	uint32	val;
+
+	/* Don't stack-allocate this. */
+	static bool mine[10000];
+
+	memset(mine, 0, 10000 * sizeof(bool));
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		ok = dynahash_insert(i, MyProcPid);
+		if (ok)
+			mine[i] = true;
+	}
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		if (!mine[i])
+			continue;
+		ok = dynahash_search(i, &val);
+		if (!ok)
+			elog(LOG, "search %u: not found", i);
+		else if (val != MyProcPid)
+			elog(LOG, "search %u: expected %u found %u",
+				 i, (unsigned) MyProcPid, val);
+		ok = dynahash_delete(i);
+		if (!ok)
+			elog(LOG, "delete %u: not found", i);
+	}
+
+	PG_RETURN_VOID();
+}
diff --git a/contrib/hashtest/hashtest.control b/contrib/hashtest/hashtest.control
new file mode 100644
index 0000000..b8e0f01
--- /dev/null
+++ b/contrib/hashtest/hashtest.control
@@ -0,0 +1,4 @@
+comment = 'hash testing code'
+default_version = '1.0'
+module_pathname = '$libdir/hashtest'
+relocatable = true
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index 7a38f2f..092cf8f 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -21,8 +21,10 @@
  */
 #include "postgres.h"
 
+#include "miscadmin.h"
 #include "storage/bufmgr.h"
 #include "storage/buf_internals.h"
+#include "utils/chash.h"
 
 
 /* entry for buffer lookup hashtable */
@@ -32,8 +34,13 @@ typedef struct
 	int			id;				/* Associated buffer ID */
 } BufferLookupEnt;
 
-static HTAB *SharedBufHash;
-
+static CHashDescriptor SharedBufDescriptor = {
+	"buffer lookup table",
+	0,
+	sizeof(BufferLookupEnt),
+	sizeof(BufferTag)
+};
+static CHashTable SharedBufHash;
 
 /*
  * Estimate space needed for mapping hashtable
@@ -42,7 +49,13 @@ static HTAB *SharedBufHash;
 Size
 BufTableShmemSize(int size)
 {
-	return hash_estimate_size(size, sizeof(BufferLookupEnt));
+	if (SharedBufHash == NULL)
+	{
+		SharedBufDescriptor.capacity = size;
+		SharedBufHash = CHashBootstrap(&SharedBufDescriptor);
+	}
+
+	return CHashEstimateSize(SharedBufHash);
 }
 
 /*
@@ -52,59 +65,29 @@ BufTableShmemSize(int size)
 void
 InitBufTable(int size)
 {
-	HASHCTL		info;
-
-	/* assume no locking is needed yet */
-
-	/* BufferTag maps to Buffer */
-	info.keysize = sizeof(BufferTag);
-	info.entrysize = sizeof(BufferLookupEnt);
-	info.hash = tag_hash;
-	info.num_partitions = NUM_BUFFER_PARTITIONS;
-
-	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
-								  size, size,
-								  &info,
-								  HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
-}
-
-/*
- * BufTableHashCode
- *		Compute the hash code associated with a BufferTag
- *
- * This must be passed to the lookup/insert/delete routines along with the
- * tag.  We do it like this because the callers need to know the hash code
- * in order to determine which buffer partition to lock, and we don't want
- * to do the hash computation twice (hash_any is a bit slow).
- */
-uint32
-BufTableHashCode(BufferTag *tagPtr)
-{
-	return get_hash_value(SharedBufHash, (void *) tagPtr);
+	if (SharedBufHash == NULL || !IsUnderPostmaster)
+	{
+		Assert(SharedBufDescriptor.capacity == 0 ||
+			SharedBufDescriptor.capacity == size);
+		SharedBufDescriptor.capacity = size;
+		SharedBufHash = CHashInitialize(SharedBufHash, &SharedBufDescriptor);
+	}
 }
 
 /*
  * BufTableLookup
  *		Lookup the given BufferTag; return buffer ID, or -1 if not found
- *
- * Caller must hold at least share lock on BufMappingLock for tag's partition
  */
 int
-BufTableLookup(BufferTag *tagPtr, uint32 hashcode)
+BufTableLookup(BufferTag *tagPtr)
 {
-	BufferLookupEnt *result;
-
-	result = (BufferLookupEnt *)
-		hash_search_with_hash_value(SharedBufHash,
-									(void *) tagPtr,
-									hashcode,
-									HASH_FIND,
-									NULL);
+	BufferLookupEnt ent;
 
-	if (!result)
+	ent.key = *tagPtr;
+	if (!CHashSearch(SharedBufHash, &ent))
 		return -1;
 
-	return result->id;
+	return ent.id;
 }
 
 /*
@@ -118,27 +101,20 @@ BufTableLookup(BufferTag *tagPtr, uint32 hashcode)
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
  */
 int
-BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
+BufTableInsert(BufferTag *tagPtr, int buf_id)
 {
-	BufferLookupEnt *result;
-	bool		found;
+	BufferLookupEnt ent;
+
+	ent.key = *tagPtr;
+	ent.id = buf_id;
 
 	Assert(buf_id >= 0);		/* -1 is reserved for not-in-table */
 	Assert(tagPtr->blockNum != P_NEW);	/* invalid tag */
 
-	result = (BufferLookupEnt *)
-		hash_search_with_hash_value(SharedBufHash,
-									(void *) tagPtr,
-									hashcode,
-									HASH_ENTER,
-									&found);
-
-	if (found)					/* found something already in the table */
-		return result->id;
-
-	result->id = buf_id;
+	if (CHashInsert(SharedBufHash, &ent))
+		return -1;
 
-	return -1;
+	return ent.id;
 }
 
 /*
@@ -148,17 +124,8 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
  */
 void
-BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
+BufTableDelete(BufferTag *tagPtr)
 {
-	BufferLookupEnt *result;
-
-	result = (BufferLookupEnt *)
-		hash_search_with_hash_value(SharedBufHash,
-									(void *) tagPtr,
-									hashcode,
-									HASH_REMOVE,
-									NULL);
-
-	if (!result)				/* shouldn't happen */
+	if (!CHashDelete(SharedBufHash, tagPtr))
 		elog(ERROR, "shared buffer hash table corrupted");
 }
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 45d1d61..437deb9 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -429,22 +429,14 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 	else
 	{
 		BufferTag	newTag;		/* identity of requested block */
-		uint32		newHash;	/* hash value for newTag */
-		LWLock	   *newPartitionLock;	/* buffer partition lock for it */
 		int			buf_id;
 
 		/* create a tag so we can lookup the buffer */
 		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
 					   forkNum, blockNum);
 
-		/* determine its hash code and partition lock ID */
-		newHash = BufTableHashCode(&newTag);
-		newPartitionLock = BufMappingPartitionLock(newHash);
-
 		/* see if the block is in the buffer pool already */
-		LWLockAcquire(newPartitionLock, LW_SHARED);
-		buf_id = BufTableLookup(&newTag, newHash);
-		LWLockRelease(newPartitionLock);
+		buf_id = BufTableLookup(&newTag);
 
 		/* If not in buffers, initiate prefetch */
 		if (buf_id < 0)
@@ -822,11 +814,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			bool *foundPtr)
 {
 	BufferTag	newTag;			/* identity of requested block */
-	uint32		newHash;		/* hash value for newTag */
-	LWLock	   *newPartitionLock;		/* buffer partition lock for it */
 	BufferTag	oldTag;			/* previous identity of selected buffer */
-	uint32		oldHash;		/* hash value for oldTag */
-	LWLock	   *oldPartitionLock;		/* buffer partition lock for it */
 	BufFlags	oldFlags;
 	int			buf_id;
 	volatile BufferDesc *buf;
@@ -835,29 +823,31 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* create a tag so we can lookup the buffer */
 	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
 
-	/* determine its hash code and partition lock ID */
-	newHash = BufTableHashCode(&newTag);
-	newPartitionLock = BufMappingPartitionLock(newHash);
-
 	/* see if the block is in the buffer pool already */
-	LWLockAcquire(newPartitionLock, LW_SHARED);
-	buf_id = BufTableLookup(&newTag, newHash);
+start:
+	buf_id = BufTableLookup(&newTag);
 	if (buf_id >= 0)
 	{
+		BufferDesc *foundbuf;
+
 		/*
 		 * Found it.  Now, pin the buffer so no one can steal it from the
-		 * buffer pool, and check to see if the correct data has been loaded
-		 * into the buffer.
+		 * buffer pool.
 		 */
-		buf = &BufferDescriptors[buf_id];
+		foundbuf = &BufferDescriptors[buf_id];
 
-		valid = PinBuffer(buf, strategy);
+		valid = PinBuffer(foundbuf, strategy);
 
-		/* Can release the mapping lock as soon as we've pinned it */
-		LWLockRelease(newPartitionLock);
+		/* Check whether someone recycled the buffer before we pinned it. */
+		if (!BUFFERTAGS_EQUAL(newTag, foundbuf->tag))
+		{
+			UnpinBuffer(foundbuf, true);
+			goto start;
+		}
 
 		*foundPtr = TRUE;
 
+		/* Check to see if the correct data has been loaded into the buffer. */
 		if (!valid)
 		{
 			/*
@@ -867,7 +857,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			 * own read attempt if the page is still not BM_VALID.
 			 * StartBufferIO does it all.
 			 */
-			if (StartBufferIO(buf, true))
+			if (StartBufferIO(foundbuf, true))
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
@@ -877,15 +867,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			}
 		}
 
-		return buf;
+		return foundbuf;
 	}
 
-	/*
-	 * Didn't find it in the buffer pool.  We'll have to initialize a new
-	 * buffer.  Remember to unlock the mapping lock while doing the work.
-	 */
-	LWLockRelease(newPartitionLock);
-
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
@@ -986,42 +970,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 */
 		if (oldFlags & BM_TAG_VALID)
 		{
-			/*
-			 * Need to compute the old tag's hashcode and partition lock ID.
-			 * XXX is it worth storing the hashcode in BufferDesc so we need
-			 * not recompute it here?  Probably not.
-			 */
+			/* Save old tag. */
 			oldTag = buf->tag;
-			oldHash = BufTableHashCode(&oldTag);
-			oldPartitionLock = BufMappingPartitionLock(oldHash);
-
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-		}
-		else
-		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			/* these just keep the compiler quiet about uninit variables */
-			oldHash = 0;
-			oldPartitionLock = 0;
 		}
 
 		/*
@@ -1031,32 +981,34 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Note that we have not yet removed the hashtable entry for the old
 		 * tag.
 		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+enter:
+		buf_id = BufTableInsert(&newTag, buf->buf_id);
 
 		if (buf_id >= 0)
 		{
+			BufferDesc *foundbuf;
+
 			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
+			 * We've got a collision, apparently: it looks like someone else
+			 * did what we were about to do.  We can handle this as if we had
+			 * found the buffer in the pool in the first place, but we must
+			 * recheck the buffer tag after pinning it, because it could still
+			 * get renamed under us.
+		 	 */
+			foundbuf = &BufferDescriptors[buf_id];
+			valid = PinBuffer(foundbuf, strategy);
+			if (!BUFFERTAGS_EQUAL(newTag, foundbuf->tag))
+			{
+				UnpinBuffer(foundbuf, true);
+				goto enter;
+			}
+
+			/*
+			 * Collision confirmed.  Give up the buffer we were planning to
+			 * use.
 			 */
 			UnpinBuffer(buf, true);
 
-			/* Can give up that buffer's mapping partition lock now */
-			if ((oldFlags & BM_TAG_VALID) &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = &BufferDescriptors[buf_id];
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
 			*foundPtr = TRUE;
 
 			if (!valid)
@@ -1068,7 +1020,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				 * then set up our own read attempt if the page is still not
 				 * BM_VALID.  StartBufferIO does it all.
 				 */
-				if (StartBufferIO(buf, true))
+				if (StartBufferIO(foundbuf, true))
 				{
 					/*
 					 * If we get here, previous attempts to read the buffer
@@ -1078,7 +1030,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				}
 			}
 
-			return buf;
+			return foundbuf;
 		}
 
 		/*
@@ -1097,11 +1049,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			break;
 
 		UnlockBufHdr(buf);
-		BufTableDelete(&newTag, newHash);
-		if ((oldFlags & BM_TAG_VALID) &&
-			oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
+		BufTableDelete(&newTag);
 		UnpinBuffer(buf, true);
 	}
 
@@ -1124,13 +1072,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	UnlockBufHdr(buf);
 
 	if (oldFlags & BM_TAG_VALID)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
-
-	LWLockRelease(newPartitionLock);
+		BufTableDelete(&oldTag);
 
 	/*
 	 * Buffer contents are currently invalid.  Try to get the io_in_progress
@@ -1166,42 +1108,11 @@ static void
 InvalidateBuffer(volatile BufferDesc *buf)
 {
 	BufferTag	oldTag;
-	uint32		oldHash;		/* hash value for oldTag */
-	LWLock	   *oldPartitionLock;		/* buffer partition lock for it */
 	BufFlags	oldFlags;
 
 	/* Save the original buffer tag before dropping the spinlock */
 	oldTag = buf->tag;
 
-	UnlockBufHdr(buf);
-
-	/*
-	 * Need to compute the old tag's hashcode and partition lock ID. XXX is it
-	 * worth storing the hashcode in BufferDesc so we need not recompute it
-	 * here?  Probably not.
-	 */
-	oldHash = BufTableHashCode(&oldTag);
-	oldPartitionLock = BufMappingPartitionLock(oldHash);
-
-retry:
-
-	/*
-	 * Acquire exclusive mapping lock in preparation for changing the buffer's
-	 * association.
-	 */
-	LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-
-	/* Re-lock the buffer header */
-	LockBufHdr(buf);
-
-	/* If it's changed while we were waiting for lock, do nothing */
-	if (!BUFFERTAGS_EQUAL(buf->tag, oldTag))
-	{
-		UnlockBufHdr(buf);
-		LWLockRelease(oldPartitionLock);
-		return;
-	}
-
 	/*
 	 * We assume the only reason for it to be pinned is that someone else is
 	 * flushing the page out.  Wait for them to finish.  (This could be an
@@ -1211,15 +1122,21 @@ retry:
 	 * yet done StartBufferIO, WaitIO will fall through and we'll effectively
 	 * be busy-looping here.)
 	 */
-	if (buf->refcount != 0)
+	while (buf->refcount != 0)
 	{
 		UnlockBufHdr(buf);
-		LWLockRelease(oldPartitionLock);
 		/* safety check: should definitely not be our *own* pin */
 		if (GetPrivateRefCount(buf->buf_id) > 0)
 			elog(ERROR, "buffer is pinned in InvalidateBuffer");
 		WaitIO(buf);
-		goto retry;
+		LockBufHdr(buf);
+
+		/* If it's changed while we were waiting for lock, do nothing */
+		if (!BUFFERTAGS_EQUAL(buf->tag, oldTag))
+		{
+			UnlockBufHdr(buf);
+			return;
+		}
 	}
 
 	/*
@@ -1237,12 +1154,7 @@ retry:
 	 * Remove the buffer from the lookup hashtable, if it was in there.
 	 */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
-
-	/*
-	 * Done with mapping lock.
-	 */
-	LWLockRelease(oldPartitionLock);
+		BufTableDelete(&oldTag);
 
 	/*
 	 * Insert the buffer at the head of the list of free buffers.
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 2ea2216..38614a4 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -423,6 +423,29 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 	return structPtr;
 }
 
+/*
+ * ShmemInitStruct -- Attach to an existing structure in shared memory.
+ */
+void *
+ShmemAttachStruct(const char *name)
+{
+	ShmemIndexEnt *result;
+	void	   *ptr;
+	bool		found;
+
+	LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+	result = (ShmemIndexEnt *)
+		hash_search(ShmemIndex, name, HASH_FIND, &found);
+	if (!found || result == NULL)
+		elog(ERROR, "shared memory structure %s not found", name);
+	ptr = result->location;
+	Assert(ptr != NULL);
+
+	LWLockRelease(ShmemIndexLock);
+
+	return ptr;
+}
 
 /*
  * Add two Size values, checking for overflow
diff --git a/src/backend/utils/hash/Makefile b/src/backend/utils/hash/Makefile
index 05d347c..5d53382 100644
--- a/src/backend/utils/hash/Makefile
+++ b/src/backend/utils/hash/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/utils/hash
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = dynahash.o hashfn.o
+OBJS = chash.o dynahash.o hashfn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/utils/hash/chash.c b/src/backend/utils/hash/chash.c
new file mode 100644
index 0000000..0d4dc78
--- /dev/null
+++ b/src/backend/utils/hash/chash.c
@@ -0,0 +1,1075 @@
+/*-------------------------------------------------------------------------
+ *
+ * chash.c
+ *	  concurrent hash tables
+ *
+ * A concurrent hash table stores a collection of fixed-size objects.
+ * From the point of view of this module, such objects are merely an
+ * opaque array of bytes, but the caller will typically implement them as
+ * a C "struct".  Some fixed-size, leading portion of each object is
+ * designated as the key, which must be distinct for all objects in the
+ * collection.  Since PostgreSQL's shared memory model does not permit
+ * dynamic shared-memory allocation, we preallocate shared-memory space
+ * for the maximum number of entities which can be stored (plus a few
+ * extra, for reasons that will be further explained below).  This space
+ * is allocated as a single large array called the arena, and we often
+ * refer to entities by their position in the arena rather than via an
+ * ordinary pointer.  This saves a considerable amount of memory, since
+ * most modern architectures are 64-bit and therefore use 8-byte pointers,
+ * while arena offsets can be stored in a 32-bit word.  In fact, we
+ * reserve one bit in each such word as a mark bit, so the maximum size
+ * of the arena is 2^31 elements, a restriction that does not currently
+ * appear to be problematic.  An additional advantage of this representation
+ * is that aligned 32-bit loads and stores are atomic on all architectures
+ * we support, but 64-bit loads and stores are not.
+ *
+ * When an element is inserted, we copy the data from the backend-private
+ * object supplied by the caller into one of these shared-memory entities.
+ * When the hash table is searched, the caller passes a backend-private
+ * entity with just the key filled in; if a matching element is found,
+ * data is copied from the shared memory entity into the non-key portion
+ * of the user-supplied entity.  In this way, clients of this module
+ * never use pointers into shared memory directly.
+ *
+ * As normal, we structure the hash table as an array of buckets, whose
+ * size is always a power of two, so that the low-order bytes of the
+ * hash code can be used to select a bucket.  If multiple entities has
+ * to the same bucket, we use separate chaining: each entity in the
+ * arena has an 8-byte header that stores the 4-byte arena offset of the
+ * next item in the bucket and the hash value of the entity's key.
+ * Bucket chains are maintained in order by ascending hash value and
+ * then by ascending entity key (as per memcmp) so that there is
+ * precisely one legal location at which a given new item can be inserted
+ * into a bucket.
+ *
+ * Items are inserted into buckets using compare-and-swap (CAS).  Thus, this
+ * module cannot be used on architectures where we do not have 4-byte
+ * compare-and-swap.  When an item is deleted, we first set its mark bit,
+ * which is stored within the next-pointer, again using CAS.  Once this
+ * step is completed, the node is deleted.  As a cleanup operation, we then
+ * use CAS to modify the next-pointer of the previous node to point to the
+ * node following the deleted node.  Note that, even once this cleanup
+ * operation has been performed, some other backend concurrently walking the
+ * chain might still be visiting the deleted node.  Thus, we must be certain
+ * not to overwrite the deleted node's next-pointer until all concurrent
+ * bucket scans have completed.  This means, in particular, that we cannot
+ * immediately view the deleted node as available for reuse.
+ *
+ * Instead, when a delete-marked node is removed from the bucket chain, it
+ * is added to one of many garbage lists.  There is a many-to-one mapping from
+ * buckets to garbage lists, so that the choice of bucket determines the
+ * garbage list but not visca versa.  Any process which wishes to scan a bucket
+ * must first advertise in shared memory the corresponding garbage list number.
+ * When a backend wishes to move entries from a garbage list to a free list,
+ * it must first wait (by spinning) for any backends scanning that garbage
+ * list to complete their scans.
+ *
+ * Ideally, it would be nice to make this completely lock-free, but because
+ * of the above-described choice of garbage collection algorithm, it currently
+ * isn't.  For an algorithm to be lock-free, it must be possible to suspend
+ * the execution of any number of processes for an arbitrary period of time
+ * without impeding the overall progress of the system.  For this code, that
+ * is true except when garbage collection occurs.  In that case, an insert,
+ * search, or delete operation can obstruct an inserting thread attempting to
+ * perform garbage collection for an unbounded period of time.  The algorithm
+ * could be adapted as to be completely lock-free, essentially by guaranteeing
+ * that even in the worst case no combination of processes can obstruct garbage
+ * collection to a sufficient degree as to prevent an inserter from finding
+ * an available entry in a hash table containing fewer live elements than its
+ * declared maximum capacity.  However, it's not clear that this is a good
+ * idea, because it would likely slow down operation in the non-contended
+ * case to solve a problem that we hope won't happen anyway.
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/hash/chash.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "access/hash.h"
+#include "storage/barrier.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/chash.h"
+#include "utils/memutils.h"
+
+/*
+ * CHashPtr represents an offset into the arena, plus a mark bit that is
+ * used to implement concurrent deletion.
+ */
+typedef uint32 CHashPtr;
+#define InvalidCHashPtr				((uint32) -2)
+#define CHashPtrIsInvalid(x)		((x) >= InvalidCHashPtr)
+#define CHashPtrIsMarked(x)			((x) & 1)
+#define CHashPtrGetOffset(x)		((x) >> 1)
+#define CHashPtrMark(x)				((x) | 1)
+#define CHashPtrUnmark(x)			((x) & ~1)
+#define MakeCHashPtr(x)				((x) << 1)
+#define CHashMaxCapacity			CHashPtrGetOffset(InvalidCHashPtr)
+
+/*
+ * Each object stored in the hash table is represented by a CHashNode, which
+ * stores a pointer to the next item in the same bucket, and the exact hash
+ * value of the current item.  Each CHashNode is followed by space for the
+ * item itself.
+ */
+typedef struct
+{
+	CHashPtr		next;			/* arena offset of next element */
+	union
+	{
+		uint32		hashcode;		/* hash(key) */
+		CHashPtr	gcnext;			/* arena offset of next garbage item */
+	} un;
+} CHashNode;
+
+#define SizeOfCHashNode		MAXALIGN(sizeof(CHashNode))
+#define CHashNodeGetItem(x)	(((char *) x) + SizeOfCHashNode)
+
+/*
+ * CHashTableData stores all the information that we need in order to access
+ * a concurrent hash table.  We store one copy of this data in shared memory,
+ * and an additional copy in the private memory of each backend accessing the
+ * table.
+ */
+typedef struct CHashTableData
+{
+	/*
+	 * These fields do not change after initialization.
+	 */
+	CHashDescriptor	desc;			/* descriptor for this hash table */
+	uint32			bucket_mask;	/* # of buckets, minus one */
+	uint8			garbage_shift;	/* log2(# of buckets/# of garbage lists) */
+	uint8			freelist_shift;	/* log2(# of garbage lists/# freelists) */
+	uint16			nfreelists;		/* # of freelists */
+	uint32			arena_limit;	/* # of arena elements */
+	uint32			arena_stride;	/* bytes allocated per arena element */
+	CHashPtr	   *bucket;			/* array with 1 entry per bucket */
+	CHashPtr	   *extra;			/* entries for garbage and free lists */
+	char		   *arena;			/* arena */
+
+	/*
+	 * These fields will be different in each backend; the shared copy is
+	 * irrelevant.
+	 */
+	int				gc_pid;			/* PID that set gc_next */
+	uint32			gc_next;		/* next garbage list to reclaim */
+	uint64			stats[CHS_NumberOfStatistics];	/* statistics */
+} CHashTableData;
+
+/* Compute # of buckets, garbage lists, or free lists. */
+#define CHashTableNBuckets(table) \
+	((table)->bucket_mask + 1)
+#define CHashTableNGarbage(table) \
+	(CHashTableNBuckets((table)) >> (table)->garbage_shift)
+#define CHashTableNFreeLists(table) \
+	((table)->nfreelists)
+
+/*
+ * Garbage lists and free lists are interleaved to reduce cache line
+ * contention on the free lists, so the calculation of where an individual
+ * list is located is a bit complex.
+ */
+#define CHashTableGetGarbageList(table, n) \
+	(&(table)->extra[(n) + ((n) >> (table)->freelist_shift)])
+#define CHashTableGetGarbageByBucket(table, n) \
+	(CHashTableGetGarbageList((table), (n) >> (table)->garbage_shift))
+#define CHashTableGetFreeList(table, n) \
+	(&(table)->extra[(n) + (((n) + 1) << (table)->freelist_shift)])
+
+/* Access macros for arena nodes. */
+#define CHashTableGetRaw(table, offset) \
+	(AssertMacro((offset) < (table)->arena_limit), \
+	 (CHashNode *) ((table)->arena + (table)->arena_stride * (offset)))
+#define CHashTableGetNode(table, ptr) \
+	(AssertMacro(!CHashPtrIsInvalid(ptr)), \
+	 CHashTableGetRaw((table), CHashPtrGetOffset((ptr))))
+
+/* Statistics macros. */
+#define CHashTableIncrementStatistic(table, stat) \
+	((table)->stats[(stat)]++)
+
+/*
+ * Bucket scan.
+ */
+typedef struct
+{
+	CHashPtr	target;
+	CHashPtr	next;
+	CHashPtr   *pointer_to_target;
+	CHashNode  *target_node;
+	bool		found;
+} CHashScanResult;
+
+/* Human-readable statistics names. */
+char *CHashStatisticsNames[] = {
+	"searches",
+	"searches failed",
+	"inserts",
+	"inserts failed",
+	"inserts retried",
+	"deletions",
+	"deletions failed",
+	"deletions retried",
+	"scan expunges",
+	"scan expunges failed",
+	"scans restarted",
+	"cleanup scans",
+	"allocations failed",
+	"garbage enqueues retried",
+	"garbage dequeues failed",
+	"garbage collections",
+	"garbage collection spins",
+	"garbage collection reclaims skipped",
+	"garbage collection fast reclaims",
+	"garbage collection reclaims retried",
+	"<end>"
+};
+
+/* Function prototypes. */
+static CHashPtr CHashAllocate(CHashTable table);
+static CHashPtr CHashAllocateViaGC(CHashTable table);
+static void CHashAddToGarbage(CHashTable table, uint32 bucket, CHashPtr c);
+static void CHashBucketScan(CHashTable table,
+				CHashPtr *start,
+				uint32 hashcode,
+				const void *key,
+				CHashScanResult *res);
+
+/*
+ * First stage of CHashTable initialization.  We fill in all the constants
+ * here, but not the pointers.
+ */
+CHashTable
+CHashBootstrap(CHashDescriptor *desc)
+{
+	CHashTable		table;
+	uint32			bucket_shift;
+
+	/* Sanity check. */
+	Assert(!strcmp(CHashStatisticsNames[CHS_NumberOfStatistics], "<end>"));
+
+	/* Allocate table and copy descriptor. */
+	table = MemoryContextAllocZero(TopMemoryContext, sizeof(CHashTableData));
+	memcpy(&table->desc, desc, sizeof(CHashDescriptor)); 
+
+	/* Sanity checks. */
+	if (desc->capacity < 1 || desc->capacity > CHashMaxCapacity)
+		elog(ERROR, "invalid capacity for concurrent hash");
+	if (desc->key_size < 1 || desc->key_size > desc->element_size)
+		elog(ERROR, "invalid key size for concurrent hash");
+
+ 	/*
+	 * The number of buckets must be a power of two.  To avoid (as much as
+	 * possible) having to traverse long bucket chains, we aim for a load
+	 * factor <= 1.0, so this is a pretty simple calculation: we just find the
+	 * smallest power of two greater than or equal to the target capacity.
+	 */
+	bucket_shift = fls(desc->capacity) - 1;
+	table->bucket_mask = (1 << bucket_shift) - 1;
+
+	/*
+	 * We choose to have one garbage list for every 64 hash table buckets
+	 * (that is, garbage_shift = 6) unless there are fewer than 64 buckets in
+	 * total, in which case we still have a minimum of one garbage list.
+	 *
+	 * Increasing the garbage_shift would reduce the likelihood of a backend
+	 * performing garbage collection needing to wait for a backend walking a
+	 * bucket to finish its scan.  On the other hand, decreasing the garbage
+	 * shift would allow more items to be recovered in a single garbage
+	 * collection cycle.  It's not clear what the optimal value is.
+	 */
+	table->garbage_shift = Min(bucket_shift, 6);
+	table->gc_next = 0;
+	table->gc_pid = 0;
+
+	/*
+ 	 * Experimentation reveals that the free list manipulation is both one of
+ 	 * the slowest parts of this algorithm and the most vulnerable to
+ 	 * contention.  Therefore, we want to have as many free lists as possible,
+ 	 * but there's no need to have more than the number of CPU cores, so we
+ 	 * limit the number of freelists to 64.  There might be a benefit in some
+ 	 * larger limit on a really big system, but we'll cap it here pending some
+ 	 * actual test results.  We're also limited to having no more freelists
+ 	 * than we do garbage lists.
+ 	 */
+#define LOG2_MAX_FREELIST 6
+	if (bucket_shift - table->garbage_shift < LOG2_MAX_FREELIST)
+		table->freelist_shift = 0;
+	else
+		table->freelist_shift =
+			(bucket_shift - table->garbage_shift) - LOG2_MAX_FREELIST;
+	table->nfreelists =
+		1 << (bucket_shift - (table->garbage_shift + table->freelist_shift));
+
+	/*
+	 * To make garbage collection efficient, we overallocate.  Normally, we
+	 * overallocate by one-eighth, but if that would be less than 15 elements,
+	 * then we allocate 15 elements instead.  This extra capacity can actually
+	 * be used, but for best performance, it shouldn't be.  It's the caller's
+	 * responsibility to avoid this.
+	 */
+	table->arena_limit = desc->capacity;
+	if (desc->capacity < 120)
+		table->arena_limit += 15;
+	else
+		table->arena_limit += table->arena_limit / 8;
+
+	/* Each arena element must be MAXALIGN'd and include per-node space. */
+	table->arena_stride = SizeOfCHashNode + MAXALIGN(desc->element_size);
+
+	/* Zero out statistics. */
+	memset(table->stats, 0, sizeof(uint64) * CHS_NumberOfStatistics);
+
+	return table;
+}
+
+/*
+ * Estimate shared memory requirements.
+ */
+Size
+CHashEstimateSize(CHashTable table)
+{
+	Size		total_buckets;
+	Size		size;
+	Size		nbuckets = CHashTableNBuckets(table);
+	Size		ngarbage = CHashTableNGarbage(table);
+	Size		nfreelists = CHashTableNFreeLists(table);
+
+	Assert(nbuckets > 0 && ngarbage > 0 && nfreelists > 0);
+	total_buckets = add_size(nbuckets, ngarbage);
+	total_buckets = add_size(total_buckets, nfreelists);
+
+	size = MAXALIGN(sizeof(CHashTableData));
+	size = add_size(size, mul_size(sizeof(CHashPtr), total_buckets));
+	size = add_size(size, mul_size(table->arena_stride, table->arena_limit));
+
+	return size;
+}
+
+/*
+ * Create a concurrent hash table in shared memory, or attach to an existing
+ * table.
+ */
+CHashTable
+CHashInitialize(CHashTable table, CHashDescriptor *desc)
+{
+	Size	size;
+	bool	found;
+	void   *shmem;
+	uint32	i;
+	uint32	nbuckets;
+	uint32	nfreelists;
+	uint32	ngarbage;
+	uint32	nextra;
+
+	/*
+	 * If we're under the postmaster, this must be the EXEC_BACKEND case where
+	 * we need to attach to an existing shared-memory segment.
+	 */
+	if (IsUnderPostmaster)
+	{
+		void   *shmem;
+
+		Assert(table == NULL);
+		table = MemoryContextAlloc(TopMemoryContext, sizeof(CHashTableData));
+		shmem = ShmemAttachStruct(desc->shmem_name);
+		memcpy(table, shmem, sizeof(CHashTableData));
+		return table;
+	}
+
+	/*
+	 * Otherwise, the hash table should not already exist, and we must
+	 * create it.  But the table should already be bootstrapped, since we
+	 * must previously have computed its size when figuring out our shared
+	 * memory allocation.
+	 */
+	Assert(table != NULL);
+	size = CHashEstimateSize(table);
+	shmem = ShmemInitStruct(table->desc.shmem_name, size, &found);
+	Assert(!found);
+
+	/* Bucket, garbage, and freelist arrays follow table info. */
+	table->bucket = (CHashPtr *)
+		(((char *) shmem) + MAXALIGN(sizeof(CHashTableData)));
+	nbuckets = CHashTableNBuckets(table);
+	table->extra = &table->bucket[nbuckets];
+
+	/* Arena follows the various lists. */
+	ngarbage = CHashTableNGarbage(table);
+	nfreelists = CHashTableNFreeLists(table);
+	nextra = ngarbage + nfreelists;
+	table->arena = (void *) (&table->extra[nextra]);
+
+	/* Initialize all three sets of lists to empty. */
+	for (i = 0; i < nbuckets; ++i)
+		table->bucket[i] = InvalidCHashPtr;
+	for (i = 0; i < nextra; ++i)
+		table->extra[i] = InvalidCHashPtr;
+
+	/* Put all arena elements on the free lists. */
+	for (i = 0; i < table->arena_limit; ++i)
+	{
+		CHashPtr   *f = CHashTableGetFreeList(table, i % nfreelists);
+		CHashNode  *n = CHashTableGetRaw(table, i);
+
+		n->un.gcnext = *f;
+		*f = MakeCHashPtr(i);
+	}
+
+	/*
+	 * Copy table (with pointers now filled in) to shared memory.  This is
+	 * arguably unnecessary when not using EXEC_BACKEND, but we do it anyway.
+	 */
+	memcpy(shmem, table, sizeof(CHashTableData));
+
+	return table;
+}
+
+/*
+ * Search a concurrent hash table.  entry should be a block of memory large
+ * enough to hold a complete entry, with just the key portion filled in.  If
+ * a matching entry is found, this function will fill in the rest of the entry
+ * from the data in the hash table and return true.  If not, it will return
+ * false.
+ */
+bool
+CHashSearch(CHashTable table, void *entry)
+{
+	uint32	hashcode = hash_any(entry, table->desc.key_size);
+	uint32	bucket = hashcode & table->bucket_mask;
+	CHashPtr	   *b = &table->bucket[bucket];
+	CHashScanResult	scan;
+
+	/* Prevent garbage collection for this bucket. */
+	Assert(MyProc->hazard[0] == NULL);
+	MyProc->hazard[0] = CHashTableGetGarbageByBucket(table, bucket);
+	pg_memory_barrier();
+
+	/* Scan bucket and return data from any matching entry. */
+	CHashBucketScan(table, b, hashcode, entry, &scan);
+	if (scan.found)
+		memcpy(((char *) entry) + table->desc.key_size,
+			   CHashNodeGetItem(scan.target_node) + table->desc.key_size,
+			   table->desc.element_size - table->desc.key_size);
+
+	/* Allow garbage collection for this bucket. */
+	Assert(MyProc->hazard[0] != NULL);
+	pg_memory_barrier();
+	MyProc->hazard[0] = NULL;
+
+	CHashTableIncrementStatistic(table, CHS_Search);
+	if (!scan.found)
+		CHashTableIncrementStatistic(table, CHS_Search_Failed);
+	return scan.found;
+}
+
+/*
+ * Insert into a concurrent hash table.  entry should be the complete entry
+ * to be inserted.  If no duplicate entry is found in the table, this function
+ * will insert the entry and return true.  Otherwise, the duplicate entry's
+ * value will be copied into the supplied entry and the function will return
+ * false.
+ *
+ * The caller is responsible for ensuring that no inserts are performed into
+ * a hash table which is at capacity.  The behavor of such an insert is
+ * undefined (the actual behavior is that the insert may either succeed,
+ * degrading performance; or CHashAllocate may enter a tight loop until such
+ * time as an element is deleted).
+ */
+bool
+CHashInsert(CHashTable table, void *entry)
+{
+	uint32	hashcode = hash_any(entry, table->desc.key_size);
+	uint32	bucket = hashcode & table->bucket_mask;
+	CHashPtr	   *b = &table->bucket[bucket];
+	CHashPtr	new;
+	CHashNode  *nnew;
+	CHashScanResult	scan;
+
+	/*
+	 * Allocate and initialize a new entry, on the assumption that the insert
+	 * will succeed.  If it ends up failing, we must be sure to put this back
+	 * on some free list, lest it be permanently leaked.
+	 */
+	new = CHashAllocate(table);
+	nnew = CHashTableGetNode(table, new);
+	nnew->un.hashcode = hashcode;
+	memcpy(CHashNodeGetItem(nnew), entry, table->desc.element_size);
+
+	/* Prevent garbage collection for this bucket. */
+	MyProc->hazard[0] = CHashTableGetGarbageByBucket(table, bucket);
+	pg_memory_barrier();
+
+	/*
+	 * Scan the bucket.  If we don't find a match, use compare-and-swap to
+	 * insert the new node at the insert position.  If we do find a match,
+	 * return the data to the caller.
+	 */
+retry:
+	CHashBucketScan(table, b, hashcode, entry, &scan);
+	if (scan.found)
+		memcpy(((char *) entry) + table->desc.key_size,
+			   CHashNodeGetItem(scan.target_node) + table->desc.key_size,
+			   table->desc.element_size - table->desc.key_size);
+	else
+	{
+		/*
+		 * We didn't find a matching element, so use compare-and-swap to
+		 * attempt to insert the new element we've prepared.  If this fails,
+		 * someone currently inserted or deleted an element.  The correct
+		 * insertion point might have changed, or the key we're trying to
+		 * insert might now be present when it wasn't before, so we'll have
+		 * to search the bucket chain anew.
+		 *
+		 * There is a risk of starvation here, but we hope it will not happen
+		 * in practice.  Contention for inserting new elements should be
+		 * spread out pretty much evenly across N+M possible insertion points,
+		 * where N is the number of buckets and M is the number of elements
+		 * in the table.  Even for a quite modestly size table this is likely
+		 * to exceed the number of CPU cores.
+		 */
+		Assert(!CHashPtrIsMarked(scan.target));
+		nnew->next = scan.target;
+		if (!__sync_bool_compare_and_swap(scan.pointer_to_target,
+										  scan.target, new))
+		{
+			CHashTableIncrementStatistic(table, CHS_Insert_Retry);
+			goto retry;
+		}
+	}
+
+	/* Allow garbage collection for this bucket. */
+	Assert(MyProc->hazard[0] != NULL);
+	pg_memory_barrier();
+	MyProc->hazard[0] = NULL;
+
+	/*
+	 * If the insert failed, add the entry we found to the appropriate
+	 * garbage list.  We can't simply put this back on the freelist,
+	 * because that leads to an ABA problem: some other backend could
+	 * read the head of the freelist, go into the tank, and then use
+	 * CAS to pop an element.  If at that point, it finds the same
+	 * element at the head of the freelist but with a different successor,
+	 * we're hosed.  To prevent that, we ensure that elements are added
+	 * to a given freelist only after verifying that any allocations in
+	 * progress at the time we popped the freelist has completed.  This
+	 * guarantees that any allocation still in progress at the time this
+	 * element makes it back to the freelist is trying to allocate some
+	 * other node.
+	 */
+	CHashTableIncrementStatistic(table, CHS_Insert);
+	if (scan.found)
+	{
+		CHashTableIncrementStatistic(table, CHS_Insert_Failed);
+		CHashAddToGarbage(table, bucket, new);
+	}
+
+	/* The insert succeeded if and only if no duplicate was found. */
+	return !scan.found;
+}
+
+/*
+ * Delete from a concurrent hash table.  entry need only contain the key field.
+ * Returns true if we find and delete a matching key and false otherwise.
+ */
+bool
+CHashDelete(CHashTable table, void *entry)
+{
+	uint32	hashcode = hash_any(entry, table->desc.key_size);
+	uint32	bucket = hashcode & table->bucket_mask;
+	CHashPtr	   *b = &table->bucket[bucket];
+	CHashScanResult	scan;
+
+	/* Prevent garbage collection for this bucket. */
+	Assert(MyProc->hazard[0] == NULL);
+	MyProc->hazard[0] = CHashTableGetGarbageByBucket(table, bucket);
+	pg_memory_barrier();
+
+	/* Scan bucket. */
+retry:
+	CHashBucketScan(table, b, hashcode, entry, &scan);
+
+	/* If we found it, try to delete it. */
+	if (scan.found)
+	{
+		Assert(!CHashPtrIsMarked(scan.next));
+
+		/* Attempt to apply delete-mark. */
+		if (!__sync_bool_compare_and_swap(&scan.target_node->next,
+										  scan.next,
+										  CHashPtrMark(scan.next)))
+		{
+			CHashTableIncrementStatistic(table, CHS_Delete_Retry);
+			goto retry;
+		}
+
+		/* Deletion is done; attempt to remove node from list. */
+		if (__sync_bool_compare_and_swap(scan.pointer_to_target,
+										 scan.target,
+										 scan.next))
+			CHashAddToGarbage(table, bucket, scan.target);
+		else
+		{
+			CHashScanResult	cleanup_scan;
+
+			/*
+			 * If we weren't able to remove the deleted item, rescan
+			 * the bucket to make sure it's really gone.  This is just
+			 * like a regular bucket scan, except that we don't care
+			 * about the results.  We're just doing it to achieve the
+			 * side-effect of removing delete-marked nodes from the
+			 * bucket chain.
+			 */
+			CHashTableIncrementStatistic(table, CHS_Cleanup_Scan);
+			CHashBucketScan(table, b, hashcode, entry, &cleanup_scan);
+		}
+	}
+
+	/* Allow garbage collection for this bucket. */
+	Assert(MyProc->hazard[0] != NULL);
+	pg_memory_barrier();
+	MyProc->hazard[0] = NULL;
+
+	/* We're done. */
+	CHashTableIncrementStatistic(table, CHS_Delete);
+	if (!scan.found)
+		CHashTableIncrementStatistic(table, CHS_Delete_Failed);
+	return scan.found;
+}
+
+/*
+ * Provide backend-local statistics to caller.
+ */
+void
+CHashStatistics(CHashTable table, uint64 *stats)
+{
+	memcpy(stats, &table->stats, sizeof(uint64) * CHS_NumberOfStatistics);
+}
+
+/*
+ * Scan one bucket of a concurrent hash table, storing the results in a
+ * CHashResult object provided by the caller.
+ *
+ * Caller must suppress garbage collection for the target bucket before
+ * calling this function, and resume it afterwards.
+ *
+ * On return, res->found will be true if a matching item was found and false
+ * otherwise.  res->target will be a CHashPtr referencing the matching item,
+ * or the first one following the position where the matching item should have
+ * been; res->pointer_to_target will be a pointer to the memory address from
+ * which res->target was read.
+ *
+ * If res->target is not invalid, then res->target_node is a pointer to its
+ * location in memory.  If res->found, then res->next will be the next pointer
+ * of res->target_node; otherwise, it's undefined.
+ */
+static void
+CHashBucketScan(CHashTable table,
+				CHashPtr *start,
+				uint32 hashcode,
+				const void *key,
+				CHashScanResult *res)
+{
+	CHashPtr	target;
+	CHashPtr   *pointer_to_target;
+	CHashNode  *target_node = NULL;
+
+retry:
+	pointer_to_target = start;
+	target = *pointer_to_target;
+	for (;;)
+	{
+		CHashPtr	next;
+		uint32		h;
+		int			cmp;
+
+		/*
+		 * If we've reached the end of the bucket chain, stop; otherwise,
+		 * figure out the actual address of the next item.
+		 */
+		if (CHashPtrIsInvalid(target))
+		{
+			res->found = false;
+			break;
+		}
+		target_node = CHashTableGetNode(table, target);
+
+		/*
+		 * target may have been fetched from an arena entry that could be
+		 * concurrently modified, so a dependency barrier is required before
+		 * dereferencing the derived pointer.
+		 */
+		pg_read_barrier_depends();
+		next = target_node->next;
+
+		/*
+		 * For simplicity, any bucket scan, even if it's servicing a search,
+		 * will attempt to remove delete-marked items from the bucket.  This
+		 * ensures that delete-marked elements are removed from bucket chains
+		 * as quickly as possible and reduces code duplication.  See
+		 * CHashDelete for further comments about why delete-marking is
+		 * necessary and how it allows safe deletion.
+		 */
+		if (CHashPtrIsMarked(next))
+		{
+zap:
+			if (__sync_bool_compare_and_swap(pointer_to_target,
+											 target,
+											 CHashPtrUnmark(next)))
+			{
+				/*
+				 * No one else can possibly have modified target_node->next,
+				 * because such modifications are not allowed after a
+				 * delete-mark has been applied.  Thus, if we just keep
+				 * following the next pointers, we're guaranteed to visit
+				 * all non-deleted items (and possibly some deleted items)
+				 * that were present at the time we began the scan.
+				 */
+				CHashTableIncrementStatistic(table, CHS_Scan_Expunge);
+				CHashAddToGarbage(table, hashcode & table->bucket_mask,
+								  target);
+				target = CHashPtrUnmark(next);
+			}
+			else
+			{
+				/*
+				 * If the previous node has been delete-marked, we can't
+				 * further update the next-pointer, so restart the scan.
+				 * Otherwise, we know that some other backend removed one
+				 * or more deleted nodes from the bucket chain just as we
+				 * were trying to do, and we can simply continue the scan
+				 * from wherever the previous node is pointing now.  It's
+				 * possible that some concurrent inserts have happened, too,
+				 * but that's OK; we can view those as having happened "before"
+				 * whatever operation this scan is supporting.
+				 *
+				 * Although starvation is a theoretical possibility here if
+				 * we are forced to retry repeatedly, even a single retry is
+				 * vanishingly unlikely in practice.  It requires some other
+				 * backend to delete both the node that we're looking at and
+				 * the node which precedes it before we advance to the next
+				 * node.  That could certainly happen occasionally, but we'd
+				 * have to be pretty unlucky to have it happen even twice in
+				 * a row.
+				 */
+				CHashTableIncrementStatistic(table, CHS_Scan_Expunge_Fail);
+				target = *pointer_to_target;
+				if (CHashPtrIsMarked(target))
+				{
+					CHashTableIncrementStatistic(table, CHS_Scan_Restart);
+					goto retry;
+				}
+			}
+			continue;
+		}
+
+		/*
+		 * Bucket chains are kept in order, so that there is exactly one legal
+		 * point at which any given key can be inserted.  The ordering is by
+		 * hashcode first, and then by memcmp ordering of the keys involved.
+		 */
+		h = target_node->un.hashcode;
+		if (h == hashcode)
+			cmp = memcmp(CHashNodeGetItem(target_node), key,
+						 table->desc.key_size);
+		else if (h > hashcode)
+			cmp = 1;
+		else
+			cmp = -1;
+
+		/*
+		 * If cmp < 0, then we haven't yet reached the point at which we'd
+		 * expect to find the key, so we must continue the scan.  If cmp == 0,
+		 * we've found the key and can stop.  If cmp > 0, we've either passed
+		 * the point where we expect to find the key OR someone delete-marked
+		 * the item and overwrote the hashcode with a gcnext pointer.  In the
+		 * latter case we must take care not to be fooled into stopping the
+		 * scan early.
+		 */
+		if (cmp >= 0)
+		{
+			if (cmp == 0)
+			{
+				res->found = true;
+				res->next = next;
+				break;
+			}
+			else
+			{
+				/*
+				 * pg_read_barrier() prevents the reread of the next pointer
+				 * from being speculated ahead of the read of the hash value.
+				 */
+				pg_read_barrier();
+				next = target_node->next;
+				if (CHashPtrIsMarked(next))
+					goto zap;
+				res->found = false;
+				break;
+			}
+		}
+
+		/* Continue scan from next node. */
+		pointer_to_target = &target_node->next;
+		target = next;
+	}
+
+	/* Send results back to caller. */
+	res->target = target;
+	res->pointer_to_target = pointer_to_target;
+	res->target_node = target_node;
+}
+
+/*
+ * Allocate an arena slot for a new item to be inserted into a hash table.
+ *
+ * We don't want to wait until every single free-list is completely empty
+ * before beginning to garbage collect, because that could result in very
+ * fast allocation followed by a storm of garbage collection activity.
+ * It could also lead to every inserting backend ganging up on the only
+ * non-empty freelist.
+ *
+ * To avoid that, we check free lists and garbage lists in alternation.
+ * We always start with the same free list - which one is based on our
+ * backend ID - but we try to round-robin over all the available garbage
+ * lists.  Whenever we successfully garbage collect, we put the recovered
+ * items on our own free list.  In this way, if there's only one backend
+ * active, it will typically find a free buffer in the first place it looks:
+ * its own free list.  It will also settle into a pattern of garbage
+ * collecting the garbage list which it has visited least recently, which
+ * is what we want.
+ */
+static CHashPtr
+CHashAllocate(CHashTable table)
+{
+	uint32		f_current;
+	CHashPtr	new;
+
+	/* Pick a starting freelist base on our backend ID. */
+ 	f_current = ((uint32) MyBackendId) % CHashTableNFreeLists(table);
+
+	/* If this process hasn't initialized gc_next yet, do that now. */
+	if (table->gc_pid != MyProcPid)
+	{
+		table->gc_pid = MyProcPid;
+		table->gc_next = ((uint32) MyProcPid) % CHashTableNGarbage(table);
+	}
+
+	/* Loop until we allocate a buffer. */
+	for (;;)
+	{
+		CHashPtr  *b;
+
+		/*
+ 		 * Try to pop a buffer from a freelist using compare-and-swap.
+ 		 *
+ 		 * Note that this is only safe if it's impossible for the next pointer
+ 		 * of the first element on the list to change between the time when
+ 		 * we read it and the time we use CAS to pop it off the list.  This
+ 		 * means that we can't allow any element that is currently on this
+ 		 * freelist to be allocated, marked as garbage, garbage collected,
+ 		 * and returned back to this freelist before we finish the CAS.
+ 		 *
+ 		 * If we attempt to pop the free-list and fail, we retry immediately
+ 		 * with the same free-list.  This reduces the frequency with which
+ 		 * we're obliged to update our hazard pointers, which is a material
+ 		 * savings due to the associated memory barrier.
+ 		 */
+		b = CHashTableGetFreeList(table, f_current);
+		MyProc->hazard[0] = b;
+		pg_memory_barrier();
+		new = *b;
+		while (!CHashPtrIsInvalid(new))
+		{
+			CHashNode  *n = CHashTableGetNode(table, new);
+
+			/*
+			 * n is computed from table->freelist[f_current], which could
+			 * be modified by concurrent activity, so we need a dependency
+			 * barrier here.
+			 */
+			pg_read_barrier_depends();
+			if (__sync_bool_compare_and_swap(b, new, n->un.gcnext))
+				return new;
+			CHashTableIncrementStatistic(table, CHS_Allocate_Fail);
+			new = *b;
+		}
+
+		/* Attempt garbage collection. */
+		new = CHashAllocateViaGC(table);
+		if (!CHashPtrIsInvalid(new))
+			return new;
+
+		/* Advance to next freelist. */
+		f_current = (f_current + 1) % CHashTableNFreeLists(table);
+	}
+}
+
+/*
+ * Attempt to satisfy an allocation request via garbage collection.
+ */
+static CHashPtr
+CHashAllocateViaGC(CHashTable table)
+{
+	uint32		f_home;
+	CHashPtr   *b;
+	CHashPtr   *fh;
+	CHashPtr	fhead;
+	CHashPtr	garbage;
+	CHashPtr	new;
+	CHashNode  *n;
+	uint32		i;
+
+	/* Pick a target freelist based on our backend ID. */
+ 	f_home = ((uint32) MyBackendId) % CHashTableNFreeLists(table);
+	fh = CHashTableGetFreeList(table, f_home);
+
+	/* Select target garbage list. */
+	table->gc_next = (table->gc_next + 1) % CHashTableNGarbage(table);
+	b = CHashTableGetGarbageList(table, table->gc_next);
+	garbage = *b;
+
+	/* If list is empty, fail. */
+	if (CHashPtrIsInvalid(garbage))
+		return InvalidCHashPtr;
+
+	/* If we're unable to empty the list via compare-and-swap, fail. */
+	if (!__sync_bool_compare_and_swap(b, garbage, InvalidCHashPtr))
+	{
+		CHashTableIncrementStatistic(table, CHS_Garbage_Dequeue_Fail);
+		return InvalidCHashPtr;
+	}
+
+	/*
+	 * Before removing elements removed from the garbage list to the
+	 * freelist, we must wait until (1) all bucket scans that might
+	 * still see elements on the freelist as part of the bucket chain
+	 * have completed and (2) all allocations that might see an old
+	 * version of the freelist containing one of the elements to be
+	 * garbage collected have completed.
+	 *
+	 * Note: We can't begin this operation until the clearing of the
+	 * garbage list has been committed to memory, but since that was
+	 * done using an atomic operation no explicit barrier is needed
+	 * here.
+	 *
+	 * Note: We could have a "soft" version of this that merely
+	 * requeues the garbage if it's not immediately recycleable, but
+	 * it's not clear that we need such a thing.  On the flip side we
+	 * might want to eventually enter a longer sleep here, or PANIC,
+	 * but it's not clear exactly how to calibrate that.
+	 */
+	CHashTableIncrementStatistic(table, CHS_GC);
+	MyProc->hazard[0] = NULL;
+	for (i = 0; i < ProcGlobal->allProcCount; i++)
+	{
+		volatile PGPROC	*proc = &ProcGlobal->allProcs[i];
+		void	   *hazard;
+
+		hazard = proc->hazard[0];
+		if (hazard == b || hazard == fh)
+		{
+			CHashTableIncrementStatistic(table, CHS_GC_Spin);
+			do
+			{
+				hazard = proc->hazard[0];
+			} while (hazard == b || hazard == fh);
+		}
+	}
+
+	/* Remove one item from list to satisfy current allocation. */
+	new = garbage;
+	n = CHashTableGetNode(table, new);
+	pg_read_barrier_depends();
+	fhead = n->un.gcnext;
+
+	if (CHashPtrIsInvalid(fhead))
+	{
+		/*
+		 * We have reclaimed exactly node, so there's nothing to put
+		 * back on the free list.  In this case (only) we need a
+		 * memory barrier, because the reads above must complete
+		 * before we overwrite n->un.gcnext with a new hashcode.
+		 * (This is only needed when we reclaim exactly one node,
+		 * because in any other case we'll do a compare-and-swap
+		 * before returning, which implies a full barrier.)
+ 		 */
+		pg_memory_barrier();
+		CHashTableIncrementStatistic(table, CHS_GC_Reclaim_Skipped);
+	}
+	else if (__sync_bool_compare_and_swap(fh, InvalidCHashPtr, fhead))
+	{
+		/*
+ 		 * Our free list is empty, and we've succesfully pushed the
+ 		 * reclaimed nodes onto it.  So we're done.
+ 		 */
+		CHashTableIncrementStatistic(table, CHS_GC_Reclaim_Fast);
+	}
+	else
+	{
+		CHashPtr	fcurrent;
+		CHashPtr	fnext;
+		CHashPtr	oldhead;
+
+		/* Walk list of reclaimed elements to end. */
+		fcurrent = fhead;
+		for (;;)
+		{
+			n = CHashTableGetNode(table, fcurrent);
+			fnext = n->un.gcnext;
+			if (CHashPtrIsInvalid(fnext))
+				break;
+			fcurrent = fnext;
+		}
+
+		/* Push reclaimed elements onto home free list. */
+		for (;;)
+		{
+			oldhead = *fh;
+			n->un.gcnext = oldhead;
+			if (__sync_bool_compare_and_swap(fh, oldhead, fhead))
+				break;
+			CHashTableIncrementStatistic(table, CHS_GC_Reclaim_Retry);
+		}
+	}
+
+	/* Return the element we saved for ourselves. */
+	return new;
+}
+
+/*
+ * Add an arena slot to the appropriate garbage list.
+ *
+ * The next garbage collection cycle for the affected bucket will move it
+ * to the free list.  We can't do that immediately because there might be
+ * someone traversing the list who is counting on being able to follow the
+ * next pointer.  It's OK to clobber the hash value, though, since a spurious
+ * failure to match an already-deleted item shouldn't cause any problems;
+ * this is why gcnext can share space with the hash value.
+ */
+static void
+CHashAddToGarbage(CHashTable table, uint32 bucket, CHashPtr c)
+{
+	CHashPtr	g;
+	CHashNode *n;
+	CHashPtr *garbage;
+
+	n = CHashTableGetNode(table, c);
+	garbage = CHashTableGetGarbageByBucket(table, bucket);
+
+	while (1)
+	{
+		g = *garbage;
+		n->un.gcnext = g;
+		if (__sync_bool_compare_and_swap(garbage, g, c))
+			break;
+		CHashTableIncrementStatistic(table, CHS_Garbage_Enqueue_Retry);
+	}
+}
diff --git a/src/include/storage/barrier.h b/src/include/storage/barrier.h
index b36705b..6ef779b 100644
--- a/src/include/storage/barrier.h
+++ b/src/include/storage/barrier.h
@@ -20,4 +20,12 @@
  */
 #include "port/atomics.h"
 
+/*
+ * If dependency barriers are undefined, we define them as a no-op.  The only
+ * known platform where anything more is required is DEC Alpha.
+ */
+#if !defined(pg_read_barrier_depends)
+#define pg_read_barrier_depends()
+#endif
+
 #endif   /* BARRIER_H */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 0e69b63..4c6fac8 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -96,20 +96,6 @@ typedef struct buftag
 )
 
 /*
- * The shared buffer mapping table is partitioned to reduce contention.
- * To determine which partition lock a given tag requires, compute the tag's
- * hash code with BufTableHashCode(), then apply BufMappingPartitionLock().
- * NB: NUM_BUFFER_PARTITIONS must be a power of 2!
- */
-#define BufTableHashPartition(hashcode) \
-	((hashcode) % NUM_BUFFER_PARTITIONS)
-#define BufMappingPartitionLock(hashcode) \
-	(&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + \
-		BufTableHashPartition(hashcode)].lock)
-#define BufMappingPartitionLockByIndex(i) \
-	(&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + (i)].lock)
-
-/*
  *	BufferDesc -- shared descriptor/state data for a single shared buffer.
  *
  * Note: buf_hdr_lock must be held to examine or change the tag, flags,
@@ -200,9 +186,9 @@ extern void StrategyInitialize(bool init);
 extern Size BufTableShmemSize(int size);
 extern void InitBufTable(int size);
 extern uint32 BufTableHashCode(BufferTag *tagPtr);
-extern int	BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
-extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
-extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern int	BufTableLookup(BufferTag *tagPtr);
+extern int	BufTableInsert(BufferTag *tagPtr, int buf_id);
+extern void BufTableDelete(BufferTag *tagPtr);
 
 /* localbuf.c */
 extern void LocalPrefetchBuffer(SMgrRelation smgr, ForkNumber forkNum,
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 02c8f1a..f98be4d 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -136,7 +136,7 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS  128
+#define NUM_BUFFER_PARTITIONS  0
 
 /* Number of partitions the shared lock tables are divided into */
 #define LOG2_NUM_LOCK_PARTITIONS  4
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index c23f4da..07f9761 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -58,6 +58,14 @@ struct XidCache
 #define		FP_LOCK_SLOTS_PER_BACKEND 16
 
 /*
+ * Some lock-free algorithms require each backend process to be able to
+ * advertise a certain number of "hazard pointers" in shared memory.
+ * Right now one per backend is enough for our purpose, but some
+ * algorithms require more.
+ */
+#define		NUM_HAZARD_POINTERS			1
+
+/*
  * Each backend has a PGPROC struct in shared memory.  There is also a list of
  * currently-unused PGPROC structs that will be reallocated to new backends.
  *
@@ -142,6 +150,12 @@ struct PGPROC
 	bool		fpVXIDLock;		/* are we holding a fast-path VXID lock? */
 	LocalTransactionId fpLocalTransactionId;	/* lxid for fast-path VXID
 												 * lock */
+
+	/*
+	 * Hazard pointers. Currently one is enough, though some algorithms
+	 * require a few more.
+	 */
+	void	   *hazard[NUM_HAZARD_POINTERS];
 };
 
 /* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index 745eb7e..4ff8415 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -40,6 +40,7 @@ extern void InitShmemIndex(void);
 extern HTAB *ShmemInitHash(const char *name, long init_size, long max_size,
 			  HASHCTL *infoP, int hash_flags);
 extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
+extern void *ShmemAttachStruct(const char *name);
 extern Size add_size(Size s1, Size s2);
 extern Size mul_size(Size s1, Size s2);
 
diff --git a/src/include/utils/chash.h b/src/include/utils/chash.h
new file mode 100644
index 0000000..ee0573c
--- /dev/null
+++ b/src/include/utils/chash.h
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * chash.h
+ *	  Concurrent shared-memory hash table.
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/chash.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CHASH_H 
+#define CHASH_H
+
+/* Everything caller must supply to set up a concurrent hash table. */
+typedef struct
+{
+	const char *shmem_name;		/* shared memory name for this hash table */
+	uint32		capacity;		/* maximum size of hash table */
+	uint16		element_size;	/* size of each element */
+	uint16		key_size;		/* size of each key */
+} CHashDescriptor;
+
+/* Concurrent hash table statistics. */
+typedef enum
+{
+	CHS_Search,					/* search */
+	CHS_Search_Failed,			/* search failed (no such key) */
+	CHS_Insert,					/* insert */
+	CHS_Insert_Failed,			/* insert failed (duplicate key) */
+	CHS_Insert_Retry,			/* insert retried (concurrent update) */
+	CHS_Delete,					/* delete */
+	CHS_Delete_Failed,			/* delete failed (no such key) */
+	CHS_Delete_Retry,			/* delete retried (concurrent update) */
+	CHS_Scan_Expunge,			/* scan expunged deleted item */
+	CHS_Scan_Expunge_Fail,		/* scan failed to expunge */
+	CHS_Scan_Restart,			/* concurrent deletes forced a scan restart */
+	CHS_Cleanup_Scan,			/* concurrent update forced a cleanup scan */
+	CHS_Allocate_Fail,			/* allocation failed to pop freelist */
+	CHS_Garbage_Enqueue_Retry,	/* enqueue on garbage list retried */
+	CHS_Garbage_Dequeue_Fail,	/* dequeue of garbage failed */
+	CHS_GC,						/* garbage collection cycle */
+	CHS_GC_Spin,				/* GC spun waiting for concurrent process */
+	CHS_GC_Reclaim_Skipped,		/* GC recovered only one item */
+	CHS_GC_Reclaim_Fast,		/* GC put garbage on freelist via fast path */
+	CHS_GC_Reclaim_Retry,		/* enqueue of garbage on freelist retried */
+	CHS_NumberOfStatistics		/* number of statistics */
+} CHashStatisticsType;
+
+/* Human-readable names for statistics. */
+extern char *CHashStatisticsNames[];
+
+/* Opaque handle for a concurrent hash table. */
+struct CHashTableData;
+typedef struct CHashTableData *CHashTable;
+
+/* Initialization functions. */
+extern CHashTable CHashBootstrap(CHashDescriptor *desc);
+extern Size CHashEstimateSize(CHashTable table);
+extern CHashTable CHashInitialize(CHashTable table, CHashDescriptor *desc);
+
+/* Accessor functions. */
+extern bool CHashInsert(CHashTable table, void *entry);
+extern bool CHashDelete(CHashTable table, void *key);
+extern bool CHashSearch(CHashTable table, void *entry);
+extern void CHashStatistics(CHashTable table, uint64 *stats);
+
+#endif   /* CHASH_H */

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Robert Haas (#1)

Re: WIP: dynahash replacement for buffer table

On 2014-10-14 09:30:58 -0400, Robert Haas wrote:

A few years ago I started working on a concurrent hash table for
PostgreSQL. The hash table part of it worked, but I never did
anything with it, really. Amit mentioned to me earlier this week that
he was seeing contention inside the dynahash machinery, which inspired
me to go back and update the patch.

Interestingly I've benchmarked similar loads, even on the same machine
as Amit, and I do seem trememdous time spent in dynahash.c. It's nearly
all cache misses in my tests though.

I took the basic infrastructure
from before and used it to replace the buffer table. Patch is
attached.

That's pretty cool. The algorithm is complex enough that I haven't fully
understood it yet, but it sounds sane on a first glance.

The key idea here is that lookups are done without any locks, only
memory barriers; and inserts and deletes are done using atomic ops.

Hm. I quickly looked and I see that you use two full barriers for every
lookup. That's pretty expensive. I think this should be doable using
only read/write barriers on the lookup side? Even on architectures where
read/write memory barriers have to be something but a compiler barrier,
they're still usually a magnitude or so cheaper than full barriers.

The algorithm is not strictly lock-free for reasons explained in the
comments in chash.c, but it's a lot less locky than what we have now,
so in theory you might think that would be a good thing.

I don't see much reason to strive for full lock-free ness. If the locks
aren't noticeable in real world scenarios - who cares?

I haven't had time to do much performance testing yet, but it looks
like this may be slower at low client counts and faster at high client
counts. However, my results weren't real reproducible, and I haven't
done comprehensive testing yet. What was really bizarre is that I
couldn't really pin down the cause of the slowness at low client
counts; a quick perf profile showed overhead concentrated in
CHashBucketScan, basically memory access latency for walking the
bucket chain. But the table is built to have a load factor of 1, so I
can't see why that should amount to much, or why it should be
significantly worse than for dynahash.

With regard for using a hash table for the buffer mapping lock I'm
doubtful that any form of separate chaining is the right one. We
currently have a quite noticeable problem with the number of cache
misses in the buffer mapping hash (and also the heavyweight lock mgr) -
if we stay with hashes that's only going to be fundamentally lower than
dynahash if we change the type of hashing. I've had good, *preliminary*,
results using an open addressing + linear probing approach.

My guess is that the additional indirection via the arena explains the
difference in cache misses? But I might be completely off here.

It'd be quite interesting to see a perf stat -ddd of dynahash.c/chash
employing builds for some comparable load.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Robert Haas (#1)

Re: WIP: dynahash replacement for buffer table

On 2014-10-14 09:30:58 -0400, Robert Haas wrote:

I took the basic infrastructure from before and used it to replace the
buffer table. Patch is attached.

Hm. Unless I missed something you pretty much completely eradicated
locks from the buffer mapping code? That'd be pretty cool, but it's also
scary.

How confident are you that your conversion is correct? Not in the sense
that there aren't any buglets, but fundamentally.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

about 11 years ago

In reply to: Andres Freund (#2)

Re: WIP: dynahash replacement for buffer table

On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-10-14 09:30:58 -0400, Robert Haas wrote:

A few years ago I started working on a concurrent hash table for
PostgreSQL. The hash table part of it worked, but I never did
anything with it, really. Amit mentioned to me earlier this week that
he was seeing contention inside the dynahash machinery, which inspired
me to go back and update the patch.

Interestingly I've benchmarked similar loads, even on the same machine
as Amit,

There is one catch here, for these profiles I am using Power-8 m/c
and the load is slightly higher (5000 scale factor).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Amit Kapila (#4)

Re: WIP: dynahash replacement for buffer table

On 2014-10-14 20:30:45 +0530, Amit Kapila wrote:

On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-10-14 09:30:58 -0400, Robert Haas wrote:

A few years ago I started working on a concurrent hash table for
PostgreSQL. The hash table part of it worked, but I never did
anything with it, really. Amit mentioned to me earlier this week that
he was seeing contention inside the dynahash machinery, which inspired
me to go back and update the patch.

Interestingly I've benchmarked similar loads, even on the same machine
as Amit,

There is one catch here, for these profiles I am using Power-8 m/c
and the load is slightly higher (5000 scale factor).

Ah, right. I don't think the scale factor changes much, but the
different architecture certainly does. As I said elsewhere, I would not
believe these profiles much until they're actually done with optimized
code...

I also think we need to be a bit careful about optimizing too much for
stock pgbench with working set >> s_b. The uniform readonly access
pattern you're profiling here isn't something super realistic. Still
valuable, but we should take it with a grain of salt.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Andres Freund (#3)

Re: WIP: dynahash replacement for buffer table

On Tue, Oct 14, 2014 at 10:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-10-14 09:30:58 -0400, Robert Haas wrote:

I took the basic infrastructure from before and used it to replace the
buffer table. Patch is attached.

Hm. Unless I missed something you pretty much completely eradicated
locks from the buffer mapping code? That'd be pretty cool, but it's also
scary.

How confident are you that your conversion is correct? Not in the sense
that there aren't any buglets, but fundamentally.

It doesn't look particularly dangerous to me. Famous last words.

Basically, I think what we're doing right now is holding the buffer
mapping lock so that the buffer can't be renamed under us while we're
pinning it. If we don't do that, I think the only consequence is
that, by the time we get the buffer pin, the buffer might no longer be
the one we looked up. So we need to recheck. But assuming we do
that, I don't see an issue. In fact, it occurred to me while I was
cobbling this together that we might want to experiment with that
change independently of chash. We already know (from the
StrategyGetBuffer locking changes) that holding system-wide locks to
prevent people from twaddling the state of individual buffers can be a
loser. This case isn't nearly as bad, because we're only pinning one
buffer, rather than potentially all of them multiple times, and the
lock we're holding only affects 1/128th of the buffers, not all of
them.

The other application I had in mind for chash was SLRU stuff. I
haven't tried to write the code yet, but fundamentally, the same
considerations apply there. You do the lookup locklessly to find out
which buffer is thought to contain the SLRU page you want, but by the
time you lock the page the answer might be out of date. Assuming that
this doesn't happen many times in a row and that lookups are
relatively cheap, you could get rid of having any centralized lock for
SLRU, and just have per-page locks.

I'm not quite sure what fundamental dangers you're thinking about
here, but from what I understand from reading the literature, doing
lookups in a lock-free hash table needs to be thought about in a sort
of relativistic way. The different CPUs have slightly different views
of the world, so any answer you get may be obsolete by the time you
get it, and may according to some other CPU's view of the world have
been obsolete even at the time that it was copied. That's OK, because
it's just equivalent to some other CPU doing its hash table update
slightly sooner or later than it actually did it, within the bounds
imposed by memory barriers, which it could have done anyway. There is
no one source of truth. The result of all this is that some of the
synchronization responsibility is transferred to the caller. That's
obviously a bit more complex to reason about, but it hasn't stopped
lock-free hash tables from being wildly popular data structures.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

about 11 years ago

In reply to: Andres Freund (#5)

Re: WIP: dynahash replacement for buffer table

On Tue, Oct 14, 2014 at 8:38 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-10-14 20:30:45 +0530, Amit Kapila wrote:

On Tue, Oct 14, 2014 at 7:55 PM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2014-10-14 09:30:58 -0400, Robert Haas wrote:

A few years ago I started working on a concurrent hash table for
PostgreSQL. The hash table part of it worked, but I never did
anything with it, really. Amit mentioned to me earlier this week

that

he was seeing contention inside the dynahash machinery, which

inspired

me to go back and update the patch.

Interestingly I've benchmarked similar loads, even on the same machine
as Amit,

There is one catch here, for these profiles I am using Power-8 m/c
and the load is slightly higher (5000 scale factor).

Ah, right. I don't think the scale factor changes much, but the
different architecture certainly does. As I said elsewhere, I would not
believe these profiles much until they're actually done with optimized
code...

Today, that m/c is not accessible, so will take a day or so to get the
optimized profiles and will post it once I am able to take the same.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Andres Freund (#2)

Re: WIP: dynahash replacement for buffer table

On Tue, Oct 14, 2014 at 10:25 AM, Andres Freund <andres@2ndquadrant.com> wrote:

The key idea here is that lookups are done without any locks, only
memory barriers; and inserts and deletes are done using atomic ops.

Hm. I quickly looked and I see that you use two full barriers for every
lookup. That's pretty expensive. I think this should be doable using
only read/write barriers on the lookup side? Even on architectures where
read/write memory barriers have to be something but a compiler barrier,
they're still usually a magnitude or so cheaper than full barriers.

The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer. A read or write barrier won't provide store/load
or load/store ordering.

I don't see much reason to strive for full lock-free ness. If the locks
aren't noticeable in real world scenarios - who cares?

That's my feeling too. ISTR that when I stress-tested the hash table
back in 2012 when I started this code, the concurrency was far better
than dynahash with 16 lwlock partitions. I don't remember off hand
exactly how I tested that, but it was with the code in hashtest.c. I
also seem to recall that it was possible to make the freelists get
VERY hot, but of course that was a workload where hash table updates
were the only thing happening, so I assume that on a workload where
we're actually doing work (like, copying the data in and out of kernel
buffers) that's not going to be a real problem unless you have
thousands of cores. But we'll see.

With regard for using a hash table for the buffer mapping lock I'm
doubtful that any form of separate chaining is the right one. We
currently have a quite noticeable problem with the number of cache
misses in the buffer mapping hash (and also the heavyweight lock mgr) -
if we stay with hashes that's only going to be fundamentally lower than
dynahash if we change the type of hashing. I've had good, *preliminary*,
results using an open addressing + linear probing approach.

I'm very skeptical of open addressing + linear probing. Such hash
tables tend to degrade over time, because you have to leave tombstones
behind to ensure that future scans advance far enough. There's really
no way to recover without rebuilding the whole hash table, and
eventually it degrades to linear search. If we're spending too much
time walking hash chains, I think the solution is to increase the
number of buckets so that the chains get shorter.

My guess is that the additional indirection via the arena explains the
difference in cache misses? But I might be completely off here.

The arena makes the calculation of the pointer address less
predictable, which might make things more difficult for the CPU
pipeline. But aside from that, I think it's the same basic idea: you
bounce from some sort of bucket header to the first element of the
chain and then, hopefully, you're done. Am I missing something?

It'd be quite interesting to see a perf stat -ddd of dynahash.c/chash
employing builds for some comparable load.

I'm hoping you can test this out on x86. All I have to work with are
the POWER machines, and the characteristics of those are somewhat
different. I can test it there, though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Robert Haas (#6)

Re: WIP: dynahash replacement for buffer table

On 2014-10-14 11:08:08 -0400, Robert Haas wrote:

On Tue, Oct 14, 2014 at 10:47 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-10-14 09:30:58 -0400, Robert Haas wrote:

I took the basic infrastructure from before and used it to replace the
buffer table. Patch is attached.

Hm. Unless I missed something you pretty much completely eradicated
locks from the buffer mapping code? That'd be pretty cool, but it's also
scary.

How confident are you that your conversion is correct? Not in the sense
that there aren't any buglets, but fundamentally.

It doesn't look particularly dangerous to me. Famous last words.

Basically, I think what we're doing right now is holding the buffer
mapping lock so that the buffer can't be renamed under us while we're
pinning it.

What I'm afraid of is that there's hidden assumptions about the
consistency provided by the mapping locks.

If we don't do that, I think the only consequence is
that, by the time we get the buffer pin, the buffer might no longer be
the one we looked up. So we need to recheck. But assuming we do
that, I don't see an issue. In fact, it occurred to me while I was
cobbling this together that we might want to experiment with that
change independently of chash. We already know (from the
StrategyGetBuffer locking changes) that holding system-wide locks to
prevent people from twaddling the state of individual buffers can be a
loser. This case isn't nearly as bad, because we're only pinning one
buffer, rather than potentially all of them multiple times, and the
lock we're holding only affects 1/128th of the buffers, not all of
them.

Also IIRC at least linux has targeted wakeup/time transfer logic when
using semaphore - and doesn't for spinlocks. So if you're not happening
to sleep while the lwlock's spinlock is held, the result will be
better. Only that we'll frequently sleep within that for very frequently
taken locks...

The other application I had in mind for chash was SLRU stuff. I
haven't tried to write the code yet, but fundamentally, the same
considerations apply there. You do the lookup locklessly to find out
which buffer is thought to contain the SLRU page you want, but by the
time you lock the page the answer might be out of date. Assuming that
this doesn't happen many times in a row and that lookups are
relatively cheap, you could get rid of having any centralized lock for
SLRU, and just have per-page locks.

Hm. I have to admit I haven't really looked enough into the slru code to
judge this. My impression so far wasn't that the locking itself was the
problem in most scenarios - that's not what you've seen?

I'm not quite sure what fundamental dangers you're thinking about
here

Oh, only in the context of the bufmgr.c conversion. Not more
generally. I agree that a lockfree hashtable is something quite
worthwile to have.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Andres Freund (#9)

Re: WIP: dynahash replacement for buffer table

On Tue, Oct 14, 2014 at 11:31 AM, Andres Freund <andres@2ndquadrant.com> wrote:

It doesn't look particularly dangerous to me. Famous last words.

Basically, I think what we're doing right now is holding the buffer
mapping lock so that the buffer can't be renamed under us while we're
pinning it.

What I'm afraid of is that there's hidden assumptions about the
consistency provided by the mapping locks.

That's certainly worth checking for, but I think the only code that
needs to be checked is the code that would formerly have run while
holding said lock. And there isn't that much of that.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Robert Haas (#8)

Re: WIP: dynahash replacement for buffer table

On 2014-10-14 11:19:16 -0400, Robert Haas wrote:

On Tue, Oct 14, 2014 at 10:25 AM, Andres Freund <andres@2ndquadrant.com> wrote:

The key idea here is that lookups are done without any locks, only
memory barriers; and inserts and deletes are done using atomic ops.

Hm. I quickly looked and I see that you use two full barriers for every
lookup. That's pretty expensive. I think this should be doable using
only read/write barriers on the lookup side? Even on architectures where
read/write memory barriers have to be something but a compiler barrier,
they're still usually a magnitude or so cheaper than full barriers.

The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer. A read or write barrier won't provide store/load
or load/store ordering.

I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.

With regard for using a hash table for the buffer mapping lock I'm
doubtful that any form of separate chaining is the right one. We
currently have a quite noticeable problem with the number of cache
misses in the buffer mapping hash (and also the heavyweight lock mgr) -
if we stay with hashes that's only going to be fundamentally lower than
dynahash if we change the type of hashing. I've had good, *preliminary*,
results using an open addressing + linear probing approach.

I'm very skeptical of open addressing + linear probing. Such hash
tables tend to degrade over time, because you have to leave tombstones
behind to ensure that future scans advance far enough.

You can try to be a bit smarter than a plain open addressing
approach. But yes, it has disadvantages.

There's really
no way to recover without rebuilding the whole hash table, and
eventually it degrades to linear search. If we're spending too much
time walking hash chains, I think the solution is to increase the
number of buckets so that the chains get shorter.

Might be worthwile to try. I know that both my handrolled open
addressing and my hand rolled chaining approach have significantly fewer
cache misses than dynahash for the same amount of work.

Hm. Possibly that's at least partially because of the segment stuff in
dynahash?

Anyway, you can get the size of the entire hashtable down quite a
bit. Primarily because there's no need to store a next pointer. There's
also not really any need for the 'hashvalue' in the bufmgr case
imo.

My guess is that the additional indirection via the arena explains the
difference in cache misses? But I might be completely off here.

The arena makes the calculation of the pointer address less
predictable, which might make things more difficult for the CPU
pipeline. But aside from that, I think it's the same basic idea: you
bounce from some sort of bucket header to the first element of the
chain and then, hopefully, you're done. Am I missing something?

I haven't really read much of the code yet (wrote most of this email on
my workstation before embarking on a plane), but afaics there's one
indirection to the bucket, and then one to the first node in the linked
list. Right?
In dynahash it's first to the segment, but there's only few, so it's
likely to be cached. Then to the bucket - which, afaics, already can
contain the key.

So it's not really the arena, but that you chose not to store the first
element in a chain inline...

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Andres Freund (#11)

Re: WIP: dynahash replacement for buffer table

On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:

The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer. A read or write barrier won't provide store/load
or load/store ordering.

I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.

Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain. It probably doesn't need the second mfence,
before clearing the the hazard pointer, because x86 allows loads to be
reordered before stores but not stores before loads. We could
certainly try removing the second fence on x86 for benchmarking
purposes, and we could also check whether mfence outperforms lock;
addl in that scenario.

I tested PPC, because that's what I have. I think we're emitting two
sync instructions there, but maybe lwsync or isync would be enough in
one or both cases. The first of these links indicates that lwsync
ought to be enough for both cases; the second says that we need a sync
after setting the hazard pointer but that an lwsync is enough before
clearing it. Or that's my reading anyway.

http://www.ibm.com/developerworks/systems/articles/powerpc.html
http://peeterjoot.wordpress.com/2010/07/23/use-of-lwsync-instead-of-isync-as-a-mutex-acquire-memory-barrier/

My guess is that the additional indirection via the arena explains the
difference in cache misses? But I might be completely off here.

The arena makes the calculation of the pointer address less
predictable, which might make things more difficult for the CPU
pipeline. But aside from that, I think it's the same basic idea: you
bounce from some sort of bucket header to the first element of the
chain and then, hopefully, you're done. Am I missing something?

I haven't really read much of the code yet (wrote most of this email on
my workstation before embarking on a plane), but afaics there's one
indirection to the bucket, and then one to the first node in the linked
list. Right?
In dynahash it's first to the segment, but there's only few, so it's
likely to be cached. Then to the bucket - which, afaics, already can
contain the key.

So it's not really the arena, but that you chose not to store the first
element in a chain inline...

Hmm, you have a point. I think the reason I did it that way is
because I don't know how to make the memory management work out
otherwise. If you store the first element inline, then even an empty
bucket uses up as much memory space as one with a single element.
More seriously, it breaks the deletion algorithm. There's no good way
to atomically swap out the bucket header if it is located in a fixed
position and bigger than the size of object the machine can manipulate
atomically.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Robert Haas (#12)

Re: WIP: dynahash replacement for buffer table

On 2014-10-14 17:53:10 -0400, Robert Haas wrote:

On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:

The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer. A read or write barrier won't provide store/load
or load/store ordering.

I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.

Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain.

Yes, I can see that now. I do wonder if that's not going to prove quite
expensive... I think I can see some ways around needing that. But I
think that needs some benchmarking first - no need to build a even more
complex solution if not needed.

The solution I'm thinking of is to essentially move away from hazard
pointers and store something like a generation counter per
backend. Which is updated less often, and in combination with other
operations. When a backend need to clean up and sees that there's a
straggler with a old generation it sends that backend a signal to ensure
it sets the latest generation.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Ants Aasma

ants@cybertec.at

about 11 years ago

In reply to: Robert Haas (#8)

Re: WIP: dynahash replacement for buffer table

On Tue, Oct 14, 2014 at 6:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:

With regard for using a hash table for the buffer mapping lock I'm
doubtful that any form of separate chaining is the right one. We
currently have a quite noticeable problem with the number of cache
misses in the buffer mapping hash (and also the heavyweight lock mgr) -
if we stay with hashes that's only going to be fundamentally lower than
dynahash if we change the type of hashing. I've had good, *preliminary*,
results using an open addressing + linear probing approach.

I'm very skeptical of open addressing + linear probing. Such hash
tables tend to degrade over time, because you have to leave tombstones
behind to ensure that future scans advance far enough. There's really
no way to recover without rebuilding the whole hash table, and
eventually it degrades to linear search. If we're spending too much
time walking hash chains, I think the solution is to increase the
number of buckets so that the chains get shorter.

What about cuckoo hashing? There was a recent paper on how to do fine
grained locking with cuckoo hashes. [1]https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf

I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way
associativity). This allows us to fit the bucket onto 2 regular sized
cache lines and have 8 bytes left over. Buckets would be protected by
seqlocks stored in the extra space. On the read side we would only
need 2 read barriers (basically free on x86), and we are guaranteed to
have an answer by fetching two pairs of cache lines. We can trade
memory bandwidth for latency by issuing prefetches (once we add
primitives for this). Alternatively we can trade insert speed for
lookup speed by using asymmetrically sized tables.

[1]: https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Ryan Johnson

ryan.johnson@cs.utoronto.ca

about 11 years ago

In reply to: Ants Aasma (#14)

Re: WIP: dynahash replacement for buffer table

On 15/10/2014 10:32 AM, Ants Aasma wrote:

On Tue, Oct 14, 2014 at 6:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:

With regard for using a hash table for the buffer mapping lock I'm
doubtful that any form of separate chaining is the right one. We
currently have a quite noticeable problem with the number of cache
misses in the buffer mapping hash (and also the heavyweight lock mgr) -
if we stay with hashes that's only going to be fundamentally lower than
dynahash if we change the type of hashing. I've had good, *preliminary*,
results using an open addressing + linear probing approach.

I'm very skeptical of open addressing + linear probing. Such hash
tables tend to degrade over time, because you have to leave tombstones
behind to ensure that future scans advance far enough. There's really
no way to recover without rebuilding the whole hash table, and
eventually it degrades to linear search. If we're spending too much
time walking hash chains, I think the solution is to increase the
number of buckets so that the chains get shorter.

What about cuckoo hashing? There was a recent paper on how to do fine
grained locking with cuckoo hashes. [1]

I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way
associativity). This allows us to fit the bucket onto 2 regular sized
cache lines and have 8 bytes left over. Buckets would be protected by
seqlocks stored in the extra space. On the read side we would only
need 2 read barriers (basically free on x86), and we are guaranteed to
have an answer by fetching two pairs of cache lines. We can trade
memory bandwidth for latency by issuing prefetches (once we add
primitives for this). Alternatively we can trade insert speed for
lookup speed by using asymmetrically sized tables.

[1] https://www.cs.princeton.edu/~mfreed/docs/cuckoo-eurosys14.pdf

Actually, I'd go with second-chance hashing [2]http://www.eecs.harvard.edu/~michaelm/postscripts/infocom_hardware_submit.pdf, same number of hash
functions but it's more stable (no infinite loops, for example). Most
probably the techniques from [1] would apply equally well.

[2]: http://www.eecs.harvard.edu/~michaelm/postscripts/infocom_hardware_submit.pdf
http://www.eecs.harvard.edu/~michaelm/postscripts/infocom_hardware_submit.pdf

Ryan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Andres Freund (#13)

Re: WIP: dynahash replacement for buffer table

On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-10-14 17:53:10 -0400, Robert Haas wrote:

On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:

The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer. A read or write barrier won't provide store/load
or load/store ordering.

I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.

Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain.

Yes, I can see that now. I do wonder if that's not going to prove quite
expensive... I think I can see some ways around needing that. But I
think that needs some benchmarking first - no need to build a even more
complex solution if not needed.

The solution I'm thinking of is to essentially move away from hazard
pointers and store something like a generation counter per
backend. Which is updated less often, and in combination with other
operations. When a backend need to clean up and sees that there's a
straggler with a old generation it sends that backend a signal to ensure
it sets the latest generation.

It's possible that might work ... but on the timescale we're talking
about here, that's asking the garbage collecting process to wait for
practically geologic time.

Back when I first wrote this code, I spent a fair amount of time
looking at existing work in the area of lock-free hash tables.
Essentially all of the work I was able to find on this topic assumes a
threaded model - or more precisely, it assumes that shared memory
allocation is cheap and easy and you'll have no problem getting as
much of it as you need whenever you need it. This assumption often
isn't even spelled out explicitly: it's just assumed that that's the
programming environment you're working in. Finding highly parallel
algorithms that don't rely on memory allocation as a primitive is
hard. Hazard pointers are one of the few techniques I found that
seems like it can work in our architecture. I'm quite reluctant to
leave aside techniques that have been proven to work well in favor of
inventing something entirely novel to PostgreSQL.

That having been said, there is some literature on generation numbers,
and I think something like that could be made to work. It does have
some significant disadvantages, though. One is that a single process
which fails to update its generation number prevents all reclamation,
system-wide. In my algorithm, a process whose execution is
suspended while a hazard pointer is set prevents recycling of only one
of many garbage lists. A process searching for a reusable element can
mostly likely find some other garbage list to reclaim instead. Also,
a generation number implies that we update the value periodically,
rather than before and after each critical section. Anything that
forces garbage collection to be postponed longer than absolutely
necessary seems to me likely to be a loser. It might be a little
faster as long as we have free elements to allocate, but as soon as
we're out and have to wait longer than we otherwise would for garbage
collection, and all system activity halts as a result, even for a few
milliseconds, it's going to be a whole lot slower. Or at least, I
think.

That having been said, I don't know what to do about the fact that the
fence is too expensive. I don't know that we've really established
that that's the true root of the problem rather than some other
pedestrian optimization failure. But the existing code is using an
atomic operation to acquire a spinlock, then releasing it, walking the
bucket chain, and then using another atomic operation to acquire a
spinlock and then releasing it again. Surely a pure fence shouldn't
cost more than a spinlock cycle? Even with arguably one extra cache
line touch, that seems like it ought to be a win. But my intuitions
in this area are shaky.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Robert Haas (#16)

Re: WIP: dynahash replacement for buffer table

On 2014-10-15 13:41:25 -0400, Robert Haas wrote:

On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund <andres@2ndquadrant.com> wrote:

The solution I'm thinking of is to essentially move away from hazard
pointers and store something like a generation counter per
backend. Which is updated less often, and in combination with other
operations. When a backend need to clean up and sees that there's a
straggler with a old generation it sends that backend a signal to ensure
it sets the latest generation.

It's possible that might work ... but on the timescale we're talking
about here, that's asking the garbage collecting process to wait for
practically geologic time.

I think it depends on what we're tying the generation increase to. We
very well could add a implict generation increment to, say, lwlock
acquisition - which are full barriers anyway. And I don't think we'll
normally have a that high rate of garbage collection anyway - there
should be plenty of headroom.

Back when I first wrote this code, I spent a fair amount of time
looking at existing work in the area of lock-free hash tables.
Essentially all of the work I was able to find on this topic assumes a
threaded model - or more precisely, it assumes that shared memory
allocation is cheap and easy and you'll have no problem getting as
much of it as you need whenever you need it.

FWIW, I think many of the solutions that are actually used in practice
don't rely on that heavily. I know at least some that require memory to
be reserved beforehand, in special contexts.

This assumption often
isn't even spelled out explicitly: it's just assumed that that's the
programming environment you're working in. Finding highly parallel
algorithms that don't rely on memory allocation as a primitive is
hard. Hazard pointers are one of the few techniques I found that
seems like it can work in our architecture. I'm quite reluctant to
leave aside techniques that have been proven to work well in favor of
inventing something entirely novel to PostgreSQL.

I don't think something like generation numbers is really that new and
unproven - it's essentially a more trivial version of RCU. Which is used
quite heavily, and under different names.

That said, I'm far from convinced that they are the solution - they just
were the first thing that came to my mind thinking about the problem.

That having been said, there is some literature on generation numbers,
and I think something like that could be made to work. It does have
some significant disadvantages, though. One is that a single process
which fails to update its generation number prevents all reclamation,
system-wide. In my algorithm, a process whose execution is
suspended while a hazard pointer is set prevents recycling of only one
of many garbage lists. A process searching for a reusable element can
mostly likely find some other garbage list to reclaim instead.

Hm. Couldn't a similar scheme with several lists be used with generation
numbers?

Also, a generation number implies that we update the value
periodically, rather than before and after each critical section.

Hm. Don't think it necessarily has to do that.

Anything that forces garbage collection to be postponed longer than
absolutely necessary seems to me likely to be a loser. It might be a
little faster as long as we have free elements to allocate, but as
soon as we're out and have to wait longer than we otherwise would for
garbage collection, and all system activity halts as a result, even
for a few milliseconds, it's going to be a whole lot slower. Or at
least, I think.

I think it really depends on the user of the facility. If the generation
were e.g. also tied to lwlocks I couldn't see bufmgr.c outrunning that
realistically.

That having been said, I don't know what to do about the fact that the
fence is too expensive.

I'm far from sure that it's the problem at hand here - the reason I'm
wondering about the barriers is primarily that the buffer mapping hash
table is one of the top profile entries in in all concurrent
workloads. The top one often even, after removing some locking
bottleneck. Nearly all of the time is spent there due to cache
misses. While I think we can get the table smaller and more efficient
for the same NBuffers value, realistically we need to cope with much
bigger NBuffers values.

Since cache misses are a problem that we're going to have to deal with,
restricting the processor's tool for efficiently dealing with that (out
of order execution, SMT), doesn't seem like a wise choice.

I don't know that we've really established
that that's the true root of the problem rather than some other
pedestrian optimization failure.

Absolutely.

But the existing code is using an
atomic operation to acquire a spinlock, then releasing it, walking the
bucket chain, and then using another atomic operation to acquire a
spinlock and then releasing it again.

Well, we don't do that for lookups right now. Just for
insertions/removals. Or are you talking about the buffer mapping lock?

Surely a pure fence shouldn't cost more than a spinlock cycle?

It really shouldn't - there's a difference right now that there's some
other thing the processor can do while dealing with the cache miss.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Robert Haas (#12)

Re: WIP: dynahash replacement for buffer table

On 2014-10-14 17:53:10 -0400, Robert Haas wrote:

On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:

The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer. A read or write barrier won't provide store/load
or load/store ordering.

I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.

Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain. It probably doesn't need the second mfence,
before clearing the the hazard pointer, because x86 allows loads to be
reordered before stores but not stores before loads. We could
certainly try removing the second fence on x86 for benchmarking
purposes, and we could also check whether mfence outperforms lock;
addl in that scenario.

Hm. So I thought about this for a while this morning after I wasn't able
to unprogram my hotel room's alarm clock that a previous occupant set
way to early. Given that premise, take the following with a grain of
salt.

The reason for neading an mfence is that a read/write barrier doesn't
guarantee that the store is visible to other processes - just in which
order they become visible. Right? And that's essentially because the
write might sit in the process's store buffer.

So, how about *not* using a full barrier on the reader side (i.e. the
side that does the hazard pointer writes). But instead do something like
a atomic_fetch_add(hazard_ptr, 0) (i.e. lock; xaddl) on the side that
needs to scan the hazard pointers. That, combined with the write memory
barrier, should guarantee that the store buffers are emptied. Which is
pretty nice, because it moves the overhead to the rather infrequent
scanning of the hazard pointers - which needs to do lots of other atomic
ops anyway.

Sounds sane?

That's something that should best be simulated in an exhaustive x86
simulator, but it sounds sane - and it'd be quite awesome imo :)

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Ryan Johnson

ryan.johnson@cs.utoronto.ca

about 11 years ago

In reply to: Robert Haas (#16)

Re: WIP: dynahash replacement for buffer table

On 15/10/2014 11:41 AM, Robert Haas wrote:

On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-10-14 17:53:10 -0400, Robert Haas wrote:

On Tue, Oct 14, 2014 at 4:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:

The code in CHashSearch shows the problem there: you need to STORE the
hazard pointer before you begin to do the LOAD operations to scan the
bucket, and you must finish all of those LOADs before you STORE the
NULL hazard pointer. A read or write barrier won't provide store/load
or load/store ordering.

I'm not sure I understand the problem here - but a read + write barrier
should suffice? To avoid falling back to two full barriers, we'd need a
separate define pg_read_write_barrier(), but that's not a problem. IIUC
that should allow us to avoid emitting any actual code on x86.

Well, thinking about x86 specifically, it definitely needs at least
one mfence, after setting the hazard pointer and before beginning to
read the bucket chain.

Yes, I can see that now. I do wonder if that's not going to prove quite
expensive... I think I can see some ways around needing that. But I
think that needs some benchmarking first - no need to build a even more
complex solution if not needed.

The solution I'm thinking of is to essentially move away from hazard
pointers and store something like a generation counter per
backend. Which is updated less often, and in combination with other
operations. When a backend need to clean up and sees that there's a
straggler with a old generation it sends that backend a signal to ensure
it sets the latest generation.

It's possible that might work ... but on the timescale we're talking
about here, that's asking the garbage collecting process to wait for
practically geologic time.

Back when I first wrote this code, I spent a fair amount of time
looking at existing work in the area of lock-free hash tables.
Essentially all of the work I was able to find on this topic assumes a
threaded model - or more precisely, it assumes that shared memory
allocation is cheap and easy and you'll have no problem getting as
much of it as you need whenever you need it. This assumption often
isn't even spelled out explicitly: it's just assumed that that's the
programming environment you're working in. Finding highly parallel
algorithms that don't rely on memory allocation as a primitive is
hard. Hazard pointers are one of the few techniques I found that
seems like it can work in our architecture. I'm quite reluctant to
leave aside techniques that have been proven to work well in favor of
inventing something entirely novel to PostgreSQL.

That having been said, there is some literature on generation numbers,
and I think something like that could be made to work. It does have
some significant disadvantages, though. One is that a single process
which fails to update its generation number prevents all reclamation,
system-wide. In my algorithm, a process whose execution is
suspended while a hazard pointer is set prevents recycling of only one
of many garbage lists. A process searching for a reusable element can
mostly likely find some other garbage list to reclaim instead. Also,
a generation number implies that we update the value periodically,
rather than before and after each critical section. Anything that
forces garbage collection to be postponed longer than absolutely
necessary seems to me likely to be a loser. It might be a little
faster as long as we have free elements to allocate, but as soon as
we're out and have to wait longer than we otherwise would for garbage
collection, and all system activity halts as a result, even for a few
milliseconds, it's going to be a whole lot slower. Or at least, I
think.

That having been said, I don't know what to do about the fact that the
fence is too expensive. I don't know that we've really established
that that's the true root of the problem rather than some other
pedestrian optimization failure. But the existing code is using an
atomic operation to acquire a spinlock, then releasing it, walking the
bucket chain, and then using another atomic operation to acquire a
spinlock and then releasing it again. Surely a pure fence shouldn't
cost more than a spinlock cycle? Even with arguably one extra cache
line touch, that seems like it ought to be a win. But my intuitions
in this area are shaky.

Why not use an RCU mechanism [1]http://www.rdrop.com/~paulmck/RCU/ and ditch the hazard pointers? Seems
like an ideal fit...

In brief, RCU has the following requirements:

* Read-heavy access pattern
* Writers must be able to make dead objects unreachable to new readers
(easily done for most data structures)
* Writers must be able to mark dead objects in such a way that
existing readers know to ignore their contents but can still
traverse the data structure properly (usually straightforward)
* Readers must occasionally inform the system that they are not
currently using any RCU-protected pointers (to allow resource
reclamation)

[1]: http://www.rdrop.com/~paulmck/RCU/

If the above are satisfied, then:

* Readers need no synchronization at all (not even compiler fences)
* Writers can use the synchronization mechanism of their choice to
coordinate with each other (lwlock, atomic CAS, etc.)
* Object reclamation can be performed in the background, or
synchronously (and incrementally, if desired )

I've had very good results implementing user-level RCU for my own
database projects. It slashes the complexity of reasoning about
concurrent reader accesses. Implemented properly, it requires only a bit
of shared memory, has extremely low overhead in the common case of no
stragglers, and requires only minimal communication except when
stragglers are present (and even then mostly between stragglers). It's
reasonably simple to implement in userspace dbms context (< 1kLoC with
comments, assertions, etc.). Just don't forget to quiesce all in-flight
back ends at regular intervals, or the system won't be able to reclaim
anything...

Thoughts?
Ryan

#20

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Ryan Johnson (#19)

Re: WIP: dynahash replacement for buffer table

On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson
<ryan.johnson@cs.utoronto.ca> wrote:

Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like
an ideal fit...

In brief, RCU has the following requirements:

Read-heavy access pattern
Writers must be able to make dead objects unreachable to new readers (easily
done for most data structures)
Writers must be able to mark dead objects in such a way that existing
readers know to ignore their contents but can still traverse the data
structure properly (usually straightforward)
Readers must occasionally inform the system that they are not currently
using any RCU-protected pointers (to allow resource reclamation)

Have a look at http://lwn.net/Articles/573424/ and specifically the
"URCU overview" section. Basically, that last requirement - that
readers inform the system tat they are not currently using any
RCU-protected pointers - turns out to require either memory barriers
or signals.

All of the many techniques that have been developed in this area are
merely minor variations on a very old theme: set some kind of flag
variable in shared memory to let people know that you are reading a
shared data structure, and clear it when you are done. Then, other
people can figure out when it's safe to recycle memory that was
previously part of that data structure. In Linux's RCU, the flag
variable is "whether the process is currently scheduled on a CPU",
which is obviously not workable from user-space. Lacking that, you
need an explicit flag variable, which means you need memory barriers,
since the protected operation is a load and the flag variable is
updated via a store. You can try to avoid some of the overhead by
updating the flag variable less often (say, when a signal arrives) or
you can make it more fine-grained (in my case, we only prevent reclaim
of a fraction of the data structure at a time, rather than all of it)
or various other variants, but none of this is unfortunately so simple
as "apply technique X and your problem just goes away".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Robert Haas (#20)

Re: WIP: dynahash replacement for buffer table

On 2014-10-16 09:19:16 -0400, Robert Haas wrote:

On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson
<ryan.johnson@cs.utoronto.ca> wrote:

Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like
an ideal fit...

In brief, RCU has the following requirements:

Read-heavy access pattern
Writers must be able to make dead objects unreachable to new readers (easily
done for most data structures)
Writers must be able to mark dead objects in such a way that existing
readers know to ignore their contents but can still traverse the data
structure properly (usually straightforward)
Readers must occasionally inform the system that they are not currently
using any RCU-protected pointers (to allow resource reclamation)

Have a look at http://lwn.net/Articles/573424/ and specifically the
"URCU overview" section. Basically, that last requirement - that
readers inform the system tat they are not currently using any
RCU-protected pointers - turns out to require either memory barriers
or signals.

Well, there's the "quiescent-state-based RCU" - that's actually
something that could reasonably be used inside postgres. Put something
around socket reads (syscalls are synchronization points) and non-nested
lwlock acquires. That'd mean it's nearly no additional overhead.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Andres Freund (#21)

Re: WIP: dynahash replacement for buffer table

On Thu, Oct 16, 2014 at 9:30 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-10-16 09:19:16 -0400, Robert Haas wrote:

On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson
<ryan.johnson@cs.utoronto.ca> wrote:

Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like
an ideal fit...

In brief, RCU has the following requirements:

Read-heavy access pattern
Writers must be able to make dead objects unreachable to new readers (easily
done for most data structures)
Writers must be able to mark dead objects in such a way that existing
readers know to ignore their contents but can still traverse the data
structure properly (usually straightforward)
Readers must occasionally inform the system that they are not currently
using any RCU-protected pointers (to allow resource reclamation)

Have a look at http://lwn.net/Articles/573424/ and specifically the
"URCU overview" section. Basically, that last requirement - that
readers inform the system tat they are not currently using any
RCU-protected pointers - turns out to require either memory barriers
or signals.

Well, there's the "quiescent-state-based RCU" - that's actually
something that could reasonably be used inside postgres. Put something
around socket reads (syscalls are synchronization points) and non-nested
lwlock acquires. That'd mean it's nearly no additional overhead.

Sure, so, you reuse your existing barriers instead of adding new ones.
Making it work sounds like a lot of work for an uncertain benefit
though.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Ryan Johnson

ryan.johnson@cs.utoronto.ca

about 11 years ago

In reply to: Robert Haas (#20)

Re: WIP: dynahash replacement for buffer table

On 16/10/2014 7:19 AM, Robert Haas wrote:

On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson
<ryan.johnson@cs.utoronto.ca> wrote:

Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like
an ideal fit...

In brief, RCU has the following requirements:

Read-heavy access pattern
Writers must be able to make dead objects unreachable to new readers (easily
done for most data structures)
Writers must be able to mark dead objects in such a way that existing
readers know to ignore their contents but can still traverse the data
structure properly (usually straightforward)
Readers must occasionally inform the system that they are not currently
using any RCU-protected pointers (to allow resource reclamation)

Have a look at http://lwn.net/Articles/573424/ and specifically the
"URCU overview" section. Basically, that last requirement - that
readers inform the system tat they are not currently using any
RCU-protected pointers - turns out to require either memory barriers
or signals.
All of the many techniques that have been developed in this area are
merely minor variations on a very old theme: set some kind of flag
variable in shared memory to let people know that you are reading a
shared data structure, and clear it when you are done. Then, other
people can figure out when it's safe to recycle memory that was
previously part of that data structure.

Sure, but RCU has the key benefit of decoupling its machinery (esp. that
flag update) from the actual critical section(s) it protects. In a DBMS
setting, for example, once per transaction or SQL statement would do
just fine. The notification can be much better than a simple flag---you
want to know whether the thread has ever quiesced since the last reclaim
cycle began, not whether it is currently quiesced (which it usually
isn't). In the implementation I use, a busy thread (e.g. not about to go
idle) can "chain" its RCU "transactions." In the common case, a chained
quiesce call comes when the RCU epoch is not trying to change, and the
"flag update" degenerates to a simple load. Further, the only time it's
critical to have that memory barrier is if the quiescing thread is about
to go idle. Otherwise, missing a flag just imposes a small delay on
resource reclamation (and that's assuming the flag in question even
belonged to a straggler process). How you implement epoch management,
especially the handling of stragglers, is the deciding factor in whether
RCU works well. The early URCU techniques were pretty terrible, and
maybe general-purpose URCU is doomed to stay that way, but in a DBMS
core it can be done very cleanly and efficiently because we can easily
add the quiescent points at appropriate locations in the code.

In Linux's RCU, the flag
variable is "whether the process is currently scheduled on a CPU",
which is obviously not workable from user-space. Lacking that, you
need an explicit flag variable, which means you need memory barriers,
since the protected operation is a load and the flag variable is
updated via a store. You can try to avoid some of the overhead by
updating the flag variable less often (say, when a signal arrives) or
you can make it more fine-grained (in my case, we only prevent reclaim
of a fraction of the data structure at a time, rather than all of it)
or various other variants, but none of this is unfortunately so simple
as "apply technique X and your problem just goes away".

Magic wand, no (does nothing for update contention, for example, and
requires some care to apply). But from a practical perspective RCU,
properly implemented, does make an awful lot of problems an awful lot
simpler to tackle. Especially for the readers.

Ryan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Ryan Johnson

ryan.johnson@cs.utoronto.ca

about 11 years ago

In reply to: Robert Haas (#22)

Re: WIP: dynahash replacement for buffer table

On 16/10/2014 7:59 AM, Robert Haas wrote:

On Thu, Oct 16, 2014 at 9:30 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-10-16 09:19:16 -0400, Robert Haas wrote:

On Thu, Oct 16, 2014 at 8:03 AM, Ryan Johnson
<ryan.johnson@cs.utoronto.ca> wrote:

Why not use an RCU mechanism [1] and ditch the hazard pointers? Seems like
an ideal fit...

In brief, RCU has the following requirements:

Read-heavy access pattern
Writers must be able to make dead objects unreachable to new readers (easily
done for most data structures)
Writers must be able to mark dead objects in such a way that existing
readers know to ignore their contents but can still traverse the data
structure properly (usually straightforward)
Readers must occasionally inform the system that they are not currently
using any RCU-protected pointers (to allow resource reclamation)

Have a look at http://lwn.net/Articles/573424/ and specifically the
"URCU overview" section. Basically, that last requirement - that
readers inform the system tat they are not currently using any
RCU-protected pointers - turns out to require either memory barriers
or signals.

Well, there's the "quiescent-state-based RCU" - that's actually
something that could reasonably be used inside postgres. Put something
around socket reads (syscalls are synchronization points) and non-nested
lwlock acquires. That'd mean it's nearly no additional overhead.

Sure, so, you reuse your existing barriers instead of adding new ones.
Making it work sounds like a lot of work for an uncertain benefit
though.

Perhaps RCU is too much work and/or too little benefit to justify
replacing the current latch-based code. I personally am not convinced,
but have no hard data to offer other than to point at the rapid (even
accelerating) uptake of RCU throughout the Linux kernel (cf. Fig. 1 of
http://www2.rdrop.com/users/paulmck/techreports/RCUUsage.2013.02.24a.pdf).

However---
If we are convinced the latch-based code needs to go and the question is
whether to replace it with RCU or hazard pointers, then RCU wins
hands-down IMO. It's comparable work to implement, easier to reason
about (RCU read protocol is significantly simpler), and a clearer
performance benefit (hazard pointers require more barriers, more atomic
ops, more validating, and more pointer chasing than RCU). The only
metric where RCU loses to hazard pointers is if you have really tight
timing constraints on resource reclamation. Hazard pointers give a tight
bound on the number of dead objects that cannot be reclaimed at any
given moment, while RCU does not. I've not heard that this is a major
concern, though, and in practice it doesn't seem to be a problem unless
you forget to annotate an important quiescent point (like a blocking
syscall).

Ryan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Ants Aasma

ants@cybertec.at

about 11 years ago

In reply to: Ants Aasma (#14)

Re: WIP: dynahash replacement for buffer table

On Oct 15, 2014 7:32 PM, "Ants Aasma" <ants@cybertec.at> wrote:

I'm imagining a bucketized cuckoo hash with 5 item buckets (5-way
associativity). This allows us to fit the bucket onto 2 regular sized
cache lines and have 8 bytes left over. Buckets would be protected by
seqlocks stored in the extra space. On the read side we would only
need 2 read barriers (basically free on x86), and we are guaranteed to
have an answer by fetching two pairs of cache lines. We can trade
memory bandwidth for latency by issuing prefetches (once we add
primitives for this). Alternatively we can trade insert speed for
lookup speed by using asymmetrically sized tables.

I was thinking about this. It came to me that we might not even need
BufferTag's to be present in the hash table. In the hash table itself
we store just a combination of 4 byte buffer tag hash-buffer id. This
way we can store 8 pairs in one cacheline. Lookup must account for the
fact that due to hash collisions we may find multiple matches one of
which may be the buffer we are looking for. We can make condition
exceedingly unlikely by taking advantage of the fact that we have two
hashes anyway, in each table we use the lookup hash of the other table
for tagging. (32bit hash collision within 8 items). By having a
reasonably high load factor (75% should work) and 8 bytes per key we
even have a pretty good chance of fitting the hotter parts of the
buffer lookup table in CPU caches.

We use a separate array for the locks (spinlocks should be ok here)
for fine grained locking, this may be 1:1 with the buckets or we can
map many buckets to a single lock. Otherwise the scheme should work
mostly the same. Using this scheme we can get by on the lookup side
using 64 bit atomic reads with no barriers, we only need atomic ops
for pinning the found buffer.

I haven't had the chance to check out how second-chance hashing works
and if this scheme would be applicable to it.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Ryan Johnson (#24)

Re: WIP: dynahash replacement for buffer table

On Thu, Oct 16, 2014 at 12:26 PM, Ryan Johnson
<ryan.johnson@cs.utoronto.ca> wrote:

The only metric where RCU loses to
hazard pointers is if you have really tight timing constraints on resource
reclamation.

I think we do have that problem. It's certainly completely
unacceptable for a query to prevent buffer reclaim on any significant
number of buffers even until the end of the query, let alone the end
of the transaction.

But, hey, if somebody wants to try writing a patch different than the
one that I wrote and see whether it works better than mine, I'm
totally cool with that. This is something I came up with, and we're
here to evaluate whether it works better than any other option that we
have now or that someone wants to develop. I'm not saying this is the
best solution; it's just something I've got that seems to basically
work.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Andres Freund (#13)

Re: WIP: dynahash replacement for buffer table

On Wed, Oct 15, 2014 at 2:03 AM, Andres Freund <andres@2ndquadrant.com> wrote:

Yes, I can see that now. I do wonder if that's not going to prove quite
expensive... I think I can see some ways around needing that. But I
think that needs some benchmarking first - no need to build a even more
complex solution if not needed.

I did a bit of testing on my MacBook Pro last night. Unfortunately, I
don't have access to a large x86 machine the moment.[1]Yes, this is a hint. I tried four
configurations:

(1) master
(2) master with chash patch
(3) master with chash patch, pg_memory_barrier() changed from lock
addl to mfence
(4) same as (3), plus barrier at the end of CHashSearch changed to a
compiler barrier (which should be safe on x86)

I tested pgbench -S, scale factor 100, shared_buffers 400MB. 3 trials
per configuration; runs were interleaved. Percentages indicate the
average difference between that line and master.

At 1 client:

(1) 11689.388986 11426.653176 11297.775005
(2) 11618.306822 11331.805396 11273.272945 -0.55%
(3) 11813.664402 11300.082928 11231.265030 -0.20%
(4) 11428.097384 11266.979165 11294.422376 -1.23%

At 16 clients:

(1) 56716.465893 56992.590587 56755.298362
(2) 57126.787712 56848.527712 57351.559138 +0.51%
(3) 56779.624757 57036.610925 56878.036445 +0.13%
(4) 56818.522369 56885.544509 56977.810413 +0.13%

I think this tends to confirm that the patch is a small loss (for
unknown reasons) at 1 client, but that it picks up speed with more
contention, even on a machine with only 4 cores. But if there's an
important difference between the different fencing techniques, it
doesn't show up in this test.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

[1]: Yes, this is a hint.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Robert Haas (#1)

Re: WIP: dynahash replacement for buffer table

Hi,

On 2014-10-14 09:30:58 -0400, Robert Haas wrote:

A few years ago I started working on a concurrent hash table for
PostgreSQL. The hash table part of it worked, but I never did
anything with it, really. Amit mentioned to me earlier this week that
he was seeing contention inside the dynahash machinery, which inspired
me to go back and update the patch. I took the basic infrastructure
from before and used it to replace the buffer table. Patch is
attached.

So. I ran a quick tests on a larger x86 machine. 4xE5-4620, 256GB RAM.

The hotel wifi is too flaky to get detailed results, but some tasty
bits.

The test I used is readonly pgbench on a scale 5000 DB - about 73GB. I'm
benchmarking chash ontop my LW_SHARED and StrategyGetBuffer()
optimizations because otherwise the bottleneck is completely elsewhere.

When using shared_buffers = 96GB there's a performance benefit, but not
huge:
master (f630b0dd5ea2de52972d456f5978a012436115e): 153621.8
master + LW_SHARED + lockless StrategyGetBuffer(): 477118.4
master + LW_SHARED + lockless StrategyGetBuffer() + chash: 496788.6
master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 499562.7

But with shared_buffers = 16GB:
master (f630b0dd5ea2de52972d456f5978a012436115e): 177302.9
master + LW_SHARED + lockless StrategyGetBuffer(): 206172.4
master + LW_SHARED + lockless StrategyGetBuffer() + chash: 413344.1
master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 426405.8

All the numbers here -P5 output when it looks like it has stabilized -
it takes a couple minutes to get to that point so pgbench runs would
have to be really long to not be skewed by the startup. I don't think my
manual selection of measurements matters much at this scale of
differences ;)

I had to play with setting max_connections+1 sometimes to get halfway
comparable results for master - unaligned data otherwise causes wierd
results otherwise. Without doing that the performance gap between master
96/16G was even bigger. We really need to fix that...

This is pretty awesome.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Andres Freund (#28)

Re: WIP: dynahash replacement for buffer table

On Thu, Oct 16, 2014 at 6:53 PM, Andres Freund <andres@2ndquadrant.com> wrote:

When using shared_buffers = 96GB there's a performance benefit, but not
huge:
master (f630b0dd5ea2de52972d456f5978a012436115e): 153621.8
master + LW_SHARED + lockless StrategyGetBuffer(): 477118.4
master + LW_SHARED + lockless StrategyGetBuffer() + chash: 496788.6
master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 499562.7

But with shared_buffers = 16GB:
master (f630b0dd5ea2de52972d456f5978a012436115e): 177302.9
master + LW_SHARED + lockless StrategyGetBuffer(): 206172.4
master + LW_SHARED + lockless StrategyGetBuffer() + chash: 413344.1
master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 426405.8

Very interesting. This doesn't show that chash is the right solution,
but it definitely shows that doing nothing is the wrong solution. It
shows that, even with the recent bump to 128 lock manager partitions,
and LW_SHARED on top of that, workloads that actually update the
buffer mapping table still produce a lot of contention there. This
hasn't been obvious to me from profiling, but the numbers above make
it pretty clear.

It also seems to suggest that trying to get rid of the memory barriers
isn't a very useful optimization project. We might get a couple of
percent out of it, but it's pretty small potatoes, so unless it can be
done more easily than I suspect, it's probably not worth bothering
with. An approach I think might have more promise is to have bufmgr.c
call the CHash stuff directly instead of going through buf_table.c.
Right now, for example, BufferAlloc() creates and initializes a
BufferTag and passes a pointer to that buffer tag to BufTableLookup,
which copies it into a BufferLookupEnt. But it would be just as easy
for BufferAlloc() to put the BufferLookupEnt on its own stack, and
then you wouldn't need to copy the data an extra time. Now a 20-byte
copy isn't a lot, but it's completely unnecessary and looks easy to
get rid of.

I had to play with setting max_connections+1 sometimes to get halfway
comparable results for master - unaligned data otherwise causes wierd
results otherwise. Without doing that the performance gap between master
96/16G was even bigger. We really need to fix that...

This is pretty awesome.

Thanks. I wasn't quite sure how to test this or where the workloads
that it would benefit would be found, so I appreciate you putting time
into it. And I'm really glad to hear that it delivers good results.

I think it would be useful to plumb the chash statistics into the
stats collector or at least a debugging dump of some kind for testing.
They include a number of useful contention measures, and I'd be
interested to see what those look like on this workload. (If we're
really desperate for every last ounce of performance, we could also
disable those statistics in production builds. That's probably worth
testing at least once to see if it matters much, but I kind of hope it
doesn't.)

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Robert Haas (#29)

Re: WIP: dynahash replacement for buffer table

On 2014-10-16 20:22:24 -0400, Robert Haas wrote:

On Thu, Oct 16, 2014 at 6:53 PM, Andres Freund <andres@2ndquadrant.com> wrote:

When using shared_buffers = 96GB there's a performance benefit, but not
huge:
master (f630b0dd5ea2de52972d456f5978a012436115e): 153621.8
master + LW_SHARED + lockless StrategyGetBuffer(): 477118.4
master + LW_SHARED + lockless StrategyGetBuffer() + chash: 496788.6
master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 499562.7

But with shared_buffers = 16GB:
master (f630b0dd5ea2de52972d456f5978a012436115e): 177302.9
master + LW_SHARED + lockless StrategyGetBuffer(): 206172.4
master + LW_SHARED + lockless StrategyGetBuffer() + chash: 413344.1
master + LW_SHARED + lockless StrategyGetBuffer() + chash-nomb: 426405.8

Very interesting. This doesn't show that chash is the right solution,
but it definitely shows that doing nothing is the wrong solution.

Absolutely.

The thing worrying me most (but not all that much on an absolute scale)
about chash is that lots of the solutions to memory management it builds
are specific to it... And generalizing afterwards will be hard because
we'll have to prove that that general solution is as performant as the
special case code...

It shows that, even with the recent bump to 128 lock manager
partitions, and LW_SHARED on top of that, workloads that actually
update the buffer mapping table still produce a lot of contention
there.

FWIW, I couldn't see much of a benefit without LW_SHARED even though I
a *few* things. The bottleneck simply is completely elsewhere.

This hasn't been obvious to me from profiling, but the numbers
above make it pretty clear.

So I'm not super surprised about that. There very well might be cases
where it's a bad bottleneck before, but I've not seen them.

It also seems to suggest that trying to get rid of the memory barriers
isn't a very useful optimization project.

I'm not yet convinced of that. Yes, it's not showing up hugely in the
numbers here, but that's simply because the workload is again completely
dominated by the kernel copying data for the read()s, GetSnapshotData(),
the buffer mapping cache misses and a few other familiar friends.

We might get a couple of
percent out of it, but it's pretty small potatoes, so unless it can be
done more easily than I suspect, it's probably not worth bothering
with.

I still think that moving the barrier to the reading side would be
simple (implementation wise) and correct for x86. If you think about it,
otherwise our spinlock implementation for x86 would be broken. As a
unlock is just a compiler barrier the full barrier on acquire better be
a full synchronization point. Am I missing something?

An approach I think might have more promise is to have bufmgr.c
call the CHash stuff directly instead of going through buf_table.c.

I don't see all that much point in buf_table.c currently - on the other
hand it has lead to it replacing the buffer mapping being slightly
easier... Maybe it should just go to a header...

Right now, for example, BufferAlloc() creates and initializes a
BufferTag and passes a pointer to that buffer tag to BufTableLookup,
which copies it into a BufferLookupEnt. But it would be just as easy
for BufferAlloc() to put the BufferLookupEnt on its own stack, and
then you wouldn't need to copy the data an extra time. Now a 20-byte
copy isn't a lot, but it's completely unnecessary and looks easy to
get rid of.

Worthwile to try.

I had to play with setting max_connections+1 sometimes to get halfway
comparable results for master - unaligned data otherwise causes wierd
results otherwise. Without doing that the performance gap between master
96/16G was even bigger. We really need to fix that...

This is pretty awesome.

Thanks. I wasn't quite sure how to test this or where the workloads
that it would benefit would be found, so I appreciate you putting time
into it. And I'm really glad to hear that it delivers good results.

I wasn't sure either ;). Lucky that it played out so impressively. After
the first results I was nearly ready to send out a 'Meh.' type of
message ;)

Btw, CHashTableGetNode isn't exactly cheap. It shows up noticeably in
profiles. It results in several non-pipelineable instructions in a
already penalized section of the code... Replacing arena_stride by a
constant provided measurable improvements (no imul is required anymore,
instead you get shr;lea or so). I'm not sure how to deal with that. If
it shows up even on my quite new laptop, it'll be more so noticeable on
older x86 platforms.

I think it would be useful to plumb the chash statistics into the
stats collector or at least a debugging dump of some kind for testing.

I'm not sure it's solid enough at this point to be in the stats
collector. But I surely would like to access it somehow. I'm
e.g. absolutely not sure that your loadfactor is "good" and it'd be much
easier if those stats where visible. I'm not really sure how to make
them visible though.

They include a number of useful contention measures, and I'd be
interested to see what those look like on this workload. (If we're
really desperate for every last ounce of performance, we could also
disable those statistics in production builds. That's probably worth
testing at least once to see if it matters much, but I kind of hope it
doesn't.)

I can't retest on bigger HW right now, but IIRC they didn't show up in
the profile there. It's not visible on my laptop.

Profilewise it'd be helpful if BucketScan would be inlined. Right now
it's hard to see the difference in cost between search/insert/delete and
I think that'd be worth the cost of duplicated instructions...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Robert Haas (#1)

Re: WIP: dynahash replacement for buffer table

Hi,

git branch also available at:
http://git.postgresql.org/gitweb/?p=users/rhaas/postgres.git;a=shortlog;h=refs/heads/chash2014

A minor review of this:
* Should be rebased ontop of the atomics API

* In benchmarks it becomes apparent that the dynamic element width makes
some macros like CHashTableGetNode() and
CHashTableGetGarbageByBucket() quite expensive. At least gcc on x86
can't figure how to build an efficient expression for the
target. There's two ways to address this:
a) Actually morph chash into something that will be included into the
user sites. Since I think there'll only be a limited number of those
that's probably acceptable.
b) Simplify the access rules. I quite seriously doubt that the
interleaving of garbage and freelists is actually necessary/helpful.

* There's several cases where it's noticeable that chash creates more
cacheline misses - that's imo a good part of why the single threaded
performance regresses. There's good reasons for the current design
without a inline bucket, namely that it makes the concurrency handling
easier. But I think that can be countered by still storing a pointer -
just to an element inside the bucket. Afaics that'd allow this to be
introduced quite easily?

I'm afraid that I can't see us going forward unless we address the
noticeable single threaded penalties. Do you see that differently?

* There's some whitespace damage. (space followed by tab, followed by
actual contents)

* I still think it's a good idea to move the full memory barriers into
the cleanup/writing process by doing write memory barriers when
acquiring a hazard pointer and RMW atomic operations (e.g. atomic add)
in the process testing for the hazard pointer.

* Shouldn't we relax the CPU in a couple places like CHashAllocate(),
CHashBucketScan()'s loops?

* I don't understand right now (but then I'm in a Hotel bar) why it's
safe for CHashAddToGarbage() to clobber the hash value?
CHashBucketScan() relies the chain being ordered by hashcode. No?
Another backend might just have been past the IsMarked() bit and look
at the hashcode?

* We really should find a way to sensibly print the stats.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Andres Freund (#31)

Re: WIP: dynahash replacement for buffer table

On Wed, Nov 5, 2014 at 6:19 PM, Andres Freund <andres@2ndquadrant.com> wrote:

* In benchmarks it becomes apparent that the dynamic element width makes
some macros like CHashTableGetNode() and
CHashTableGetGarbageByBucket() quite expensive. At least gcc on x86
can't figure how to build an efficient expression for the
target. There's two ways to address this:
a) Actually morph chash into something that will be included into the
user sites. Since I think there'll only be a limited number of those
that's probably acceptable.

Have you benchmarked this? How much does it help?

b) Simplify the access rules. I quite seriously doubt that the
interleaving of garbage and freelists is actually necessary/helpful.

That wasn't added for nothing. I did a bunch of performance testing
on this vs. dynahash (I think the test code is included in the patch)
and the region of memory containing the freelists practically caught
fire. Spreading them out like this greatly improved the performance
under heavy concurrency, but even with those changes the freelists
were extremely hot. Now, since this is the buffer mapping table
rather than a tight loop around hash table manipulation, the
difference may not be enough to measure. But on a microbenchmark it
*definitely* was.

* There's several cases where it's noticeable that chash creates more
cacheline misses - that's imo a good part of why the single threaded
performance regresses. There's good reasons for the current design
without a inline bucket, namely that it makes the concurrency handling
easier. But I think that can be countered by still storing a pointer -
just to an element inside the bucket. Afaics that'd allow this to be
introduced quite easily?

It's not obvious to me how that would work. If you just throw those
elements on the garbage lists with everything else, it will soon be
the case that one bucket header ends up using the cell stored in some
other bucket, which isn't going to be awesome. So you need some kind
of more complex scheme to preserve locality.

I'm afraid that I can't see us going forward unless we address the
noticeable single threaded penalties. Do you see that differently?

I'm not sure. We're talking about something on the order of half a
percent on my tests, and lots of changes cause things to bounce around
by that much. I'm not sure it makes sense to get too worked up about
that if the alternative is to add a big pile of new complexity.

* There's some whitespace damage. (space followed by tab, followed by
actual contents)

That's the least of our problems at this point.

* I still think it's a good idea to move the full memory barriers into
the cleanup/writing process by doing write memory barriers when
acquiring a hazard pointer and RMW atomic operations (e.g. atomic add)
in the process testing for the hazard pointer.

Can you cite any existing precedent in other system software? Does
Linux do anything like that, for example?

* Shouldn't we relax the CPU in a couple places like CHashAllocate(),
CHashBucketScan()'s loops?

You mean like, have it sleep the way SpinLockAcquire() does? Yeah,
possibly, but that adds non-trivial code complexity which may not be
worth it if - as is hoped for - there's no real contention there.

* I don't understand right now (but then I'm in a Hotel bar) why it's
safe for CHashAddToGarbage() to clobber the hash value?
CHashBucketScan() relies the chain being ordered by hashcode. No?
Another backend might just have been past the IsMarked() bit and look
at the hashcode?

I think the worst case scenario is that we falsely conclude that
there's a hash match when there really isn't. The ensuing memcmp()
will clarify matters.

* We really should find a way to sensibly print the stats.

Yep.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Andres Freund

andres@2ndquadrant.com

about 11 years ago

In reply to: Robert Haas (#32)

Re: WIP: dynahash replacement for buffer table

On 2014-11-07 11:08:57 -0500, Robert Haas wrote:

On Wed, Nov 5, 2014 at 6:19 PM, Andres Freund <andres@2ndquadrant.com> wrote:

* In benchmarks it becomes apparent that the dynamic element width makes
some macros like CHashTableGetNode() and
CHashTableGetGarbageByBucket() quite expensive. At least gcc on x86
can't figure how to build an efficient expression for the
target. There's two ways to address this:
a) Actually morph chash into something that will be included into the
user sites. Since I think there'll only be a limited number of those
that's probably acceptable.

Have you benchmarked this? How much does it help?

I've done quick benchmarks, and it helped in the 2-3% range in some
tests, and was a wash in others. What I did wasn't actually including it
in buf_table.c, but hardcoding the size in chash. That's obviously not
the real solution.

b) Simplify the access rules. I quite seriously doubt that the
interleaving of garbage and freelists is actually necessary/helpful.

That wasn't added for nothing. I did a bunch of performance testing
on this vs. dynahash (I think the test code is included in the patch)
and the region of memory containing the freelists practically caught
fire. Spreading them out like this greatly improved the performance
under heavy concurrency, but even with those changes the freelists
were extremely hot. Now, since this is the buffer mapping table
rather than a tight loop around hash table manipulation, the
difference may not be enough to measure. But on a microbenchmark it
*definitely* was.

I think it'd be much less visible for the buffer manager, but it might
still be visible...

* There's several cases where it's noticeable that chash creates more
cacheline misses - that's imo a good part of why the single threaded
performance regresses. There's good reasons for the current design
without a inline bucket, namely that it makes the concurrency handling
easier. But I think that can be countered by still storing a pointer -
just to an element inside the bucket. Afaics that'd allow this to be
introduced quite easily?

It's not obvious to me how that would work. If you just throw those
elements on the garbage lists with everything else, it will soon be
the case that one bucket header ends up using the cell stored in some
other bucket, which isn't going to be awesome. So you need some kind
of more complex scheme to preserve locality.

Treating the element inside the bucket as a kind of one element freelist
seems quite workable to me. Test whether it's marked deleted, scan the
hazard pointer list to see whether it can be used. If yes, grand. If
not, go to the current code. The fact that the element is cacheline
local will make the test for its deletedness almost free. Yes, the
hazard pointer scan surely ain't free - but at least for cases like
bufmgr where reads are never less frequent than writes and very often
*much* more frequent I'm pretty sure it'd be a noticeable win.

I'm afraid that I can't see us going forward unless we address the
noticeable single threaded penalties. Do you see that differently?

I'm not sure. We're talking about something on the order of half a
percent on my tests, and lots of changes cause things to bounce around
by that much. I'm not sure it makes sense to get too worked up about
that if the alternative is to add a big pile of new complexity.

I saw things in the range of 3-4% on my laptop.

* There's some whitespace damage. (space followed by tab, followed by
actual contents)

That's the least of our problems at this point.

Sure, but still worth cleaning up ;)

* I still think it's a good idea to move the full memory barriers into
the cleanup/writing process by doing write memory barriers when
acquiring a hazard pointer and RMW atomic operations (e.g. atomic add)
in the process testing for the hazard pointer.

Can you cite any existing precedent in other system software? Does
Linux do anything like that, for example?

No, I can't right now. I think it follows from the cache coherency
rules, but I can understand suspicion there.

* Shouldn't we relax the CPU in a couple places like CHashAllocate(),
CHashBucketScan()'s loops?

You mean like, have it sleep the way SpinLockAcquire() does?

Not actually like in s_lock(), but rather like the SPIN_DELAY() used in
the spinlock code for retries.

Yeah, possibly, but that adds non-trivial code complexity which may
not be worth it if - as is hoped for - there's no real contention
there.

I think just adding a pg_spin_delay() before retrying should be trivial.

* I don't understand right now (but then I'm in a Hotel bar) why it's
safe for CHashAddToGarbage() to clobber the hash value?
CHashBucketScan() relies the chain being ordered by hashcode. No?
Another backend might just have been past the IsMarked() bit and look
at the hashcode?

I think the worst case scenario is that we falsely conclude that
there's a hash match when there really isn't. The ensuing memcmp()
will clarify matters.

Hm. Let me think about it for a while.

I was thinking that a spurious cmp < 0 could causing us to to miss a
match - but that could only happen if the match just has been removed
from the list. Which is fine. Perhaps that case deserves to be mentioned
in the comment?

* Another thing I'm wondering about here is whether it's possible that
somebody is currently walking an "older version" of the bucket list
(i.e. is currently at an already unlinked element) and then visits a
newly inserted element with an 'apparently' out of order hash
value. That seems possible because the inserter might not have seen the
unlinked element. Afaics that's not a problem for searches - but I'm not
sure whether it couldn't cause a problem for concurrent inserts and
deletes.

* One thing I noticed while looking at that part of code is:
+		h = target_node->un.hashcode;
+		if (h == hashcode)
+			cmp = memcmp(CHashNodeGetItem(target_node), key,
+						 table->desc.key_size);
+		else if (h > hashcode)
+			cmp = 1;
+		else
+			cmp = -1;

Compilers can feel entirely free to reload local variables from memory
(i.e. use target_node->un.hashcode instead of h) in situations like
these. That's why I used the ACCESS_ONCE macros in my freelist
patch. The same is done in a couple of other places (like looking at
->next). I'm *very* wary of relying on the compiler to not do stuff like
that. I unfortunately think you'll need similar things here.

* I noticed is the following comment:
+	 * Note: We can't begin this operation until the clearing of the
+	 * garbage list has been committed to memory, but since that was
+	 * done using an atomic operation no explicit barrier is needed
+	 * here.
+	 *

Uh. I don't think that holds true. A memory barrier on one cpu doesn't
forbid all kinds of reordering on others.

* How is the consistency of the element data guaranteed? Afaics there
very well could be two concurrent CHashInsert()'s for the same key
interleaving inside their respective memcpy()s. It'll often be fine
for small element sizes, but even there the compiler might restort to
a char-by-char memcpy.

* Should CHashInsert perhaps be renamed to CHashSetKey or CHashUpdate -
since it's not actually just inserting?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

Michael Paquier

michael.paquier@gmail.com

about 11 years ago

In reply to: Andres Freund (#33)

Re: WIP: dynahash replacement for buffer table

There hasn't been much activity on this patch these days, and Andres
has provided some feedback, hence switching to Returned with feedback.
Regards,
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Robert Haas

robertmhaas@gmail.com

about 11 years ago

In reply to: Andres Freund (#33)

Re: WIP: dynahash replacement for buffer table

On Sun, Nov 9, 2014 at 3:58 PM, Andres Freund <andres@2ndquadrant.com> wrote:

* There's several cases where it's noticeable that chash creates more
cacheline misses - that's imo a good part of why the single threaded
performance regresses. There's good reasons for the current design
without a inline bucket, namely that it makes the concurrency handling
easier. But I think that can be countered by still storing a pointer -
just to an element inside the bucket. Afaics that'd allow this to be
introduced quite easily?

It's not obvious to me how that would work. If you just throw those
elements on the garbage lists with everything else, it will soon be
the case that one bucket header ends up using the cell stored in some
other bucket, which isn't going to be awesome. So you need some kind
of more complex scheme to preserve locality.

Treating the element inside the bucket as a kind of one element freelist
seems quite workable to me. Test whether it's marked deleted, scan the
hazard pointer list to see whether it can be used. If yes, grand. If
not, go to the current code. The fact that the element is cacheline
local will make the test for its deletedness almost free. Yes, the
hazard pointer scan surely ain't free - but at least for cases like
bufmgr where reads are never less frequent than writes and very often
*much* more frequent I'm pretty sure it'd be a noticeable win.

Maybe. I'd expect that to radically increase the time spend doing
hazard pointer scans. The performance of this system depends heavily
on garbage collection not happening too often, and ISTR that the
performance changes significantly if you vary the amount of of
overallocation.

I'm not sure. We're talking about something on the order of half a
percent on my tests, and lots of changes cause things to bounce around
by that much. I'm not sure it makes sense to get too worked up about
that if the alternative is to add a big pile of new complexity.

I saw things in the range of 3-4% on my laptop.

Mrmpf. Well, that sucks.

* I don't understand right now (but then I'm in a Hotel bar) why it's
safe for CHashAddToGarbage() to clobber the hash value?
CHashBucketScan() relies the chain being ordered by hashcode. No?
Another backend might just have been past the IsMarked() bit and look
at the hashcode?

I think the worst case scenario is that we falsely conclude that
there's a hash match when there really isn't. The ensuing memcmp()
will clarify matters.

Hm. Let me think about it for a while.

I was thinking that a spurious cmp < 0 could causing us to to miss a
match - but that could only happen if the match just has been removed
from the list. Which is fine. Perhaps that case deserves to be mentioned
in the comment?

Maybe. I mean, the general principle here is that there may be some
difference between the apparent order of execution on one CPU as
opposed to another, and the code that uses the hash table has to be OK
with that - at least, unless it has independent methods of assuring
that things happen in the right order, but in that case that other
thing may well become the botleneck. This is just one example of
that. You might also fail to see an insert that's just happened on
some other node but is not committed to main memory yet, which is not
really an issue we need to fix; it's just how things are. A general
discussion of this somewhere might be worthwhile, but it's pretty much
the same as any other highly-concurrent hashtable implemenation you'll
find out there.

(It's also not really different from surrounding the hash table
operation, and only the hash table operation, with a big lock. Then
things can't change while you are scanning the bucket list, but they
can change by the time you can do anything with the returned value,
which is the same problem we have to cope with here.)

* Another thing I'm wondering about here is whether it's possible that
somebody is currently walking an "older version" of the bucket list
(i.e. is currently at an already unlinked element) and then visits a
newly inserted element with an 'apparently' out of order hash
value. That seems possible because the inserter might not have seen the
unlinked element. Afaics that's not a problem for searches - but I'm not
sure whether it couldn't cause a problem for concurrent inserts and
deletes.

This is why the delete-mark bit has to be part of the same atomic word
as the next-pointer. If somebody applies a delete-mark, a subsequent
attempt to insert an entry at that position will fail because the
CAS() of the next-word won't find the right value there - it will find
a delete-marked value, which will never be the value it's expecting.

* One thing I noticed while looking at that part of code is:
+               h = target_node->un.hashcode;
+               if (h == hashcode)
+                       cmp = memcmp(CHashNodeGetItem(target_node), key,
+                                                table->desc.key_size);
+               else if (h > hashcode)
+                       cmp = 1;
+               else
+                       cmp = -1;
Compilers can feel entirely free to reload local variables from memory
(i.e. use target_node->un.hashcode instead of h) in situations like
these. That's why I used the ACCESS_ONCE macros in my freelist
patch. The same is done in a couple of other places (like looking at
->next). I'm *very* wary of relying on the compiler to not do stuff like
that. I unfortunately think you'll need similar things here.

Even if the compiler decides to fetch the target node's hash code
twice, it won't create a correctness issue, because it's guaranteed to
see the same value both times. The hash code is set before adding the
item to the list (with a memory barrier to enforce ordering), and it's
never reset until the item is put back on a new list - which requires
an intervening cycle of garbage collection and hazard-pointer
scanning, so we can't still be looking at the old item at that point.

* I noticed is the following comment:
+        * Note: We can't begin this operation until the clearing of the
+        * garbage list has been committed to memory, but since that was
+        * done using an atomic operation no explicit barrier is needed
+        * here.
+        *

Uh. I don't think that holds true. A memory barrier on one cpu doesn't
forbid all kinds of reordering on others.

Hmm, you might be right about that. I thought that was guaranteed,
but a quick look at http://en.wikipedia.org/wiki/Memory_ordering
suggests otherwise.

* How is the consistency of the element data guaranteed? Afaics there
very well could be two concurrent CHashInsert()'s for the same key
interleaving inside their respective memcpy()s. It'll often be fine
for small element sizes, but even there the compiler might restort to
a char-by-char memcpy.

The two operations would have to CAS() the same next-pointer, and one
of those operations will fail. Note that the item is completely
initialized *before* we insert it into the hash table, so the premise
that two memcpy() operations can target the same address is just
wrong.

* Should CHashInsert perhaps be renamed to CHashSetKey or CHashUpdate -
since it's not actually just inserting?

This may be the root of the confusion. It *is* just inserting. It
does not update; there is no update operation, and like most
algorithms for highly-concurrent hash tables, the algorithm can't
support such an operation. The copy goes the other way: if an insert
fails, the caller gets back the value already present in the table,
which cannot be concurrently getting modified for the same reasons the
hashcode can't be modified while a bucket scan is looking at it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Robert Haas (#35)

1 attachment(s)

Re: WIP: dynahash replacement for buffer table

This developed a slight merge conflict. I've rebased the attached
version, and I also took the step of getting rid of buf_table.c, as I
think I proposed somewhere upthread. This avoids the overhead of
constructing a BufferTag only to copy it into a BufferLookupEnt, plus
some function calls and so forth. A quick-and-dirty test suggests
this might not have cut down on the 1-client overhead much, but I
think it's worth doing anyway: it's certainly saving a few cycles, and
I don't think it's complicating anything measurably.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

chash-buftable-v2.patchtext/x-patch; charset=US-ASCII; name=chash-buftable-v2.patchDownload

diff --git a/contrib/Makefile b/contrib/Makefile
index 195d447..141ef0a 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
 		earthdistance	\
 		file_fdw	\
 		fuzzystrmatch	\
+		hashtest	\
 		hstore		\
 		intagg		\
 		intarray	\
diff --git a/contrib/hashtest/Makefile b/contrib/hashtest/Makefile
new file mode 100644
index 0000000..3ee42f8
--- /dev/null
+++ b/contrib/hashtest/Makefile
@@ -0,0 +1,18 @@
+# contrib/hashtest/Makefile
+
+MODULE_big = hashtest
+OBJS = hashtest.o
+
+EXTENSION = hashtest
+DATA = hashtest--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/hashtest
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/hashtest/hashtest--1.0.sql b/contrib/hashtest/hashtest--1.0.sql
new file mode 100644
index 0000000..e271baf
--- /dev/null
+++ b/contrib/hashtest/hashtest--1.0.sql
@@ -0,0 +1,52 @@
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION hashtest" to load this file. \quit
+
+CREATE FUNCTION chash_insert_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_insert_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_search_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_search_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_delete_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_delete_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_concurrent_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_concurrent_test'
+LANGUAGE C;
+
+CREATE FUNCTION chash_collision_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'chash_collision_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_insert_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_insert_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_search_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_search_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_delete_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_delete_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_concurrent_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_concurrent_test'
+LANGUAGE C;
+
+CREATE FUNCTION dynahash_collision_test()
+RETURNS void
+AS 'MODULE_PATHNAME', 'dynahash_collision_test'
+LANGUAGE C;
diff --git a/contrib/hashtest/hashtest.c b/contrib/hashtest/hashtest.c
new file mode 100644
index 0000000..172a5bb
--- /dev/null
+++ b/contrib/hashtest/hashtest.c
@@ -0,0 +1,527 @@
+/*-------------------------------------------------------------------------
+ * hashtest.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "funcapi.h"
+#include "libpq/auth.h"
+#include "lib/stringinfo.h"
+#include "miscadmin.h"
+#include "portability/instr_time.h"
+#include "storage/ipc.h"
+#include "utils/chash.h"
+
+PG_MODULE_MAGIC;
+
+void		_PG_init(void);
+Datum		chash_insert_test(PG_FUNCTION_ARGS);
+Datum		chash_search_test(PG_FUNCTION_ARGS);
+Datum		chash_delete_test(PG_FUNCTION_ARGS);
+Datum		chash_concurrent_test(PG_FUNCTION_ARGS);
+Datum		chash_collision_test(PG_FUNCTION_ARGS);
+Datum		dynahash_insert_test(PG_FUNCTION_ARGS);
+Datum		dynahash_search_test(PG_FUNCTION_ARGS);
+Datum		dynahash_delete_test(PG_FUNCTION_ARGS);
+Datum		dynahash_concurrent_test(PG_FUNCTION_ARGS);
+Datum		dynahash_collision_test(PG_FUNCTION_ARGS);
+static void hashtest_shmem_startup(void);
+
+PG_FUNCTION_INFO_V1(chash_insert_test);
+PG_FUNCTION_INFO_V1(chash_search_test);
+PG_FUNCTION_INFO_V1(chash_delete_test);
+PG_FUNCTION_INFO_V1(chash_concurrent_test);
+PG_FUNCTION_INFO_V1(chash_collision_test);
+PG_FUNCTION_INFO_V1(dynahash_insert_test);
+PG_FUNCTION_INFO_V1(dynahash_search_test);
+PG_FUNCTION_INFO_V1(dynahash_delete_test);
+PG_FUNCTION_INFO_V1(dynahash_concurrent_test);
+PG_FUNCTION_INFO_V1(dynahash_collision_test);
+
+typedef struct
+{
+	uint32	key;
+	uint32	val;
+} hentry;
+
+static CHashDescriptor cdesc = {
+	"hashtest-chash",	/* name */
+	1048576,			/* capacity */
+	sizeof(hentry),		/* element size */
+	sizeof(uint32)		/* key size */
+};
+
+#define DYNAHASH_PARTITIONS		16
+
+static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
+static CHashTable chash;
+static HTAB *dynahash;
+static LWLockId dynahash_lock[DYNAHASH_PARTITIONS];
+static ClientAuthentication_hook_type original_client_auth_hook = NULL;
+
+static void hashtest_client_auth_hook(Port *port, int status);
+static void chash_write_stats_to_log(int code, Datum dummy);
+
+#define dynahash_get_lock(hashcode) \
+	(dynahash_lock[(hashcode) % DYNAHASH_PARTITIONS])
+
+void
+_PG_init(void)
+{
+	Size	cs;
+	Size	ds;
+
+	if (!process_shared_preload_libraries_in_progress)
+		return;
+	prev_shmem_startup_hook = shmem_startup_hook;
+	shmem_startup_hook = hashtest_shmem_startup;
+	chash = CHashBootstrap(&cdesc);
+	cs = CHashEstimateSize(chash);
+	RequestAddinShmemSpace(cs);
+	ds = hash_estimate_size(cdesc.capacity, cdesc.element_size);
+	RequestAddinShmemSpace(ds);
+	elog(LOG, "chash: %u bytes; dynahash: %u bytes", (unsigned) cs,
+		 (unsigned) ds);
+	RequestAddinLWLocks(DYNAHASH_PARTITIONS);
+	original_client_auth_hook = ClientAuthentication_hook;
+	ClientAuthentication_hook = hashtest_client_auth_hook;
+	
+}
+
+static void
+hashtest_client_auth_hook(Port *port, int status)
+{
+	if (original_client_auth_hook)
+		original_client_auth_hook(port, status);
+	on_proc_exit(chash_write_stats_to_log, (Datum) 0);
+}
+
+static void
+chash_write_stats_to_log(int code, Datum dummy)
+{
+	uint64	stats[CHS_NumberOfStatistics];
+	CHashStatisticsType i;
+	StringInfoData	buf;
+
+	CHashStatistics(chash, stats);
+	initStringInfo(&buf);
+
+	for (i = 0; i < CHS_NumberOfStatistics; ++i)
+	{
+		if (stats[i] == 0)
+			continue;
+		appendStringInfo(&buf, UINT64_FORMAT " %s; ", stats[i],
+						 CHashStatisticsNames[i]);
+	}
+
+	if (buf.len > 1)
+	{
+		buf.data[buf.len-2] = '\0';
+		elog(LOG, "chash statistics: %s", buf.data);
+	}
+}
+
+static void
+hashtest_shmem_startup(void)
+{
+	HASHCTL		info;
+	uint32		i;
+
+	if (prev_shmem_startup_hook)
+		prev_shmem_startup_hook();
+
+	/* Initialize concurrent hash table. */
+	chash = CHashInitialize(chash, &cdesc);
+
+	/* Initialize shared dynahash table. */
+	info.keysize = cdesc.key_size;
+	info.entrysize = cdesc.element_size;
+	info.hash = tag_hash;
+	info.num_partitions = DYNAHASH_PARTITIONS;
+
+	dynahash = ShmemInitHash("hashtest-dynahash",
+							 cdesc.capacity, cdesc.capacity,
+							 &info,
+							 HASH_ELEM | HASH_FUNCTION | HASH_PARTITION);
+
+	for (i = 0; i < DYNAHASH_PARTITIONS; ++i)
+		dynahash_lock[i] = LWLockAssign();
+}
+
+Datum
+chash_insert_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+	hentry	e;
+
+	for (i = 0; i < 1000000; ++i)
+	{
+		bool ok;
+
+		e.key = i;
+		e.val = i * 31;
+		ok = CHashInsert(chash, &e);
+		if (!ok)
+			elog(LOG, "insert %u: failed", i);
+		ok = CHashInsert(chash, &e);
+		if (ok)
+			elog(LOG, "insert %u: worked twice", i);
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+chash_search_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+	hentry	e;
+
+	for (i = 0; i < 1000000; ++i)
+	{
+		bool ok;
+
+		e.key = i;
+		ok = CHashSearch(chash, &e);
+		if (!ok)
+			elog(LOG, "search %u: not found", i);
+		else if (e.val != e.key * 31)
+			elog(LOG, "search %u: found %u", i, e.val);
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+chash_delete_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+	hentry	e;
+
+	for (i = 0; i < 1000000; ++i)
+	{
+		bool ok;
+
+		e.key = i;
+		ok = CHashDelete(chash, &e);
+		if (!ok)
+			elog(LOG, "delete %u: not found", i);
+		ok = CHashDelete(chash, &e);
+		if (ok)
+			elog(LOG, "delete %u: found twice", i);
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+chash_concurrent_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+	hentry	e;
+	uint32	seed = MyProcPid << 16;
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		e.key = seed | i;
+		e.val = MyProcPid;
+		ok = CHashInsert(chash, &e);
+		if (!ok)
+			elog(LOG, "insert %u: found", i);
+	}
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		e.key = seed | i;
+		e.val = 0;
+		ok = CHashSearch(chash, &e);
+		if (!ok)
+		{
+			uint64	retry = 1;
+			elog(LOG, "search %u: not found", i);
+			while (!CHashSearch(chash, &e))
+				++retry;
+			elog(LOG, "search %u: eventually found it after "
+				UINT64_FORMAT " retries", i, retry);
+		}
+		if (e.val != MyProcPid)
+			elog(LOG, "search %u: expected %u found %u", i, (unsigned) MyProcPid, e.val);
+	}
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		e.key = seed | i;
+		ok = CHashDelete(chash, &e);
+		if (!ok)
+		{
+			uint64	retry = 1;
+			elog(LOG, "delete %u: not found", i);
+			while (!CHashDelete(chash, &e))
+				++retry;
+			elog(LOG, "delete %u: eventually deleted it after "
+				UINT64_FORMAT " retries", i, retry);
+		}
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+chash_collision_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+	hentry	e;
+
+	/* Don't stack-allocate this. */
+	static bool mine[10000];
+
+	memset(mine, 0, 10000 * sizeof(bool));
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		e.key = i;
+		e.val = MyProcPid;
+		ok = CHashInsert(chash, &e);
+		if (ok)
+			mine[i] = true;
+	}
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		if (!mine[i])
+			continue;
+		e.key = i;
+		ok = CHashSearch(chash, &e);
+		if (!ok)
+			elog(LOG, "search %u: not found", i);
+		else if (e.val != MyProcPid)
+			elog(LOG, "search %u: expected %u found %u",
+				 i, (unsigned) MyProcPid, e.val);
+		ok = CHashDelete(chash, &e);
+		if (!ok)
+			elog(LOG, "delete %u: not found", i);
+	}
+
+	PG_RETURN_VOID();
+}
+
+static bool
+dynahash_insert(uint32 key, uint32 val)
+{
+	bool	found;
+	uint32	hashcode;
+	hentry *e;
+	LWLockId	lockid;
+
+	hashcode = get_hash_value(dynahash, (void *) &key);
+ 	lockid = dynahash_get_lock(hashcode);
+	LWLockAcquire(lockid, LW_EXCLUSIVE);
+	e = hash_search_with_hash_value(dynahash, (void *) &key,
+									hashcode, HASH_ENTER, &found);
+	if (!found)
+		e->val = val;
+	LWLockRelease(lockid);
+
+	return !found;
+}
+
+static bool
+dynahash_search(uint32 key, uint32 *val)
+{
+	uint32	hashcode;
+	hentry *e;
+	LWLockId	lockid;
+
+	hashcode = get_hash_value(dynahash, (void *) &key);
+ 	lockid = dynahash_get_lock(hashcode);
+	LWLockAcquire(lockid, LW_SHARED);
+	e = hash_search_with_hash_value(dynahash, (void *) &key,
+									hashcode, HASH_FIND, NULL);
+	if (e)
+		*val = e->val;
+	LWLockRelease(lockid);
+
+	return e != NULL;
+}
+
+static bool
+dynahash_delete(uint32 key)
+{
+	uint32	hashcode;
+	hentry *e;
+	LWLockId	lockid;
+
+	hashcode = get_hash_value(dynahash, (void *) &key);
+ 	lockid = dynahash_get_lock(hashcode);
+	LWLockAcquire(lockid, LW_EXCLUSIVE);
+	e = hash_search_with_hash_value(dynahash, (void *) &key,
+									hashcode, HASH_REMOVE, NULL);
+	LWLockRelease(lockid);
+
+	return e != NULL;
+}
+
+Datum
+dynahash_insert_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+
+	for (i = 0; i < 1000000; ++i)
+	{
+		bool	ok;
+
+		ok = dynahash_insert(i, i * 31);
+		if (!ok)
+			elog(LOG, "insert %u: failed", i);
+		ok = dynahash_insert(i, i * 31);
+		if (ok)
+			elog(LOG, "insert %u: worked twice", i);
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+dynahash_search_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+
+	for (i = 0; i < 1000000; ++i)
+	{
+		bool	ok;
+		uint32	val;
+
+		ok = dynahash_search(i, &val);
+		if (!ok)
+			elog(LOG, "search %u: not found", i);
+		else if (val != i* 31)
+			elog(LOG, "search %u: found %u", i, val);
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+dynahash_delete_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+
+	for (i = 0; i < 1000000; ++i)
+	{
+		bool	ok;
+
+		ok = dynahash_delete(i);
+		if (!ok)
+			elog(LOG, "delete %u: not found", i);
+		ok = dynahash_delete(i);
+		if (ok)
+			elog(LOG, "delete %u: found twice", i);
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+dynahash_concurrent_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+	uint32	val;
+	uint32	seed = MyProcPid << 16;
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		ok = dynahash_insert(seed | i, MyProcPid);
+		if (!ok)
+			elog(LOG, "insert %u: found", i);
+	}
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		ok = dynahash_search(seed | i, &val);
+		if (!ok)
+		{
+			uint64	retry = 1;
+			elog(LOG, "search %u: not found", i);
+			while (!dynahash_search(seed | i, &val))
+				++retry;
+			elog(LOG, "search %u: eventually found it after "
+				UINT64_FORMAT " retries", i, retry);
+		}
+		if (val != MyProcPid)
+			elog(LOG, "search %u: expected %u found %u",
+				 i, (unsigned) MyProcPid, val);
+	}
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		ok = dynahash_delete(seed | i);
+		if (!ok)
+		{
+			uint64	retry = 1;
+			elog(LOG, "delete %u: not found", i);
+			while (!dynahash_delete(seed | i))
+				++retry;
+			elog(LOG, "delete %u: eventually deleted it after "
+				UINT64_FORMAT " retries", i, retry);
+		}
+	}
+
+	PG_RETURN_VOID();
+}
+
+Datum
+dynahash_collision_test(PG_FUNCTION_ARGS)
+{
+	uint32	i;
+	uint32	val;
+
+	/* Don't stack-allocate this. */
+	static bool mine[10000];
+
+	memset(mine, 0, 10000 * sizeof(bool));
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		ok = dynahash_insert(i, MyProcPid);
+		if (ok)
+			mine[i] = true;
+	}
+
+	for (i = 0; i < 10000; ++i)
+	{
+		bool ok;
+
+		if (!mine[i])
+			continue;
+		ok = dynahash_search(i, &val);
+		if (!ok)
+			elog(LOG, "search %u: not found", i);
+		else if (val != MyProcPid)
+			elog(LOG, "search %u: expected %u found %u",
+				 i, (unsigned) MyProcPid, val);
+		ok = dynahash_delete(i);
+		if (!ok)
+			elog(LOG, "delete %u: not found", i);
+	}
+
+	PG_RETURN_VOID();
+}
diff --git a/contrib/hashtest/hashtest.control b/contrib/hashtest/hashtest.control
new file mode 100644
index 0000000..b8e0f01
--- /dev/null
+++ b/contrib/hashtest/hashtest.control
@@ -0,0 +1,4 @@
+comment = 'hash testing code'
+default_version = '1.0'
+module_pathname = '$libdir/hashtest'
+relocatable = true
diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile
index 2c10fba..b30a0da 100644
--- a/src/backend/storage/buffer/Makefile
+++ b/src/backend/storage/buffer/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/buffer
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = buf_table.o buf_init.o bufmgr.o freelist.o localbuf.o
+OBJS = buf_init.o bufmgr.o freelist.o localbuf.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/README b/src/backend/storage/buffer/README
index a4ebbcc..86697e9 100644
--- a/src/backend/storage/buffer/README
+++ b/src/backend/storage/buffer/README
@@ -100,30 +100,10 @@ Buffer Manager's Internal Locking
 
 Before PostgreSQL 8.1, all operations of the shared buffer manager itself
 were protected by a single system-wide lock, the BufMgrLock, which
-unsurprisingly proved to be a source of contention.  The new locking scheme
-avoids grabbing system-wide exclusive locks in common code paths.  It works
-like this:
-
-* There is a system-wide LWLock, the BufMappingLock, that notionally
-protects the mapping from buffer tags (page identifiers) to buffers.
-(Physically, it can be thought of as protecting the hash table maintained
-by buf_table.c.)  To look up whether a buffer exists for a tag, it is
-sufficient to obtain share lock on the BufMappingLock.  Note that one
-must pin the found buffer, if any, before releasing the BufMappingLock.
-To alter the page assignment of any buffer, one must hold exclusive lock
-on the BufMappingLock.  This lock must be held across adjusting the buffer's
-header fields and changing the buf_table hash table.  The only common
-operation that needs exclusive lock is reading in a page that was not
-in shared buffers already, which will require at least a kernel call
-and usually a wait for I/O, so it will be slow anyway.
-
-* As of PG 8.2, the BufMappingLock has been split into NUM_BUFFER_PARTITIONS
-separate locks, each guarding a portion of the buffer tag space.  This allows
-further reduction of contention in the normal code paths.  The partition
-that a particular buffer tag belongs to is determined from the low-order
-bits of the tag's hash value.  The rules stated above apply to each partition
-independently.  If it is necessary to lock more than one partition at a time,
-they must be locked in partition-number order to avoid risk of deadlock.
+unsurprisingly proved to be a source of contention.  In subsequent releases,
+this lock was split into NUM_BUFFER_PARTITIONS locks, each guarding a portion
+of the buffer tag space.  Even this proved to be too much contention, so
+now we use a highly concurrent hashtable (see chash.c and chash.h).
 
 * A separate system-wide spinlock, buffer_strategy_lock, provides mutual
 exclusion for operations that access the buffer free list or select
diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
deleted file mode 100644
index 6ed47d5..0000000
--- a/src/backend/storage/buffer/buf_table.c
+++ /dev/null
@@ -1,163 +0,0 @@
-/*-------------------------------------------------------------------------
- *
- * buf_table.c
- *	  routines for mapping BufferTags to buffer indexes.
- *
- * Note: the routines in this file do no locking of their own.  The caller
- * must hold a suitable lock on the appropriate BufMappingLock, as specified
- * in the comments.  We can't do the locking inside these functions because
- * in most cases the caller needs to adjust the buffer header contents
- * before the lock is released (see notes in README).
- *
- *
- * Portions Copyright (c) 1996-2015, PostgreSQL Global Development Group
- * Portions Copyright (c) 1994, Regents of the University of California
- *
- *
- * IDENTIFICATION
- *	  src/backend/storage/buffer/buf_table.c
- *
- *-------------------------------------------------------------------------
- */
-#include "postgres.h"
-
-#include "storage/bufmgr.h"
-#include "storage/buf_internals.h"
-
-
-/* entry for buffer lookup hashtable */
-typedef struct
-{
-	BufferTag	key;			/* Tag of a disk page */
-	int			id;				/* Associated buffer ID */
-} BufferLookupEnt;
-
-static HTAB *SharedBufHash;
-
-
-/*
- * Estimate space needed for mapping hashtable
- *		size is the desired hash table size (possibly more than NBuffers)
- */
-Size
-BufTableShmemSize(int size)
-{
-	return hash_estimate_size(size, sizeof(BufferLookupEnt));
-}
-
-/*
- * Initialize shmem hash table for mapping buffers
- *		size is the desired hash table size (possibly more than NBuffers)
- */
-void
-InitBufTable(int size)
-{
-	HASHCTL		info;
-
-	/* assume no locking is needed yet */
-
-	/* BufferTag maps to Buffer */
-	info.keysize = sizeof(BufferTag);
-	info.entrysize = sizeof(BufferLookupEnt);
-	info.num_partitions = NUM_BUFFER_PARTITIONS;
-
-	SharedBufHash = ShmemInitHash("Shared Buffer Lookup Table",
-								  size, size,
-								  &info,
-								  HASH_ELEM | HASH_BLOBS | HASH_PARTITION);
-}
-
-/*
- * BufTableHashCode
- *		Compute the hash code associated with a BufferTag
- *
- * This must be passed to the lookup/insert/delete routines along with the
- * tag.  We do it like this because the callers need to know the hash code
- * in order to determine which buffer partition to lock, and we don't want
- * to do the hash computation twice (hash_any is a bit slow).
- */
-uint32
-BufTableHashCode(BufferTag *tagPtr)
-{
-	return get_hash_value(SharedBufHash, (void *) tagPtr);
-}
-
-/*
- * BufTableLookup
- *		Lookup the given BufferTag; return buffer ID, or -1 if not found
- *
- * Caller must hold at least share lock on BufMappingLock for tag's partition
- */
-int
-BufTableLookup(BufferTag *tagPtr, uint32 hashcode)
-{
-	BufferLookupEnt *result;
-
-	result = (BufferLookupEnt *)
-		hash_search_with_hash_value(SharedBufHash,
-									(void *) tagPtr,
-									hashcode,
-									HASH_FIND,
-									NULL);
-
-	if (!result)
-		return -1;
-
-	return result->id;
-}
-
-/*
- * BufTableInsert
- *		Insert a hashtable entry for given tag and buffer ID,
- *		unless an entry already exists for that tag
- *
- * Returns -1 on successful insertion.  If a conflicting entry exists
- * already, returns the buffer ID in that entry.
- *
- * Caller must hold exclusive lock on BufMappingLock for tag's partition
- */
-int
-BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
-{
-	BufferLookupEnt *result;
-	bool		found;
-
-	Assert(buf_id >= 0);		/* -1 is reserved for not-in-table */
-	Assert(tagPtr->blockNum != P_NEW);	/* invalid tag */
-
-	result = (BufferLookupEnt *)
-		hash_search_with_hash_value(SharedBufHash,
-									(void *) tagPtr,
-									hashcode,
-									HASH_ENTER,
-									&found);
-
-	if (found)					/* found something already in the table */
-		return result->id;
-
-	result->id = buf_id;
-
-	return -1;
-}
-
-/*
- * BufTableDelete
- *		Delete the hashtable entry for given tag (which must exist)
- *
- * Caller must hold exclusive lock on BufMappingLock for tag's partition
- */
-void
-BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
-{
-	BufferLookupEnt *result;
-
-	result = (BufferLookupEnt *)
-		hash_search_with_hash_value(SharedBufHash,
-									(void *) tagPtr,
-									hashcode,
-									HASH_REMOVE,
-									NULL);
-
-	if (!result)				/* shouldn't happen */
-		elog(ERROR, "shared buffer hash table corrupted");
-}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 7eb2d22..4435b3e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -24,9 +24,7 @@
  * MarkBufferDirty() -- mark a pinned buffer's contents as "dirty".
  *		The disk write is delayed until buffer replacement or checkpoint.
  *
- * See also these files:
- *		freelist.c -- chooses victim for buffer replacement
- *		buf_table.c -- manages the buffer lookup table
+ * See also freelist.c, which chooses victim for buffer replacement
  */
 #include "postgres.h"
 
@@ -47,10 +45,25 @@
 #include "storage/proc.h"
 #include "storage/smgr.h"
 #include "storage/standby.h"
+#include "utils/chash.h"
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
 #include "utils/timestamp.h"
 
+/* entry for buffer lookup hashtable */
+typedef struct
+{
+	BufferTag	key;			/* Tag of a disk page */
+	int			id;				/* Associated buffer ID */
+} BufferLookupEnt;
+
+static CHashDescriptor SharedBufDescriptor = {
+	"buffer lookup table",
+	0,
+	sizeof(BufferLookupEnt),
+	sizeof(BufferTag)
+};
+static CHashTable SharedBufHash;
 
 /* Note: these two macros only work on shared buffers, not local ones! */
 #define BufHdrGetBlock(bufHdr)	((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
@@ -138,6 +151,38 @@ static inline int32 GetPrivateRefCount(Buffer buffer);
 static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref);
 
 /*
+ * Estimate space needed for mapping hashtable
+ *		size is the desired hash table size (possibly more than NBuffers)
+ */
+Size
+BufMgrShmemSize(int size)
+{
+	if (SharedBufHash == NULL)
+	{
+		SharedBufDescriptor.capacity = size;
+		SharedBufHash = CHashBootstrap(&SharedBufDescriptor);
+	}
+
+	return CHashEstimateSize(SharedBufHash);
+}
+
+/*
+ * Initialize shmem hash table for mapping buffers
+ *		size is the desired hash table size (possibly more than NBuffers)
+ */
+void
+BufMgrInitShmem(int size)
+{
+	if (SharedBufHash == NULL || !IsUnderPostmaster)
+	{
+		Assert(SharedBufDescriptor.capacity == 0 ||
+			SharedBufDescriptor.capacity == size);
+		SharedBufDescriptor.capacity = size;
+		SharedBufHash = CHashInitialize(SharedBufHash, &SharedBufDescriptor);
+	}
+}
+
+/*
  * Ensure that the the PrivateRefCountArray has sufficient space to store one
  * more entry. This has to be called before using NewPrivateRefCountEntry() to
  * fill a new entry - but it's perfectly fine to not use a reserved entry.
@@ -444,26 +489,14 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 	}
 	else
 	{
-		BufferTag	newTag;		/* identity of requested block */
-		uint32		newHash;	/* hash value for newTag */
-		LWLock	   *newPartitionLock;	/* buffer partition lock for it */
-		int			buf_id;
+		BufferLookupEnt	ent;	/* identity of requested block */
 
 		/* create a tag so we can lookup the buffer */
-		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node,
+		INIT_BUFFERTAG(ent.key, reln->rd_smgr->smgr_rnode.node,
 					   forkNum, blockNum);
 
-		/* determine its hash code and partition lock ID */
-		newHash = BufTableHashCode(&newTag);
-		newPartitionLock = BufMappingPartitionLock(newHash);
-
-		/* see if the block is in the buffer pool already */
-		LWLockAcquire(newPartitionLock, LW_SHARED);
-		buf_id = BufTableLookup(&newTag, newHash);
-		LWLockRelease(newPartitionLock);
-
 		/* If not in buffers, initiate prefetch */
-		if (buf_id < 0)
+		if (!CHashSearch(SharedBufHash, &ent))
 			smgrprefetch(reln->rd_smgr, forkNum, blockNum);
 
 		/*
@@ -870,43 +903,39 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			BufferAccessStrategy strategy,
 			bool *foundPtr)
 {
-	BufferTag	newTag;			/* identity of requested block */
-	uint32		newHash;		/* hash value for newTag */
-	LWLock	   *newPartitionLock;		/* buffer partition lock for it */
-	BufferTag	oldTag;			/* previous identity of selected buffer */
-	uint32		oldHash;		/* hash value for oldTag */
-	LWLock	   *oldPartitionLock;		/* buffer partition lock for it */
+	BufferLookupEnt	newEnt;		/* identity of requested block */
+	BufferLookupEnt oldEnt;		/* previous identity of selected buffer */
 	BufFlags	oldFlags;
-	int			buf_id;
 	volatile BufferDesc *buf;
 	bool		valid;
 
 	/* create a tag so we can lookup the buffer */
-	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
-
-	/* determine its hash code and partition lock ID */
-	newHash = BufTableHashCode(&newTag);
-	newPartitionLock = BufMappingPartitionLock(newHash);
+	INIT_BUFFERTAG(newEnt.key, smgr->smgr_rnode.node, forkNum, blockNum);
 
 	/* see if the block is in the buffer pool already */
-	LWLockAcquire(newPartitionLock, LW_SHARED);
-	buf_id = BufTableLookup(&newTag, newHash);
-	if (buf_id >= 0)
+start:
+	if (CHashSearch(SharedBufHash, &newEnt))
 	{
+		BufferDesc *foundbuf;
+
 		/*
 		 * Found it.  Now, pin the buffer so no one can steal it from the
-		 * buffer pool, and check to see if the correct data has been loaded
-		 * into the buffer.
+		 * buffer pool.
 		 */
-		buf = &BufferDescriptors[buf_id];
+		foundbuf = &BufferDescriptors[newEnt.id];
 
-		valid = PinBuffer(buf, strategy);
+		valid = PinBuffer(foundbuf, strategy);
 
-		/* Can release the mapping lock as soon as we've pinned it */
-		LWLockRelease(newPartitionLock);
+		/* Check whether someone recycled the buffer before we pinned it. */
+		if (!BUFFERTAGS_EQUAL(newEnt.key, foundbuf->tag))
+		{
+			UnpinBuffer(foundbuf, true);
+			goto start;
+		}
 
 		*foundPtr = TRUE;
 
+		/* Check to see if the correct data has been loaded into the buffer. */
 		if (!valid)
 		{
 			/*
@@ -916,7 +945,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			 * own read attempt if the page is still not BM_VALID.
 			 * StartBufferIO does it all.
 			 */
-			if (StartBufferIO(buf, true))
+			if (StartBufferIO(foundbuf, true))
 			{
 				/*
 				 * If we get here, previous attempts to read the buffer must
@@ -926,15 +955,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			}
 		}
 
-		return buf;
+		return foundbuf;
 	}
 
-	/*
-	 * Didn't find it in the buffer pool.  We'll have to initialize a new
-	 * buffer.  Remember to unlock the mapping lock while doing the work.
-	 */
-	LWLockRelease(newPartitionLock);
-
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
@@ -1041,42 +1064,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 */
 		if (oldFlags & BM_TAG_VALID)
 		{
-			/*
-			 * Need to compute the old tag's hashcode and partition lock ID.
-			 * XXX is it worth storing the hashcode in BufferDesc so we need
-			 * not recompute it here?  Probably not.
-			 */
-			oldTag = buf->tag;
-			oldHash = BufTableHashCode(&oldTag);
-			oldPartitionLock = BufMappingPartitionLock(oldHash);
-
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-		}
-		else
-		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			/* these just keep the compiler quiet about uninit variables */
-			oldHash = 0;
-			oldPartitionLock = 0;
+			/* Save old tag. */
+			oldEnt.key = buf->tag;
 		}
 
 		/*
@@ -1086,32 +1075,33 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Note that we have not yet removed the hashtable entry for the old
 		 * tag.
 		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
+enter:
+		newEnt.id = buf->buf_id;
+		if (!CHashInsert(SharedBufHash, &newEnt))
 		{
+			BufferDesc *foundbuf;
+
+			/*
+			 * We've got a collision, apparently: it looks like someone else
+			 * did what we were about to do.  We can handle this as if we had
+			 * found the buffer in the pool in the first place, but we must
+			 * recheck the buffer tag after pinning it, because it could still
+			 * get renamed under us.
+		 	 */
+			foundbuf = &BufferDescriptors[newEnt.id];
+			valid = PinBuffer(foundbuf, strategy);
+			if (!BUFFERTAGS_EQUAL(newEnt.key, foundbuf->tag))
+			{
+				UnpinBuffer(foundbuf, true);
+				goto enter;
+			}
+
 			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
+			 * Collision confirmed.  Give up the buffer we were planning to
+			 * use.
 			 */
 			UnpinBuffer(buf, true);
 
-			/* Can give up that buffer's mapping partition lock now */
-			if ((oldFlags & BM_TAG_VALID) &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = &BufferDescriptors[buf_id];
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
 			*foundPtr = TRUE;
 
 			if (!valid)
@@ -1123,7 +1113,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				 * then set up our own read attempt if the page is still not
 				 * BM_VALID.  StartBufferIO does it all.
 				 */
-				if (StartBufferIO(buf, true))
+				if (StartBufferIO(foundbuf, true))
 				{
 					/*
 					 * If we get here, previous attempts to read the buffer
@@ -1133,7 +1123,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				}
 			}
 
-			return buf;
+			return foundbuf;
 		}
 
 		/*
@@ -1152,11 +1142,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			break;
 
 		UnlockBufHdr(buf);
-		BufTableDelete(&newTag, newHash);
-		if ((oldFlags & BM_TAG_VALID) &&
-			oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
+		if (!CHashDelete(SharedBufHash, &newEnt.key))
+			elog(ERROR, "shared buffer hash table corrupted");
 		UnpinBuffer(buf, true);
 	}
 
@@ -1168,7 +1155,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 * the old content is no longer relevant.  (The usage_count starts out at
 	 * 1 so that the buffer can survive one clock-sweep pass.)
 	 */
-	buf->tag = newTag;
+	buf->tag = newEnt.key;
 	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT);
 	if (relpersistence == RELPERSISTENCE_PERMANENT)
 		buf->flags |= BM_TAG_VALID | BM_PERMANENT;
@@ -1178,14 +1165,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf);
 
-	if (oldFlags & BM_TAG_VALID)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
-
-	LWLockRelease(newPartitionLock);
+	if ((oldFlags & BM_TAG_VALID) != 0 &&
+		!CHashDelete(SharedBufHash, &oldEnt))
+		elog(ERROR, "shared buffer hash table corrupted");
 
 	/*
 	 * Buffer contents are currently invalid.  Try to get the io_in_progress
@@ -1220,42 +1202,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 static void
 InvalidateBuffer(volatile BufferDesc *buf)
 {
-	BufferTag	oldTag;
-	uint32		oldHash;		/* hash value for oldTag */
-	LWLock	   *oldPartitionLock;		/* buffer partition lock for it */
+	BufferLookupEnt	oldEnt;
 	BufFlags	oldFlags;
 
 	/* Save the original buffer tag before dropping the spinlock */
-	oldTag = buf->tag;
-
-	UnlockBufHdr(buf);
-
-	/*
-	 * Need to compute the old tag's hashcode and partition lock ID. XXX is it
-	 * worth storing the hashcode in BufferDesc so we need not recompute it
-	 * here?  Probably not.
-	 */
-	oldHash = BufTableHashCode(&oldTag);
-	oldPartitionLock = BufMappingPartitionLock(oldHash);
-
-retry:
-
-	/*
-	 * Acquire exclusive mapping lock in preparation for changing the buffer's
-	 * association.
-	 */
-	LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-
-	/* Re-lock the buffer header */
-	LockBufHdr(buf);
-
-	/* If it's changed while we were waiting for lock, do nothing */
-	if (!BUFFERTAGS_EQUAL(buf->tag, oldTag))
-	{
-		UnlockBufHdr(buf);
-		LWLockRelease(oldPartitionLock);
-		return;
-	}
+	oldEnt.key = buf->tag;
 
 	/*
 	 * We assume the only reason for it to be pinned is that someone else is
@@ -1266,15 +1217,21 @@ retry:
 	 * yet done StartBufferIO, WaitIO will fall through and we'll effectively
 	 * be busy-looping here.)
 	 */
-	if (buf->refcount != 0)
+	while (buf->refcount != 0)
 	{
 		UnlockBufHdr(buf);
-		LWLockRelease(oldPartitionLock);
 		/* safety check: should definitely not be our *own* pin */
 		if (GetPrivateRefCount(buf->buf_id) > 0)
 			elog(ERROR, "buffer is pinned in InvalidateBuffer");
 		WaitIO(buf);
-		goto retry;
+		LockBufHdr(buf);
+
+		/* If it's changed while we were waiting for lock, do nothing */
+		if (!BUFFERTAGS_EQUAL(buf->tag, oldEnt.key))
+		{
+			UnlockBufHdr(buf);
+			return;
+		}
 	}
 
 	/*
@@ -1291,13 +1248,9 @@ retry:
 	/*
 	 * Remove the buffer from the lookup hashtable, if it was in there.
 	 */
-	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
-
-	/*
-	 * Done with mapping lock.
-	 */
-	LWLockRelease(oldPartitionLock);
+	if ((oldFlags & BM_TAG_VALID) != 0 &&
+		!CHashDelete(SharedBufHash, &oldEnt))
+		elog(ERROR, "shared buffer hash table corrupted");
 
 	/*
 	 * Insert the buffer at the head of the list of free buffers.
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 3add619..2410dfc 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -432,7 +432,7 @@ StrategyShmemSize(void)
 	Size		size = 0;
 
 	/* size of lookup hash table ... see comment in StrategyInitialize */
-	size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
+	size = add_size(size, BufMgrShmemSize(NBuffers + NUM_BUFFER_PARTITIONS));
 
 	/* size of the shared replacement strategy control block */
 	size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl)));
@@ -462,7 +462,7 @@ StrategyInitialize(bool init)
 	 * happening in each partition concurrently, so we could need as many as
 	 * NBuffers + NUM_BUFFER_PARTITIONS entries.
 	 */
-	InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
+	BufMgrInitShmem(NBuffers + NUM_BUFFER_PARTITIONS);
 
 	/*
 	 * Get or create the shared strategy control block
diff --git a/src/backend/storage/ipc/shmem.c b/src/backend/storage/ipc/shmem.c
index 250e312..5f94f57 100644
--- a/src/backend/storage/ipc/shmem.c
+++ b/src/backend/storage/ipc/shmem.c
@@ -423,6 +423,29 @@ ShmemInitStruct(const char *name, Size size, bool *foundPtr)
 	return structPtr;
 }
 
+/*
+ * ShmemInitStruct -- Attach to an existing structure in shared memory.
+ */
+void *
+ShmemAttachStruct(const char *name)
+{
+	ShmemIndexEnt *result;
+	void	   *ptr;
+	bool		found;
+
+	LWLockAcquire(ShmemIndexLock, LW_SHARED);
+
+	result = (ShmemIndexEnt *)
+		hash_search(ShmemIndex, name, HASH_FIND, &found);
+	if (!found || result == NULL)
+		elog(ERROR, "shared memory structure %s not found", name);
+	ptr = result->location;
+	Assert(ptr != NULL);
+
+	LWLockRelease(ShmemIndexLock);
+
+	return ptr;
+}
 
 /*
  * Add two Size values, checking for overflow
diff --git a/src/backend/utils/hash/Makefile b/src/backend/utils/hash/Makefile
index 05d347c..5d53382 100644
--- a/src/backend/utils/hash/Makefile
+++ b/src/backend/utils/hash/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/utils/hash
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = dynahash.o hashfn.o
+OBJS = chash.o dynahash.o hashfn.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/utils/hash/chash.c b/src/backend/utils/hash/chash.c
new file mode 100644
index 0000000..0d4dc78
--- /dev/null
+++ b/src/backend/utils/hash/chash.c
@@ -0,0 +1,1075 @@
+/*-------------------------------------------------------------------------
+ *
+ * chash.c
+ *	  concurrent hash tables
+ *
+ * A concurrent hash table stores a collection of fixed-size objects.
+ * From the point of view of this module, such objects are merely an
+ * opaque array of bytes, but the caller will typically implement them as
+ * a C "struct".  Some fixed-size, leading portion of each object is
+ * designated as the key, which must be distinct for all objects in the
+ * collection.  Since PostgreSQL's shared memory model does not permit
+ * dynamic shared-memory allocation, we preallocate shared-memory space
+ * for the maximum number of entities which can be stored (plus a few
+ * extra, for reasons that will be further explained below).  This space
+ * is allocated as a single large array called the arena, and we often
+ * refer to entities by their position in the arena rather than via an
+ * ordinary pointer.  This saves a considerable amount of memory, since
+ * most modern architectures are 64-bit and therefore use 8-byte pointers,
+ * while arena offsets can be stored in a 32-bit word.  In fact, we
+ * reserve one bit in each such word as a mark bit, so the maximum size
+ * of the arena is 2^31 elements, a restriction that does not currently
+ * appear to be problematic.  An additional advantage of this representation
+ * is that aligned 32-bit loads and stores are atomic on all architectures
+ * we support, but 64-bit loads and stores are not.
+ *
+ * When an element is inserted, we copy the data from the backend-private
+ * object supplied by the caller into one of these shared-memory entities.
+ * When the hash table is searched, the caller passes a backend-private
+ * entity with just the key filled in; if a matching element is found,
+ * data is copied from the shared memory entity into the non-key portion
+ * of the user-supplied entity.  In this way, clients of this module
+ * never use pointers into shared memory directly.
+ *
+ * As normal, we structure the hash table as an array of buckets, whose
+ * size is always a power of two, so that the low-order bytes of the
+ * hash code can be used to select a bucket.  If multiple entities has
+ * to the same bucket, we use separate chaining: each entity in the
+ * arena has an 8-byte header that stores the 4-byte arena offset of the
+ * next item in the bucket and the hash value of the entity's key.
+ * Bucket chains are maintained in order by ascending hash value and
+ * then by ascending entity key (as per memcmp) so that there is
+ * precisely one legal location at which a given new item can be inserted
+ * into a bucket.
+ *
+ * Items are inserted into buckets using compare-and-swap (CAS).  Thus, this
+ * module cannot be used on architectures where we do not have 4-byte
+ * compare-and-swap.  When an item is deleted, we first set its mark bit,
+ * which is stored within the next-pointer, again using CAS.  Once this
+ * step is completed, the node is deleted.  As a cleanup operation, we then
+ * use CAS to modify the next-pointer of the previous node to point to the
+ * node following the deleted node.  Note that, even once this cleanup
+ * operation has been performed, some other backend concurrently walking the
+ * chain might still be visiting the deleted node.  Thus, we must be certain
+ * not to overwrite the deleted node's next-pointer until all concurrent
+ * bucket scans have completed.  This means, in particular, that we cannot
+ * immediately view the deleted node as available for reuse.
+ *
+ * Instead, when a delete-marked node is removed from the bucket chain, it
+ * is added to one of many garbage lists.  There is a many-to-one mapping from
+ * buckets to garbage lists, so that the choice of bucket determines the
+ * garbage list but not visca versa.  Any process which wishes to scan a bucket
+ * must first advertise in shared memory the corresponding garbage list number.
+ * When a backend wishes to move entries from a garbage list to a free list,
+ * it must first wait (by spinning) for any backends scanning that garbage
+ * list to complete their scans.
+ *
+ * Ideally, it would be nice to make this completely lock-free, but because
+ * of the above-described choice of garbage collection algorithm, it currently
+ * isn't.  For an algorithm to be lock-free, it must be possible to suspend
+ * the execution of any number of processes for an arbitrary period of time
+ * without impeding the overall progress of the system.  For this code, that
+ * is true except when garbage collection occurs.  In that case, an insert,
+ * search, or delete operation can obstruct an inserting thread attempting to
+ * perform garbage collection for an unbounded period of time.  The algorithm
+ * could be adapted as to be completely lock-free, essentially by guaranteeing
+ * that even in the worst case no combination of processes can obstruct garbage
+ * collection to a sufficient degree as to prevent an inserter from finding
+ * an available entry in a hash table containing fewer live elements than its
+ * declared maximum capacity.  However, it's not clear that this is a good
+ * idea, because it would likely slow down operation in the non-contended
+ * case to solve a problem that we hope won't happen anyway.
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/hash/chash.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "miscadmin.h"
+#include "access/hash.h"
+#include "storage/barrier.h"
+#include "storage/proc.h"
+#include "storage/shmem.h"
+#include "utils/chash.h"
+#include "utils/memutils.h"
+
+/*
+ * CHashPtr represents an offset into the arena, plus a mark bit that is
+ * used to implement concurrent deletion.
+ */
+typedef uint32 CHashPtr;
+#define InvalidCHashPtr				((uint32) -2)
+#define CHashPtrIsInvalid(x)		((x) >= InvalidCHashPtr)
+#define CHashPtrIsMarked(x)			((x) & 1)
+#define CHashPtrGetOffset(x)		((x) >> 1)
+#define CHashPtrMark(x)				((x) | 1)
+#define CHashPtrUnmark(x)			((x) & ~1)
+#define MakeCHashPtr(x)				((x) << 1)
+#define CHashMaxCapacity			CHashPtrGetOffset(InvalidCHashPtr)
+
+/*
+ * Each object stored in the hash table is represented by a CHashNode, which
+ * stores a pointer to the next item in the same bucket, and the exact hash
+ * value of the current item.  Each CHashNode is followed by space for the
+ * item itself.
+ */
+typedef struct
+{
+	CHashPtr		next;			/* arena offset of next element */
+	union
+	{
+		uint32		hashcode;		/* hash(key) */
+		CHashPtr	gcnext;			/* arena offset of next garbage item */
+	} un;
+} CHashNode;
+
+#define SizeOfCHashNode		MAXALIGN(sizeof(CHashNode))
+#define CHashNodeGetItem(x)	(((char *) x) + SizeOfCHashNode)
+
+/*
+ * CHashTableData stores all the information that we need in order to access
+ * a concurrent hash table.  We store one copy of this data in shared memory,
+ * and an additional copy in the private memory of each backend accessing the
+ * table.
+ */
+typedef struct CHashTableData
+{
+	/*
+	 * These fields do not change after initialization.
+	 */
+	CHashDescriptor	desc;			/* descriptor for this hash table */
+	uint32			bucket_mask;	/* # of buckets, minus one */
+	uint8			garbage_shift;	/* log2(# of buckets/# of garbage lists) */
+	uint8			freelist_shift;	/* log2(# of garbage lists/# freelists) */
+	uint16			nfreelists;		/* # of freelists */
+	uint32			arena_limit;	/* # of arena elements */
+	uint32			arena_stride;	/* bytes allocated per arena element */
+	CHashPtr	   *bucket;			/* array with 1 entry per bucket */
+	CHashPtr	   *extra;			/* entries for garbage and free lists */
+	char		   *arena;			/* arena */
+
+	/*
+	 * These fields will be different in each backend; the shared copy is
+	 * irrelevant.
+	 */
+	int				gc_pid;			/* PID that set gc_next */
+	uint32			gc_next;		/* next garbage list to reclaim */
+	uint64			stats[CHS_NumberOfStatistics];	/* statistics */
+} CHashTableData;
+
+/* Compute # of buckets, garbage lists, or free lists. */
+#define CHashTableNBuckets(table) \
+	((table)->bucket_mask + 1)
+#define CHashTableNGarbage(table) \
+	(CHashTableNBuckets((table)) >> (table)->garbage_shift)
+#define CHashTableNFreeLists(table) \
+	((table)->nfreelists)
+
+/*
+ * Garbage lists and free lists are interleaved to reduce cache line
+ * contention on the free lists, so the calculation of where an individual
+ * list is located is a bit complex.
+ */
+#define CHashTableGetGarbageList(table, n) \
+	(&(table)->extra[(n) + ((n) >> (table)->freelist_shift)])
+#define CHashTableGetGarbageByBucket(table, n) \
+	(CHashTableGetGarbageList((table), (n) >> (table)->garbage_shift))
+#define CHashTableGetFreeList(table, n) \
+	(&(table)->extra[(n) + (((n) + 1) << (table)->freelist_shift)])
+
+/* Access macros for arena nodes. */
+#define CHashTableGetRaw(table, offset) \
+	(AssertMacro((offset) < (table)->arena_limit), \
+	 (CHashNode *) ((table)->arena + (table)->arena_stride * (offset)))
+#define CHashTableGetNode(table, ptr) \
+	(AssertMacro(!CHashPtrIsInvalid(ptr)), \
+	 CHashTableGetRaw((table), CHashPtrGetOffset((ptr))))
+
+/* Statistics macros. */
+#define CHashTableIncrementStatistic(table, stat) \
+	((table)->stats[(stat)]++)
+
+/*
+ * Bucket scan.
+ */
+typedef struct
+{
+	CHashPtr	target;
+	CHashPtr	next;
+	CHashPtr   *pointer_to_target;
+	CHashNode  *target_node;
+	bool		found;
+} CHashScanResult;
+
+/* Human-readable statistics names. */
+char *CHashStatisticsNames[] = {
+	"searches",
+	"searches failed",
+	"inserts",
+	"inserts failed",
+	"inserts retried",
+	"deletions",
+	"deletions failed",
+	"deletions retried",
+	"scan expunges",
+	"scan expunges failed",
+	"scans restarted",
+	"cleanup scans",
+	"allocations failed",
+	"garbage enqueues retried",
+	"garbage dequeues failed",
+	"garbage collections",
+	"garbage collection spins",
+	"garbage collection reclaims skipped",
+	"garbage collection fast reclaims",
+	"garbage collection reclaims retried",
+	"<end>"
+};
+
+/* Function prototypes. */
+static CHashPtr CHashAllocate(CHashTable table);
+static CHashPtr CHashAllocateViaGC(CHashTable table);
+static void CHashAddToGarbage(CHashTable table, uint32 bucket, CHashPtr c);
+static void CHashBucketScan(CHashTable table,
+				CHashPtr *start,
+				uint32 hashcode,
+				const void *key,
+				CHashScanResult *res);
+
+/*
+ * First stage of CHashTable initialization.  We fill in all the constants
+ * here, but not the pointers.
+ */
+CHashTable
+CHashBootstrap(CHashDescriptor *desc)
+{
+	CHashTable		table;
+	uint32			bucket_shift;
+
+	/* Sanity check. */
+	Assert(!strcmp(CHashStatisticsNames[CHS_NumberOfStatistics], "<end>"));
+
+	/* Allocate table and copy descriptor. */
+	table = MemoryContextAllocZero(TopMemoryContext, sizeof(CHashTableData));
+	memcpy(&table->desc, desc, sizeof(CHashDescriptor)); 
+
+	/* Sanity checks. */
+	if (desc->capacity < 1 || desc->capacity > CHashMaxCapacity)
+		elog(ERROR, "invalid capacity for concurrent hash");
+	if (desc->key_size < 1 || desc->key_size > desc->element_size)
+		elog(ERROR, "invalid key size for concurrent hash");
+
+ 	/*
+	 * The number of buckets must be a power of two.  To avoid (as much as
+	 * possible) having to traverse long bucket chains, we aim for a load
+	 * factor <= 1.0, so this is a pretty simple calculation: we just find the
+	 * smallest power of two greater than or equal to the target capacity.
+	 */
+	bucket_shift = fls(desc->capacity) - 1;
+	table->bucket_mask = (1 << bucket_shift) - 1;
+
+	/*
+	 * We choose to have one garbage list for every 64 hash table buckets
+	 * (that is, garbage_shift = 6) unless there are fewer than 64 buckets in
+	 * total, in which case we still have a minimum of one garbage list.
+	 *
+	 * Increasing the garbage_shift would reduce the likelihood of a backend
+	 * performing garbage collection needing to wait for a backend walking a
+	 * bucket to finish its scan.  On the other hand, decreasing the garbage
+	 * shift would allow more items to be recovered in a single garbage
+	 * collection cycle.  It's not clear what the optimal value is.
+	 */
+	table->garbage_shift = Min(bucket_shift, 6);
+	table->gc_next = 0;
+	table->gc_pid = 0;
+
+	/*
+ 	 * Experimentation reveals that the free list manipulation is both one of
+ 	 * the slowest parts of this algorithm and the most vulnerable to
+ 	 * contention.  Therefore, we want to have as many free lists as possible,
+ 	 * but there's no need to have more than the number of CPU cores, so we
+ 	 * limit the number of freelists to 64.  There might be a benefit in some
+ 	 * larger limit on a really big system, but we'll cap it here pending some
+ 	 * actual test results.  We're also limited to having no more freelists
+ 	 * than we do garbage lists.
+ 	 */
+#define LOG2_MAX_FREELIST 6
+	if (bucket_shift - table->garbage_shift < LOG2_MAX_FREELIST)
+		table->freelist_shift = 0;
+	else
+		table->freelist_shift =
+			(bucket_shift - table->garbage_shift) - LOG2_MAX_FREELIST;
+	table->nfreelists =
+		1 << (bucket_shift - (table->garbage_shift + table->freelist_shift));
+
+	/*
+	 * To make garbage collection efficient, we overallocate.  Normally, we
+	 * overallocate by one-eighth, but if that would be less than 15 elements,
+	 * then we allocate 15 elements instead.  This extra capacity can actually
+	 * be used, but for best performance, it shouldn't be.  It's the caller's
+	 * responsibility to avoid this.
+	 */
+	table->arena_limit = desc->capacity;
+	if (desc->capacity < 120)
+		table->arena_limit += 15;
+	else
+		table->arena_limit += table->arena_limit / 8;
+
+	/* Each arena element must be MAXALIGN'd and include per-node space. */
+	table->arena_stride = SizeOfCHashNode + MAXALIGN(desc->element_size);
+
+	/* Zero out statistics. */
+	memset(table->stats, 0, sizeof(uint64) * CHS_NumberOfStatistics);
+
+	return table;
+}
+
+/*
+ * Estimate shared memory requirements.
+ */
+Size
+CHashEstimateSize(CHashTable table)
+{
+	Size		total_buckets;
+	Size		size;
+	Size		nbuckets = CHashTableNBuckets(table);
+	Size		ngarbage = CHashTableNGarbage(table);
+	Size		nfreelists = CHashTableNFreeLists(table);
+
+	Assert(nbuckets > 0 && ngarbage > 0 && nfreelists > 0);
+	total_buckets = add_size(nbuckets, ngarbage);
+	total_buckets = add_size(total_buckets, nfreelists);
+
+	size = MAXALIGN(sizeof(CHashTableData));
+	size = add_size(size, mul_size(sizeof(CHashPtr), total_buckets));
+	size = add_size(size, mul_size(table->arena_stride, table->arena_limit));
+
+	return size;
+}
+
+/*
+ * Create a concurrent hash table in shared memory, or attach to an existing
+ * table.
+ */
+CHashTable
+CHashInitialize(CHashTable table, CHashDescriptor *desc)
+{
+	Size	size;
+	bool	found;
+	void   *shmem;
+	uint32	i;
+	uint32	nbuckets;
+	uint32	nfreelists;
+	uint32	ngarbage;
+	uint32	nextra;
+
+	/*
+	 * If we're under the postmaster, this must be the EXEC_BACKEND case where
+	 * we need to attach to an existing shared-memory segment.
+	 */
+	if (IsUnderPostmaster)
+	{
+		void   *shmem;
+
+		Assert(table == NULL);
+		table = MemoryContextAlloc(TopMemoryContext, sizeof(CHashTableData));
+		shmem = ShmemAttachStruct(desc->shmem_name);
+		memcpy(table, shmem, sizeof(CHashTableData));
+		return table;
+	}
+
+	/*
+	 * Otherwise, the hash table should not already exist, and we must
+	 * create it.  But the table should already be bootstrapped, since we
+	 * must previously have computed its size when figuring out our shared
+	 * memory allocation.
+	 */
+	Assert(table != NULL);
+	size = CHashEstimateSize(table);
+	shmem = ShmemInitStruct(table->desc.shmem_name, size, &found);
+	Assert(!found);
+
+	/* Bucket, garbage, and freelist arrays follow table info. */
+	table->bucket = (CHashPtr *)
+		(((char *) shmem) + MAXALIGN(sizeof(CHashTableData)));
+	nbuckets = CHashTableNBuckets(table);
+	table->extra = &table->bucket[nbuckets];
+
+	/* Arena follows the various lists. */
+	ngarbage = CHashTableNGarbage(table);
+	nfreelists = CHashTableNFreeLists(table);
+	nextra = ngarbage + nfreelists;
+	table->arena = (void *) (&table->extra[nextra]);
+
+	/* Initialize all three sets of lists to empty. */
+	for (i = 0; i < nbuckets; ++i)
+		table->bucket[i] = InvalidCHashPtr;
+	for (i = 0; i < nextra; ++i)
+		table->extra[i] = InvalidCHashPtr;
+
+	/* Put all arena elements on the free lists. */
+	for (i = 0; i < table->arena_limit; ++i)
+	{
+		CHashPtr   *f = CHashTableGetFreeList(table, i % nfreelists);
+		CHashNode  *n = CHashTableGetRaw(table, i);
+
+		n->un.gcnext = *f;
+		*f = MakeCHashPtr(i);
+	}
+
+	/*
+	 * Copy table (with pointers now filled in) to shared memory.  This is
+	 * arguably unnecessary when not using EXEC_BACKEND, but we do it anyway.
+	 */
+	memcpy(shmem, table, sizeof(CHashTableData));
+
+	return table;
+}
+
+/*
+ * Search a concurrent hash table.  entry should be a block of memory large
+ * enough to hold a complete entry, with just the key portion filled in.  If
+ * a matching entry is found, this function will fill in the rest of the entry
+ * from the data in the hash table and return true.  If not, it will return
+ * false.
+ */
+bool
+CHashSearch(CHashTable table, void *entry)
+{
+	uint32	hashcode = hash_any(entry, table->desc.key_size);
+	uint32	bucket = hashcode & table->bucket_mask;
+	CHashPtr	   *b = &table->bucket[bucket];
+	CHashScanResult	scan;
+
+	/* Prevent garbage collection for this bucket. */
+	Assert(MyProc->hazard[0] == NULL);
+	MyProc->hazard[0] = CHashTableGetGarbageByBucket(table, bucket);
+	pg_memory_barrier();
+
+	/* Scan bucket and return data from any matching entry. */
+	CHashBucketScan(table, b, hashcode, entry, &scan);
+	if (scan.found)
+		memcpy(((char *) entry) + table->desc.key_size,
+			   CHashNodeGetItem(scan.target_node) + table->desc.key_size,
+			   table->desc.element_size - table->desc.key_size);
+
+	/* Allow garbage collection for this bucket. */
+	Assert(MyProc->hazard[0] != NULL);
+	pg_memory_barrier();
+	MyProc->hazard[0] = NULL;
+
+	CHashTableIncrementStatistic(table, CHS_Search);
+	if (!scan.found)
+		CHashTableIncrementStatistic(table, CHS_Search_Failed);
+	return scan.found;
+}
+
+/*
+ * Insert into a concurrent hash table.  entry should be the complete entry
+ * to be inserted.  If no duplicate entry is found in the table, this function
+ * will insert the entry and return true.  Otherwise, the duplicate entry's
+ * value will be copied into the supplied entry and the function will return
+ * false.
+ *
+ * The caller is responsible for ensuring that no inserts are performed into
+ * a hash table which is at capacity.  The behavor of such an insert is
+ * undefined (the actual behavior is that the insert may either succeed,
+ * degrading performance; or CHashAllocate may enter a tight loop until such
+ * time as an element is deleted).
+ */
+bool
+CHashInsert(CHashTable table, void *entry)
+{
+	uint32	hashcode = hash_any(entry, table->desc.key_size);
+	uint32	bucket = hashcode & table->bucket_mask;
+	CHashPtr	   *b = &table->bucket[bucket];
+	CHashPtr	new;
+	CHashNode  *nnew;
+	CHashScanResult	scan;
+
+	/*
+	 * Allocate and initialize a new entry, on the assumption that the insert
+	 * will succeed.  If it ends up failing, we must be sure to put this back
+	 * on some free list, lest it be permanently leaked.
+	 */
+	new = CHashAllocate(table);
+	nnew = CHashTableGetNode(table, new);
+	nnew->un.hashcode = hashcode;
+	memcpy(CHashNodeGetItem(nnew), entry, table->desc.element_size);
+
+	/* Prevent garbage collection for this bucket. */
+	MyProc->hazard[0] = CHashTableGetGarbageByBucket(table, bucket);
+	pg_memory_barrier();
+
+	/*
+	 * Scan the bucket.  If we don't find a match, use compare-and-swap to
+	 * insert the new node at the insert position.  If we do find a match,
+	 * return the data to the caller.
+	 */
+retry:
+	CHashBucketScan(table, b, hashcode, entry, &scan);
+	if (scan.found)
+		memcpy(((char *) entry) + table->desc.key_size,
+			   CHashNodeGetItem(scan.target_node) + table->desc.key_size,
+			   table->desc.element_size - table->desc.key_size);
+	else
+	{
+		/*
+		 * We didn't find a matching element, so use compare-and-swap to
+		 * attempt to insert the new element we've prepared.  If this fails,
+		 * someone currently inserted or deleted an element.  The correct
+		 * insertion point might have changed, or the key we're trying to
+		 * insert might now be present when it wasn't before, so we'll have
+		 * to search the bucket chain anew.
+		 *
+		 * There is a risk of starvation here, but we hope it will not happen
+		 * in practice.  Contention for inserting new elements should be
+		 * spread out pretty much evenly across N+M possible insertion points,
+		 * where N is the number of buckets and M is the number of elements
+		 * in the table.  Even for a quite modestly size table this is likely
+		 * to exceed the number of CPU cores.
+		 */
+		Assert(!CHashPtrIsMarked(scan.target));
+		nnew->next = scan.target;
+		if (!__sync_bool_compare_and_swap(scan.pointer_to_target,
+										  scan.target, new))
+		{
+			CHashTableIncrementStatistic(table, CHS_Insert_Retry);
+			goto retry;
+		}
+	}
+
+	/* Allow garbage collection for this bucket. */
+	Assert(MyProc->hazard[0] != NULL);
+	pg_memory_barrier();
+	MyProc->hazard[0] = NULL;
+
+	/*
+	 * If the insert failed, add the entry we found to the appropriate
+	 * garbage list.  We can't simply put this back on the freelist,
+	 * because that leads to an ABA problem: some other backend could
+	 * read the head of the freelist, go into the tank, and then use
+	 * CAS to pop an element.  If at that point, it finds the same
+	 * element at the head of the freelist but with a different successor,
+	 * we're hosed.  To prevent that, we ensure that elements are added
+	 * to a given freelist only after verifying that any allocations in
+	 * progress at the time we popped the freelist has completed.  This
+	 * guarantees that any allocation still in progress at the time this
+	 * element makes it back to the freelist is trying to allocate some
+	 * other node.
+	 */
+	CHashTableIncrementStatistic(table, CHS_Insert);
+	if (scan.found)
+	{
+		CHashTableIncrementStatistic(table, CHS_Insert_Failed);
+		CHashAddToGarbage(table, bucket, new);
+	}
+
+	/* The insert succeeded if and only if no duplicate was found. */
+	return !scan.found;
+}
+
+/*
+ * Delete from a concurrent hash table.  entry need only contain the key field.
+ * Returns true if we find and delete a matching key and false otherwise.
+ */
+bool
+CHashDelete(CHashTable table, void *entry)
+{
+	uint32	hashcode = hash_any(entry, table->desc.key_size);
+	uint32	bucket = hashcode & table->bucket_mask;
+	CHashPtr	   *b = &table->bucket[bucket];
+	CHashScanResult	scan;
+
+	/* Prevent garbage collection for this bucket. */
+	Assert(MyProc->hazard[0] == NULL);
+	MyProc->hazard[0] = CHashTableGetGarbageByBucket(table, bucket);
+	pg_memory_barrier();
+
+	/* Scan bucket. */
+retry:
+	CHashBucketScan(table, b, hashcode, entry, &scan);
+
+	/* If we found it, try to delete it. */
+	if (scan.found)
+	{
+		Assert(!CHashPtrIsMarked(scan.next));
+
+		/* Attempt to apply delete-mark. */
+		if (!__sync_bool_compare_and_swap(&scan.target_node->next,
+										  scan.next,
+										  CHashPtrMark(scan.next)))
+		{
+			CHashTableIncrementStatistic(table, CHS_Delete_Retry);
+			goto retry;
+		}
+
+		/* Deletion is done; attempt to remove node from list. */
+		if (__sync_bool_compare_and_swap(scan.pointer_to_target,
+										 scan.target,
+										 scan.next))
+			CHashAddToGarbage(table, bucket, scan.target);
+		else
+		{
+			CHashScanResult	cleanup_scan;
+
+			/*
+			 * If we weren't able to remove the deleted item, rescan
+			 * the bucket to make sure it's really gone.  This is just
+			 * like a regular bucket scan, except that we don't care
+			 * about the results.  We're just doing it to achieve the
+			 * side-effect of removing delete-marked nodes from the
+			 * bucket chain.
+			 */
+			CHashTableIncrementStatistic(table, CHS_Cleanup_Scan);
+			CHashBucketScan(table, b, hashcode, entry, &cleanup_scan);
+		}
+	}
+
+	/* Allow garbage collection for this bucket. */
+	Assert(MyProc->hazard[0] != NULL);
+	pg_memory_barrier();
+	MyProc->hazard[0] = NULL;
+
+	/* We're done. */
+	CHashTableIncrementStatistic(table, CHS_Delete);
+	if (!scan.found)
+		CHashTableIncrementStatistic(table, CHS_Delete_Failed);
+	return scan.found;
+}
+
+/*
+ * Provide backend-local statistics to caller.
+ */
+void
+CHashStatistics(CHashTable table, uint64 *stats)
+{
+	memcpy(stats, &table->stats, sizeof(uint64) * CHS_NumberOfStatistics);
+}
+
+/*
+ * Scan one bucket of a concurrent hash table, storing the results in a
+ * CHashResult object provided by the caller.
+ *
+ * Caller must suppress garbage collection for the target bucket before
+ * calling this function, and resume it afterwards.
+ *
+ * On return, res->found will be true if a matching item was found and false
+ * otherwise.  res->target will be a CHashPtr referencing the matching item,
+ * or the first one following the position where the matching item should have
+ * been; res->pointer_to_target will be a pointer to the memory address from
+ * which res->target was read.
+ *
+ * If res->target is not invalid, then res->target_node is a pointer to its
+ * location in memory.  If res->found, then res->next will be the next pointer
+ * of res->target_node; otherwise, it's undefined.
+ */
+static void
+CHashBucketScan(CHashTable table,
+				CHashPtr *start,
+				uint32 hashcode,
+				const void *key,
+				CHashScanResult *res)
+{
+	CHashPtr	target;
+	CHashPtr   *pointer_to_target;
+	CHashNode  *target_node = NULL;
+
+retry:
+	pointer_to_target = start;
+	target = *pointer_to_target;
+	for (;;)
+	{
+		CHashPtr	next;
+		uint32		h;
+		int			cmp;
+
+		/*
+		 * If we've reached the end of the bucket chain, stop; otherwise,
+		 * figure out the actual address of the next item.
+		 */
+		if (CHashPtrIsInvalid(target))
+		{
+			res->found = false;
+			break;
+		}
+		target_node = CHashTableGetNode(table, target);
+
+		/*
+		 * target may have been fetched from an arena entry that could be
+		 * concurrently modified, so a dependency barrier is required before
+		 * dereferencing the derived pointer.
+		 */
+		pg_read_barrier_depends();
+		next = target_node->next;
+
+		/*
+		 * For simplicity, any bucket scan, even if it's servicing a search,
+		 * will attempt to remove delete-marked items from the bucket.  This
+		 * ensures that delete-marked elements are removed from bucket chains
+		 * as quickly as possible and reduces code duplication.  See
+		 * CHashDelete for further comments about why delete-marking is
+		 * necessary and how it allows safe deletion.
+		 */
+		if (CHashPtrIsMarked(next))
+		{
+zap:
+			if (__sync_bool_compare_and_swap(pointer_to_target,
+											 target,
+											 CHashPtrUnmark(next)))
+			{
+				/*
+				 * No one else can possibly have modified target_node->next,
+				 * because such modifications are not allowed after a
+				 * delete-mark has been applied.  Thus, if we just keep
+				 * following the next pointers, we're guaranteed to visit
+				 * all non-deleted items (and possibly some deleted items)
+				 * that were present at the time we began the scan.
+				 */
+				CHashTableIncrementStatistic(table, CHS_Scan_Expunge);
+				CHashAddToGarbage(table, hashcode & table->bucket_mask,
+								  target);
+				target = CHashPtrUnmark(next);
+			}
+			else
+			{
+				/*
+				 * If the previous node has been delete-marked, we can't
+				 * further update the next-pointer, so restart the scan.
+				 * Otherwise, we know that some other backend removed one
+				 * or more deleted nodes from the bucket chain just as we
+				 * were trying to do, and we can simply continue the scan
+				 * from wherever the previous node is pointing now.  It's
+				 * possible that some concurrent inserts have happened, too,
+				 * but that's OK; we can view those as having happened "before"
+				 * whatever operation this scan is supporting.
+				 *
+				 * Although starvation is a theoretical possibility here if
+				 * we are forced to retry repeatedly, even a single retry is
+				 * vanishingly unlikely in practice.  It requires some other
+				 * backend to delete both the node that we're looking at and
+				 * the node which precedes it before we advance to the next
+				 * node.  That could certainly happen occasionally, but we'd
+				 * have to be pretty unlucky to have it happen even twice in
+				 * a row.
+				 */
+				CHashTableIncrementStatistic(table, CHS_Scan_Expunge_Fail);
+				target = *pointer_to_target;
+				if (CHashPtrIsMarked(target))
+				{
+					CHashTableIncrementStatistic(table, CHS_Scan_Restart);
+					goto retry;
+				}
+			}
+			continue;
+		}
+
+		/*
+		 * Bucket chains are kept in order, so that there is exactly one legal
+		 * point at which any given key can be inserted.  The ordering is by
+		 * hashcode first, and then by memcmp ordering of the keys involved.
+		 */
+		h = target_node->un.hashcode;
+		if (h == hashcode)
+			cmp = memcmp(CHashNodeGetItem(target_node), key,
+						 table->desc.key_size);
+		else if (h > hashcode)
+			cmp = 1;
+		else
+			cmp = -1;
+
+		/*
+		 * If cmp < 0, then we haven't yet reached the point at which we'd
+		 * expect to find the key, so we must continue the scan.  If cmp == 0,
+		 * we've found the key and can stop.  If cmp > 0, we've either passed
+		 * the point where we expect to find the key OR someone delete-marked
+		 * the item and overwrote the hashcode with a gcnext pointer.  In the
+		 * latter case we must take care not to be fooled into stopping the
+		 * scan early.
+		 */
+		if (cmp >= 0)
+		{
+			if (cmp == 0)
+			{
+				res->found = true;
+				res->next = next;
+				break;
+			}
+			else
+			{
+				/*
+				 * pg_read_barrier() prevents the reread of the next pointer
+				 * from being speculated ahead of the read of the hash value.
+				 */
+				pg_read_barrier();
+				next = target_node->next;
+				if (CHashPtrIsMarked(next))
+					goto zap;
+				res->found = false;
+				break;
+			}
+		}
+
+		/* Continue scan from next node. */
+		pointer_to_target = &target_node->next;
+		target = next;
+	}
+
+	/* Send results back to caller. */
+	res->target = target;
+	res->pointer_to_target = pointer_to_target;
+	res->target_node = target_node;
+}
+
+/*
+ * Allocate an arena slot for a new item to be inserted into a hash table.
+ *
+ * We don't want to wait until every single free-list is completely empty
+ * before beginning to garbage collect, because that could result in very
+ * fast allocation followed by a storm of garbage collection activity.
+ * It could also lead to every inserting backend ganging up on the only
+ * non-empty freelist.
+ *
+ * To avoid that, we check free lists and garbage lists in alternation.
+ * We always start with the same free list - which one is based on our
+ * backend ID - but we try to round-robin over all the available garbage
+ * lists.  Whenever we successfully garbage collect, we put the recovered
+ * items on our own free list.  In this way, if there's only one backend
+ * active, it will typically find a free buffer in the first place it looks:
+ * its own free list.  It will also settle into a pattern of garbage
+ * collecting the garbage list which it has visited least recently, which
+ * is what we want.
+ */
+static CHashPtr
+CHashAllocate(CHashTable table)
+{
+	uint32		f_current;
+	CHashPtr	new;
+
+	/* Pick a starting freelist base on our backend ID. */
+ 	f_current = ((uint32) MyBackendId) % CHashTableNFreeLists(table);
+
+	/* If this process hasn't initialized gc_next yet, do that now. */
+	if (table->gc_pid != MyProcPid)
+	{
+		table->gc_pid = MyProcPid;
+		table->gc_next = ((uint32) MyProcPid) % CHashTableNGarbage(table);
+	}
+
+	/* Loop until we allocate a buffer. */
+	for (;;)
+	{
+		CHashPtr  *b;
+
+		/*
+ 		 * Try to pop a buffer from a freelist using compare-and-swap.
+ 		 *
+ 		 * Note that this is only safe if it's impossible for the next pointer
+ 		 * of the first element on the list to change between the time when
+ 		 * we read it and the time we use CAS to pop it off the list.  This
+ 		 * means that we can't allow any element that is currently on this
+ 		 * freelist to be allocated, marked as garbage, garbage collected,
+ 		 * and returned back to this freelist before we finish the CAS.
+ 		 *
+ 		 * If we attempt to pop the free-list and fail, we retry immediately
+ 		 * with the same free-list.  This reduces the frequency with which
+ 		 * we're obliged to update our hazard pointers, which is a material
+ 		 * savings due to the associated memory barrier.
+ 		 */
+		b = CHashTableGetFreeList(table, f_current);
+		MyProc->hazard[0] = b;
+		pg_memory_barrier();
+		new = *b;
+		while (!CHashPtrIsInvalid(new))
+		{
+			CHashNode  *n = CHashTableGetNode(table, new);
+
+			/*
+			 * n is computed from table->freelist[f_current], which could
+			 * be modified by concurrent activity, so we need a dependency
+			 * barrier here.
+			 */
+			pg_read_barrier_depends();
+			if (__sync_bool_compare_and_swap(b, new, n->un.gcnext))
+				return new;
+			CHashTableIncrementStatistic(table, CHS_Allocate_Fail);
+			new = *b;
+		}
+
+		/* Attempt garbage collection. */
+		new = CHashAllocateViaGC(table);
+		if (!CHashPtrIsInvalid(new))
+			return new;
+
+		/* Advance to next freelist. */
+		f_current = (f_current + 1) % CHashTableNFreeLists(table);
+	}
+}
+
+/*
+ * Attempt to satisfy an allocation request via garbage collection.
+ */
+static CHashPtr
+CHashAllocateViaGC(CHashTable table)
+{
+	uint32		f_home;
+	CHashPtr   *b;
+	CHashPtr   *fh;
+	CHashPtr	fhead;
+	CHashPtr	garbage;
+	CHashPtr	new;
+	CHashNode  *n;
+	uint32		i;
+
+	/* Pick a target freelist based on our backend ID. */
+ 	f_home = ((uint32) MyBackendId) % CHashTableNFreeLists(table);
+	fh = CHashTableGetFreeList(table, f_home);
+
+	/* Select target garbage list. */
+	table->gc_next = (table->gc_next + 1) % CHashTableNGarbage(table);
+	b = CHashTableGetGarbageList(table, table->gc_next);
+	garbage = *b;
+
+	/* If list is empty, fail. */
+	if (CHashPtrIsInvalid(garbage))
+		return InvalidCHashPtr;
+
+	/* If we're unable to empty the list via compare-and-swap, fail. */
+	if (!__sync_bool_compare_and_swap(b, garbage, InvalidCHashPtr))
+	{
+		CHashTableIncrementStatistic(table, CHS_Garbage_Dequeue_Fail);
+		return InvalidCHashPtr;
+	}
+
+	/*
+	 * Before removing elements removed from the garbage list to the
+	 * freelist, we must wait until (1) all bucket scans that might
+	 * still see elements on the freelist as part of the bucket chain
+	 * have completed and (2) all allocations that might see an old
+	 * version of the freelist containing one of the elements to be
+	 * garbage collected have completed.
+	 *
+	 * Note: We can't begin this operation until the clearing of the
+	 * garbage list has been committed to memory, but since that was
+	 * done using an atomic operation no explicit barrier is needed
+	 * here.
+	 *
+	 * Note: We could have a "soft" version of this that merely
+	 * requeues the garbage if it's not immediately recycleable, but
+	 * it's not clear that we need such a thing.  On the flip side we
+	 * might want to eventually enter a longer sleep here, or PANIC,
+	 * but it's not clear exactly how to calibrate that.
+	 */
+	CHashTableIncrementStatistic(table, CHS_GC);
+	MyProc->hazard[0] = NULL;
+	for (i = 0; i < ProcGlobal->allProcCount; i++)
+	{
+		volatile PGPROC	*proc = &ProcGlobal->allProcs[i];
+		void	   *hazard;
+
+		hazard = proc->hazard[0];
+		if (hazard == b || hazard == fh)
+		{
+			CHashTableIncrementStatistic(table, CHS_GC_Spin);
+			do
+			{
+				hazard = proc->hazard[0];
+			} while (hazard == b || hazard == fh);
+		}
+	}
+
+	/* Remove one item from list to satisfy current allocation. */
+	new = garbage;
+	n = CHashTableGetNode(table, new);
+	pg_read_barrier_depends();
+	fhead = n->un.gcnext;
+
+	if (CHashPtrIsInvalid(fhead))
+	{
+		/*
+		 * We have reclaimed exactly node, so there's nothing to put
+		 * back on the free list.  In this case (only) we need a
+		 * memory barrier, because the reads above must complete
+		 * before we overwrite n->un.gcnext with a new hashcode.
+		 * (This is only needed when we reclaim exactly one node,
+		 * because in any other case we'll do a compare-and-swap
+		 * before returning, which implies a full barrier.)
+ 		 */
+		pg_memory_barrier();
+		CHashTableIncrementStatistic(table, CHS_GC_Reclaim_Skipped);
+	}
+	else if (__sync_bool_compare_and_swap(fh, InvalidCHashPtr, fhead))
+	{
+		/*
+ 		 * Our free list is empty, and we've succesfully pushed the
+ 		 * reclaimed nodes onto it.  So we're done.
+ 		 */
+		CHashTableIncrementStatistic(table, CHS_GC_Reclaim_Fast);
+	}
+	else
+	{
+		CHashPtr	fcurrent;
+		CHashPtr	fnext;
+		CHashPtr	oldhead;
+
+		/* Walk list of reclaimed elements to end. */
+		fcurrent = fhead;
+		for (;;)
+		{
+			n = CHashTableGetNode(table, fcurrent);
+			fnext = n->un.gcnext;
+			if (CHashPtrIsInvalid(fnext))
+				break;
+			fcurrent = fnext;
+		}
+
+		/* Push reclaimed elements onto home free list. */
+		for (;;)
+		{
+			oldhead = *fh;
+			n->un.gcnext = oldhead;
+			if (__sync_bool_compare_and_swap(fh, oldhead, fhead))
+				break;
+			CHashTableIncrementStatistic(table, CHS_GC_Reclaim_Retry);
+		}
+	}
+
+	/* Return the element we saved for ourselves. */
+	return new;
+}
+
+/*
+ * Add an arena slot to the appropriate garbage list.
+ *
+ * The next garbage collection cycle for the affected bucket will move it
+ * to the free list.  We can't do that immediately because there might be
+ * someone traversing the list who is counting on being able to follow the
+ * next pointer.  It's OK to clobber the hash value, though, since a spurious
+ * failure to match an already-deleted item shouldn't cause any problems;
+ * this is why gcnext can share space with the hash value.
+ */
+static void
+CHashAddToGarbage(CHashTable table, uint32 bucket, CHashPtr c)
+{
+	CHashPtr	g;
+	CHashNode *n;
+	CHashPtr *garbage;
+
+	n = CHashTableGetNode(table, c);
+	garbage = CHashTableGetGarbageByBucket(table, bucket);
+
+	while (1)
+	{
+		g = *garbage;
+		n->un.gcnext = g;
+		if (__sync_bool_compare_and_swap(garbage, g, c))
+			break;
+		CHashTableIncrementStatistic(table, CHS_Garbage_Enqueue_Retry);
+	}
+}
diff --git a/src/include/storage/barrier.h b/src/include/storage/barrier.h
index cd85633..dbcc0f8 100644
--- a/src/include/storage/barrier.h
+++ b/src/include/storage/barrier.h
@@ -20,4 +20,12 @@
  */
 #include "port/atomics.h"
 
+/*
+ * If dependency barriers are undefined, we define them as a no-op.  The only
+ * known platform where anything more is required is DEC Alpha.
+ */
+#if !defined(pg_read_barrier_depends)
+#define pg_read_barrier_depends()
+#endif
+
 #endif   /* BARRIER_H */
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 9b8ace5..b58af88 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -96,20 +96,6 @@ typedef struct buftag
 )
 
 /*
- * The shared buffer mapping table is partitioned to reduce contention.
- * To determine which partition lock a given tag requires, compute the tag's
- * hash code with BufTableHashCode(), then apply BufMappingPartitionLock().
- * NB: NUM_BUFFER_PARTITIONS must be a power of 2!
- */
-#define BufTableHashPartition(hashcode) \
-	((hashcode) % NUM_BUFFER_PARTITIONS)
-#define BufMappingPartitionLock(hashcode) \
-	(&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + \
-		BufTableHashPartition(hashcode)].lock)
-#define BufMappingPartitionLockByIndex(i) \
-	(&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + (i)].lock)
-
-/*
  *	BufferDesc -- shared descriptor/state data for a single shared buffer.
  *
  * Note: buf_hdr_lock must be held to examine or change the tag, flags,
@@ -196,14 +182,6 @@ extern void StrategyNotifyBgWriter(int bgwprocno);
 extern Size StrategyShmemSize(void);
 extern void StrategyInitialize(bool init);
 
-/* buf_table.c */
-extern Size BufTableShmemSize(int size);
-extern void InitBufTable(int size);
-extern uint32 BufTableHashCode(BufferTag *tagPtr);
-extern int	BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
-extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
-extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
-
 /* localbuf.c */
 extern void LocalPrefetchBuffer(SMgrRelation smgr, ForkNumber forkNum,
 					BlockNumber blockNum);
@@ -215,4 +193,8 @@ extern void DropRelFileNodeLocalBuffers(RelFileNode rnode, ForkNumber forkNum,
 extern void DropRelFileNodeAllLocalBuffers(RelFileNode rnode);
 extern void AtEOXact_LocalBuffers(bool isCommit);
 
+/* bufmgr.c */
+extern Size BufMgrShmemSize(int size);
+extern void BufMgrInitShmem(int size);
+
 #endif   /* BUFMGR_INTERNALS_H */
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index e3c2efc..1b37447 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -144,7 +144,7 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
  */
 
 /* Number of partitions of the shared buffer mapping hashtable */
-#define NUM_BUFFER_PARTITIONS  128
+#define NUM_BUFFER_PARTITIONS  0
 
 /* Number of partitions the shared lock tables are divided into */
 #define LOG2_NUM_LOCK_PARTITIONS  4
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index d194f38..8d9b4dd 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -59,6 +59,14 @@ struct XidCache
 #define		FP_LOCK_SLOTS_PER_BACKEND 16
 
 /*
+ * Some lock-free algorithms require each backend process to be able to
+ * advertise a certain number of "hazard pointers" in shared memory.
+ * Right now one per backend is enough for our purpose, but some
+ * algorithms require more.
+ */
+#define		NUM_HAZARD_POINTERS			1
+
+/*
  * Each backend has a PGPROC struct in shared memory.  There is also a list of
  * currently-unused PGPROC structs that will be reallocated to new backends.
  *
@@ -143,6 +151,12 @@ struct PGPROC
 	bool		fpVXIDLock;		/* are we holding a fast-path VXID lock? */
 	LocalTransactionId fpLocalTransactionId;	/* lxid for fast-path VXID
 												 * lock */
+
+	/*
+	 * Hazard pointers. Currently one is enough, though some algorithms
+	 * require a few more.
+	 */
+	void	   *hazard[NUM_HAZARD_POINTERS];
 };
 
 /* NOTE: "typedef struct PGPROC PGPROC" appears in storage/lock.h. */
diff --git a/src/include/storage/shmem.h b/src/include/storage/shmem.h
index c94d620..855b65e 100644
--- a/src/include/storage/shmem.h
+++ b/src/include/storage/shmem.h
@@ -40,6 +40,7 @@ extern void InitShmemIndex(void);
 extern HTAB *ShmemInitHash(const char *name, long init_size, long max_size,
 			  HASHCTL *infoP, int hash_flags);
 extern void *ShmemInitStruct(const char *name, Size size, bool *foundPtr);
+extern void *ShmemAttachStruct(const char *name);
 extern Size add_size(Size s1, Size s2);
 extern Size mul_size(Size s1, Size s2);
 
diff --git a/src/include/utils/chash.h b/src/include/utils/chash.h
new file mode 100644
index 0000000..ee0573c
--- /dev/null
+++ b/src/include/utils/chash.h
@@ -0,0 +1,69 @@
+/*-------------------------------------------------------------------------
+ *
+ * chash.h
+ *	  Concurrent shared-memory hash table.
+ *
+ * Portions Copyright (c) 1996-2012, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/utils/chash.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CHASH_H 
+#define CHASH_H
+
+/* Everything caller must supply to set up a concurrent hash table. */
+typedef struct
+{
+	const char *shmem_name;		/* shared memory name for this hash table */
+	uint32		capacity;		/* maximum size of hash table */
+	uint16		element_size;	/* size of each element */
+	uint16		key_size;		/* size of each key */
+} CHashDescriptor;
+
+/* Concurrent hash table statistics. */
+typedef enum
+{
+	CHS_Search,					/* search */
+	CHS_Search_Failed,			/* search failed (no such key) */
+	CHS_Insert,					/* insert */
+	CHS_Insert_Failed,			/* insert failed (duplicate key) */
+	CHS_Insert_Retry,			/* insert retried (concurrent update) */
+	CHS_Delete,					/* delete */
+	CHS_Delete_Failed,			/* delete failed (no such key) */
+	CHS_Delete_Retry,			/* delete retried (concurrent update) */
+	CHS_Scan_Expunge,			/* scan expunged deleted item */
+	CHS_Scan_Expunge_Fail,		/* scan failed to expunge */
+	CHS_Scan_Restart,			/* concurrent deletes forced a scan restart */
+	CHS_Cleanup_Scan,			/* concurrent update forced a cleanup scan */
+	CHS_Allocate_Fail,			/* allocation failed to pop freelist */
+	CHS_Garbage_Enqueue_Retry,	/* enqueue on garbage list retried */
+	CHS_Garbage_Dequeue_Fail,	/* dequeue of garbage failed */
+	CHS_GC,						/* garbage collection cycle */
+	CHS_GC_Spin,				/* GC spun waiting for concurrent process */
+	CHS_GC_Reclaim_Skipped,		/* GC recovered only one item */
+	CHS_GC_Reclaim_Fast,		/* GC put garbage on freelist via fast path */
+	CHS_GC_Reclaim_Retry,		/* enqueue of garbage on freelist retried */
+	CHS_NumberOfStatistics		/* number of statistics */
+} CHashStatisticsType;
+
+/* Human-readable names for statistics. */
+extern char *CHashStatisticsNames[];
+
+/* Opaque handle for a concurrent hash table. */
+struct CHashTableData;
+typedef struct CHashTableData *CHashTable;
+
+/* Initialization functions. */
+extern CHashTable CHashBootstrap(CHashDescriptor *desc);
+extern Size CHashEstimateSize(CHashTable table);
+extern CHashTable CHashInitialize(CHashTable table, CHashDescriptor *desc);
+
+/* Accessor functions. */
+extern bool CHashInsert(CHashTable table, void *entry);
+extern bool CHashDelete(CHashTable table, void *key);
+extern bool CHashSearch(CHashTable table, void *entry);
+extern void CHashStatistics(CHashTable table, uint64 *stats);
+
+#endif   /* CHASH_H */

#37

Robert Haas

robertmhaas@gmail.com

almost 11 years ago

In reply to: Robert Haas (#36)

Re: WIP: dynahash replacement for buffer table

On Mon, Jan 26, 2015 at 11:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:

This developed a slight merge conflict. I've rebased the attached
version, and I also took the step of getting rid of buf_table.c, as I
think I proposed somewhere upthread. This avoids the overhead of
constructing a BufferTag only to copy it into a BufferLookupEnt, plus
some function calls and so forth. A quick-and-dirty test suggests
this might not have cut down on the 1-client overhead much, but I
think it's worth doing anyway: it's certainly saving a few cycles, and
I don't think it's complicating anything measurably.

So here are median-of-three results for 5-minute read-only pgbench
runs at scale factor 1000, shared_buffers = 8GB, on hydra (POWER7) at
various client counts.

clients - master@4b2a25 - master+chash-buftable-v2
1 8319.254199 8105.479632
2 15485.561948 15138.227533
3 23690.186576 23264.702784
4 32157.346740 31536.870881
5 40879.532747 40455.794841
6 49063.279970 49625.681573
7 57518.374517 57492.275197
8 65351.415323 65622.211763
16 126166.416498 126668.793798
24 155727.916112 155587.414299
32 180329.012019 179543.424754
40 201384.186317 200109.614362
48 222352.265595 225688.574611
56 240400.659338 254398.144976
64 253699.031075 266624.224699
72 261198.336625 270652.578322
80 264569.862257 270332.738084

That's extremely unimpressive. You (Andres) showed a much bigger
benefit on a different machine with a much higher scale factor (5000)
so I don't know whether the modest benefit here is because of the
different machine, the different scale factor, or what.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Andres Freund

andres@2ndquadrant.com

almost 11 years ago

In reply to: Robert Haas (#37)

Re: WIP: dynahash replacement for buffer table

On 2015-01-27 10:36:35 -0500, Robert Haas wrote:

On Mon, Jan 26, 2015 at 11:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:

This developed a slight merge conflict. I've rebased the attached
version, and I also took the step of getting rid of buf_table.c, as I
think I proposed somewhere upthread. This avoids the overhead of
constructing a BufferTag only to copy it into a BufferLookupEnt, plus
some function calls and so forth. A quick-and-dirty test suggests
this might not have cut down on the 1-client overhead much, but I
think it's worth doing anyway: it's certainly saving a few cycles, and
I don't think it's complicating anything measurably.

So here are median-of-three results for 5-minute read-only pgbench
runs at scale factor 1000, shared_buffers = 8GB, on hydra (POWER7) at
various client counts.

That's extremely unimpressive. You (Andres) showed a much bigger
benefit on a different machine with a much higher scale factor (5000)
so I don't know whether the modest benefit here is because of the
different machine, the different scale factor, or what.

Based on my test on hydra some time back the bottleneck is nearly
entirely in GetSnapshotData() at higher client counts. So I'm not that
surprised you don't see the big benefit here. IIRC on hydra the results
for using a large enough shared_buffers setting for a fully cached run
at that scale is pretty close to your master results - so there's really
not much performance increase to be expected by making the buf table
lockless.

I guess you would see a slightly bigger difference at a different
shared_buffer/scale combination. IIRC a scale 1000 is about 15GB large?
So about half of the data set fit in shared buffers. In the scale 5k
results I posted it was about 1/5.

I had also tested on a four socket x86 machine where GetSnapshotData()
was a, but not the sole big, bottleneck.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: Robert Haas (#36)

Re: WIP: dynahash replacement for buffer table

On Tue, Jan 27, 2015 at 9:50 AM, Robert Haas <robertmhaas@gmail.com> wrote:

This developed a slight merge conflict. I've rebased the attached
version, and I also took the step of getting rid of buf_table.c, as I
think I proposed somewhere upthread. This avoids the overhead of
constructing a BufferTag only to copy it into a BufferLookupEnt, plus
some function calls and so forth. A quick-and-dirty test suggests
this might not have cut down on the 1-client overhead much, but I
think it's worth doing anyway: it's certainly saving a few cycles, and
I don't think it's complicating anything measurably.

Performance data at some of the configurations.

Configuration and Db Details
----------------------------------------------
IBM POWER-8 24 cores, 192 hardware threads
RAM = 492GB
checkpoint_segments=300
checkpoint_timeout =25min
Client Count = number of concurrent sessions and threads (ex. -c 8 -j 8)
Duration of each individual run = 5min
Scale_factor - 5000
HEAD (commit id - 168a809d)

Below is the data for median of 3-runs with pgbench read-only
(using -M prepared) configuration

Shared_buffers=8GB Client Count/No. Of Runs (tps) 1 8 16 32 64 128 256 HEAD
17748 119106 164949 246632 216763 183177 173055 HEAD + patch 17802 119721
167422 298746 457863 422621 356756

Shared_buffers=16GB
Client Count/No. Of Runs (tps) 1 8 16 32 64 128 256 HEAD 18139 113265
169217 270172 310936 238490 215308 HEAD + patch 17900 119960 174196 314866
448238 425916 347919

Observations as per data
--------------------------------------
a. It improves the tps by great margin at higher client count.
b. It is evident that as we increase the shared buffers, the gain
is relatively less (gain when shared_buffers is 16GB is lesser as
compare to when shared_buffers is 8GB)

I think the patch is valuable for such loads even though it will show
lesser benefit at higher shared buffers value, although we might want
to once verify that it doesn't topple at configurations such as
(shared_buffers = scale_factor).

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com