Improve catcache/syscache performance.

Started by Andres Freundover 8 years ago16 messages
#1Andres Freund
andres@anarazel.de
3 attachment(s)

Hi,

There's plenty workloads where SearchSysCache()/SearchCatCache() shows
up as a major runtime factor. Primarily in workloads with very fast
queries.

A fair while ago, before I had my commit bit, I'd posted [1]/messages/by-id/20130905191323.GC490889@alap2.anarazel.de. Looking at
the profiles/benchmarks I was easily able to confirm that it still
helps, but that there's also still a lot left on the table.

Attached is a patch that tries to improve sys/catcache performance,
going further than the patch referenced earlier.

This primarily includes four pieces:

1) Avoidance of FunctionCallInfo based function calls, replaced by
more efficient functions with a native C argument interface.
2) Only initializing the ScanKey when necessary, i.e. catcache misses,
reduces cache unnecessary cpu cache misses.
3) Allowing the compiler to specialize critical SearchCatCache for a
specific number of attributes allows to unroll loops and avoid
other nkeys dependant initialization.
4) Split of the heap lookup from the hash lookup, reducing stack
allocations etc in the common case.

There's further potential:
- replace open coded hash with simplehash - the list walk right now
shows up in profiles.
- As oid is the only system column supported, avoid the use of
heap_getsysattr(), by adding an explicit branch for
ObjectIdAttributeNumber. This shows up in profiles.
- move cache initialization out of the search path
- add more proper functions, rather than macros for
SearchSysCacheCopyN etc., but right now they don't show up in profiles.

The reason the macro wrapper for syscache.c/h have to be changed,
rather than just catcache, is that doing otherwise would require
exposing the SysCache array to the outside. That might be a good idea
anyway, but it's for another day.

This patch gives me roughly 8% speedup in a workload that consists out
of a fast query that returns a lot of columns. If I apply a few
other performance patches, this patch itself starts to make a bigger
difference, of around 11%.

Greetings,

Andres Freund

[1]: /messages/by-id/20130905191323.GC490889@alap2.anarazel.de

Attachments:

0004-Add-inline-murmurhash32-int32-function.patchtext/x-diff; charset=us-asciiDownload
From 2b3e06380d5a339efc94e748aa57985d3bb80223 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 13 Sep 2017 18:43:46 -0700
Subject: [PATCH 4/8] Add inline murmurhash32(int32) function.

The function already existed in tidbitmap.c but more users requiring
fast hashing of 32bit ints are coming up.
---
 src/backend/nodes/tidbitmap.c | 20 ++------------------
 src/include/utils/hashutils.h | 18 ++++++++++++++++++
 2 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index c4e53adb0c..01d6bc5c11 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -45,6 +45,7 @@
 #include "nodes/tidbitmap.h"
 #include "storage/lwlock.h"
 #include "utils/dsa.h"
+#include "utils/hashutils.h"
 
 /*
  * The maximum number of tuples per page is not large (typically 256 with
@@ -237,30 +238,13 @@ static int	tbm_comparator(const void *left, const void *right);
 static int tbm_shared_comparator(const void *left, const void *right,
 					  void *arg);
 
-/*
- * Simple inline murmur hash implementation for the exact width required, for
- * performance.
- */
-static inline uint32
-hash_blockno(BlockNumber b)
-{
-	uint32		h = b;
-
-	h ^= h >> 16;
-	h *= 0x85ebca6b;
-	h ^= h >> 13;
-	h *= 0xc2b2ae35;
-	h ^= h >> 16;
-	return h;
-}
-
 /* define hashtable mapping block numbers to PagetableEntry's */
 #define SH_USE_NONDEFAULT_ALLOCATOR
 #define SH_PREFIX pagetable
 #define SH_ELEMENT_TYPE PagetableEntry
 #define SH_KEY_TYPE BlockNumber
 #define SH_KEY blockno
-#define SH_HASH_KEY(tb, key) hash_blockno(key)
+#define SH_HASH_KEY(tb, key) murmurhash32(key)
 #define SH_EQUAL(tb, a, b) a == b
 #define SH_SCOPE static inline
 #define SH_DEFINE
diff --git a/src/include/utils/hashutils.h b/src/include/utils/hashutils.h
index 56b7bfc9cb..35281689e8 100644
--- a/src/include/utils/hashutils.h
+++ b/src/include/utils/hashutils.h
@@ -20,4 +20,22 @@ hash_combine(uint32 a, uint32 b)
 	return a;
 }
 
+
+/*
+ * Simple inline murmur hash implementation hashing a 32 bit ingeger, for
+ * performance.
+ */
+static inline uint32
+murmurhash32(uint32 data)
+{
+	uint32		h = data;
+
+	h ^= h >> 16;
+	h *= 0x85ebca6b;
+	h ^= h >> 13;
+	h *= 0xc2b2ae35;
+	h ^= h >> 16;
+	return h;
+}
+
 #endif							/* HASHUTILS_H */
-- 
2.14.1.536.g6867272d5b.dirty

0006-Add-pg_noinline-macro-to-c.h.patchtext/x-diff; charset=us-asciiDownload
From 0feaa0c4a39b3e0e995cd5897cfd3ebba6a92c48 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 13 Sep 2017 19:58:43 -0700
Subject: [PATCH 6/8] Add pg_noinline macro to c.h.

Forcing a function not to be inlined can be useful if it's the
slow-path of a performance critical function, or should be visible in
profiles to allow for proper cost attribution.

Author: Andres Freund
Discussion: https://postgr.es/m/
---
 src/include/c.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/src/include/c.h b/src/include/c.h
index 630dfbfc41..087b1eac5b 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -644,6 +644,22 @@ typedef NameData *Name;
 #define pg_attribute_noreturn()
 #endif
 
+
+/*
+ * Forcing a function not to be inlined can be useful if it's the slow-path of
+ * a performance critical function, or should be visible in profiles to allow
+ * for proper cost attribution.
+ */
+/* GCC, Sunpro and XLC support noinline via __attribute */
+#if defined(__GNUC__) || defined(__SUNPRO_C) || defined(__IBMC__)
+#define pg_noinline __attribute__((noinline))
+/* msvc via declspec */
+#elif defined(_MSC_VER)
+#define pg_noinline __declspec(noinline)
+#else
+#define pg_noinline
+#endif
+
 /* ----------------------------------------------------------------
  *				Section 6:	assertions
  * ----------------------------------------------------------------
-- 
2.14.1.536.g6867272d5b.dirty

0007-Improve-sys-catcache-performance.patchtext/x-diff; charset=us-asciiDownload
From 334fd03974d01b389c6d615345522df207c60c74 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 11 Sep 2017 18:25:39 -0700
Subject: [PATCH 7/8] Improve sys/catcache performance.

This primarily includes four pieces:

1) Avoidance of FunctionCallInfo based function calls, replaced by
   more efficient functions with a native C argument interface.
2) Only initializing the ScanKey when necessary, i.e. catcache misses,
   reduces cache unnecessary cpu cache misses.
3) Allowing the compiler to specialize critical SearchCatCache for a
   specific number of attributes allows to unroll loops and avoid
   other nkeys dependant initialization.
4) Split of the heap lookup from the hash lookup, reducing stack
   allocations etc in the common case.

There's further potential:
- replace open coded hash with simplehash - the list walk right now
  shows up in profiles.
- As oid is the only system column supported, avoid the use of
  heap_getsysattr(), by adding an explicit branch for
  ObjectIdAttributeNumber. This shows up in profiles.
- move cache initialization out of the search path
- add more proper functions, rather than macros for
  SearchSysCacheCopyN etc., but right now they don't show up in profiles.

The reason the macro wrapper for syscache.c/h have to be changed,
rather than just catcache, is that doing otherwise would require
exposing the SysCache array to the outside.  That might be a good idea
anyway, but it's for another day.

Author: Andres Freund
---
 src/backend/utils/cache/catcache.c | 438 ++++++++++++++++++++++++++-----------
 src/backend/utils/cache/syscache.c |  49 ++++-
 src/include/utils/catcache.h       |  18 +-
 src/include/utils/syscache.h       |  23 +-
 4 files changed, 390 insertions(+), 138 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index e092801025..67c596d29b 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -31,6 +31,7 @@
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
+#include "utils/hashutils.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -72,11 +73,25 @@
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
+											   int nkeys,
+											   Datum v1, Datum v2,
+											   Datum v3, Datum v4);
+
+static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache,
+												int nkeys,
+												uint32 hashValue,
+												Index hashIndex,
+												Datum v1, Datum v2,
+												Datum v3, Datum v4);
 
 static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
-							 ScanKey cur_skey);
-static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache,
+							 Datum v1, Datum v2, Datum v3, Datum v4);
+static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache, int nkeys,
 								  HeapTuple tuple);
+static inline bool CatalogCacheCompareTuple(const CatCache *cache, int nkeys,
+											const HeapTuple tuple,
+											const Datum *arguments);
 
 #ifdef CATCACHE_STATS
 static void CatCachePrintStats(int code, Datum arg);
@@ -95,45 +110,127 @@ static HeapTuple build_dummy_tuple(CatCache *cache, int nkeys, ScanKey skeys);
  */
 
 /*
- * Look up the hash and equality functions for system types that are used
- * as cache key fields.
- *
- * XXX this should be replaced by catalog lookups,
- * but that seems to pose considerable risk of circularity...
+ * Hash and equality functions for system types that are used as cache key
+ * fields.  To compute hashes, and to check for hash collisions, use functions
+ * hardcoded for that purpose. This is sufficiently performance critical that
+ * the overhead of SQL style function calls is noticeable.
  */
+
+static bool
+chareqfast(Datum a, Datum b)
+{
+	return DatumGetChar(a) == DatumGetChar(b);
+}
+
+static uint32
+charhashfast(Datum datum)
+{
+	return murmurhash32((int32) DatumGetChar(datum));
+}
+
+static bool
+nameeqfast(Datum a, Datum b)
+{
+	char	   *ca = NameStr(*DatumGetName(a));
+	char	   *cb = NameStr(*DatumGetName(b));
+
+	return strncmp(ca, cb, NAMEDATALEN) == 0;
+}
+
+static uint32
+namehashfast(Datum datum)
+{
+	char	   *key = NameStr(*DatumGetName(datum));
+
+	return hash_any((unsigned char *) key, strlen(key));
+}
+
+static bool
+int2eqfast(Datum a, Datum b)
+{
+	return DatumGetInt16(a) == DatumGetInt16(b);
+}
+
+static uint32
+int2hashfast(Datum datum)
+{
+	return murmurhash32((int32) DatumGetInt16(datum));
+}
+
+static bool
+int4eqfast(Datum a, Datum b)
+{
+	return DatumGetInt32(a) == DatumGetInt32(b);
+}
+
+static uint32
+int4hashfast(Datum datum)
+{
+	return murmurhash32((int32) DatumGetInt32(datum));
+}
+
+static bool
+texteqfast(Datum a, Datum b)
+{
+	/* not as performance critical & "complicated" */
+	return DatumGetBool(DirectFunctionCall2(texteq, a, b));
+}
+
+static uint32
+texthashfast(Datum datum)
+{
+	/* not as performance critical & "complicated" */
+	return DatumGetInt32(DirectFunctionCall1(hashtext, datum));
+}
+
+static bool
+oidvectoreqfast(Datum a, Datum b)
+{
+	/* not as performance critical & "complicated" */
+	return DatumGetBool(DirectFunctionCall2(oidvectoreq, a, b));
+}
+
+static uint32
+oidvectorhashfast(Datum datum)
+{
+	/* not as performance critical & "complicated" */
+	return DatumGetInt32(DirectFunctionCall1(hashoidvector, datum));
+}
+
+/* Lookup support functions for a type. */
 static void
-GetCCHashEqFuncs(Oid keytype, PGFunction *hashfunc, RegProcedure *eqfunc)
+GetCCHashEqFuncs(Oid keytype, CCHashFN *hashfunc, RegProcedure *eqfunc, CCFastEqualFN *fasteqfunc)
 {
 	switch (keytype)
 	{
 		case BOOLOID:
-			*hashfunc = hashchar;
-
+			*hashfunc = charhashfast;
+			*fasteqfunc = chareqfast;
 			*eqfunc = F_BOOLEQ;
 			break;
 		case CHAROID:
-			*hashfunc = hashchar;
-
+			*hashfunc = charhashfast;
+			*fasteqfunc = chareqfast;
 			*eqfunc = F_CHAREQ;
 			break;
 		case NAMEOID:
-			*hashfunc = hashname;
-
+			*hashfunc = namehashfast;
+			*fasteqfunc = nameeqfast;
 			*eqfunc = F_NAMEEQ;
 			break;
 		case INT2OID:
-			*hashfunc = hashint2;
-
+			*hashfunc = int2hashfast;
+			*fasteqfunc = int2eqfast;
 			*eqfunc = F_INT2EQ;
 			break;
 		case INT4OID:
-			*hashfunc = hashint4;
-
+			*hashfunc = int4hashfast;
+			*fasteqfunc = int4eqfast;
 			*eqfunc = F_INT4EQ;
 			break;
 		case TEXTOID:
-			*hashfunc = hashtext;
-
+			*hashfunc = texthashfast;
+			*fasteqfunc = texteqfast;
 			*eqfunc = F_TEXTEQ;
 			break;
 		case OIDOID:
@@ -147,13 +244,13 @@ GetCCHashEqFuncs(Oid keytype, PGFunction *hashfunc, RegProcedure *eqfunc)
 		case REGDICTIONARYOID:
 		case REGROLEOID:
 		case REGNAMESPACEOID:
-			*hashfunc = hashoid;
-
+			*hashfunc = int4hashfast;
+			*fasteqfunc = int4eqfast;
 			*eqfunc = F_OIDEQ;
 			break;
 		case OIDVECTOROID:
-			*hashfunc = hashoidvector;
-
+			*hashfunc = oidvectorhashfast;
+			*fasteqfunc = oidvectoreqfast;
 			*eqfunc = F_OIDVECTOREQ;
 			break;
 		default:
@@ -171,10 +268,12 @@ GetCCHashEqFuncs(Oid keytype, PGFunction *hashfunc, RegProcedure *eqfunc)
  * Compute the hash value associated with a given set of lookup keys
  */
 static uint32
-CatalogCacheComputeHashValue(CatCache *cache, int nkeys, ScanKey cur_skey)
+CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
+							 Datum v1, Datum v2, Datum v3, Datum v4)
 {
 	uint32		hashValue = 0;
 	uint32		oneHash;
+	CCHashFN   *cc_hashfunc = cache->cc_hashfunc;
 
 	CACHE4_elog(DEBUG2, "CatalogCacheComputeHashValue %s %d %p",
 				cache->cc_relname,
@@ -184,30 +283,26 @@ CatalogCacheComputeHashValue(CatCache *cache, int nkeys, ScanKey cur_skey)
 	switch (nkeys)
 	{
 		case 4:
-			oneHash =
-				DatumGetUInt32(DirectFunctionCall1(cache->cc_hashfunc[3],
-												   cur_skey[3].sk_argument));
+			oneHash = (cc_hashfunc[3])(v4);
+
 			hashValue ^= oneHash << 24;
 			hashValue ^= oneHash >> 8;
 			/* FALLTHROUGH */
 		case 3:
-			oneHash =
-				DatumGetUInt32(DirectFunctionCall1(cache->cc_hashfunc[2],
-												   cur_skey[2].sk_argument));
+			oneHash = (cc_hashfunc[2])(v3);
+
 			hashValue ^= oneHash << 16;
 			hashValue ^= oneHash >> 16;
 			/* FALLTHROUGH */
 		case 2:
-			oneHash =
-				DatumGetUInt32(DirectFunctionCall1(cache->cc_hashfunc[1],
-												   cur_skey[1].sk_argument));
+			oneHash = (cc_hashfunc[1])(v2);
+
 			hashValue ^= oneHash << 8;
 			hashValue ^= oneHash >> 24;
 			/* FALLTHROUGH */
 		case 1:
-			oneHash =
-				DatumGetUInt32(DirectFunctionCall1(cache->cc_hashfunc[0],
-												   cur_skey[0].sk_argument));
+			oneHash = (cc_hashfunc[0])(v1);
+
 			hashValue ^= oneHash;
 			break;
 		default:
@@ -224,63 +319,96 @@ CatalogCacheComputeHashValue(CatCache *cache, int nkeys, ScanKey cur_skey)
  * Compute the hash value associated with a given tuple to be cached
  */
 static uint32
-CatalogCacheComputeTupleHashValue(CatCache *cache, HeapTuple tuple)
+CatalogCacheComputeTupleHashValue(CatCache *cache, int nkeys, HeapTuple tuple)
 {
-	ScanKeyData cur_skey[CATCACHE_MAXKEYS];
+	Datum		v1 = 0, v2 = 0, v3 = 0, v4 = 0;
 	bool		isNull = false;
-
-	/* Copy pre-initialized overhead data for scankey */
-	memcpy(cur_skey, cache->cc_skey, sizeof(cur_skey));
+	int		   *cc_key = cache->cc_key;
+	TupleDesc	cc_tupdesc = cache->cc_tupdesc;
 
 	/* Now extract key fields from tuple, insert into scankey */
-	switch (cache->cc_nkeys)
+	switch (nkeys)
 	{
 		case 4:
-			cur_skey[3].sk_argument =
-				(cache->cc_key[3] == ObjectIdAttributeNumber)
+			v4 = (cc_key[3] == ObjectIdAttributeNumber)
 				? ObjectIdGetDatum(HeapTupleGetOid(tuple))
 				: fastgetattr(tuple,
-							  cache->cc_key[3],
-							  cache->cc_tupdesc,
+							  cc_key[3],
+							  cc_tupdesc,
 							  &isNull);
 			Assert(!isNull);
 			/* FALLTHROUGH */
 		case 3:
-			cur_skey[2].sk_argument =
-				(cache->cc_key[2] == ObjectIdAttributeNumber)
+			v3 = (cc_key[2] == ObjectIdAttributeNumber)
 				? ObjectIdGetDatum(HeapTupleGetOid(tuple))
 				: fastgetattr(tuple,
-							  cache->cc_key[2],
-							  cache->cc_tupdesc,
+							  cc_key[2],
+							  cc_tupdesc,
 							  &isNull);
 			Assert(!isNull);
 			/* FALLTHROUGH */
 		case 2:
-			cur_skey[1].sk_argument =
-				(cache->cc_key[1] == ObjectIdAttributeNumber)
+			v2 = (cc_key[1] == ObjectIdAttributeNumber)
 				? ObjectIdGetDatum(HeapTupleGetOid(tuple))
 				: fastgetattr(tuple,
-							  cache->cc_key[1],
-							  cache->cc_tupdesc,
+							  cc_key[1],
+							  cc_tupdesc,
 							  &isNull);
 			Assert(!isNull);
 			/* FALLTHROUGH */
 		case 1:
-			cur_skey[0].sk_argument =
-				(cache->cc_key[0] == ObjectIdAttributeNumber)
+			v1 = (cc_key[0] == ObjectIdAttributeNumber)
 				? ObjectIdGetDatum(HeapTupleGetOid(tuple))
 				: fastgetattr(tuple,
-							  cache->cc_key[0],
-							  cache->cc_tupdesc,
+							  cc_key[0],
+							  cc_tupdesc,
 							  &isNull);
 			Assert(!isNull);
 			break;
 		default:
-			elog(FATAL, "wrong number of hash keys: %d", cache->cc_nkeys);
+			elog(FATAL, "wrong number of hash keys: %d", nkeys);
 			break;
 	}
 
-	return CatalogCacheComputeHashValue(cache, cache->cc_nkeys, cur_skey);
+	return CatalogCacheComputeHashValue(cache, nkeys, v1, v2, v3, v4);
+}
+
+/*
+ *		CatalogCacheCompareTuple
+ *
+ * Compare a tuple to the passed arguments.
+ */
+static inline bool
+CatalogCacheCompareTuple(const CatCache *cache, int nkeys,
+						 const HeapTuple tuple,
+						 const Datum *arguments)
+{
+	TupleDesc	tupdesc = cache->cc_tupdesc;
+	const int  *cc_key = cache->cc_key;
+	const CCFastEqualFN *cc_fastequal = cache->cc_fastequal;
+	int i;
+
+	for (i = 0; i < nkeys; i++)
+	{
+		Datum atp;
+		bool isnull;
+
+		/*
+		 * XXX: might be worthwhile to only handle oid sysattr, to reduce
+		 * overhead - it's the most common key.
+		 */
+		atp = heap_getattr(tuple,
+						   cc_key[i],
+						   tupdesc,
+						   &isnull);
+		Assert(!isnull);
+
+		if (!(cc_fastequal[i])(atp, arguments[i]))
+		{
+			return false;
+		}
+	}
+	return true;
 }
 
 
@@ -878,7 +1006,8 @@ CatalogCacheInitializeCache(CatCache *cache)
 
 		GetCCHashEqFuncs(keytype,
 						 &cache->cc_hashfunc[i],
-						 &eqfunc);
+						 &eqfunc,
+						 &cache->cc_fastequal[i]);
 
 		cache->cc_isname[i] = (keytype == NAMEOID);
 
@@ -1020,7 +1149,7 @@ IndexScanOK(CatCache *cache, ScanKey cur_skey)
 }
 
 /*
- *	SearchCatCache
+ *	SearchCatCacheInternal
  *
  *		This call searches a system cache for a tuple, opening the relation
  *		if necessary (on the first access to a particular cache).
@@ -1042,15 +1171,64 @@ SearchCatCache(CatCache *cache,
 			   Datum v3,
 			   Datum v4)
 {
-	ScanKeyData cur_skey[CATCACHE_MAXKEYS];
+	return SearchCatCacheInternal(cache, cache->cc_nkeys, v1, v2, v3, v4);
+}
+
+
+/*
+ * SearchCatCacheN() are SearchCatCache() versions for a specific number of
+ * arguments. The compiler can inline the body and unroll the loop, making
+ * them a bit faster than SearchCatCache().
+ */
+
+HeapTuple
+SearchCatCache1(CatCache *cache,
+				Datum v1)
+{
+	return SearchCatCacheInternal(cache, 1, v1, 0, 0, 0);
+}
+
+
+HeapTuple
+SearchCatCache2(CatCache *cache,
+				Datum v1, Datum v2)
+{
+	return SearchCatCacheInternal(cache, 2, v1, v2, 0, 0);
+}
+
+
+HeapTuple
+SearchCatCache3(CatCache *cache,
+				Datum v1, Datum v2, Datum v3)
+{
+	return SearchCatCacheInternal(cache, 3, v1, v2, v3, 0);
+}
+
+
+HeapTuple
+SearchCatCache4(CatCache *cache,
+				Datum v1, Datum v2, Datum v3, Datum v4)
+{
+	return SearchCatCacheInternal(cache, 4, v1, v2, v3, v4);
+}
+
+/*
+ * Work-horse for SearchCatCache/SearchCatCacheN.
+ */
+static inline HeapTuple
+SearchCatCacheInternal(CatCache *cache,
+			   int nkeys,
+			   Datum v1,
+			   Datum v2,
+			   Datum v3,
+			   Datum v4)
+{
+	Datum		arguments[CATCACHE_MAXKEYS];
 	uint32		hashValue;
 	Index		hashIndex;
 	dlist_iter	iter;
 	dlist_head *bucket;
 	CatCTup    *ct;
-	Relation	relation;
-	SysScanDesc scandesc;
-	HeapTuple	ntp;
 
 	/* Make sure we're in an xact, even if this ends up being a cache hit */
 	Assert(IsTransactionState());
@@ -1058,26 +1236,23 @@ SearchCatCache(CatCache *cache,
 	/*
 	 * one-time startup overhead for each cache
 	 */
-	if (cache->cc_tupdesc == NULL)
+	if (unlikely(cache->cc_tupdesc == NULL))
 		CatalogCacheInitializeCache(cache);
 
 #ifdef CATCACHE_STATS
 	cache->cc_searches++;
 #endif
 
-	/*
-	 * initialize the search key information
-	 */
-	memcpy(cur_skey, cache->cc_skey, sizeof(cur_skey));
-	cur_skey[0].sk_argument = v1;
-	cur_skey[1].sk_argument = v2;
-	cur_skey[2].sk_argument = v3;
-	cur_skey[3].sk_argument = v4;
+	/* Initialize local parameter array */
+	arguments[0] = v1;
+	arguments[1] = v2;
+	arguments[2] = v3;
+	arguments[3] = v4;
 
 	/*
 	 * find the hash bucket in which to look for the tuple
 	 */
-	hashValue = CatalogCacheComputeHashValue(cache, cache->cc_nkeys, cur_skey);
+	hashValue = CatalogCacheComputeHashValue(cache, nkeys, v1, v2, v3, v4);
 	hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets);
 
 	/*
@@ -1089,8 +1264,6 @@ SearchCatCache(CatCache *cache,
 	bucket = &cache->cc_bucket[hashIndex];
 	dlist_foreach(iter, bucket)
 	{
-		bool		res;
-
 		ct = dlist_container(CatCTup, cache_elem, iter.cur);
 
 		if (ct->dead)
@@ -1099,15 +1272,7 @@ SearchCatCache(CatCache *cache,
 		if (ct->hash_value != hashValue)
 			continue;			/* quickly skip entry if wrong hash val */
 
-		/*
-		 * see if the cached tuple matches our key.
-		 */
-		HeapKeyTest(&ct->tuple,
-					cache->cc_tupdesc,
-					cache->cc_nkeys,
-					cur_skey,
-					res);
-		if (!res)
+		if (!CatalogCacheCompareTuple(cache, nkeys, &ct->tuple, arguments))
 			continue;
 
 		/*
@@ -1150,6 +1315,42 @@ SearchCatCache(CatCache *cache,
 		}
 	}
 
+	return SearchCatCacheMiss(cache, nkeys, hashValue, hashIndex, v1, v2, v3, v4);
+}
+
+/*
+ * Search the actual catalogs, rather than the cache.
+ *
+ * This is kept separate from SearchCatCacheInternal() to keep the fast-path
+ * as small as possible.  To avoid that effort being undone, try to explicitly
+ * forbid inlining.
+ */
+static pg_noinline HeapTuple
+SearchCatCacheMiss(CatCache *cache,
+				   int nkeys,
+				   uint32 hashValue,
+				   Index hashIndex,
+				   Datum v1,
+				   Datum v2,
+				   Datum v3,
+				   Datum v4)
+{
+	ScanKeyData cur_skey[CATCACHE_MAXKEYS];
+	Relation	relation;
+	SysScanDesc scandesc;
+	HeapTuple	ntp;
+	CatCTup    *ct;
+
+	/*
+	 * Ok, need to make a lookup in the relation, copy the scankey and fill out
+	 * any per-call fields.
+	 */
+	memcpy(cur_skey, cache->cc_skey, sizeof(ScanKeyData) * nkeys);
+	cur_skey[0].sk_argument = v1;
+	cur_skey[1].sk_argument = v2;
+	cur_skey[2].sk_argument = v3;
+	cur_skey[3].sk_argument = v4;
+
 	/*
 	 * Tuple was not found in cache, so we have to try to retrieve it directly
 	 * from the relation.  If found, we will add it to the cache; if not
@@ -1171,7 +1372,7 @@ SearchCatCache(CatCache *cache,
 								  cache->cc_indexoid,
 								  IndexScanOK(cache, cur_skey),
 								  NULL,
-								  cache->cc_nkeys,
+								  nkeys,
 								  cur_skey);
 
 	ct = NULL;
@@ -1207,7 +1408,7 @@ SearchCatCache(CatCache *cache,
 		if (IsBootstrapProcessingMode())
 			return NULL;
 
-		ntp = build_dummy_tuple(cache, cache->cc_nkeys, cur_skey);
+		ntp = build_dummy_tuple(cache, nkeys, cur_skey);
 		ct = CatalogCacheCreateEntry(cache, ntp,
 									 hashValue, hashIndex,
 									 true);
@@ -1288,27 +1489,16 @@ GetCatCacheHashValue(CatCache *cache,
 					 Datum v3,
 					 Datum v4)
 {
-	ScanKeyData cur_skey[CATCACHE_MAXKEYS];
-
 	/*
 	 * one-time startup overhead for each cache
 	 */
 	if (cache->cc_tupdesc == NULL)
 		CatalogCacheInitializeCache(cache);
 
-	/*
-	 * initialize the search key information
-	 */
-	memcpy(cur_skey, cache->cc_skey, sizeof(cur_skey));
-	cur_skey[0].sk_argument = v1;
-	cur_skey[1].sk_argument = v2;
-	cur_skey[2].sk_argument = v3;
-	cur_skey[3].sk_argument = v4;
-
 	/*
 	 * calculate the hash value
 	 */
-	return CatalogCacheComputeHashValue(cache, cache->cc_nkeys, cur_skey);
+	return CatalogCacheComputeHashValue(cache, cache->cc_nkeys, v1, v2, v3, v4);
 }
 
 
@@ -1329,7 +1519,7 @@ SearchCatCacheList(CatCache *cache,
 				   Datum v3,
 				   Datum v4)
 {
-	ScanKeyData cur_skey[CATCACHE_MAXKEYS];
+	Datum		arguments[CATCACHE_MAXKEYS];
 	uint32		lHashValue;
 	dlist_iter	iter;
 	CatCList   *cl;
@@ -1354,21 +1544,18 @@ SearchCatCacheList(CatCache *cache,
 	cache->cc_lsearches++;
 #endif
 
-	/*
-	 * initialize the search key information
-	 */
-	memcpy(cur_skey, cache->cc_skey, sizeof(cur_skey));
-	cur_skey[0].sk_argument = v1;
-	cur_skey[1].sk_argument = v2;
-	cur_skey[2].sk_argument = v3;
-	cur_skey[3].sk_argument = v4;
+	/* Initialize local parameter array */
+	arguments[0] = v1;
+	arguments[1] = v2;
+	arguments[2] = v3;
+	arguments[3] = v4;
 
 	/*
 	 * compute a hash value of the given keys for faster search.  We don't
 	 * presently divide the CatCList items into buckets, but this still lets
 	 * us skip non-matching items quickly most of the time.
 	 */
-	lHashValue = CatalogCacheComputeHashValue(cache, nkeys, cur_skey);
+	lHashValue = CatalogCacheComputeHashValue(cache, nkeys, v1, v2, v3, v4);
 
 	/*
 	 * scan the items until we find a match or exhaust our list
@@ -1378,8 +1565,6 @@ SearchCatCacheList(CatCache *cache,
 	 */
 	dlist_foreach(iter, &cache->cc_lists)
 	{
-		bool		res;
-
 		cl = dlist_container(CatCList, cache_elem, iter.cur);
 
 		if (cl->dead)
@@ -1393,12 +1578,8 @@ SearchCatCacheList(CatCache *cache,
 		 */
 		if (cl->nkeys != nkeys)
 			continue;
-		HeapKeyTest(&cl->tuple,
-					cache->cc_tupdesc,
-					nkeys,
-					cur_skey,
-					res);
-		if (!res)
+
+		if (!CatalogCacheCompareTuple(cache, nkeys, &cl->tuple, arguments))
 			continue;
 
 		/*
@@ -1441,9 +1622,20 @@ SearchCatCacheList(CatCache *cache,
 
 	PG_TRY();
 	{
+		ScanKeyData cur_skey[CATCACHE_MAXKEYS];
 		Relation	relation;
 		SysScanDesc scandesc;
 
+		/*
+		 * Ok, need to make a lookup in the relation, copy the scankey and fill out
+		 * any per-call fields.
+		 */
+		memcpy(cur_skey, cache->cc_skey, sizeof(ScanKeyData) * cache->cc_nkeys);
+		cur_skey[0].sk_argument = v1;
+		cur_skey[1].sk_argument = v2;
+		cur_skey[2].sk_argument = v3;
+		cur_skey[3].sk_argument = v4;
+
 		relation = heap_open(cache->cc_reloid, AccessShareLock);
 
 		scandesc = systable_beginscan(relation,
@@ -1467,7 +1659,7 @@ SearchCatCacheList(CatCache *cache,
 			 * See if there's an entry for this tuple already.
 			 */
 			ct = NULL;
-			hashValue = CatalogCacheComputeTupleHashValue(cache, ntp);
+			hashValue = CatalogCacheComputeTupleHashValue(cache, cache->cc_nkeys, ntp);
 			hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets);
 
 			bucket = &cache->cc_bucket[hashIndex];
@@ -1820,7 +2012,7 @@ PrepareToInvalidateCacheTuple(Relation relation,
 		if (ccp->cc_tupdesc == NULL)
 			CatalogCacheInitializeCache(ccp);
 
-		hashvalue = CatalogCacheComputeTupleHashValue(ccp, tuple);
+		hashvalue = CatalogCacheComputeTupleHashValue(ccp, ccp->cc_nkeys, tuple);
 		dbid = ccp->cc_relisshared ? (Oid) 0 : MyDatabaseId;
 
 		(*function) (ccp->id, hashvalue, dbid);
@@ -1829,7 +2021,7 @@ PrepareToInvalidateCacheTuple(Relation relation,
 		{
 			uint32		newhashvalue;
 
-			newhashvalue = CatalogCacheComputeTupleHashValue(ccp, newtuple);
+			newhashvalue = CatalogCacheComputeTupleHashValue(ccp, ccp->cc_nkeys, newtuple);
 
 			if (newhashvalue != hashvalue)
 				(*function) (ccp->id, newhashvalue, dbid);
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index fcbb683a99..888edbb325 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -1102,13 +1102,56 @@ SearchSysCache(int cacheId,
 			   Datum key3,
 			   Datum key4)
 {
-	if (cacheId < 0 || cacheId >= SysCacheSize ||
-		!PointerIsValid(SysCache[cacheId]))
-		elog(ERROR, "invalid cache ID: %d", cacheId);
+	Assert(cacheId >= 0 && cacheId < SysCacheSize &&
+		   PointerIsValid(SysCache[cacheId]));
 
 	return SearchCatCache(SysCache[cacheId], key1, key2, key3, key4);
 }
 
+HeapTuple
+SearchSysCache1(int cacheId,
+				Datum key1)
+{
+	Assert(cacheId >= 0 && cacheId < SysCacheSize &&
+		   PointerIsValid(SysCache[cacheId]));
+	Assert(SysCache[cacheId]->cc_nkeys == 1);
+
+	return SearchCatCache1(SysCache[cacheId], key1);
+}
+
+HeapTuple
+SearchSysCache2(int cacheId,
+				Datum key1, Datum key2)
+{
+	Assert(cacheId >= 0 && cacheId < SysCacheSize &&
+		   PointerIsValid(SysCache[cacheId]));
+	Assert(SysCache[cacheId]->cc_nkeys == 2);
+
+	return SearchCatCache2(SysCache[cacheId], key1, key2);
+}
+
+HeapTuple
+SearchSysCache3(int cacheId,
+				Datum key1, Datum key2, Datum key3)
+{
+	Assert(cacheId >= 0 && cacheId < SysCacheSize &&
+		   PointerIsValid(SysCache[cacheId]));
+	Assert(SysCache[cacheId]->cc_nkeys == 3);
+
+	return SearchCatCache3(SysCache[cacheId], key1, key2, key3);
+}
+
+HeapTuple
+SearchSysCache4(int cacheId,
+				Datum key1, Datum key2, Datum key3, Datum key4)
+{
+	Assert(cacheId >= 0 && cacheId < SysCacheSize &&
+		   PointerIsValid(SysCache[cacheId]));
+	Assert(SysCache[cacheId]->cc_nkeys == 4);
+
+	return SearchCatCache4(SysCache[cacheId], key1, key2, key3, key4);
+}
+
 /*
  * ReleaseSysCache
  *		Release previously grabbed reference count on a tuple
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 200a3022e7..360f0c5dd5 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -34,6 +34,10 @@
 
 #define CATCACHE_MAXKEYS		4
 
+
+typedef uint32 (*CCHashFN) (Datum datum);
+typedef bool (*CCFastEqualFN) (Datum a, Datum b);
+
 typedef struct catcache
 {
 	int			id;				/* cache identifier --- see syscache.h */
@@ -47,7 +51,8 @@ typedef struct catcache
 	int			cc_nbuckets;	/* # of hash buckets in this cache */
 	int			cc_nkeys;		/* # of keys (1..CATCACHE_MAXKEYS) */
 	int			cc_key[CATCACHE_MAXKEYS];	/* AttrNumber of each key */
-	PGFunction	cc_hashfunc[CATCACHE_MAXKEYS];	/* hash function for each key */
+	CCHashFN	cc_hashfunc[CATCACHE_MAXKEYS];	/* hash function for each key */
+	CCFastEqualFN cc_fastequal[CATCACHE_MAXKEYS];	/* fast equal function for each key */
 	ScanKeyData cc_skey[CATCACHE_MAXKEYS];	/* precomputed key info for heap
 											 * scans */
 	bool		cc_isname[CATCACHE_MAXKEYS];	/* flag "name" key columns */
@@ -174,8 +179,15 @@ extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
 extern void InitCatCachePhase2(CatCache *cache, bool touch_index);
 
 extern HeapTuple SearchCatCache(CatCache *cache,
-			   Datum v1, Datum v2,
-			   Datum v3, Datum v4);
+			   Datum v1, Datum v2, Datum v3, Datum v4);
+extern HeapTuple SearchCatCache1(CatCache *cache,
+			   Datum v1);
+extern HeapTuple SearchCatCache2(CatCache *cache,
+			   Datum v1, Datum v2);
+extern HeapTuple SearchCatCache3(CatCache *cache,
+			   Datum v1, Datum v2, Datum v3);
+extern HeapTuple SearchCatCache4(CatCache *cache,
+			   Datum v1, Datum v2, Datum v3, Datum v4);
 extern void ReleaseCatCache(HeapTuple tuple);
 
 extern uint32 GetCatCacheHashValue(CatCache *cache,
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 8a92ea27ac..12bda02cd7 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -117,6 +117,20 @@ extern void InitCatalogCachePhase2(void);
 
 extern HeapTuple SearchSysCache(int cacheId,
 			   Datum key1, Datum key2, Datum key3, Datum key4);
+
+/*
+ * The use of argument specific numbers is encouraged, they're faster, and
+ * insulates the caller from changes in the maximum number of keys.
+ */
+extern HeapTuple SearchSysCache1(int cacheId,
+			   Datum key1);
+extern HeapTuple SearchSysCache2(int cacheId,
+			   Datum key1, Datum key2);
+extern HeapTuple SearchSysCache3(int cacheId,
+			   Datum key1, Datum key2, Datum key3);
+extern HeapTuple SearchSysCache4(int cacheId,
+			   Datum key1, Datum key2, Datum key3, Datum key4);
+
 extern void ReleaseSysCache(HeapTuple tuple);
 
 /* convenience routines */
@@ -156,15 +170,6 @@ extern bool RelationSupportsSysCache(Oid relid);
  * functions is encouraged, as it insulates the caller from changes in the
  * maximum number of keys.
  */
-#define SearchSysCache1(cacheId, key1) \
-	SearchSysCache(cacheId, key1, 0, 0, 0)
-#define SearchSysCache2(cacheId, key1, key2) \
-	SearchSysCache(cacheId, key1, key2, 0, 0)
-#define SearchSysCache3(cacheId, key1, key2, key3) \
-	SearchSysCache(cacheId, key1, key2, key3, 0)
-#define SearchSysCache4(cacheId, key1, key2, key3, key4) \
-	SearchSysCache(cacheId, key1, key2, key3, key4)
-
 #define SearchSysCacheCopy1(cacheId, key1) \
 	SearchSysCacheCopy(cacheId, key1, 0, 0, 0)
 #define SearchSysCacheCopy2(cacheId, key1, key2) \
-- 
2.14.1.536.g6867272d5b.dirty

#2amul sul
sulamul@gmail.com
In reply to: Andres Freund (#1)
1 attachment(s)
Re: Improve catcache/syscache performance.

Patch 0007:

1:
400 + /*
401 + * XXX: might be worthwhile to only handle oid sysattr, to
reduce
402 + * overhead - it's the most common key.
403 + */

IMHO, let fix that as well. I tested this by fixing (see the attach patch)
but does
not found much gain on my local centos vm (of course, the pgbench load
wasn't big enough).

2: How about have wrapping following condition in SearchCatCacheMiss() by
unlikely():

if (IsBootstrapProcessingMode())
return NULL;

3: Can we have following assert in SearchCatCacheN() instead
SearchSysCacheN(), so that we'll assert direct SearchCatCacheN() call as
well?

Assert(SysCache[cacheId]->cc_nkeys == <N>);

Other than these concern, patch looks pretty reasonable to me.

Regards,
Amul

Attachments:

0007_ex_handle_oid.patchapplication/octet-stream; name=0007_ex_handle_oid.patchDownload
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 67c596d..aa1bc62 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -393,19 +393,21 @@ CatalogCacheCompareTuple(const CatCache *cache, int nkeys,
 		Datum atp;
 		bool isnull;
 
-		/*
-		 * XXX: might be worthwhile to only handle oid sysattr, to reduce
-		 * overhead - it's the most common key.
-		 */
-		atp = heap_getattr(tuple,
-						   cc_key[i],
-						   tupdesc,
-						   &isnull);
-		Assert(!isnull);
-
-		if (!(cc_fastequal[i])(atp, arguments[i]))
+		if (cc_key[i] == ObjectIdAttributeNumber)
 		{
-			return false;
+			if (HeapTupleGetOid(tuple) != DatumGetObjectId(arguments[i]))
+				return false;
+		}
+		else
+		{
+			atp = heap_getattr(tuple,
+							   cc_key[i],
+							   tupdesc,
+							   &isnull);
+			Assert(!isnull);
+
+			if (!(cc_fastequal[i])(atp, arguments[i]))
+				return false;
 		}
 	}
 	return true;
#3Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
5 attachment(s)
Re: Improve catcache/syscache performance.

Hi,

On 2017-09-13 23:12:07 -0700, Andres Freund wrote:

Attached is a patch that tries to improve sys/catcache performance,
going further than the patch referenced earlier.

Here's a variant that cleans up the previous changes a bit, and adds
some further improvements:

Here's the main commit message:

Improve sys/catcache performance.

The following are the individual improvements:
1) Avoidance of FunctionCallInfo based function calls, replaced by
more efficient functions with a native C argument interface.
2) Don't extract columns from a cache entry's tuple whenever matching
entries - instead store them as a Datum array. This also allows to
get rid of having to build dummy tuples for negative & list
entries, and of a hack for dealing with cstring vs. text weirdness.
3) Reorder members of catcache.h struct, so imortant entries are more
likely to be on one cacheline.
4) Allowing the compiler to specialize critical SearchCatCache for a
specific number of attributes allows to unroll loops and avoid
other nkeys dependant initialization.
5) Only initializing the ScanKey when necessary, i.e. catcache misses,
greatly reduces cache unnecessary cpu cache misses.
6) Split of the cache-miss case from the hash lookup, reducing stack
allocations etc in the common case.
7) CatCTup and their corresponding heaptuple are allocated in one
piece.

This results in making cache lookups themselves roughly three times as
fast - full-system benchmarks obviously improve less than that.

I've also evaluated further techniques:
- replace open coded hash with simplehash - the list walk right now
shows up in profiles. Unfortunately it's not easy to do so safely as
an entry's memory location can change at various times, which
doesn't work well with the refcounting and cache invalidation.
- Cacheline-aligning CatCTup entries - helps some with performance,
but the win isn't big and the code for it is ugly, because the
tuples have to be freed as well.
- add more proper functions, rather than macros for
SearchSysCacheCopyN etc., but right now they don't show up in
profiles.

The reason the macro wrapper for syscache.c/h have to be changed,
rather than just catcache, is that doing otherwise would require
exposing the SysCache array to the outside. That might be a good idea
anyway, but it's for another day.

With the attached benchmark for wide tuples and simple queries I get:

pgbench -M prepared -f ~/tmp/pgbench-many-cols.sql

master:
tps = 16112.117859 (excluding connections establishing)
tps = 16192.186504 (excluding connections establishing)
tps = 16091.257399 (excluding connections establishing)

patch:
tps = 18616.116993 (excluding connections establishing)
tps = 18584.036276 (excluding connections establishing)
tps = 18843.246281 (excluding connections establishing)

~17% gain

pgbench -M prepared -f ~/tmp/pgbench-many-cols.sql -c -j 16:
master:
tps = 73277.282455 (excluding connections establishing)
tps = 73078.408303 (excluding connections establishing)
tps = 73432.476550 (excluding connections establishing)

patch:
tps = 89424.043728 (excluding connections establishing)
tps = 89223.731307 (excluding connections establishing)
tps = 87830.665009 (excluding connections establishing)

~21% gain

standard pgbench readonly:
1 client:
master:
tps = 41662.984894 (excluding connections establishing)
tps = 40965.435121 (excluding connections establishing)
tps = 41438.197117 (excluding connections establishing)

patch:
tps = 42657.455818 (excluding connections establishing)
tps = 42834.812173 (excluding connections establishing)
tps = 42784.306987 (excluding connections establishing)

So roughly ~2.3%, much smaller, as expected, because the syscache is
much less of a bottleneck here.

-cj 16:
master:
tps = 204642.558752 (excluding connections establishing)
tps = 205834.493312 (excluding connections establishing)
tps = 207781.943687 (excluding connections establishing)

dev:
tps = 211459.087649 (excluding connections establishing)
tps = 214890.093976 (excluding connections establishing)
tps = 214526.773530 (excluding connections establishing)

So ~3.3%.

I personally find these numbers quite convincing for a fairly localized
microoptimization.

For the attached benchmark, here's the difference in profiles:
before:
single function overhead:
+    8.10%  postgres  postgres            [.] SearchCatCache
-    7.26%  postgres  libc-2.24.so        [.] __memmove_avx_unaligned_erms
   - __memmove_avx_unaligned_erms
      + 59.29% SearchCatCache
      + 23.51% appendBinaryStringInfo
      + 5.56% pgstat_report_activity
      + 4.05% socket_putmessage
      + 2.86% pstrdup
      + 2.65% AllocSetRealloc
      + 0.73% hash_search_with_hash_value
      + 0.68% btrescan
        0.67% 0x55c02baea83f
+    4.97%  postgres  postgres            [.] appendBinaryStringInfo
+    2.92%  postgres  postgres            [.] ExecBuildProjectionInfo
+    2.60%  postgres  libc-2.24.so        [.] __strncpy_sse2_unaligned
+    2.27%  postgres  postgres            [.] hashoid
+    2.18%  postgres  postgres            [.] fmgr_info
+    2.02%  postgres  libc-2.24.so        [.] strlen

hierarchical / include child costs:
+ 21.35% 8.86% postgres postgres [.] SearchCatCache

after:
single function overhead:
+    6.34%  postgres  postgres            [.] appendBinaryStringInfo
+    5.12%  postgres  postgres            [.] SearchCatCache1
-    4.44%  postgres  libc-2.24.so        [.] __memmove_avx_unaligned_erms
   - __memmove_avx_unaligned_erms
      + 60.08% appendBinaryStringInfo
      + 13.88% AllocSetRealloc
      + 11.58% socket_putmessage
      + 6.54% pstrdup
      + 4.67% pgstat_report_activity
      + 1.20% pq_getbytes
      + 1.03% btrescan
        1.03% 0x560d35168dab
+    4.02%  postgres  postgres            [.] fmgr_info
+    3.18%  postgres  postgres            [.] ExecBuildProjectionInfo
+    2.43%  postgres  libc-2.24.so        [.] strlen
hierarchical / include child costs:
+    6.63%     5.12%  postgres  postgres  [.] SearchCatCache1
+    0.49%     0.49%  postgres  postgres  [.] SearchSysCache1
+    0.10%     0.10%  postgres  postgres  [.] SearchCatCache3

(Most of the other top entries here are addressed in neirby threads)

- Andres

Attachments:

0001-Add-pg_noinline-macro-to-c.hv2.patchtext/x-diff; charset=us-asciiDownload
From 9c0e02483a71f15cbff7fcc0573afc8856f1d49b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 13 Sep 2017 19:58:43 -0700
Subject: [PATCH 1/3] Add pg_noinline macro to c.h.

Forcing a function not to be inlined can be useful if it's the
slow-path of a performance critical function, or should be visible in
profiles to allow for proper cost attribution.

Author: Andres Freund
Discussion: https://postgr.es/m/20170914061207.zxotvyopetm7lrrp@alap3.anarazel.de
---
 src/include/c.h | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/src/include/c.h b/src/include/c.h
index fd53010e24..f44610c0ef 100644
--- a/src/include/c.h
+++ b/src/include/c.h
@@ -642,6 +642,22 @@ typedef NameData *Name;
 #define pg_attribute_noreturn()
 #endif
 
+
+/*
+ * Forcing a function not to be inlined can be useful if it's the slow-path of
+ * a performance critical function, or should be visible in profiles to allow
+ * for proper cost attribution.
+ */
+/* GCC, Sunpro and XLC support noinline via __attribute */
+#if defined(__GNUC__) || defined(__SUNPRO_C) || defined(__IBMC__)
+#define pg_noinline __attribute__((noinline))
+/* msvc via declspec */
+#elif defined(_MSC_VER)
+#define pg_noinline __declspec(noinline)
+#else
+#define pg_noinline
+#endif
+
 /* ----------------------------------------------------------------
  *				Section 6:	assertions
  * ----------------------------------------------------------------
-- 
2.14.1.536.g6867272d5b.dirty

0002-Add-inline-murmurhash32-int32-functionv2.patchtext/x-diff; charset=us-asciiDownload
From 44315ef335ad8d49f1b3bdbbce5d86d22db21f8f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 13 Sep 2017 18:43:46 -0700
Subject: [PATCH 2/3] Add inline murmurhash32(int32) function.

The function already existed in tidbitmap.c but more users requiring
fast hashing of 32bit ints are coming up.

Author: Andres Freund
Discussion: https://postgr.es/m/20170914061207.zxotvyopetm7lrrp@alap3.anarazel.de
---
 src/backend/nodes/tidbitmap.c | 20 ++------------------
 src/include/utils/hashutils.h | 18 ++++++++++++++++++
 2 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/src/backend/nodes/tidbitmap.c b/src/backend/nodes/tidbitmap.c
index c4e53adb0c..01d6bc5c11 100644
--- a/src/backend/nodes/tidbitmap.c
+++ b/src/backend/nodes/tidbitmap.c
@@ -45,6 +45,7 @@
 #include "nodes/tidbitmap.h"
 #include "storage/lwlock.h"
 #include "utils/dsa.h"
+#include "utils/hashutils.h"
 
 /*
  * The maximum number of tuples per page is not large (typically 256 with
@@ -237,30 +238,13 @@ static int	tbm_comparator(const void *left, const void *right);
 static int tbm_shared_comparator(const void *left, const void *right,
 					  void *arg);
 
-/*
- * Simple inline murmur hash implementation for the exact width required, for
- * performance.
- */
-static inline uint32
-hash_blockno(BlockNumber b)
-{
-	uint32		h = b;
-
-	h ^= h >> 16;
-	h *= 0x85ebca6b;
-	h ^= h >> 13;
-	h *= 0xc2b2ae35;
-	h ^= h >> 16;
-	return h;
-}
-
 /* define hashtable mapping block numbers to PagetableEntry's */
 #define SH_USE_NONDEFAULT_ALLOCATOR
 #define SH_PREFIX pagetable
 #define SH_ELEMENT_TYPE PagetableEntry
 #define SH_KEY_TYPE BlockNumber
 #define SH_KEY blockno
-#define SH_HASH_KEY(tb, key) hash_blockno(key)
+#define SH_HASH_KEY(tb, key) murmurhash32(key)
 #define SH_EQUAL(tb, a, b) a == b
 #define SH_SCOPE static inline
 #define SH_DEFINE
diff --git a/src/include/utils/hashutils.h b/src/include/utils/hashutils.h
index 56b7bfc9cb..35281689e8 100644
--- a/src/include/utils/hashutils.h
+++ b/src/include/utils/hashutils.h
@@ -20,4 +20,22 @@ hash_combine(uint32 a, uint32 b)
 	return a;
 }
 
+
+/*
+ * Simple inline murmur hash implementation hashing a 32 bit ingeger, for
+ * performance.
+ */
+static inline uint32
+murmurhash32(uint32 data)
+{
+	uint32		h = data;
+
+	h ^= h >> 16;
+	h *= 0x85ebca6b;
+	h ^= h >> 13;
+	h *= 0xc2b2ae35;
+	h ^= h >> 16;
+	return h;
+}
+
 #endif							/* HASHUTILS_H */
-- 
2.14.1.536.g6867272d5b.dirty

0003-Improve-sys-catcache-performancev2.patchtext/x-diff; charset=us-asciiDownload
From 6f17f9c46f47a3f447966ab8c341057bf1bf6718 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 11 Sep 2017 18:25:39 -0700
Subject: [PATCH 3/3] Improve sys/catcache performance.

The following are the individual improvements:
1) Avoidance of FunctionCallInfo based function calls, replaced by
   more efficient functions with a native C argument interface.
2) Don't extract columns from a cache entry's tuple whenever matching
   entries - instead store them as a Datum array. This also allows to
   get rid of having to build dummy tuples for negative & list
   entries, and of a hack for dealing with cstring vs. text weirdness.
3) Reorder members of catcache.h struct, so imortant entries are more
   likely to be on one cacheline.
4) Allowing the compiler to specialize critical SearchCatCache for a
   specific number of attributes allows to unroll loops and avoid
   other nkeys dependant initialization.
5) Only initializing the ScanKey when necessary, i.e. catcache misses,
   greatly reduces cache unnecessary cpu cache misses.
6) Split of the cache-miss case from the hash lookup, reducing stack
   allocations etc in the common case.
7) CatCTup and their corresponding heaptuple are allocated in one
   piece.

This results in making cache lookups themselves roughly three times as
fast - full-system benchmarks obviously improve less than that.

I've also evaluated further techniques:
- replace open coded hash with simplehash - the list walk right now
  shows up in profiles. Unfortunately it's not easy to do so safely as
  an entry's memory location can change at various times, which
  doesn't work well with the refcounting and cache invalidation.
- Cacheline-aligning CatCTup entries - helps some with performance,
  but the win isn't big and the code for it is ugly, because the
  tuples have to be freed as well.
- add more proper functions, rather than macros for
  SearchSysCacheCopyN etc., but right now they don't show up in
  profiles.

The reason the macro wrapper for syscache.c/h have to be changed,
rather than just catcache, is that doing otherwise would require
exposing the SysCache array to the outside.  That might be a good idea
anyway, but it's for another day.

Author: Andres Freund
Discussion: https://postgr.es/m/20170914061207.zxotvyopetm7lrrp@alap3.anarazel.de
---
 src/backend/utils/cache/catcache.c | 678 +++++++++++++++++++++++++------------
 src/backend/utils/cache/syscache.c |  49 ++-
 src/include/utils/catcache.h       | 122 ++++---
 src/include/utils/syscache.h       |  23 +-
 4 files changed, 592 insertions(+), 280 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index e092801025..4babcbfcaf 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -30,7 +30,9 @@
 #endif
 #include "storage/lmgr.h"
 #include "utils/builtins.h"
+#include "utils/datum.h"
 #include "utils/fmgroids.h"
+#include "utils/hashutils.h"
 #include "utils/inval.h"
 #include "utils/memutils.h"
 #include "utils/rel.h"
@@ -72,11 +74,25 @@
 /* Cache management header --- pointer is NULL until created */
 static CatCacheHeader *CacheHdr = NULL;
 
+static inline HeapTuple SearchCatCacheInternal(CatCache *cache,
+											   int nkeys,
+											   Datum v1, Datum v2,
+											   Datum v3, Datum v4);
+
+static pg_noinline HeapTuple SearchCatCacheMiss(CatCache *cache,
+												int nkeys,
+												uint32 hashValue,
+												Index hashIndex,
+												Datum v1, Datum v2,
+												Datum v3, Datum v4);
 
 static uint32 CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
-							 ScanKey cur_skey);
-static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache,
+							 Datum v1, Datum v2, Datum v3, Datum v4);
+static uint32 CatalogCacheComputeTupleHashValue(CatCache *cache, int nkeys,
 								  HeapTuple tuple);
+static inline bool CatalogCacheCompareTuple(const CatCache *cache, int nkeys,
+											const Datum *cachekeys,
+											const Datum *searchkeys);
 
 #ifdef CATCACHE_STATS
 static void CatCachePrintStats(int code, Datum arg);
@@ -85,9 +101,14 @@ static void CatCacheRemoveCTup(CatCache *cache, CatCTup *ct);
 static void CatCacheRemoveCList(CatCache *cache, CatCList *cl);
 static void CatalogCacheInitializeCache(CatCache *cache);
 static CatCTup *CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp,
+						Datum *arguments,
 						uint32 hashValue, Index hashIndex,
 						bool negative);
-static HeapTuple build_dummy_tuple(CatCache *cache, int nkeys, ScanKey skeys);
+
+static void CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos,
+							 Datum *keys);
+static void CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
+							 Datum *srckeys, Datum *dstkeys);
 
 
 /*
@@ -95,45 +116,127 @@ static HeapTuple build_dummy_tuple(CatCache *cache, int nkeys, ScanKey skeys);
  */
 
 /*
- * Look up the hash and equality functions for system types that are used
- * as cache key fields.
- *
- * XXX this should be replaced by catalog lookups,
- * but that seems to pose considerable risk of circularity...
+ * Hash and equality functions for system types that are used as cache key
+ * fields.  To compute hashes, and to check for hash collisions, use functions
+ * hardcoded for that purpose. This is sufficiently performance critical that
+ * the overhead of SQL style function calls is noticeable.
  */
+
+static bool
+chareqfast(Datum a, Datum b)
+{
+	return DatumGetChar(a) == DatumGetChar(b);
+}
+
+static uint32
+charhashfast(Datum datum)
+{
+	return murmurhash32((int32) DatumGetChar(datum));
+}
+
+static bool
+nameeqfast(Datum a, Datum b)
+{
+	char	   *ca = NameStr(*DatumGetName(a));
+	char	   *cb = NameStr(*DatumGetName(b));
+
+	return strncmp(ca, cb, NAMEDATALEN) == 0;
+}
+
+static uint32
+namehashfast(Datum datum)
+{
+	char	   *key = NameStr(*DatumGetName(datum));
+
+	return hash_any((unsigned char *) key, strlen(key));
+}
+
+static bool
+int2eqfast(Datum a, Datum b)
+{
+	return DatumGetInt16(a) == DatumGetInt16(b);
+}
+
+static uint32
+int2hashfast(Datum datum)
+{
+	return murmurhash32((int32) DatumGetInt16(datum));
+}
+
+static bool
+int4eqfast(Datum a, Datum b)
+{
+	return DatumGetInt32(a) == DatumGetInt32(b);
+}
+
+static uint32
+int4hashfast(Datum datum)
+{
+	return murmurhash32((int32) DatumGetInt32(datum));
+}
+
+static bool
+texteqfast(Datum a, Datum b)
+{
+	/* not as performance critical & "complicated" */
+	return DatumGetBool(DirectFunctionCall2(texteq, a, b));
+}
+
+static uint32
+texthashfast(Datum datum)
+{
+	/* not as performance critical & "complicated" */
+	return DatumGetInt32(DirectFunctionCall1(hashtext, datum));
+}
+
+static bool
+oidvectoreqfast(Datum a, Datum b)
+{
+	/* not as performance critical & "complicated" */
+	return DatumGetBool(DirectFunctionCall2(oidvectoreq, a, b));
+}
+
+static uint32
+oidvectorhashfast(Datum datum)
+{
+	/* not as performance critical & "complicated" */
+	return DatumGetInt32(DirectFunctionCall1(hashoidvector, datum));
+}
+
+/* Lookup support functions for a type. */
 static void
-GetCCHashEqFuncs(Oid keytype, PGFunction *hashfunc, RegProcedure *eqfunc)
+GetCCHashEqFuncs(Oid keytype, CCHashFN *hashfunc, RegProcedure *eqfunc, CCFastEqualFN *fasteqfunc)
 {
 	switch (keytype)
 	{
 		case BOOLOID:
-			*hashfunc = hashchar;
-
+			*hashfunc = charhashfast;
+			*fasteqfunc = chareqfast;
 			*eqfunc = F_BOOLEQ;
 			break;
 		case CHAROID:
-			*hashfunc = hashchar;
-
+			*hashfunc = charhashfast;
+			*fasteqfunc = chareqfast;
 			*eqfunc = F_CHAREQ;
 			break;
 		case NAMEOID:
-			*hashfunc = hashname;
-
+			*hashfunc = namehashfast;
+			*fasteqfunc = nameeqfast;
 			*eqfunc = F_NAMEEQ;
 			break;
 		case INT2OID:
-			*hashfunc = hashint2;
-
+			*hashfunc = int2hashfast;
+			*fasteqfunc = int2eqfast;
 			*eqfunc = F_INT2EQ;
 			break;
 		case INT4OID:
-			*hashfunc = hashint4;
-
+			*hashfunc = int4hashfast;
+			*fasteqfunc = int4eqfast;
 			*eqfunc = F_INT4EQ;
 			break;
 		case TEXTOID:
-			*hashfunc = hashtext;
-
+			*hashfunc = texthashfast;
+			*fasteqfunc = texteqfast;
 			*eqfunc = F_TEXTEQ;
 			break;
 		case OIDOID:
@@ -147,13 +250,13 @@ GetCCHashEqFuncs(Oid keytype, PGFunction *hashfunc, RegProcedure *eqfunc)
 		case REGDICTIONARYOID:
 		case REGROLEOID:
 		case REGNAMESPACEOID:
-			*hashfunc = hashoid;
-
+			*hashfunc = int4hashfast;
+			*fasteqfunc = int4eqfast;
 			*eqfunc = F_OIDEQ;
 			break;
 		case OIDVECTOROID:
-			*hashfunc = hashoidvector;
-
+			*hashfunc = oidvectorhashfast;
+			*fasteqfunc = oidvectoreqfast;
 			*eqfunc = F_OIDVECTOREQ;
 			break;
 		default:
@@ -171,10 +274,12 @@ GetCCHashEqFuncs(Oid keytype, PGFunction *hashfunc, RegProcedure *eqfunc)
  * Compute the hash value associated with a given set of lookup keys
  */
 static uint32
-CatalogCacheComputeHashValue(CatCache *cache, int nkeys, ScanKey cur_skey)
+CatalogCacheComputeHashValue(CatCache *cache, int nkeys,
+							 Datum v1, Datum v2, Datum v3, Datum v4)
 {
 	uint32		hashValue = 0;
 	uint32		oneHash;
+	CCHashFN   *cc_hashfunc = cache->cc_hashfunc;
 
 	CACHE4_elog(DEBUG2, "CatalogCacheComputeHashValue %s %d %p",
 				cache->cc_relname,
@@ -184,30 +289,26 @@ CatalogCacheComputeHashValue(CatCache *cache, int nkeys, ScanKey cur_skey)
 	switch (nkeys)
 	{
 		case 4:
-			oneHash =
-				DatumGetUInt32(DirectFunctionCall1(cache->cc_hashfunc[3],
-												   cur_skey[3].sk_argument));
+			oneHash = (cc_hashfunc[3])(v4);
+
 			hashValue ^= oneHash << 24;
 			hashValue ^= oneHash >> 8;
 			/* FALLTHROUGH */
 		case 3:
-			oneHash =
-				DatumGetUInt32(DirectFunctionCall1(cache->cc_hashfunc[2],
-												   cur_skey[2].sk_argument));
+			oneHash = (cc_hashfunc[2])(v3);
+
 			hashValue ^= oneHash << 16;
 			hashValue ^= oneHash >> 16;
 			/* FALLTHROUGH */
 		case 2:
-			oneHash =
-				DatumGetUInt32(DirectFunctionCall1(cache->cc_hashfunc[1],
-												   cur_skey[1].sk_argument));
+			oneHash = (cc_hashfunc[1])(v2);
+
 			hashValue ^= oneHash << 8;
 			hashValue ^= oneHash >> 24;
 			/* FALLTHROUGH */
 		case 1:
-			oneHash =
-				DatumGetUInt32(DirectFunctionCall1(cache->cc_hashfunc[0],
-												   cur_skey[0].sk_argument));
+			oneHash = (cc_hashfunc[0])(v1);
+
 			hashValue ^= oneHash;
 			break;
 		default:
@@ -224,63 +325,81 @@ CatalogCacheComputeHashValue(CatCache *cache, int nkeys, ScanKey cur_skey)
  * Compute the hash value associated with a given tuple to be cached
  */
 static uint32
-CatalogCacheComputeTupleHashValue(CatCache *cache, HeapTuple tuple)
+CatalogCacheComputeTupleHashValue(CatCache *cache, int nkeys, HeapTuple tuple)
 {
-	ScanKeyData cur_skey[CATCACHE_MAXKEYS];
+	Datum		v1 = 0, v2 = 0, v3 = 0, v4 = 0;
 	bool		isNull = false;
-
-	/* Copy pre-initialized overhead data for scankey */
-	memcpy(cur_skey, cache->cc_skey, sizeof(cur_skey));
+	int		   *cc_keyno = cache->cc_keyno;
+	TupleDesc	cc_tupdesc = cache->cc_tupdesc;
 
 	/* Now extract key fields from tuple, insert into scankey */
-	switch (cache->cc_nkeys)
+	switch (nkeys)
 	{
 		case 4:
-			cur_skey[3].sk_argument =
-				(cache->cc_key[3] == ObjectIdAttributeNumber)
+			v4 = (cc_keyno[3] == ObjectIdAttributeNumber)
 				? ObjectIdGetDatum(HeapTupleGetOid(tuple))
 				: fastgetattr(tuple,
-							  cache->cc_key[3],
-							  cache->cc_tupdesc,
+							  cc_keyno[3],
+							  cc_tupdesc,
 							  &isNull);
 			Assert(!isNull);
 			/* FALLTHROUGH */
 		case 3:
-			cur_skey[2].sk_argument =
-				(cache->cc_key[2] == ObjectIdAttributeNumber)
+			v3 = (cc_keyno[2] == ObjectIdAttributeNumber)
 				? ObjectIdGetDatum(HeapTupleGetOid(tuple))
 				: fastgetattr(tuple,
-							  cache->cc_key[2],
-							  cache->cc_tupdesc,
+							  cc_keyno[2],
+							  cc_tupdesc,
 							  &isNull);
 			Assert(!isNull);
 			/* FALLTHROUGH */
 		case 2:
-			cur_skey[1].sk_argument =
-				(cache->cc_key[1] == ObjectIdAttributeNumber)
+			v2 = (cc_keyno[1] == ObjectIdAttributeNumber)
 				? ObjectIdGetDatum(HeapTupleGetOid(tuple))
 				: fastgetattr(tuple,
-							  cache->cc_key[1],
-							  cache->cc_tupdesc,
+							  cc_keyno[1],
+							  cc_tupdesc,
 							  &isNull);
 			Assert(!isNull);
 			/* FALLTHROUGH */
 		case 1:
-			cur_skey[0].sk_argument =
-				(cache->cc_key[0] == ObjectIdAttributeNumber)
+			v1 = (cc_keyno[0] == ObjectIdAttributeNumber)
 				? ObjectIdGetDatum(HeapTupleGetOid(tuple))
 				: fastgetattr(tuple,
-							  cache->cc_key[0],
-							  cache->cc_tupdesc,
+							  cc_keyno[0],
+							  cc_tupdesc,
 							  &isNull);
 			Assert(!isNull);
 			break;
 		default:
-			elog(FATAL, "wrong number of hash keys: %d", cache->cc_nkeys);
+			elog(FATAL, "wrong number of hash keys: %d", nkeys);
 			break;
 	}
 
-	return CatalogCacheComputeHashValue(cache, cache->cc_nkeys, cur_skey);
+	return CatalogCacheComputeHashValue(cache, nkeys, v1, v2, v3, v4);
+}
+
+/*
+ *		CatalogCacheCompareTuple
+ *
+ * Compare a tuple to the passed arguments.
+ */
+static inline bool
+CatalogCacheCompareTuple(const CatCache *cache, int nkeys,
+						 const Datum *cachekeys,
+						 const Datum *searchkeys)
+{
+	const CCFastEqualFN *cc_fastequal = cache->cc_fastequal;
+	int i;
+
+	for (i = 0; i < nkeys; i++)
+	{
+		if (!(cc_fastequal[i])(cachekeys[i], searchkeys[i]))
+		{
+			return false;
+		}
+	}
+	return true;
 }
 
 
@@ -371,9 +490,14 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
 	/* delink from linked list */
 	dlist_delete(&ct->cache_elem);
 
-	/* free associated tuple data */
-	if (ct->tuple.t_data != NULL)
-		pfree(ct->tuple.t_data);
+	/*
+	 * Free keys when we're dealing with a negative entry, normal entries just
+	 * point into tuple, allocated together with the CatCTup.
+	 */
+	if (ct->negative)
+		CatCacheFreeKeys(cache->cc_tupdesc, cache->cc_nkeys,
+						 cache->cc_keyno, ct->keys);
+
 	pfree(ct);
 
 	--cache->cc_ntup;
@@ -414,9 +538,10 @@ CatCacheRemoveCList(CatCache *cache, CatCList *cl)
 	/* delink from linked list */
 	dlist_delete(&cl->cache_elem);
 
-	/* free associated tuple data */
-	if (cl->tuple.t_data != NULL)
-		pfree(cl->tuple.t_data);
+	/* free associated column data */
+	CatCacheFreeKeys(cache->cc_tupdesc, cl->nkeys,
+					 cache->cc_keyno, cl->keys);
+
 	pfree(cl);
 }
 
@@ -660,6 +785,7 @@ InitCatCache(int id,
 {
 	CatCache   *cp;
 	MemoryContext oldcxt;
+	size_t		sz;
 	int			i;
 
 	/*
@@ -699,11 +825,12 @@ InitCatCache(int id,
 	}
 
 	/*
-	 * allocate a new cache structure
+	 * Allocate a new cache structure, aligning to a cacheline boundary
 	 *
 	 * Note: we rely on zeroing to initialize all the dlist headers correctly
 	 */
-	cp = (CatCache *) palloc0(sizeof(CatCache));
+	sz =  sizeof(CatCache) + PG_CACHE_LINE_SIZE;
+	cp = (CatCache *) CACHELINEALIGN(palloc0(sz));
 	cp->cc_bucket = palloc0(nbuckets * sizeof(dlist_head));
 
 	/*
@@ -721,7 +848,7 @@ InitCatCache(int id,
 	cp->cc_nbuckets = nbuckets;
 	cp->cc_nkeys = nkeys;
 	for (i = 0; i < nkeys; ++i)
-		cp->cc_key[i] = key[i];
+		cp->cc_keyno[i] = key[i];
 
 	/*
 	 * new cache is initialized as far as we can go for now. print some
@@ -794,13 +921,13 @@ RehashCatCache(CatCache *cp)
 
 #define CatalogCacheInitializeCache_DEBUG2 \
 do { \
-		if (cache->cc_key[i] > 0) { \
+		if (cache->cc_keyno[i] > 0) { \
 			elog(DEBUG2, "CatalogCacheInitializeCache: load %d/%d w/%d, %u", \
-				i+1, cache->cc_nkeys, cache->cc_key[i], \
-				 TupleDescAttr(tupdesc, cache->cc_key[i] - 1)->atttypid); \
+				i+1, cache->cc_nkeys, cache->cc_keyno[i], \
+				 TupleDescAttr(tupdesc, cache->cc_keyno[i] - 1)->atttypid); \
 		} else { \
 			elog(DEBUG2, "CatalogCacheInitializeCache: load %d/%d w/%d", \
-				i+1, cache->cc_nkeys, cache->cc_key[i]); \
+				i+1, cache->cc_nkeys, cache->cc_keyno[i]); \
 		} \
 } while(0)
 #else
@@ -860,10 +987,10 @@ CatalogCacheInitializeCache(CatCache *cache)
 
 		CatalogCacheInitializeCache_DEBUG2;
 
-		if (cache->cc_key[i] > 0)
+		if (cache->cc_keyno[i] > 0)
 		{
 			Form_pg_attribute attr = TupleDescAttr(tupdesc,
-												   cache->cc_key[i] - 1);
+												   cache->cc_keyno[i] - 1);
 
 			keytype = attr->atttypid;
 			/* cache key columns should always be NOT NULL */
@@ -871,16 +998,15 @@ CatalogCacheInitializeCache(CatCache *cache)
 		}
 		else
 		{
-			if (cache->cc_key[i] != ObjectIdAttributeNumber)
+			if (cache->cc_keyno[i] != ObjectIdAttributeNumber)
 				elog(FATAL, "only sys attr supported in caches is OID");
 			keytype = OIDOID;
 		}
 
 		GetCCHashEqFuncs(keytype,
 						 &cache->cc_hashfunc[i],
-						 &eqfunc);
-
-		cache->cc_isname[i] = (keytype == NAMEOID);
+						 &eqfunc,
+						 &cache->cc_fastequal[i]);
 
 		/*
 		 * Do equality-function lookup (we assume this won't need a catalog
@@ -891,7 +1017,7 @@ CatalogCacheInitializeCache(CatCache *cache)
 					  CacheMemoryContext);
 
 		/* Initialize sk_attno suitably for HeapKeyTest() and heap scans */
-		cache->cc_skey[i].sk_attno = cache->cc_key[i];
+		cache->cc_skey[i].sk_attno = cache->cc_keyno[i];
 
 		/* Fill in sk_strategy as well --- always standard equality */
 		cache->cc_skey[i].sk_strategy = BTEqualStrategyNumber;
@@ -1020,7 +1146,7 @@ IndexScanOK(CatCache *cache, ScanKey cur_skey)
 }
 
 /*
- *	SearchCatCache
+ *	SearchCatCacheInternal
  *
  *		This call searches a system cache for a tuple, opening the relation
  *		if necessary (on the first access to a particular cache).
@@ -1042,42 +1168,90 @@ SearchCatCache(CatCache *cache,
 			   Datum v3,
 			   Datum v4)
 {
-	ScanKeyData cur_skey[CATCACHE_MAXKEYS];
+	return SearchCatCacheInternal(cache, cache->cc_nkeys, v1, v2, v3, v4);
+}
+
+
+/*
+ * SearchCatCacheN() are SearchCatCache() versions for a specific number of
+ * arguments. The compiler can inline the body and unroll the loop, making
+ * them a bit faster than SearchCatCache().
+ */
+
+HeapTuple
+SearchCatCache1(CatCache *cache,
+				Datum v1)
+{
+	return SearchCatCacheInternal(cache, 1, v1, 0, 0, 0);
+}
+
+
+HeapTuple
+SearchCatCache2(CatCache *cache,
+				Datum v1, Datum v2)
+{
+	return SearchCatCacheInternal(cache, 2, v1, v2, 0, 0);
+}
+
+
+HeapTuple
+SearchCatCache3(CatCache *cache,
+				Datum v1, Datum v2, Datum v3)
+{
+	return SearchCatCacheInternal(cache, 3, v1, v2, v3, 0);
+}
+
+
+HeapTuple
+SearchCatCache4(CatCache *cache,
+				Datum v1, Datum v2, Datum v3, Datum v4)
+{
+	return SearchCatCacheInternal(cache, 4, v1, v2, v3, v4);
+}
+
+/*
+ * Work-horse for SearchCatCache/SearchCatCacheN.
+ */
+static inline HeapTuple
+SearchCatCacheInternal(CatCache *cache,
+			   int nkeys,
+			   Datum v1,
+			   Datum v2,
+			   Datum v3,
+			   Datum v4)
+{
+	Datum		arguments[CATCACHE_MAXKEYS];
 	uint32		hashValue;
 	Index		hashIndex;
 	dlist_iter	iter;
 	dlist_head *bucket;
 	CatCTup    *ct;
-	Relation	relation;
-	SysScanDesc scandesc;
-	HeapTuple	ntp;
 
 	/* Make sure we're in an xact, even if this ends up being a cache hit */
 	Assert(IsTransactionState());
 
+	Assert(cache->cc_nkeys == nkeys);
+
 	/*
 	 * one-time startup overhead for each cache
 	 */
-	if (cache->cc_tupdesc == NULL)
+	if (unlikely(cache->cc_tupdesc == NULL))
 		CatalogCacheInitializeCache(cache);
 
 #ifdef CATCACHE_STATS
 	cache->cc_searches++;
 #endif
 
-	/*
-	 * initialize the search key information
-	 */
-	memcpy(cur_skey, cache->cc_skey, sizeof(cur_skey));
-	cur_skey[0].sk_argument = v1;
-	cur_skey[1].sk_argument = v2;
-	cur_skey[2].sk_argument = v3;
-	cur_skey[3].sk_argument = v4;
+	/* Initialize local parameter array */
+	arguments[0] = v1;
+	arguments[1] = v2;
+	arguments[2] = v3;
+	arguments[3] = v4;
 
 	/*
 	 * find the hash bucket in which to look for the tuple
 	 */
-	hashValue = CatalogCacheComputeHashValue(cache, cache->cc_nkeys, cur_skey);
+	hashValue = CatalogCacheComputeHashValue(cache, nkeys, v1, v2, v3, v4);
 	hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets);
 
 	/*
@@ -1089,8 +1263,6 @@ SearchCatCache(CatCache *cache,
 	bucket = &cache->cc_bucket[hashIndex];
 	dlist_foreach(iter, bucket)
 	{
-		bool		res;
-
 		ct = dlist_container(CatCTup, cache_elem, iter.cur);
 
 		if (ct->dead)
@@ -1099,15 +1271,7 @@ SearchCatCache(CatCache *cache,
 		if (ct->hash_value != hashValue)
 			continue;			/* quickly skip entry if wrong hash val */
 
-		/*
-		 * see if the cached tuple matches our key.
-		 */
-		HeapKeyTest(&ct->tuple,
-					cache->cc_tupdesc,
-					cache->cc_nkeys,
-					cur_skey,
-					res);
-		if (!res)
+		if (!CatalogCacheCompareTuple(cache, nkeys, ct->keys, arguments))
 			continue;
 
 		/*
@@ -1150,6 +1314,49 @@ SearchCatCache(CatCache *cache,
 		}
 	}
 
+	return SearchCatCacheMiss(cache, nkeys, hashValue, hashIndex, v1, v2, v3, v4);
+}
+
+/*
+ * Search the actual catalogs, rather than the cache.
+ *
+ * This is kept separate from SearchCatCacheInternal() to keep the fast-path
+ * as small as possible.  To avoid that effort being undone by a helpful
+ * compiler, try to explicitly forbid inlining.
+ */
+static pg_noinline HeapTuple
+SearchCatCacheMiss(CatCache *cache,
+				   int nkeys,
+				   uint32 hashValue,
+				   Index hashIndex,
+				   Datum v1,
+				   Datum v2,
+				   Datum v3,
+				   Datum v4)
+{
+	ScanKeyData cur_skey[CATCACHE_MAXKEYS];
+	Relation	relation;
+	SysScanDesc scandesc;
+	HeapTuple	ntp;
+	CatCTup    *ct;
+	Datum		arguments[CATCACHE_MAXKEYS];
+
+	/* Initialize local parameter array */
+	arguments[0] = v1;
+	arguments[1] = v2;
+	arguments[2] = v3;
+	arguments[3] = v4;
+
+	/*
+	 * Ok, need to make a lookup in the relation, copy the scankey and fill out
+	 * any per-call fields.
+	 */
+	memcpy(cur_skey, cache->cc_skey, sizeof(ScanKeyData) * nkeys);
+	cur_skey[0].sk_argument = v1;
+	cur_skey[1].sk_argument = v2;
+	cur_skey[2].sk_argument = v3;
+	cur_skey[3].sk_argument = v4;
+
 	/*
 	 * Tuple was not found in cache, so we have to try to retrieve it directly
 	 * from the relation.  If found, we will add it to the cache; if not
@@ -1171,14 +1378,14 @@ SearchCatCache(CatCache *cache,
 								  cache->cc_indexoid,
 								  IndexScanOK(cache, cur_skey),
 								  NULL,
-								  cache->cc_nkeys,
+								  nkeys,
 								  cur_skey);
 
 	ct = NULL;
 
 	while (HeapTupleIsValid(ntp = systable_getnext(scandesc)))
 	{
-		ct = CatalogCacheCreateEntry(cache, ntp,
+		ct = CatalogCacheCreateEntry(cache, ntp, arguments,
 									 hashValue, hashIndex,
 									 false);
 		/* immediately set the refcount to 1 */
@@ -1207,11 +1414,9 @@ SearchCatCache(CatCache *cache,
 		if (IsBootstrapProcessingMode())
 			return NULL;
 
-		ntp = build_dummy_tuple(cache, cache->cc_nkeys, cur_skey);
-		ct = CatalogCacheCreateEntry(cache, ntp,
+		ct = CatalogCacheCreateEntry(cache, NULL, arguments,
 									 hashValue, hashIndex,
 									 true);
-		heap_freetuple(ntp);
 
 		CACHE4_elog(DEBUG2, "SearchCatCache(%s): Contains %d/%d tuples",
 					cache->cc_relname, cache->cc_ntup, CacheHdr->ch_ntup);
@@ -1288,27 +1493,16 @@ GetCatCacheHashValue(CatCache *cache,
 					 Datum v3,
 					 Datum v4)
 {
-	ScanKeyData cur_skey[CATCACHE_MAXKEYS];
-
 	/*
 	 * one-time startup overhead for each cache
 	 */
 	if (cache->cc_tupdesc == NULL)
 		CatalogCacheInitializeCache(cache);
 
-	/*
-	 * initialize the search key information
-	 */
-	memcpy(cur_skey, cache->cc_skey, sizeof(cur_skey));
-	cur_skey[0].sk_argument = v1;
-	cur_skey[1].sk_argument = v2;
-	cur_skey[2].sk_argument = v3;
-	cur_skey[3].sk_argument = v4;
-
 	/*
 	 * calculate the hash value
 	 */
-	return CatalogCacheComputeHashValue(cache, cache->cc_nkeys, cur_skey);
+	return CatalogCacheComputeHashValue(cache, cache->cc_nkeys, v1, v2, v3, v4);
 }
 
 
@@ -1329,7 +1523,7 @@ SearchCatCacheList(CatCache *cache,
 				   Datum v3,
 				   Datum v4)
 {
-	ScanKeyData cur_skey[CATCACHE_MAXKEYS];
+	Datum		arguments[CATCACHE_MAXKEYS];
 	uint32		lHashValue;
 	dlist_iter	iter;
 	CatCList   *cl;
@@ -1354,21 +1548,18 @@ SearchCatCacheList(CatCache *cache,
 	cache->cc_lsearches++;
 #endif
 
-	/*
-	 * initialize the search key information
-	 */
-	memcpy(cur_skey, cache->cc_skey, sizeof(cur_skey));
-	cur_skey[0].sk_argument = v1;
-	cur_skey[1].sk_argument = v2;
-	cur_skey[2].sk_argument = v3;
-	cur_skey[3].sk_argument = v4;
+	/* Initialize local parameter array */
+	arguments[0] = v1;
+	arguments[1] = v2;
+	arguments[2] = v3;
+	arguments[3] = v4;
 
 	/*
 	 * compute a hash value of the given keys for faster search.  We don't
 	 * presently divide the CatCList items into buckets, but this still lets
 	 * us skip non-matching items quickly most of the time.
 	 */
-	lHashValue = CatalogCacheComputeHashValue(cache, nkeys, cur_skey);
+	lHashValue = CatalogCacheComputeHashValue(cache, nkeys, v1, v2, v3, v4);
 
 	/*
 	 * scan the items until we find a match or exhaust our list
@@ -1378,8 +1569,6 @@ SearchCatCacheList(CatCache *cache,
 	 */
 	dlist_foreach(iter, &cache->cc_lists)
 	{
-		bool		res;
-
 		cl = dlist_container(CatCList, cache_elem, iter.cur);
 
 		if (cl->dead)
@@ -1393,12 +1582,8 @@ SearchCatCacheList(CatCache *cache,
 		 */
 		if (cl->nkeys != nkeys)
 			continue;
-		HeapKeyTest(&cl->tuple,
-					cache->cc_tupdesc,
-					nkeys,
-					cur_skey,
-					res);
-		if (!res)
+
+		if (!CatalogCacheCompareTuple(cache, nkeys, cl->keys, arguments))
 			continue;
 
 		/*
@@ -1441,9 +1626,20 @@ SearchCatCacheList(CatCache *cache,
 
 	PG_TRY();
 	{
+		ScanKeyData cur_skey[CATCACHE_MAXKEYS];
 		Relation	relation;
 		SysScanDesc scandesc;
 
+		/*
+		 * Ok, need to make a lookup in the relation, copy the scankey and fill out
+		 * any per-call fields.
+		 */
+		memcpy(cur_skey, cache->cc_skey, sizeof(ScanKeyData) * cache->cc_nkeys);
+		cur_skey[0].sk_argument = v1;
+		cur_skey[1].sk_argument = v2;
+		cur_skey[2].sk_argument = v3;
+		cur_skey[3].sk_argument = v4;
+
 		relation = heap_open(cache->cc_reloid, AccessShareLock);
 
 		scandesc = systable_beginscan(relation,
@@ -1467,7 +1663,7 @@ SearchCatCacheList(CatCache *cache,
 			 * See if there's an entry for this tuple already.
 			 */
 			ct = NULL;
-			hashValue = CatalogCacheComputeTupleHashValue(cache, ntp);
+			hashValue = CatalogCacheComputeTupleHashValue(cache, cache->cc_nkeys, ntp);
 			hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets);
 
 			bucket = &cache->cc_bucket[hashIndex];
@@ -1498,7 +1694,7 @@ SearchCatCacheList(CatCache *cache,
 			if (!found)
 			{
 				/* We didn't find a usable entry, so make a new one */
-				ct = CatalogCacheCreateEntry(cache, ntp,
+				ct = CatalogCacheCreateEntry(cache, ntp, arguments,
 											 hashValue, hashIndex,
 											 false);
 			}
@@ -1513,18 +1709,16 @@ SearchCatCacheList(CatCache *cache,
 
 		heap_close(relation, AccessShareLock);
 
-		/*
-		 * Now we can build the CatCList entry.  First we need a dummy tuple
-		 * containing the key values...
-		 */
-		ntp = build_dummy_tuple(cache, nkeys, cur_skey);
+		/* Now we can build the CatCList entry. */
 		oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
 		nmembers = list_length(ctlist);
 		cl = (CatCList *)
 			palloc(offsetof(CatCList, members) + nmembers * sizeof(CatCTup *));
-		heap_copytuple_with_tuple(ntp, &cl->tuple);
+
+		/* Extract key values */
+		CatCacheCopyKeys(cache->cc_tupdesc, nkeys, cache->cc_keyno,
+						 arguments, cl->keys);
 		MemoryContextSwitchTo(oldcxt);
-		heap_freetuple(ntp);
 
 		/*
 		 * We are now past the last thing that could trigger an elog before we
@@ -1621,35 +1815,82 @@ ReleaseCatCacheList(CatCList *list)
  *		supplied data into it.  The new entry initially has refcount 0.
  */
 static CatCTup *
-CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp,
-						uint32 hashValue, Index hashIndex, bool negative)
+CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
+						uint32 hashValue, Index hashIndex,
+						bool negative)
 {
 	CatCTup    *ct;
 	HeapTuple	dtp;
 	MemoryContext oldcxt;
+	int i;
 
 	/*
-	 * If there are any out-of-line toasted fields in the tuple, expand them
-	 * in-line.  This saves cycles during later use of the catcache entry, and
-	 * also protects us against the possibility of the toast tuples being
-	 * freed before we attempt to fetch them, in case of something using a
-	 * slightly stale catcache entry.
 	 */
-	if (HeapTupleHasExternal(ntp))
-		dtp = toast_flatten_tuple(ntp, cache->cc_tupdesc);
+
+	/* negative entries have no tuple associated */
+	if (ntp)
+	{
+		Assert(!negative);
+
+		/*
+		 * If there are any out-of-line toasted fields in the tuple, expand them
+		 * in-line.  This saves cycles during later use of the catcache entry, and
+		 * also protects us against the possibility of the toast tuples being
+		 * freed before we attempt to fetch them, in case of something using a
+		 * slightly stale catcache entry.
+		 */
+		if (HeapTupleHasExternal(ntp))
+			dtp = toast_flatten_tuple(ntp, cache->cc_tupdesc);
+		else
+			dtp = ntp;
+
+		/* Allocate memory for CatCTup and the cached tuple in one go */
+		oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
+
+		ct = (CatCTup *) palloc(sizeof(CatCTup) +
+								MAXIMUM_ALIGNOF + dtp->t_len);
+		ct->tuple.t_len = dtp->t_len;
+		ct->tuple.t_self = dtp->t_self;
+		ct->tuple.t_tableOid = dtp->t_tableOid;
+		ct->tuple.t_data = (HeapTupleHeader)
+			MAXALIGN(((char *) ct) + sizeof(CatCTup));
+		/* copy tuple contents */
+		memcpy((char *) ct->tuple.t_data,
+			   (const char *) dtp->t_data,
+			   dtp->t_len);
+		MemoryContextSwitchTo(oldcxt);
+
+		if (dtp != ntp)
+			heap_freetuple(dtp);
+
+		/* extract keys - they'll point into the tuple if not by-value */
+		for (i = 0; i < cache->cc_nkeys; i++)
+		{
+			Datum atp;
+			bool isnull;
+
+			atp = heap_getattr(&ct->tuple,
+							   cache->cc_keyno[i],
+							   cache->cc_tupdesc,
+							   &isnull);
+			Assert(!isnull);
+			ct->keys[i] = atp;
+		}
+	}
 	else
-		dtp = ntp;
+	{
+		Assert(negative);
+		oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
+		ct = (CatCTup *) palloc(sizeof(CatCTup));
 
-	/*
-	 * Allocate CatCTup header in cache memory, and copy the tuple there too.
-	 */
-	oldcxt = MemoryContextSwitchTo(CacheMemoryContext);
-	ct = (CatCTup *) palloc(sizeof(CatCTup));
-	heap_copytuple_with_tuple(dtp, &ct->tuple);
-	MemoryContextSwitchTo(oldcxt);
-
-	if (dtp != ntp)
-		heap_freetuple(dtp);
+		/*
+		 * Store keys - they'll point into separately allocated memory if not
+		 * by-value.
+		 */
+		CatCacheCopyKeys(cache->cc_tupdesc, cache->cc_nkeys, cache->cc_keyno,
+						 arguments, ct->keys);
+		MemoryContextSwitchTo(oldcxt);
+	}
 
 	/*
 	 * Finish initializing the CatCTup header, and add it to the cache's
@@ -1679,71 +1920,66 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp,
 }
 
 /*
- * build_dummy_tuple
- *		Generate a palloc'd HeapTuple that contains the specified key
- *		columns, and NULLs for other columns.
- *
- * This is used to store the keys for negative cache entries and CatCList
- * entries, which don't have real tuples associated with them.
+ * Helper routine that frees keys stored in the keys array.
  */
-static HeapTuple
-build_dummy_tuple(CatCache *cache, int nkeys, ScanKey skeys)
+static void
+CatCacheFreeKeys(TupleDesc tupdesc, int nkeys, int *attnos, Datum *keys)
 {
-	HeapTuple	ntp;
-	TupleDesc	tupDesc = cache->cc_tupdesc;
-	Datum	   *values;
-	bool	   *nulls;
-	Oid			tupOid = InvalidOid;
-	NameData	tempNames[4];
-	int			i;
+	int i;
 
-	values = (Datum *) palloc(tupDesc->natts * sizeof(Datum));
-	nulls = (bool *) palloc(tupDesc->natts * sizeof(bool));
+	for (i = 0; i <nkeys; i++)
+	{
+		int attnum = attnos[i];
+		Form_pg_attribute att;
 
-	memset(values, 0, tupDesc->natts * sizeof(Datum));
-	memset(nulls, true, tupDesc->natts * sizeof(bool));
+		/* only valid system attribute is the oid, which is by value */
+		if (attnum == ObjectIdAttributeNumber)
+			continue;
+		Assert(attnum > 0);
+
+		att = TupleDescAttr(tupdesc, attnum - 1);
+
+		if (!att->attbyval)
+			pfree(DatumGetPointer(keys[i]));
+	}
+}
+
+/*
+ * Helper routine that copies the keys in the srckeys array into the dstkeys
+ * one, guaranteeing that the datums are fully allocated in the current memory
+ * context.
+ */
+static void
+CatCacheCopyKeys(TupleDesc tupdesc, int nkeys, int *attnos,
+				 Datum *srckeys, Datum *dstkeys)
+{
+	int i;
+
+	/*
+	 * XXX: memory and lookup performance could possibly be improved by
+	 * storing all keys in one allocation.
+	 */
 
 	for (i = 0; i < nkeys; i++)
 	{
-		int			attindex = cache->cc_key[i];
-		Datum		keyval = skeys[i].sk_argument;
+		int attnum = attnos[i];
 
-		if (attindex > 0)
+		if (attnum != ObjectIdAttributeNumber)
 		{
-			/*
-			 * Here we must be careful in case the caller passed a C string
-			 * where a NAME is wanted: convert the given argument to a
-			 * correctly padded NAME.  Otherwise the memcpy() done in
-			 * heap_form_tuple could fall off the end of memory.
-			 */
-			if (cache->cc_isname[i])
-			{
-				Name		newval = &tempNames[i];
+			Form_pg_attribute att = TupleDescAttr(tupdesc, attnum - 1);
 
-				namestrcpy(newval, DatumGetCString(keyval));
-				keyval = NameGetDatum(newval);
-			}
-			values[attindex - 1] = keyval;
-			nulls[attindex - 1] = false;
+			dstkeys[i] = datumCopy(srckeys[i],
+								   att->attbyval,
+								   att->attlen);
 		}
 		else
 		{
-			Assert(attindex == ObjectIdAttributeNumber);
-			tupOid = DatumGetObjectId(keyval);
+			dstkeys[i] = srckeys[i];
 		}
 	}
 
-	ntp = heap_form_tuple(tupDesc, values, nulls);
-	if (tupOid != InvalidOid)
-		HeapTupleSetOid(ntp, tupOid);
-
-	pfree(values);
-	pfree(nulls);
-
-	return ntp;
 }
 
-
 /*
  *	PrepareToInvalidateCacheTuple()
  *
@@ -1820,7 +2056,7 @@ PrepareToInvalidateCacheTuple(Relation relation,
 		if (ccp->cc_tupdesc == NULL)
 			CatalogCacheInitializeCache(ccp);
 
-		hashvalue = CatalogCacheComputeTupleHashValue(ccp, tuple);
+		hashvalue = CatalogCacheComputeTupleHashValue(ccp, ccp->cc_nkeys, tuple);
 		dbid = ccp->cc_relisshared ? (Oid) 0 : MyDatabaseId;
 
 		(*function) (ccp->id, hashvalue, dbid);
@@ -1829,7 +2065,7 @@ PrepareToInvalidateCacheTuple(Relation relation,
 		{
 			uint32		newhashvalue;
 
-			newhashvalue = CatalogCacheComputeTupleHashValue(ccp, newtuple);
+			newhashvalue = CatalogCacheComputeTupleHashValue(ccp, ccp->cc_nkeys, newtuple);
 
 			if (newhashvalue != hashvalue)
 				(*function) (ccp->id, newhashvalue, dbid);
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index fcbb683a99..888edbb325 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -1102,13 +1102,56 @@ SearchSysCache(int cacheId,
 			   Datum key3,
 			   Datum key4)
 {
-	if (cacheId < 0 || cacheId >= SysCacheSize ||
-		!PointerIsValid(SysCache[cacheId]))
-		elog(ERROR, "invalid cache ID: %d", cacheId);
+	Assert(cacheId >= 0 && cacheId < SysCacheSize &&
+		   PointerIsValid(SysCache[cacheId]));
 
 	return SearchCatCache(SysCache[cacheId], key1, key2, key3, key4);
 }
 
+HeapTuple
+SearchSysCache1(int cacheId,
+				Datum key1)
+{
+	Assert(cacheId >= 0 && cacheId < SysCacheSize &&
+		   PointerIsValid(SysCache[cacheId]));
+	Assert(SysCache[cacheId]->cc_nkeys == 1);
+
+	return SearchCatCache1(SysCache[cacheId], key1);
+}
+
+HeapTuple
+SearchSysCache2(int cacheId,
+				Datum key1, Datum key2)
+{
+	Assert(cacheId >= 0 && cacheId < SysCacheSize &&
+		   PointerIsValid(SysCache[cacheId]));
+	Assert(SysCache[cacheId]->cc_nkeys == 2);
+
+	return SearchCatCache2(SysCache[cacheId], key1, key2);
+}
+
+HeapTuple
+SearchSysCache3(int cacheId,
+				Datum key1, Datum key2, Datum key3)
+{
+	Assert(cacheId >= 0 && cacheId < SysCacheSize &&
+		   PointerIsValid(SysCache[cacheId]));
+	Assert(SysCache[cacheId]->cc_nkeys == 3);
+
+	return SearchCatCache3(SysCache[cacheId], key1, key2, key3);
+}
+
+HeapTuple
+SearchSysCache4(int cacheId,
+				Datum key1, Datum key2, Datum key3, Datum key4)
+{
+	Assert(cacheId >= 0 && cacheId < SysCacheSize &&
+		   PointerIsValid(SysCache[cacheId]));
+	Assert(SysCache[cacheId]->cc_nkeys == 4);
+
+	return SearchCatCache4(SysCache[cacheId], key1, key2, key3, key4);
+}
+
 /*
  * ReleaseSysCache
  *		Release previously grabbed reference count on a tuple
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index 200a3022e7..dc8375f829 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -34,26 +34,31 @@
 
 #define CATCACHE_MAXKEYS		4
 
+
+/* function computing a datum's hash */
+typedef uint32 (*CCHashFN) (Datum datum);
+/* function computing equality of two datums */
+typedef bool (*CCFastEqualFN) (Datum a, Datum b);
+
 typedef struct catcache
 {
 	int			id;				/* cache identifier --- see syscache.h */
-	slist_node	cc_next;		/* list link */
+	int			cc_nbuckets;	/* # of hash buckets in this cache */
+	TupleDesc	cc_tupdesc;		/* tuple descriptor (copied from reldesc) */
+	dlist_head *cc_bucket;		/* hash buckets */
+	CCHashFN	cc_hashfunc[CATCACHE_MAXKEYS];	/* hash function for each key */
+	CCFastEqualFN cc_fastequal[CATCACHE_MAXKEYS];	/* fast equal function for each key */
+	int			cc_keyno[CATCACHE_MAXKEYS];	/* AttrNumber of each key */
+	dlist_head	cc_lists;		/* list of CatCList structs */
+	int			cc_ntup;		/* # of tuples currently in this cache */
+	int			cc_nkeys;		/* # of keys (1..CATCACHE_MAXKEYS) */
 	const char *cc_relname;		/* name of relation the tuples come from */
 	Oid			cc_reloid;		/* OID of relation the tuples come from */
 	Oid			cc_indexoid;	/* OID of index matching cache keys */
 	bool		cc_relisshared; /* is relation shared across databases? */
-	TupleDesc	cc_tupdesc;		/* tuple descriptor (copied from reldesc) */
-	int			cc_ntup;		/* # of tuples currently in this cache */
-	int			cc_nbuckets;	/* # of hash buckets in this cache */
-	int			cc_nkeys;		/* # of keys (1..CATCACHE_MAXKEYS) */
-	int			cc_key[CATCACHE_MAXKEYS];	/* AttrNumber of each key */
-	PGFunction	cc_hashfunc[CATCACHE_MAXKEYS];	/* hash function for each key */
+	slist_node	cc_next;		/* list link */
 	ScanKeyData cc_skey[CATCACHE_MAXKEYS];	/* precomputed key info for heap
 											 * scans */
-	bool		cc_isname[CATCACHE_MAXKEYS];	/* flag "name" key columns */
-	dlist_head	cc_lists;		/* list of CatCList structs */
-	dlist_head *cc_bucket;		/* hash buckets */
-
 	/*
 	 * Keep these at the end, so that compiling catcache.c with CATCACHE_STATS
 	 * doesn't break ABI for other modules
@@ -79,7 +84,14 @@ typedef struct catctup
 {
 	int			ct_magic;		/* for identifying CatCTup entries */
 #define CT_MAGIC   0x57261502
-	CatCache   *my_cache;		/* link to owning catcache */
+
+	uint32		hash_value;		/* hash value for this tuple's keys */
+
+	/*
+	 * Lookup keys for the entry. By-reference datums point into the tuple for
+	 * positive cache entries, and are separately allocated for negative ones.
+	 */
+	Datum		keys[CATCACHE_MAXKEYS];
 
 	/*
 	 * Each tuple in a cache is a member of a dlist that stores the elements
@@ -88,15 +100,6 @@ typedef struct catctup
 	 */
 	dlist_node	cache_elem;		/* list member of per-bucket list */
 
-	/*
-	 * The tuple may also be a member of at most one CatCList.  (If a single
-	 * catcache is list-searched with varying numbers of keys, we may have to
-	 * make multiple entries for the same tuple because of this restriction.
-	 * Currently, that's not expected to be common, so we accept the potential
-	 * inefficiency.)
-	 */
-	struct catclist *c_list;	/* containing CatCList, or NULL if none */
-
 	/*
 	 * A tuple marked "dead" must not be returned by subsequent searches.
 	 * However, it won't be physically deleted from the cache until its
@@ -112,46 +115,64 @@ typedef struct catctup
 	int			refcount;		/* number of active references */
 	bool		dead;			/* dead but not yet removed? */
 	bool		negative;		/* negative cache entry? */
-	uint32		hash_value;		/* hash value for this tuple's keys */
 	HeapTupleData tuple;		/* tuple management header */
+
+	/*
+	 * The tuple may also be a member of at most one CatCList.  (If a single
+	 * catcache is list-searched with varying numbers of keys, we may have to
+	 * make multiple entries for the same tuple because of this restriction.
+	 * Currently, that's not expected to be common, so we accept the potential
+	 * inefficiency.)
+	 */
+	struct catclist *c_list;	/* containing CatCList, or NULL if none */
+
+	CatCache   *my_cache;		/* link to owning catcache */
+	/* properly aligned tuple data follows, unless a negative entry */
 } CatCTup;
 
 
+/*
+ * A CatCList describes the result of a partial search, ie, a search using
+ * only the first K key columns of an N-key cache.  We store the keys used
+ * into the keys attribute (with other attributes NULL) to represent the
+ * stored key set.  The CatCList object contains links to cache entries
+ * for all the table rows satisfying the partial key.  (Note: none of
+ * these will be negative cache entries.)
+ *
+ * A CatCList is only a member of a per-cache list; we do not currently
+ * divide them into hash buckets.
+ *
+ * A list marked "dead" must not be returned by subsequent searches.
+ * However, it won't be physically deleted from the cache until its
+ * refcount goes to zero.  (A list should be marked dead if any of its
+ * member entries are dead.)
+ *
+ * If "ordered" is true then the member tuples appear in the order of the
+ * cache's underlying index.  This will be true in normal operation, but
+ * might not be true during bootstrap or recovery operations. (namespace.c
+ * is able to save some cycles when it is true.)
+ */
 typedef struct catclist
 {
 	int			cl_magic;		/* for identifying CatCList entries */
 #define CL_MAGIC   0x52765103
-	CatCache   *my_cache;		/* link to owning catcache */
+
+	uint32		hash_value;		/* hash value for lookup keys */
+
+	dlist_node	cache_elem;		/* list member of per-catcache list */
 
 	/*
-	 * A CatCList describes the result of a partial search, ie, a search using
-	 * only the first K key columns of an N-key cache.  We form the keys used
-	 * into a tuple (with other attributes NULL) to represent the stored key
-	 * set.  The CatCList object contains links to cache entries for all the
-	 * table rows satisfying the partial key.  (Note: none of these will be
-	 * negative cache entries.)
-	 *
-	 * A CatCList is only a member of a per-cache list; we do not currently
-	 * divide them into hash buckets.
-	 *
-	 * A list marked "dead" must not be returned by subsequent searches.
-	 * However, it won't be physically deleted from the cache until its
-	 * refcount goes to zero.  (A list should be marked dead if any of its
-	 * member entries are dead.)
-	 *
-	 * If "ordered" is true then the member tuples appear in the order of the
-	 * cache's underlying index.  This will be true in normal operation, but
-	 * might not be true during bootstrap or recovery operations. (namespace.c
-	 * is able to save some cycles when it is true.)
+	 * Lookup keys for the entry, with the first nkeys elements being
+	 * valid. All by-reference are separately allocated.
 	 */
-	dlist_node	cache_elem;		/* list member of per-catcache list */
+	Datum		keys[CATCACHE_MAXKEYS];
+
 	int			refcount;		/* number of active references */
 	bool		dead;			/* dead but not yet removed? */
 	bool		ordered;		/* members listed in index order? */
 	short		nkeys;			/* number of lookup keys specified */
-	uint32		hash_value;		/* hash value for lookup keys */
-	HeapTupleData tuple;		/* header for tuple holding keys */
 	int			n_members;		/* number of member tuples */
+	CatCache   *my_cache;		/* link to owning catcache */
 	CatCTup    *members[FLEXIBLE_ARRAY_MEMBER]; /* members */
 } CatCList;
 
@@ -174,8 +195,15 @@ extern CatCache *InitCatCache(int id, Oid reloid, Oid indexoid,
 extern void InitCatCachePhase2(CatCache *cache, bool touch_index);
 
 extern HeapTuple SearchCatCache(CatCache *cache,
-			   Datum v1, Datum v2,
-			   Datum v3, Datum v4);
+			   Datum v1, Datum v2, Datum v3, Datum v4);
+extern HeapTuple SearchCatCache1(CatCache *cache,
+			   Datum v1);
+extern HeapTuple SearchCatCache2(CatCache *cache,
+			   Datum v1, Datum v2);
+extern HeapTuple SearchCatCache3(CatCache *cache,
+			   Datum v1, Datum v2, Datum v3);
+extern HeapTuple SearchCatCache4(CatCache *cache,
+			   Datum v1, Datum v2, Datum v3, Datum v4);
 extern void ReleaseCatCache(HeapTuple tuple);
 
 extern uint32 GetCatCacheHashValue(CatCache *cache,
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index 8a92ea27ac..12bda02cd7 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -117,6 +117,20 @@ extern void InitCatalogCachePhase2(void);
 
 extern HeapTuple SearchSysCache(int cacheId,
 			   Datum key1, Datum key2, Datum key3, Datum key4);
+
+/*
+ * The use of argument specific numbers is encouraged, they're faster, and
+ * insulates the caller from changes in the maximum number of keys.
+ */
+extern HeapTuple SearchSysCache1(int cacheId,
+			   Datum key1);
+extern HeapTuple SearchSysCache2(int cacheId,
+			   Datum key1, Datum key2);
+extern HeapTuple SearchSysCache3(int cacheId,
+			   Datum key1, Datum key2, Datum key3);
+extern HeapTuple SearchSysCache4(int cacheId,
+			   Datum key1, Datum key2, Datum key3, Datum key4);
+
 extern void ReleaseSysCache(HeapTuple tuple);
 
 /* convenience routines */
@@ -156,15 +170,6 @@ extern bool RelationSupportsSysCache(Oid relid);
  * functions is encouraged, as it insulates the caller from changes in the
  * maximum number of keys.
  */
-#define SearchSysCache1(cacheId, key1) \
-	SearchSysCache(cacheId, key1, 0, 0, 0)
-#define SearchSysCache2(cacheId, key1, key2) \
-	SearchSysCache(cacheId, key1, key2, 0, 0)
-#define SearchSysCache3(cacheId, key1, key2, key3) \
-	SearchSysCache(cacheId, key1, key2, key3, 0)
-#define SearchSysCache4(cacheId, key1, key2, key3, key4) \
-	SearchSysCache(cacheId, key1, key2, key3, key4)
-
 #define SearchSysCacheCopy1(cacheId, key1) \
 	SearchSysCacheCopy(cacheId, key1, 0, 0, 0)
 #define SearchSysCacheCopy2(cacheId, key1, key2) \
-- 
2.14.1.536.g6867272d5b.dirty

pgbench-many-cols.sqlapplication/x-sqlDownload
create_many_cols.sqlapplication/x-sqlDownload
#4Andres Freund
andres@anarazel.de
In reply to: amul sul (#2)
Re: Improve catcache/syscache performance.

Hi,

On 2017-09-20 18:26:50 +0530, amul sul wrote:

Patch 0007:

Thanks for looking!

1:
400 + /*
401 + * XXX: might be worthwhile to only handle oid sysattr, to
reduce
402 + * overhead - it's the most common key.
403 + */

IMHO, let fix that as well. I tested this by fixing (see the attach patch)
but does
not found much gain on my local centos vm (of course, the pgbench load
wasn't big enough).

I ended up with a bigger patch, that removes all extractions from
tuples, by storing the extracted column in an array.

2: How about have wrapping following condition in SearchCatCacheMiss() by
unlikely():

if (IsBootstrapProcessingMode())
return NULL;

Given this is the cache miss case, I can't get excited about it -
there's several 100ks of cycles to access the heap via an indexscan...

3: Can we have following assert in SearchCatCacheN() instead
SearchSysCacheN(), so that we'll assert direct SearchCatCacheN() call
as well?

Assert(SysCache[cacheId]->cc_nkeys == <N>);

Done, although I kept the others too.

Other than these concern, patch looks pretty reasonable to me.

I'd appreciate if you could have a look at the new version as well.

Regards,

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#3)
Re: Improve catcache/syscache performance.

On Fri, Sep 22, 2017 at 2:15 AM, Andres Freund <andres@anarazel.de> wrote:

This results in making cache lookups themselves roughly three times as
fast - full-system benchmarks obviously improve less than that.

Wow. That's really good.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6tushar
tushar.ahuja@enterprisedb.com
In reply to: Andres Freund (#3)
Re: Improve catcache/syscache performance.

On 09/22/2017 11:45 AM, Andres Freund wrote:

Here's a variant that cleans up the previous changes a bit, and adds
some further improvements:

I tested with different pgbench options with master v/s patch and found an improvement. I have applied 001 and 003 patch on PG Head ,patch 0002 was already committed.

Virtual Machine configuration - Centos 6.5 x64 / 16 GB RAM / 8 VCPU core processor

Scaling factor=30

pgbench -M prepared -T 200 postgres

PG Head - tps = 902.225954 (excluding connections establishing).
PG HEAD+patch - tps = 1001.896381 (10.97+% vs. head)

pgbench -M prepared -T 300 postgres

PG Head - tps = 920.108333 (excluding connections establishing).
PG HEAD+patch - tps = 1023.89542 (11.19+% vs. head)

pgbench -M prepared -T 500 postgres

PG Head - tps = 995.178227 (excluding connections establishing)
PG HEAD+patch - tps = 1078.32222 (+8.34% vs. head)

Later I modified the create_many_cols.sql file (previously attached) and instead of
only using int , I mixed it with varchar/int4/numeric/float and run pgbench
with different time duration

pgbench -M prepared -f /tmp/pgbench-many-cols.sql -T 300 postgres

PG Head - tps = 5540.143877 (excluding connections establishing).
PG HEAD+patch - tps = 5679.713493 (2.50+% vs. head)

pgbench -M prepared -f /tmp/pgbench-many-cols.sql -T 500 postgres

PG Head - tps = 5519.212709 (excluding connections establishing).
PG HEAD+patch - tps = 5967.059155 (8.11+% vs. head)

pgbench -M prepared -f /tmp/pgbench-many-cols.sql -T 700 postgres

PG Head - tps = 5640.314495(excluding connections establishing).
PG HEAD+patch - tps = 6012.223147 (6.59+% vs. head)

-- regards,tushar
EnterpriseDBhttps://www.enterprisedb.com/
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7amul sul
sulamul@gmail.com
In reply to: Andres Freund (#4)
Re: Improve catcache/syscache performance.

On Fri, Sep 22, 2017 at 11:47 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2017-09-20 18:26:50 +0530, amul sul wrote:

Patch 0007:

Other than these concern, patch looks pretty reasonable to me.

I'd appreciate if you could have a look at the new version as well.

​I have reviewed

the newer version
​, looks good to me​.


Here are the few cosmetic comments ​for the 0003 patch​​:

1
​.
982 + ct->tuple.t_data = (HeapTupleHeader)
983 + MAXALIGN(((char *) ct) + sizeof(CatCTup));

It would be nice if you add a comment for this address alignment​​
​.​

2
​.
947 /*
948 - * If there are any out-of-line toasted fields in the tuple,
expand them
949 - * in-line. This saves cycles during later use of the catcache
entry, and
950 - * also protects us against the possibility of the toast tuples
being
951 - * freed before we attempt to fetch them, in case of something
using a
952 - * slightly stale catcache entry.
953 */

Empty comment block left in the CatalogCacheCreateEntry().

3
​.
​ ​
411 +/*
412 + * CatalogCacheCompareTuple
413 + *
414 + * Compare a tuple to the passed arguments.
415 + */
416 +static inline bool
417 +CatalogCacheCompareTuple(const CatCache *cache, int nkeys,
418 + const Datum *cachekeys,
419 + const Datum *searchkeys)

​Need an adjust to the p
rolog comment
​.​

Regards,
Amul

#8Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: tushar (#6)
Re: Improve catcache/syscache performance.

On 09/26/2017 06:41 AM, tushar wrote:

On 09/22/2017 11:45 AM, Andres Freund wrote:

Here's a variant that cleans up the previous changes a bit, and adds
some further improvements:

I tested with different pgbench options with� master v/s patch and found
an improvement.� I have applied 001 and 003 patch on PG Head ,patch 0002
was already committed.

Virtual Machine configuration - Centos 6.5 x64 / 16 GB RAM / 8 VCPU core
processor

Scaling factor=30

pgbench -M prepared -T 200 postgres

PG Head������ -� tps = 902.225954 (excluding connections establishing).
PG HEAD+patch -� tps = 1001.896381 (10.97+% vs. head)

pgbench -M prepared -T 300 postgres

PG Head������ -� tps = 920.108333 (excluding connections establishing).
PG HEAD+patch -� tps = 1023.89542 (11.19+% vs. head)

pgbench -M prepared -T 500 postgres

PG Head������ -� tps = 995.178227 (excluding connections establishing)
PG HEAD+patch -� tps = 1078.32222 (+8.34% vs. head)

Later I modified the create_many_cols.sql file (previously attached) and
instead of
only using int� , I mixed it with varchar/int4/numeric/float and run
pgbench
with different time duration

pgbench -M prepared -f /tmp/pgbench-many-cols.sql -T 300� postgres

PG Head������ -� tps =� 5540.143877 (excluding connections establishing).
PG HEAD+patch -� tps =� 5679.713493 (2.50+% vs. head)

pgbench -M prepared -f /tmp/pgbench-many-cols.sql -T 500� postgres

PG Head������ -� tps = 5519.212709 (excluding connections establishing).
PG HEAD+patch -� tps = 5967.059155 (8.11+% vs. head)

pgbench -M prepared -f /tmp/pgbench-many-cols.sql -T 700� postgres

PG Head������ -� tps = 5640.314495(excluding connections establishing).
PG HEAD+patch -� tps = 6012.223147 (6.59+% vs. head)

I'm also seeing a speedup on a 2S/28C/56T/256Gb + 2 x RAID10 SSD machine
using -M prepared, -M prepared -N and -M prepared -S scenarios with
various scale factors, and custom queries.

Small typo in 0002- / commit 791961:

diff --git a/src/include/utils/hashutils.h b/src/include/utils/hashutils.h
index 35281689e8..366bd0e78b 100644
--- a/src/include/utils/hashutils.h
+++ b/src/include/utils/hashutils.h
@@ -22,7 +22,7 @@ hash_combine(uint32 a, uint32 b)
  /*
- * Simple inline murmur hash implementation hashing a 32 bit ingeger, for
+ * Simple inline murmur hash implementation hashing a 32 bit integer, for
   * performance.
   */
  static inline uint32

Best regards,
Jesper

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#1)
Re: Improve catcache/syscache performance.

On Thu, Sep 14, 2017 at 2:12 AM, Andres Freund <andres@anarazel.de> wrote:

This patch gives me roughly 8% speedup in a workload that consists out
of a fast query that returns a lot of columns. If I apply a few
other performance patches, this patch itself starts to make a bigger
difference, of around 11%.

I did a read-through of this patch today. I don't see anything really
serious to complain about here. Somebody might question the use of
the no-inline stuff, but it seems sensible to me in this context.

+ /* not as performance critical & "complicated" */

This comment kinda sucks. I don't think it will be clear to somebody
in 3 years what this means. It's clear enough in context but later I
think it won't be. I suggest dumping this altogether and expanding
the comment up above to encompass this:

Hash and equality functions for system types that are used as cache
key fields. In some cases, we just call the regular SQL-callable
functions for the appropriate data type, but that tends to be a little
slow, and the speed of these functions is performance-critical.
Therefore, for data types that frequently occur as catcache keys, we
hard-code the logic here. Avoiding the overhead of
DirectFunctionCallN(...) is a substantial win, and in certain cases
(like int4) we can adopt a faster hash algorithm as well.

+        {
+            return false;
+        }

Excess braces.

+ * The use of argument specific numbers is encouraged, they're faster, and
+ * insulates the caller from changes in the maximum number of keys.

s/, they're faster/. They're faster/

-    if (cache->cc_tupdesc == NULL)
+    if (unlikely(cache->cc_tupdesc == NULL))
         CatalogCacheInitializeCache(cache);

I don't think it's this patch's job to do it, but it seems like we
ought to just invent some early-initialization step where things like
this can happen, so that we don't have to branch here at all.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#9)
Re: Improve catcache/syscache performance.

Hi,

On 2017-10-13 13:06:41 -0400, Robert Haas wrote:

On Thu, Sep 14, 2017 at 2:12 AM, Andres Freund <andres@anarazel.de> wrote:

This patch gives me roughly 8% speedup in a workload that consists out
of a fast query that returns a lot of columns. If I apply a few
other performance patches, this patch itself starts to make a bigger
difference, of around 11%.

I did a read-through of this patch today.

Thanks!

+ /* not as performance critical & "complicated" */

This comment kinda sucks. I don't think it will be clear to somebody
in 3 years what this means. It's clear enough in context but later I
think it won't be. I suggest dumping this altogether and expanding
the comment up above to encompass this:

Hash and equality functions for system types that are used as cache
key fields. In some cases, we just call the regular SQL-callable
functions for the appropriate data type, but that tends to be a little
slow, and the speed of these functions is performance-critical.
Therefore, for data types that frequently occur as catcache keys, we
hard-code the logic here. Avoiding the overhead of
DirectFunctionCallN(...) is a substantial win, and in certain cases
(like int4) we can adopt a faster hash algorithm as well.

K.

-    if (cache->cc_tupdesc == NULL)
+    if (unlikely(cache->cc_tupdesc == NULL))
CatalogCacheInitializeCache(cache);

I don't think it's this patch's job to do it, but it seems like we
ought to just invent some early-initialization step where things like
this can happen, so that we don't have to branch here at all.

Yea, that'd be nice - it does show up in profiles. I'd tried to do
that, but it's not exactly trivial, so I decided to delay it. The
ordering when syscache stuff gets initialized is, uh, fragile. During
InitCatalogCache() the catalog is not ready for access. After that, we
access catcaches while not all catalogs are quite ready to be accessed.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#10)
Re: Improve catcache/syscache performance.

Andres Freund <andres@anarazel.de> writes:

On 2017-10-13 13:06:41 -0400, Robert Haas wrote:

I don't think it's this patch's job to do it, but it seems like we
ought to just invent some early-initialization step where things like
this can happen, so that we don't have to branch here at all.

Yea, that'd be nice - it does show up in profiles. I'd tried to do
that, but it's not exactly trivial, so I decided to delay it. The
ordering when syscache stuff gets initialized is, uh, fragile. During
InitCatalogCache() the catalog is not ready for access. After that, we
access catcaches while not all catalogs are quite ready to be accessed.

Yeah, I think the fragility vs. benefit tradeoff here is not very
promising.

One idea might be to see if we can precalculate all the control data
needed for the caches and set it up as compile-time constants,
a la Gen_fmgrtab.pl, rather than reading it from the catalogs during
startup. That would make the code less dependent on initialization
order rather than more so.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#11)
Re: Improve catcache/syscache performance.

On 2017-10-13 14:07:54 -0400, Tom Lane wrote:

One idea might be to see if we can precalculate all the control data
needed for the caches and set it up as compile-time constants,
a la Gen_fmgrtab.pl, rather than reading it from the catalogs during
startup. That would make the code less dependent on initialization
order rather than more so.

Hm. That sounds somewhat enticing. You're thinking of doing so for
catcaches alone, or something grander, including the relcaches? I'd
assume the former?

For catcaches the hardest part probably is that we need a TupleDesc. Per
type function lookups, oids, should be fairly easy in contrast.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#12)
Re: Improve catcache/syscache performance.

Andres Freund <andres@anarazel.de> writes:

On 2017-10-13 14:07:54 -0400, Tom Lane wrote:

One idea might be to see if we can precalculate all the control data
needed for the caches and set it up as compile-time constants,
a la Gen_fmgrtab.pl, rather than reading it from the catalogs during
startup. That would make the code less dependent on initialization
order rather than more so.

Hm. That sounds somewhat enticing. You're thinking of doing so for
catcaches alone, or something grander, including the relcaches? I'd
assume the former?

Yeah. The relcaches are basically impossible to precalculate as constants
because they contain run-time variable data such as relfilenode. I might
be wrong because I didn't go look, but offhand I think there is nothing in
the catcache control data that couldn't be predetermined.

For catcaches the hardest part probably is that we need a TupleDesc. Per
type function lookups, oids, should be fairly easy in contrast.

We already have a mechanism for precalculated TupleDescs for system
catalogs, cf src/backend/catalog/schemapg.h. Right now we only apply
that for a few "bootstrap" system catalogs, but I don't really see
a reason we couldn't use it for every catalog that has a catcache.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#10)
Re: Improve catcache/syscache performance.

Hi,

On 2017-10-13 10:38:47 -0700, Andres Freund wrote:

On 2017-10-13 13:06:41 -0400, Robert Haas wrote:

On Thu, Sep 14, 2017 at 2:12 AM, Andres Freund <andres@anarazel.de> wrote:

This patch gives me roughly 8% speedup in a workload that consists out
of a fast query that returns a lot of columns. If I apply a few
other performance patches, this patch itself starts to make a bigger
difference, of around 11%.

I did a read-through of this patch today.

Thanks!

Pushed after making the adaptions you suggested, pgindenting, and fixing
one bug (cstring lookups + datumCopy() = not good).

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Kyotaro HORIGUCHI
horiguchi.kyotaro@lab.ntt.co.jp
In reply to: Andres Freund (#14)
Re: [HACKERS] Improve catcache/syscache performance.

Hello,

At Fri, 13 Oct 2017 14:36:45 -0700, Andres Freund <andres@anarazel.de> wrote in <20171013213645.fuolk62xaedj3og7@alap3.anarazel.de>

Pushed after making the adaptions you suggested, pgindenting, and fixing
one bug (cstring lookups + datumCopy() = not good).

Greetings,

Andres Freund

The folloing CF entry (in Needs review state) is of this thread
and this is already pushed. May I mark this as committed or
anyone else does it?

https://commitfest.postgresql.org/15/1300/

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

#16Michael Paquier
michael.paquier@gmail.com
In reply to: Kyotaro HORIGUCHI (#15)
Re: [HACKERS] Improve catcache/syscache performance.

On Fri, Nov 17, 2017 at 1:13 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

At Fri, 13 Oct 2017 14:36:45 -0700, Andres Freund <andres@anarazel.de> wrote in <20171013213645.fuolk62xaedj3og7@alap3.anarazel.de>

Pushed after making the adaptions you suggested, pgindenting, and fixing
one bug (cstring lookups + datumCopy() = not good).

Greetings,

Andres Freund

The folloing CF entry (in Needs review state) is of this thread
and this is already pushed. May I mark this as committed or
anyone else does it?

https://commitfest.postgresql.org/15/1300/

Done. Please make sure to mark as committed patches that are
registered once they are committed, this makes the life of a CFM a bit
easier.
--
Michael