RFC: Improve CPU cache locality of syscache searches

Started by John Naylorover 4 years ago14 messages
#1John Naylor
john.naylor@enterprisedb.com

CPU caches have multiple levels, so I had an idea to use that concept in
the syscaches.

Imagine if catcache buckets were cacheline-sized. In that case, we would
store the most recently accessed hash values and pointers to catctup, in
addition to the dlist_head:

typedef struct cc_bucket
{
uint32 hashes[4];
catctup *ct[4];
dlist_head;
};

You can think of the bucket here as a 4-way set associative L1 and the list
walk as an inclusive L2. There might be an existing term for this scheme,
but I don't know what it is offhand.

Creating entries:

Instead of calling dlist_push_head(), we call dlist_push_tail() and then
stash the entry in the L1 so it's still quickly available if it's accessed
in the near future. This way, older entries evicted from L1 will tend to
remain toward the front of the list. Walking the list will always go from
oldest to newest, which might have better prefetch behavior (not sure).

Search:

Buckets with <= 4 entries would only need to access a single cacheline to
get the catctup pointer with the matching hash value. If we have to walk
the list to find a match, we stash it in the L1, which is faster than
calling dlist_move_head().

L1 Eviction:

When putting an entry here, we memmove() everything down in each array and
place the pointer and the hash value in the front, evicting the fourth
(possibly NULL) entry.

The buckets would now be 4 times larger on 64-bit machines, but that might
not be a problem if we just divide the initial bucket size by four as well.
If the minimum nbuckets is 2, then the smaller caches would waste a bit of
space, but it doesn't seem too bad. As far as when to rehash the bucket
array, the fill factor would logically jump from 2 to 8. The worst-case
list walk would be longer, but with a better overall memory access pattern,
it might be fine.

I think it would be straightforward to code this up -- the difficulty is
testing and accounting for random effects due to binary layout changes.
Thoughts?

--
John Naylor
EDB: http://www.enterprisedb.com

#2Andres Freund
andres@anarazel.de
In reply to: John Naylor (#1)
Re: RFC: Improve CPU cache locality of syscache searches

Hi,

On 2021-08-04 12:39:29 -0400, John Naylor wrote:

CPU caches have multiple levels, so I had an idea to use that concept in
the syscaches.

I do think we loose a good bit to syscache efficiency in real workloads, so I
think it's worth investing time into it.

However:

Imagine if catcache buckets were cacheline-sized. In that case, we would
store the most recently accessed hash values and pointers to catctup, in
addition to the dlist_head:

typedef struct cc_bucket
{
uint32 hashes[4];
catctup *ct[4];
dlist_head;
};

I'm not convinced that the above the right idea though. Even if the hash
matches, you're still going to need to fetch at least catctup->keys[0] from
a separate cacheline to be able to return the cache entry.

If the hashes we use are decent and we're resizing at reasonable thresholds,
we shouldn't constantly lookup up values that are later in a collision chain -
particularly because we move cache hits to the head of the bucket. So I don't
think a scheme as you described would actually result in a really better cache
hit ratio?

ISTM that something like

struct cc_bucket_1
{
uint32 hashes[3]; // 12
// 4 bytes alignment padding
Datum key0s[3]; // 24
catctup *ct[3]; // 24
// cacheline boundary
dlist_head conflicts; // 16
};

would be better for 1 key values?

It's obviously annoying to need different bucket types for different key
counts, but given how much 3 unused key Datums waste, it seems worth paying
for?

One issue with stuff like this is that aset.c won't return cacheline aligned
values...

It's possible that it'd be better to put the catctup pointers onto a
neigboring cacheline (so you still get TLB benefits, as well as making it easy
for prefetchers) and increase the number of hashes/keys that fit on one
cacheline.

If we stuffed four values into one bucket we could potentially SIMD the hash
and Datum comparisons ;)

Greetings,

Andres Freund

#3John Naylor
john.naylor@enterprisedb.com
In reply to: Andres Freund (#2)
Re: RFC: Improve CPU cache locality of syscache searches

On Wed, Aug 4, 2021 at 3:44 PM Andres Freund <andres@anarazel.de> wrote:

On 2021-08-04 12:39:29 -0400, John Naylor wrote:

typedef struct cc_bucket
{
uint32 hashes[4];
catctup *ct[4];
dlist_head;
};

I'm not convinced that the above the right idea though. Even if the hash
matches, you're still going to need to fetch at least catctup->keys[0]

from

a separate cacheline to be able to return the cache entry.

I see your point. It doesn't make sense to inline only part of the
information needed.

struct cc_bucket_1
{
uint32 hashes[3]; // 12
// 4 bytes alignment padding
Datum key0s[3]; // 24
catctup *ct[3]; // 24
// cacheline boundary
dlist_head conflicts; // 16
};

would be better for 1 key values?

It's obviously annoying to need different bucket types for different key
counts, but given how much 3 unused key Datums waste, it seems worth

paying

for?

Yeah, it's annoying, but it does make a big difference to keep out unused
Datums:

keys cachelines
3 values 4 values

1 1 1/4 1 1/2
2 1 5/8 2
3 2 2 1/2
4 2 3/8 3

Or, looking at it another way, limiting the bucket size to 2 cachelines, we
can fit:

keys values
1 5
2 4
3 3
4 2

Although I'm guessing inlining just two values in the 4-key case wouldn't
buy much.

If we stuffed four values into one bucket we could potentially SIMD the

hash

and Datum comparisons ;)

;-) That's an interesting future direction to consider when we support
building with x86-64-v2. It'd be pretty easy to compare a vector of hashes
and quickly get the array index for the key comparisons (ignoring for the
moment how to handle the rare case of multiple identical hashes).
However, we currently don't memcmp() the Datums and instead call an
"eqfast" function, so I don't see how that part would work in a vector
setting.

--
John Naylor
EDB: http://www.enterprisedb.com

#4Andres Freund
andres@anarazel.de
In reply to: John Naylor (#3)
Re: RFC: Improve CPU cache locality of syscache searches

Hi,

On 2021-08-05 12:27:49 -0400, John Naylor wrote:

On Wed, Aug 4, 2021 at 3:44 PM Andres Freund <andres@anarazel.de> wrote:

On 2021-08-04 12:39:29 -0400, John Naylor wrote:

typedef struct cc_bucket
{
uint32 hashes[4];
catctup *ct[4];
dlist_head;
};

I'm not convinced that the above the right idea though. Even if the hash
matches, you're still going to need to fetch at least catctup->keys[0]

from

a separate cacheline to be able to return the cache entry.

I see your point. It doesn't make sense to inline only part of the
information needed.

At least not for the unconditionally needed information.

Although I'm guessing inlining just two values in the 4-key case wouldn't
buy much.

Not so sure about that. I'd guess that two key comparisons take more cycles
than a cacheline fetch the further keys (perhaps not if we had inlined key
comparisons). I.e. I'd expect out-of-order + speculative execution to hide the
latency for fetching the second cacheline for later key values.

If we stuffed four values into one bucket we could potentially SIMD the

hash

and Datum comparisons ;)

;-) That's an interesting future direction to consider when we support
building with x86-64-v2. It'd be pretty easy to compare a vector of hashes
and quickly get the array index for the key comparisons (ignoring for the
moment how to handle the rare case of multiple identical hashes).
However, we currently don't memcmp() the Datums and instead call an
"eqfast" function, so I don't see how that part would work in a vector
setting.

It definitely couldn't work unconditionally - we have to deal with text,
oidvector, comparisons after all. But we could use it for the other
types. However, looking at the syscaches, I think it'd not very often be
applicable for caches with enough columns.

I have wondered before whether we should have syscache definitions generate
code specific to each syscache definition. I do think that'd give a good bit
of performance boost. But I don't see a trivial way to get there without
notational overhead.

We could define syscaches in a header using a macro that's defined differently
in syscache.c than everywhere else. The header would declare a set of
functions for each syscache, syscache.c would define them to call an
always_inline function with the relevant constants.

Or perhaps we should move syscache definitions into the pg_*.h headers, and
generate the relevant code as part of their processing. That seems like it
could be nice from a modularity POV alone. And it could do better than the
current approach, because we could hardcode the types for columns in the
syscache definition without increasing notational overhead.

Greetings,

Andres Freund

#5Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Andres Freund (#4)
Re: RFC: Improve CPU cache locality of syscache searches

Andres Freund писал 2021-08-05 23:12:

Hi,

On 2021-08-05 12:27:49 -0400, John Naylor wrote:

On Wed, Aug 4, 2021 at 3:44 PM Andres Freund <andres@anarazel.de>
wrote:

On 2021-08-04 12:39:29 -0400, John Naylor wrote:

typedef struct cc_bucket
{
uint32 hashes[4];
catctup *ct[4];
dlist_head;
};

I'm not convinced that the above the right idea though. Even if the hash
matches, you're still going to need to fetch at least catctup->keys[0]

from

a separate cacheline to be able to return the cache entry.

I see your point. It doesn't make sense to inline only part of the
information needed.

At least not for the unconditionally needed information.

Although I'm guessing inlining just two values in the 4-key case
wouldn't
buy much.

Not so sure about that. I'd guess that two key comparisons take more
cycles
than a cacheline fetch the further keys (perhaps not if we had inlined
key
comparisons). I.e. I'd expect out-of-order + speculative execution to
hide the
latency for fetching the second cacheline for later key values.

If we stuffed four values into one bucket we could potentially SIMD the

hash

and Datum comparisons ;)

;-) That's an interesting future direction to consider when we support
building with x86-64-v2. It'd be pretty easy to compare a vector of
hashes
and quickly get the array index for the key comparisons (ignoring for
the
moment how to handle the rare case of multiple identical hashes).
However, we currently don't memcmp() the Datums and instead call an
"eqfast" function, so I don't see how that part would work in a
vector
setting.

It definitely couldn't work unconditionally - we have to deal with
text,
oidvector, comparisons after all. But we could use it for the other
types. However, looking at the syscaches, I think it'd not very often
be
applicable for caches with enough columns.

I have wondered before whether we should have syscache definitions
generate
code specific to each syscache definition. I do think that'd give a
good bit
of performance boost. But I don't see a trivial way to get there
without
notational overhead.

We could define syscaches in a header using a macro that's defined
differently
in syscache.c than everywhere else. The header would declare a set of
functions for each syscache, syscache.c would define them to call an
always_inline function with the relevant constants.

Or perhaps we should move syscache definitions into the pg_*.h headers,
and
generate the relevant code as part of their processing. That seems like
it
could be nice from a modularity POV alone. And it could do better than
the
current approach, because we could hardcode the types for columns in
the
syscache definition without increasing notational overhead.

Why don't use simplehash or something like that? Open-addressing schemes
show superior cache locality.

It could be combined: use hashValue as a key in simplehash and dlist for
hashValue collision handling. 99.99% of time there will be no collisions
on hashValue itself, therefore it will be almost always two-three memory
lookups: one-two for dlist_head in simple_hash by hashValue and then
fetching first element in dlist.

And code will remain almost same. Just "bucket" search will change a
bit.

(And I'd recommend use lower fill factor for this simplehash. At most
0.85).

Well, simplehash entry will be 24 bytes this way. If simplehash template
supports external key/element storage, then it could be shrunk to 16
bytes,
and syscache entries will not need dlist_node. (But it doesn't at the
moment).

And custom open-addressing table could be even with 8 bytes per element:
- element is a pointer to entry,
- missing node is encoded as NULL,
- with fill factor as low as 0.66 there will be small amount of
collisions,
therefore non-empty entry will be matched entry most of time, and
memory
lookup for key comparison will be amortized free.
Note that 8byte entry with fill factor 0.66 consumes amortized 12.12
byte,
while 16byte entry with fill factor 0.99 consumes amortized 16.16byte.

regards,
Yura Sokolov

y.sokolov@postgrespro.ru
funny.falcon@gmail.com

#6Andres Freund
andres@anarazel.de
In reply to: Yura Sokolov (#5)
Re: RFC: Improve CPU cache locality of syscache searches

Hi,

On 2021-08-06 06:43:55 +0300, Yura Sokolov wrote:

Why don't use simplehash or something like that? Open-addressing schemes
show superior cache locality.

I thought about that as well - but it doesn't really resolve the question of
what we want to store in-line in the hashtable and what not. We can't store
the tuples themselves in the hashtable for a myriad of reasons (need pointer
stability, they're variably sized, way too large to move around frequently).

Well, simplehash entry will be 24 bytes this way. If simplehash template
supports external key/element storage, then it could be shrunk to 16 bytes,
and syscache entries will not need dlist_node. (But it doesn't at the
moment).

I think storing keys outside of the hashtable entry defeats the purpose of the
open addressing, given that they are always checked and that our conflict
ratio should be fairly low.

Greetings,

Andres Freund

#7Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Andres Freund (#6)
Re: RFC: Improve CPU cache locality of syscache searches

Andres Freund писал 2021-08-06 06:49:

Hi,

On 2021-08-06 06:43:55 +0300, Yura Sokolov wrote:

Why don't use simplehash or something like that? Open-addressing
schemes
show superior cache locality.

I thought about that as well - but it doesn't really resolve the
question of
what we want to store in-line in the hashtable and what not. We can't
store
the tuples themselves in the hashtable for a myriad of reasons (need
pointer
stability, they're variably sized, way too large to move around
frequently).

Well, simplehash entry will be 24 bytes this way. If simplehash
template
supports external key/element storage, then it could be shrunk to 16
bytes,
and syscache entries will not need dlist_node. (But it doesn't at the
moment).

I think storing keys outside of the hashtable entry defeats the purpose
of the
open addressing, given that they are always checked and that our
conflict
ratio should be fairly low.

It's opposite: if conflict ratio were high, then key outside of
hashtable will
be expensive, since lookup to non-matched key will cost excess memory
access.
But with low conflict ratio we will usually hit matched entry at first
probe.
And since we will use entry soon, it doesn't matter when it will go to
CPU L1
cache: during lookup or during actual usage.

regards,
Yura Sokolov

#8Andres Freund
andres@anarazel.de
In reply to: Yura Sokolov (#7)
Re: RFC: Improve CPU cache locality of syscache searches

Hi,

On Thu, Aug 5, 2021, at 22:20, Yura Sokolov wrote:

Andres Freund писал 2021-08-06 06:49:

Hi,

On 2021-08-06 06:43:55 +0300, Yura Sokolov wrote:

Why don't use simplehash or something like that? Open-addressing
schemes
show superior cache locality.

I thought about that as well - but it doesn't really resolve the
question of
what we want to store in-line in the hashtable and what not. We can't
store
the tuples themselves in the hashtable for a myriad of reasons (need
pointer
stability, they're variably sized, way too large to move around
frequently).

Well, simplehash entry will be 24 bytes this way. If simplehash
template
supports external key/element storage, then it could be shrunk to 16
bytes,
and syscache entries will not need dlist_node. (But it doesn't at the
moment).

I think storing keys outside of the hashtable entry defeats the purpose
of the
open addressing, given that they are always checked and that our
conflict
ratio should be fairly low.

It's opposite: if conflict ratio were high, then key outside of
hashtable will
be expensive, since lookup to non-matched key will cost excess memory
access.
But with low conflict ratio we will usually hit matched entry at first
probe.
And since we will use entry soon, it doesn't matter when it will go to
CPU L1
cache: during lookup or during actual usage.

Often enough it does matter, because there will be earlier dependencies on whether a lookup is a cache hit/miss than on the content of the cached tuple.

Regards,

Andres

#9John Naylor
john.naylor@enterprisedb.com
In reply to: Andres Freund (#4)
Re: RFC: Improve CPU cache locality of syscache searches

On Thu, Aug 5, 2021 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:

I have wondered before whether we should have syscache definitions

generate

code specific to each syscache definition. I do think that'd give a good

bit

of performance boost. But I don't see a trivial way to get there without
notational overhead.

We could define syscaches in a header using a macro that's defined

differently

in syscache.c than everywhere else. The header would declare a set of
functions for each syscache, syscache.c would define them to call an
always_inline function with the relevant constants.

Or perhaps we should move syscache definitions into the pg_*.h headers,

and

generate the relevant code as part of their processing. That seems like it
could be nice from a modularity POV alone. And it could do better than the
current approach, because we could hardcode the types for columns in the
syscache definition without increasing notational overhead.

I had code laying around to generate the info needed for initialization,
but I apparently deleted it and never emailed it. :-(

If we went that far, I wonder if we could specialize the cc_bucket for the
actual types, rather than just number of Datums, which would further save
on space. More complex, though.

Also, if comparisons got cheaper, maybe we could get away with not storing
the full hash in the bucket in favor of a tag of the 16 highest bits. (The
catctup would still need the full hash for invalidation, at least).

Another idea I came across while researching in-memory key-value stores is
"bulk chaining" -- see [1]https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-lim.pdf for an example. In that case, our example
2-cacheline bucket would have a dlist_head pointing not to the catctups,
but to another bucket. So that throws away my L1/L2 analogy and uses a
chain of buckets, each of which contain multiple sets of
hashes/keys/pointers. That's closer to a normal collision chain, but with
better cache locality. If we did that, I imagine the catctup could dispense
with storing its Datum array and its dlist_node entirely.

[1]: https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-lim.pdf
https://www.usenix.org/system/files/conference/nsdi14/nsdi14-paper-lim.pdf

--
John Naylor
EDB: http://www.enterprisedb.com

#10John Naylor
john.naylor@enterprisedb.com
In reply to: John Naylor (#9)
7 attachment(s)
Re: RFC: Improve CPU cache locality of syscache searches

OK, here is a hackish WIP to see if we get anywhere with the L1 concept:

0001 is extracted from a patch from Horiguchi-san to remove the "dead" flag
0002 adds the large bucket, but doesn't use it for anything
0003 uses the new bucket for the L1 cache
0004 changes when to rehash
0005 is Horiguchi-san's v7 benchmark
0006 removes expiration stuff from the benchmark and prevents alloc errors
with my patches while running the benchmark

This doesn't align the bucket array to a cacheline boundary, nor does it
change the initial number of buckets

make -C contrib/catcachebench/ install
psql -c 'create extension catcachebench'

# This is also from Horiguchi-san
perl gen_tbl.pl | psql > /dev/null

# warmup
psql -c 'select catcachebench(0)'

# measure median of 3

master:

psql -c 'select catcachebench(1)'
catcachebench
---------------
6084.992169

patch:

./inst/bin/psql -c 'select catcachebench(1)'
catcachebench
---------------
5508.072532

That's decent, but not exactly stellar. I get a huge slowdown in
catcachebench(2), however, so I'll have to dig into why before going any
further.

Some time I'll also try the function specialization Andres mentioned and
see how big of a speedup that gives.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v1-0002-Specialize-bucket-types-by-number-of-keys.patchapplication/octet-stream; name=v1-0002-Specialize-bucket-types-by-number-of-keys.patchDownload
From a169a7fd7fb197f42260db2f79bffbc29bb49cfc Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 18 Aug 2021 12:03:46 -0400
Subject: [PATCH v1 2/6] Specialize bucket types by number of keys

Don't do anything special with the bucket types yet, that will
come in a later commit.
---
 src/backend/utils/cache/catcache.c | 40 ++++++++++++++++------------
 src/include/utils/catcache.h       | 42 +++++++++++++++++++++++++++++-
 2 files changed, 64 insertions(+), 18 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 47fb14e1db..55ea4b0339 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -556,6 +556,7 @@ void
 CatCacheInvalidate(CatCache *cache, uint32 hashValue)
 {
 	Index		hashIndex;
+	CCBucket	bucket;
 	dlist_mutable_iter iter;
 
 	CACHE_elog(DEBUG2, "CatCacheInvalidate: called");
@@ -583,7 +584,8 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue)
 	 * inspect the proper hash bucket for tuple matches
 	 */
 	hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets);
-	dlist_foreach_modify(iter, &cache->cc_bucket[hashIndex])
+	bucket = cache->cc_bucket[hashIndex];
+	dlist_foreach_modify(iter, &bucket.conflicts)
 	{
 		CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
 
@@ -664,9 +666,9 @@ ResetCatalogCache(CatCache *cache)
 	/* Remove each tuple in this cache, or at least mark it dead */
 	for (i = 0; i < cache->cc_nbuckets; i++)
 	{
-		dlist_head *bucket = &cache->cc_bucket[i];
+		CCBucket	bucket = cache->cc_bucket[i];
 
-		dlist_foreach_modify(iter, bucket)
+		dlist_foreach_modify(iter, &bucket.conflicts)
 		{
 			CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
 
@@ -818,11 +820,12 @@ InitCatCache(int id,
 	/*
 	 * Allocate a new cache structure, aligning to a cacheline boundary
 	 *
-	 * Note: we rely on zeroing to initialize all the dlist headers correctly
+	 * Note: we rely on zeroing to initialize all the bucket elements
+	 * correctly
 	 */
 	sz = sizeof(CatCache) + PG_CACHE_LINE_SIZE;
 	cp = (CatCache *) CACHELINEALIGN(palloc0(sz));
-	cp->cc_bucket = palloc0(nbuckets * sizeof(dlist_head));
+	cp->cc_bucket = palloc0(nbuckets * sizeof(CCBucket));
 
 	/*
 	 * initialize the cache's relation information for the relation
@@ -866,7 +869,8 @@ InitCatCache(int id,
 static void
 RehashCatCache(CatCache *cp)
 {
-	dlist_head *newbucket;
+	CCBucket   *oldbucket = cp->cc_bucket;
+	CCBucket   *newbucket;
 	int			newnbuckets;
 	int			i;
 
@@ -875,20 +879,20 @@ RehashCatCache(CatCache *cp)
 
 	/* Allocate a new, larger, hash table. */
 	newnbuckets = cp->cc_nbuckets * 2;
-	newbucket = (dlist_head *) MemoryContextAllocZero(CacheMemoryContext, newnbuckets * sizeof(dlist_head));
+	newbucket = (CCBucket *) MemoryContextAllocZero(CacheMemoryContext, newnbuckets * sizeof(CCBucket));
 
 	/* Move all entries from old hash table to new. */
 	for (i = 0; i < cp->cc_nbuckets; i++)
 	{
 		dlist_mutable_iter iter;
 
-		dlist_foreach_modify(iter, &cp->cc_bucket[i])
+		dlist_foreach_modify(iter, &oldbucket[i].conflicts)
 		{
 			CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
 			int			hashIndex = HASH_INDEX(ct->hash_value, newnbuckets);
 
 			dlist_delete(iter.cur);
-			dlist_push_head(&newbucket[hashIndex], &ct->cache_elem);
+			dlist_push_head(&newbucket[hashIndex].conflicts, &ct->cache_elem);
 		}
 	}
 
@@ -1215,7 +1219,7 @@ SearchCatCacheInternal(CatCache *cache,
 	uint32		hashValue;
 	Index		hashIndex;
 	dlist_iter	iter;
-	dlist_head *bucket;
+	CCBucket	bucket;
 	CatCTup    *ct;
 
 	/* Make sure we're in an xact, even if this ends up being a cache hit */
@@ -1251,8 +1255,9 @@ SearchCatCacheInternal(CatCache *cache,
 	 * Note: it's okay to use dlist_foreach here, even though we modify the
 	 * dlist within the loop, because we don't continue the loop afterwards.
 	 */
-	bucket = &cache->cc_bucket[hashIndex];
-	dlist_foreach(iter, bucket)
+	bucket = cache->cc_bucket[hashIndex];
+
+	dlist_foreach(iter, &bucket.conflicts)
 	{
 		ct = dlist_container(CatCTup, cache_elem, iter.cur);
 
@@ -1268,7 +1273,7 @@ SearchCatCacheInternal(CatCache *cache,
 		 * most frequently accessed elements in any hashbucket will tend to be
 		 * near the front of the hashbucket's list.)
 		 */
-		dlist_move_head(bucket, &ct->cache_elem);
+		dlist_move_head(&bucket.conflicts, &ct->cache_elem);
 
 		/*
 		 * If it's a positive entry, bump its refcount and return it. If it's
@@ -1650,7 +1655,7 @@ SearchCatCacheList(CatCache *cache,
 			uint32		hashValue;
 			Index		hashIndex;
 			bool		found = false;
-			dlist_head *bucket;
+			CCBucket	bucket;
 
 			/*
 			 * See if there's an entry for this tuple already.
@@ -1659,8 +1664,8 @@ SearchCatCacheList(CatCache *cache,
 			hashValue = CatalogCacheComputeTupleHashValue(cache, cache->cc_nkeys, ntp);
 			hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets);
 
-			bucket = &cache->cc_bucket[hashIndex];
-			dlist_foreach(iter, bucket)
+			bucket = cache->cc_bucket[hashIndex];
+			dlist_foreach(iter, &bucket.conflicts)
 			{
 				ct = dlist_container(CatCTup, cache_elem, iter.cur);
 
@@ -1811,6 +1816,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
 	CatCTup    *ct;
 	HeapTuple	dtp;
 	MemoryContext oldcxt;
+	CCBucket	bucket = cache->cc_bucket[hashIndex];
 
 	/* negative entries have no tuple associated */
 	if (ntp)
@@ -1890,7 +1896,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
 	ct->negative = negative;
 	ct->hash_value = hashValue;
 
-	dlist_push_head(&cache->cc_bucket[hashIndex], &ct->cache_elem);
+	dlist_push_head(&bucket.conflicts, &ct->cache_elem);
 
 	cache->cc_ntup++;
 	CacheHdr->ch_ntup++;
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index eac6a83f29..b870816a72 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -41,12 +41,52 @@ typedef uint32 (*CCHashFN) (Datum datum);
 /* function computing equality of two datums */
 typedef bool (*CCFastEqualFN) (Datum a, Datum b);
 
+typedef struct bucket1
+{
+	uint32		hashes[5];
+	Datum		keys[5][1];
+	struct catctup *ct[5];
+} Bucket1;
+
+typedef struct bucket2
+{
+	uint32		hashes[4];
+	Datum		keys[4][2];
+	struct catctup *ct[4];
+} Bucket2;
+
+typedef struct bucket3
+{
+	uint32		hashes[3];
+	Datum		keys[3][3];
+	struct catctup *ct[3];
+} Bucket3;
+
+typedef struct bucket4
+{
+	uint32		hashes[2];
+	Datum		keys[2][4];
+	struct catctup *ct[2];
+} Bucket4;
+
+typedef struct cc_bucket
+{
+	union
+	{
+		Bucket1		bucket1;
+		Bucket2		bucket2;
+		Bucket3		bucket3;
+		Bucket4		bucket4;
+	};
+	dlist_head	conflicts;
+} CCBucket;
+
 typedef struct catcache
 {
 	int			id;				/* cache identifier --- see syscache.h */
 	int			cc_nbuckets;	/* # of hash buckets in this cache */
 	TupleDesc	cc_tupdesc;		/* tuple descriptor (copied from reldesc) */
-	dlist_head *cc_bucket;		/* hash buckets */
+	CCBucket   *cc_bucket;		/* hash buckets */
 	CCHashFN	cc_hashfunc[CATCACHE_MAXKEYS];	/* hash function for each key */
 	CCFastEqualFN cc_fastequal[CATCACHE_MAXKEYS];	/* fast equal function for
 													 * each key */
-- 
2.31.1

v1-0001-Remove-dead-flag-from-CatCTup.patchapplication/octet-stream; name=v1-0001-Remove-dead-flag-from-CatCTup.patchDownload
From e6acb258fb831253dbc452a84b1b5a8381dc293c Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Sun, 8 Aug 2021 10:44:13 -0400
Subject: [PATCH v1 1/6] Remove "dead" flag from CatCTup

We don't want to have to compare to this anymore, since that would
require including it in the upcoming L1 area of the bucket cache.
Remove, and where necessary, replace with setting or comparing
ct->cache_elem.prev to NULL.

Extracted from a larger patch by Kyotaro Horiguchi, per suggestion from Heikki Linnakangas
Discussion: https://www.postgresql.org/message-id/0b58c41e-4fbc-c73d-d293-a35e4d8223f7%40iki.fi
---
 src/backend/utils/cache/catcache.c | 45 +++++++++++++++---------------
 src/include/utils/catcache.h       |  7 -----
 2 files changed, 22 insertions(+), 30 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 4fbdc62d8c..47fb14e1db 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -459,21 +459,24 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
 	Assert(ct->refcount == 0);
 	Assert(ct->my_cache == cache);
 
+	/* delink from linked list if we haven't already */
+	if (ct->cache_elem.prev)
+	{
+		dlist_delete(&ct->cache_elem);
+		ct->cache_elem.prev = NULL;
+	}
+
 	if (ct->c_list)
 	{
 		/*
 		 * The cleanest way to handle this is to call CatCacheRemoveCList,
 		 * which will recurse back to me, and the recursive call will do the
-		 * work.  Set the "dead" flag to make sure it does recurse.
+		 * work.
 		 */
-		ct->dead = true;
 		CatCacheRemoveCList(cache, ct->c_list);
 		return;					/* nothing left to do */
 	}
 
-	/* delink from linked list */
-	dlist_delete(&ct->cache_elem);
-
 	/*
 	 * Free keys when we're dealing with a negative entry, normal entries just
 	 * point into tuple, allocated together with the CatCTup.
@@ -493,7 +496,7 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
  *
  * Unlink and delete the given cache list entry
  *
- * NB: any dead member entries that become unreferenced are deleted too.
+ * NB: any delinked member entries that become unreferenced are deleted too.
  */
 static void
 CatCacheRemoveCList(CatCache *cache, CatCList *cl)
@@ -510,10 +513,11 @@ CatCacheRemoveCList(CatCache *cache, CatCList *cl)
 
 		Assert(ct->c_list == cl);
 		ct->c_list = NULL;
-		/* if the member is dead and now has no references, remove it */
+
+		/* if the member is delinked and now has no references, remove it */
 		if (
 #ifndef CATCACHE_FORCE_RELEASE
-			ct->dead &&
+			ct->cache_elem.prev == NULL &&
 #endif
 			ct->refcount == 0)
 			CatCacheRemoveCTup(cache, ct);
@@ -588,7 +592,9 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue)
 			if (ct->refcount > 0 ||
 				(ct->c_list && ct->c_list->refcount > 0))
 			{
-				ct->dead = true;
+				dlist_delete(&ct->cache_elem);
+				ct->cache_elem.prev = NULL;
+
 				/* list, if any, was marked dead above */
 				Assert(ct->c_list == NULL || ct->c_list->dead);
 			}
@@ -667,7 +673,8 @@ ResetCatalogCache(CatCache *cache)
 			if (ct->refcount > 0 ||
 				(ct->c_list && ct->c_list->refcount > 0))
 			{
-				ct->dead = true;
+				dlist_delete(&ct->cache_elem);
+				ct->cache_elem.prev = NULL;
 				/* list, if any, was marked dead above */
 				Assert(ct->c_list == NULL || ct->c_list->dead);
 			}
@@ -1249,9 +1256,6 @@ SearchCatCacheInternal(CatCache *cache,
 	{
 		ct = dlist_container(CatCTup, cache_elem, iter.cur);
 
-		if (ct->dead)
-			continue;			/* ignore dead entries */
-
 		if (ct->hash_value != hashValue)
 			continue;			/* quickly skip entry if wrong hash val */
 
@@ -1453,7 +1457,7 @@ ReleaseCatCache(HeapTuple tuple)
 
 	if (
 #ifndef CATCACHE_FORCE_RELEASE
-		ct->dead &&
+		ct->cache_elem.prev == NULL &&
 #endif
 		ct->refcount == 0 &&
 		(ct->c_list == NULL || ct->c_list->refcount == 0))
@@ -1660,8 +1664,8 @@ SearchCatCacheList(CatCache *cache,
 			{
 				ct = dlist_container(CatCTup, cache_elem, iter.cur);
 
-				if (ct->dead || ct->negative)
-					continue;	/* ignore dead and negative entries */
+				if (ct->negative)
+					continue;	/* ignore negative entries */
 
 				if (ct->hash_value != hashValue)
 					continue;	/* quickly skip entry if wrong hash val */
@@ -1722,14 +1726,13 @@ SearchCatCacheList(CatCache *cache,
 	{
 		foreach(ctlist_item, ctlist)
 		{
+			Assert(ct->cache_elem.prev != NULL);
+
 			ct = (CatCTup *) lfirst(ctlist_item);
 			Assert(ct->c_list == NULL);
 			Assert(ct->refcount > 0);
 			ct->refcount--;
 			if (
-#ifndef CATCACHE_FORCE_RELEASE
-				ct->dead &&
-#endif
 				ct->refcount == 0 &&
 				(ct->c_list == NULL || ct->c_list->refcount == 0))
 				CatCacheRemoveCTup(cache, ct);
@@ -1757,9 +1760,6 @@ SearchCatCacheList(CatCache *cache,
 		/* release the temporary refcount on the member */
 		Assert(ct->refcount > 0);
 		ct->refcount--;
-		/* mark list dead if any members already dead */
-		if (ct->dead)
-			cl->dead = true;
 	}
 	Assert(i == nmembers);
 
@@ -1887,7 +1887,6 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
 	ct->my_cache = cache;
 	ct->c_list = NULL;
 	ct->refcount = 0;			/* for the moment */
-	ct->dead = false;
 	ct->negative = negative;
 	ct->hash_value = hashValue;
 
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index ddc2762eb3..eac6a83f29 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -104,19 +104,12 @@ typedef struct catctup
 	dlist_node	cache_elem;		/* list member of per-bucket list */
 
 	/*
-	 * A tuple marked "dead" must not be returned by subsequent searches.
-	 * However, it won't be physically deleted from the cache until its
-	 * refcount goes to zero.  (If it's a member of a CatCList, the list's
-	 * refcount must go to zero, too; also, remember to mark the list dead at
-	 * the same time the tuple is marked.)
-	 *
 	 * A negative cache entry is an assertion that there is no tuple matching
 	 * a particular key.  This is just as useful as a normal entry so far as
 	 * avoiding catalog searches is concerned.  Management of positive and
 	 * negative entries is identical.
 	 */
 	int			refcount;		/* number of active references */
-	bool		dead;			/* dead but not yet removed? */
 	bool		negative;		/* negative cache entry? */
 	HeapTupleData tuple;		/* tuple management header */
 
-- 
2.31.1

v1-0004-Rationalize-rehashing-threshold.patchapplication/octet-stream; name=v1-0004-Rationalize-rehashing-threshold.patchDownload
From 04474538aff8b3135035e7671d9639164c336a0b Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 18 Aug 2021 13:04:56 -0400
Subject: [PATCH v1 4/6] Rationalize rehashing threshold

Rehash catcache when the average length of the conflict list is one
greater than the capacity of the L1
---
 src/backend/utils/cache/catcache.c | 21 +++++++++++++++++++--
 1 file changed, 19 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index dad985fc5a..e53d4325ab 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -1908,6 +1908,7 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
 	HeapTuple	dtp;
 	MemoryContext oldcxt;
 	CCBucket	bucket = cache->cc_bucket[hashIndex];
+	int			rehash_threshold;
 
 	/* negative entries have no tuple associated */
 	if (ntp)
@@ -1994,9 +1995,25 @@ CatalogCacheCreateEntry(CatCache *cache, HeapTuple ntp, Datum *arguments,
 
 	/*
 	 * If the hash table has become too full, enlarge the buckets array. Quite
-	 * arbitrarily, we enlarge when fill factor > 2.
+	 * arbitrarily, we enlarge when fill factor > (1 + max num entries in L1).
+	 * XXX magic constants -- find a better way.
 	 */
-	if (cache->cc_ntup > cache->cc_nbuckets * 2)
+	switch (cache->cc_nkeys)
+	{
+		case 1:
+			rehash_threshold = 6;
+		case 2:
+			rehash_threshold = 5;
+		case 3:
+			rehash_threshold = 4;
+		case 4:
+			rehash_threshold = 3;
+		default:
+			/* apparently cc_nkeys is not set during bootstrap */
+			rehash_threshold = 2;
+	}
+
+	if (cache->cc_ntup > cache->cc_nbuckets * rehash_threshold)
 		RehashCatCache(cache);
 
 	return ct;
-- 
2.31.1

v1-0005-catcachebench.patchapplication/octet-stream; name=v1-0005-catcachebench.patchDownload
From 245875acc01f315a87a88c821646c734d93bfbe8 Mon Sep 17 00:00:00 2001
From: Kyotaro Horiguchi <horikyoga.ntt@gmail.com>
Date: Wed, 18 Nov 2020 16:56:41 +0900
Subject: [PATCH v1 5/6] catcachebench

---
 contrib/catcachebench/Makefile               |  17 +
 contrib/catcachebench/catcachebench--0.0.sql |  14 +
 contrib/catcachebench/catcachebench.c        | 330 +++++++++++++++++++
 contrib/catcachebench/catcachebench.control  |   6 +
 src/backend/utils/cache/catcache.c           |  35 ++
 src/backend/utils/cache/syscache.c           |   2 +-
 6 files changed, 403 insertions(+), 1 deletion(-)
 create mode 100644 contrib/catcachebench/Makefile
 create mode 100644 contrib/catcachebench/catcachebench--0.0.sql
 create mode 100644 contrib/catcachebench/catcachebench.c
 create mode 100644 contrib/catcachebench/catcachebench.control

diff --git a/contrib/catcachebench/Makefile b/contrib/catcachebench/Makefile
new file mode 100644
index 0000000000..0478818b25
--- /dev/null
+++ b/contrib/catcachebench/Makefile
@@ -0,0 +1,17 @@
+MODULE_big = catcachebench
+OBJS = catcachebench.o
+
+EXTENSION = catcachebench
+DATA = catcachebench--0.0.sql
+PGFILEDESC = "catcachebench - benchmark for catcache pruning feature"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/catcachebench
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/catcachebench/catcachebench--0.0.sql b/contrib/catcachebench/catcachebench--0.0.sql
new file mode 100644
index 0000000000..ea9cd62abb
--- /dev/null
+++ b/contrib/catcachebench/catcachebench--0.0.sql
@@ -0,0 +1,14 @@
+/* contrib/catcachebench/catcachebench--0.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION catcachebench" to load this file. \quit
+
+CREATE FUNCTION catcachebench(IN type int)
+RETURNS double precision
+AS 'MODULE_PATHNAME', 'catcachebench'
+LANGUAGE C STRICT VOLATILE;
+
+CREATE FUNCTION catcachereadstats(OUT catid int, OUT reloid oid, OUT searches bigint, OUT hits bigint, OUT neg_hits bigint)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'catcachereadstats'
+LANGUAGE C STRICT VOLATILE;
diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c
new file mode 100644
index 0000000000..b5a4d794ed
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.c
@@ -0,0 +1,330 @@
+/*
+ * catcachebench: test code for cache pruning feature
+ */
+/* #define CATCACHE_STATS */
+#include "postgres.h"
+#include "catalog/pg_type.h"
+#include "catalog/pg_statistic.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "libpq/pqsignal.h"
+#include "utils/catcache.h"
+#include "utils/syscache.h"
+#include "utils/timestamp.h"
+
+Oid		tableoids[10000];
+int		ntables = 0;
+int16	attnums[1000];
+int		natts = 0;
+
+PG_MODULE_MAGIC;
+
+double catcachebench1(void);
+double catcachebench2(void);
+double catcachebench3(void);
+void collectinfo(void);
+void catcachewarmup(void);
+
+PG_FUNCTION_INFO_V1(catcachebench);
+PG_FUNCTION_INFO_V1(catcachereadstats);
+
+extern void CatalogCacheFlushCatalog2(Oid catId);
+extern int64 catcache_called;
+extern CatCache *SysCache[];
+
+typedef struct catcachestatsstate
+{
+	TupleDesc tupd;
+	int		  catId;
+} catcachestatsstate;
+
+Datum
+catcachereadstats(PG_FUNCTION_ARGS)
+{
+	catcachestatsstate *state_data = NULL;
+	FuncCallContext *fctx;
+
+	if (SRF_IS_FIRSTCALL())
+	{
+		TupleDesc	tupdesc;
+		MemoryContext mctx;
+
+		fctx = SRF_FIRSTCALL_INIT();
+		mctx = MemoryContextSwitchTo(fctx->multi_call_memory_ctx);
+
+		state_data = palloc(sizeof(catcachestatsstate));
+
+		/* Build a tuple descriptor for our result type */
+		if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+			elog(ERROR, "return type must be a row type");
+
+		state_data->tupd = tupdesc;
+		state_data->catId = 0;
+
+		fctx->user_fctx = state_data;
+
+		MemoryContextSwitchTo(mctx);
+	}
+
+	fctx = SRF_PERCALL_SETUP();
+	state_data = fctx->user_fctx;
+
+	if (state_data->catId < SysCacheSize)
+	{
+		Datum	values[5];
+		bool	nulls[5];
+		HeapTuple	resulttup;
+		Datum	result;
+		int		catId = state_data->catId++;
+
+		memset(nulls, 0, sizeof(nulls));
+		memset(values, 0, sizeof(values));
+		values[0] = Int16GetDatum(catId);
+		values[1] = ObjectIdGetDatum(SysCache[catId]->cc_reloid);
+#ifdef CATCACHE_STATS		
+		values[2] = Int64GetDatum(SysCache[catId]->cc_searches);
+		values[3] = Int64GetDatum(SysCache[catId]->cc_hits);
+		values[4] = Int64GetDatum(SysCache[catId]->cc_neg_hits);
+#endif
+		resulttup = heap_form_tuple(state_data->tupd, values, nulls);
+		result = HeapTupleGetDatum(resulttup);
+
+		SRF_RETURN_NEXT(fctx, result);
+	}
+
+	SRF_RETURN_DONE(fctx);
+}
+
+Datum
+catcachebench(PG_FUNCTION_ARGS)
+{
+	int		testtype = PG_GETARG_INT32(0);
+	double	ms;
+
+	collectinfo();
+
+	/* flush the catalog -- safe? don't mind. */
+	CatalogCacheFlushCatalog2(StatisticRelationId);
+
+	switch (testtype)
+	{
+	case 0:
+		catcachewarmup(); /* prewarm of syscatalog */
+		PG_RETURN_NULL();
+	case 1:
+		ms = catcachebench1(); break;
+	case 2:
+		ms = catcachebench2(); break;
+	case 3:
+		ms = catcachebench3(); break;
+	default:
+		elog(ERROR, "Invalid test type: %d", testtype);
+	}
+
+	PG_RETURN_DATUM(Float8GetDatum(ms));
+}
+
+/*
+ * fetch all attribute entires of all tables.
+ */
+double
+catcachebench1(void)
+{
+	int t, a;
+	instr_time	start,
+				duration;
+
+	PG_SETMASK(&BlockSig);
+	INSTR_TIME_SET_CURRENT(start);
+	for (t = 0 ; t < ntables ; t++)
+	{
+		for (a = 0 ; a < natts ; a++)
+		{
+			HeapTuple tup;
+
+			tup = SearchSysCache3(STATRELATTINH,
+								  ObjectIdGetDatum(tableoids[t]),
+								  Int16GetDatum(attnums[a]),
+								  BoolGetDatum(false));
+			/* should be null, but.. */
+			if (HeapTupleIsValid(tup))
+				ReleaseSysCache(tup);
+		}
+	}
+	INSTR_TIME_SET_CURRENT(duration);
+	INSTR_TIME_SUBTRACT(duration, start);
+	PG_SETMASK(&UnBlockSig);
+
+	return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+/*
+ * fetch all attribute entires of a table 6000 times.
+ */
+double
+catcachebench2(void)
+{
+	int t, a;
+	instr_time	start,
+				duration;
+
+	PG_SETMASK(&BlockSig);
+	INSTR_TIME_SET_CURRENT(start);
+	for (t = 0 ; t < 240000 ; t++)
+	{
+		for (a = 0 ; a < natts ; a++)
+		{
+			HeapTuple tup;
+
+			tup = SearchSysCache3(STATRELATTINH,
+								  ObjectIdGetDatum(tableoids[0]),
+								  Int16GetDatum(attnums[a]),
+								  BoolGetDatum(false));
+			/* should be null, but.. */
+			if (HeapTupleIsValid(tup))
+				ReleaseSysCache(tup);
+		}
+	}
+	INSTR_TIME_SET_CURRENT(duration);
+	INSTR_TIME_SUBTRACT(duration, start);
+	PG_SETMASK(&UnBlockSig);
+
+	return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+/*
+ * fetch all attribute entires of all tables twice with having expiration
+ * happen.
+ */
+double
+catcachebench3(void)
+{
+	const int clock_step = 1000;
+	int i, t, a;
+	instr_time	start,
+				duration;
+
+	PG_SETMASK(&BlockSig);
+	INSTR_TIME_SET_CURRENT(start);
+	for (i = 0 ; i < 4 ; i++)
+	{
+		int ct = clock_step;
+
+		for (t = 0 ; t < ntables ; t++)
+		{
+			/*
+			 * catcacheclock is updated by transaction timestamp, so needs to
+			 * be updated by other means for this test to work. Here I choosed
+			 * to update the clock every 1000 tables scan.
+			 */
+			if (--ct < 0)
+			{
+				SetCatCacheClock(GetCurrentTimestamp());
+				ct = clock_step;
+			}
+			for (a = 0 ; a < natts ; a++)
+			{
+				HeapTuple tup;
+
+				tup = SearchSysCache3(STATRELATTINH,
+									  ObjectIdGetDatum(tableoids[t]),
+									  Int16GetDatum(attnums[a]),
+									  BoolGetDatum(false));
+				/* should be null, but.. */
+				if (HeapTupleIsValid(tup))
+					ReleaseSysCache(tup);
+			}
+		}
+	}
+	INSTR_TIME_SET_CURRENT(duration);
+	INSTR_TIME_SUBTRACT(duration, start);
+	PG_SETMASK(&UnBlockSig);
+
+	return INSTR_TIME_GET_MILLISEC(duration);
+};
+
+void
+catcachewarmup(void)
+{
+	int t, a;
+
+	/* load up catalog tables */
+	for (t = 0 ; t < ntables ; t++)
+	{
+		for (a = 0 ; a < natts ; a++)
+		{
+			HeapTuple tup;
+
+			tup = SearchSysCache3(STATRELATTINH,
+								  ObjectIdGetDatum(tableoids[t]),
+								  Int16GetDatum(attnums[a]),
+								  BoolGetDatum(false));
+			/* should be null, but.. */
+			if (HeapTupleIsValid(tup))
+				ReleaseSysCache(tup);
+		}
+	}
+}
+
+void
+collectinfo(void)
+{
+	int ret;
+	Datum	values[10000];
+	bool	nulls[10000];
+	Oid		types0[] = {OIDOID};
+	int i;
+
+	ntables = 0;
+	natts = 0;
+
+	SPI_connect();
+	/* collect target tables */
+	ret = SPI_execute("select oid from pg_class where relnamespace = (select oid from pg_namespace where nspname = \'test\')",
+					  true, 0);
+	if (ret != SPI_OK_SELECT)
+		elog(ERROR, "Failed 1");
+	if (SPI_processed == 0)
+		elog(ERROR, "no relation found in schema \"test\"");
+	if (SPI_processed > 10000)
+		elog(ERROR, "too many relation found in schema \"test\"");
+
+	for (i = 0 ; i < SPI_processed ; i++)
+	{
+		heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+						  values, nulls);
+		if (nulls[0])
+			elog(ERROR, "Failed 2");
+
+		tableoids[ntables++] = DatumGetObjectId(values[0]);
+	}
+	SPI_finish();
+	elog(DEBUG1, "%d tables found", ntables);
+
+	values[0] = ObjectIdGetDatum(tableoids[0]);
+	nulls[0] = false;
+	SPI_connect();
+	ret = SPI_execute_with_args("select attnum from pg_attribute where attrelid = (select oid from pg_class where oid = $1)",
+								1, types0, values, NULL, true, 0);
+	if (SPI_processed == 0)
+		elog(ERROR, "no attribute found in table %d", tableoids[0]);
+	if (SPI_processed > 10000)
+		elog(ERROR, "too many relation found in table %d", tableoids[0]);
+	
+	/* collect target attributes. assuming all tables have the same attnums */
+	for (i = 0 ; i < SPI_processed ; i++)
+	{
+		int16 attnum;
+
+		heap_deform_tuple(SPI_tuptable->vals[i], SPI_tuptable->tupdesc,
+						  values, nulls);
+		if (nulls[0])
+			elog(ERROR, "Failed 3");
+		attnum = DatumGetInt16(values[0]);
+
+		if (attnum > 0)
+			attnums[natts++] = attnum;
+	}
+	SPI_finish();
+	elog(DEBUG1, "%d attributes found", natts);
+}
diff --git a/contrib/catcachebench/catcachebench.control b/contrib/catcachebench/catcachebench.control
new file mode 100644
index 0000000000..3fc9d2e420
--- /dev/null
+++ b/contrib/catcachebench/catcachebench.control
@@ -0,0 +1,6 @@
+# catcachebench
+
+comment = 'benchmark for catcache pruning'
+default_version = '0.0'
+module_pathname = '$libdir/catcachebench'
+relocatable = true
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index e53d4325ab..db64b19975 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -759,6 +759,41 @@ CatalogCacheFlushCatalog(Oid catId)
 	CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call");
 }
 
+
+/* FUNCTION FOR BENCHMARKING */
+void
+CatalogCacheFlushCatalog2(Oid catId)
+{
+	slist_iter	iter;
+
+	CACHE_elog(DEBUG2, "CatalogCacheFlushCatalog called for %u", catId);
+
+	slist_foreach(iter, &CacheHdr->ch_caches)
+	{
+		CatCache   *cache = slist_container(CatCache, cc_next, iter.cur);
+
+		/* Does this cache store tuples of the target catalog? */
+		if (cache->cc_reloid == catId)
+		{
+			/* Yes, so flush all its contents */
+			ResetCatalogCache(cache);
+
+			/* Tell inval.c to call syscache callbacks for this cache */
+			CallSyscacheCallbacks(cache->id, 0);
+
+			cache->cc_nbuckets = 128;
+			pfree(cache->cc_bucket);
+			cache->cc_bucket =
+				(dlist_head *) MemoryContextAllocZero(CacheMemoryContext,
+								  cache->cc_nbuckets * sizeof(dlist_head));
+			elog(LOG, "Catcache reset");
+		}
+	}
+
+	CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call");
+}
+/* END: FUNCTION FOR BENCHMARKING */
+
 /*
  *		InitCatCache
  *
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index d6cb78dea8..d2e5d859a9 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -994,7 +994,7 @@ static const struct cachedesc cacheinfo[] = {
 	}
 };
 
-static CatCache *SysCache[SysCacheSize];
+CatCache *SysCache[SysCacheSize];
 
 static bool CacheInitialized = false;
 
-- 
2.31.1

v1-0003-Use-hash-bucket-as-a-level-one-cache-to-avoid-wal.patchapplication/octet-stream; name=v1-0003-Use-hash-bucket-as-a-level-one-cache-to-avoid-wal.patchDownload
From f6a3ddfeb049799df7b60bddb4b262c27e27e272 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 18 Aug 2021 12:36:19 -0400
Subject: [PATCH v1 3/6] Use hash bucket as a level-one cache to avoid walking
 the conflict chain

Since the hash buckets are now two cachelines wide, use that space
to store hashes, keys, and pointers to catcache entries. The number
of L1 entries supported depends on how many attributes are in the search
key -- from 2 for 4-att keys to 5 for 1-att keys.

If we must walk the list to find the entry, copy its data to the front
of the L1 and evict the last entry added there. However, we don't store
negative entries in the L1, since they are not as likely to be accessed
repeatedly.
---
 src/backend/utils/cache/catcache.c | 105 +++++++++++++++++++++++++++--
 1 file changed, 98 insertions(+), 7 deletions(-)

diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 55ea4b0339..dad985fc5a 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -459,6 +459,9 @@ CatCacheRemoveCTup(CatCache *cache, CatCTup *ct)
 	Assert(ct->refcount == 0);
 	Assert(ct->my_cache == cache);
 
+	/* reset L1 area of bucket */
+	memset(cache->cc_bucket, 0, offsetof(CCBucket, conflicts));
+
 	/* delink from linked list if we haven't already */
 	if (ct->cache_elem.prev)
 	{
@@ -585,6 +588,10 @@ CatCacheInvalidate(CatCache *cache, uint32 hashValue)
 	 */
 	hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets);
 	bucket = cache->cc_bucket[hashIndex];
+
+	/* reset L1 area of bucket */
+	memset(&bucket, 0, offsetof(CCBucket, conflicts));
+
 	dlist_foreach_modify(iter, &bucket.conflicts)
 	{
 		CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
@@ -668,6 +675,9 @@ ResetCatalogCache(CatCache *cache)
 	{
 		CCBucket	bucket = cache->cc_bucket[i];
 
+		/* reset L1 area of bucket */
+		memset(&bucket, 0, offsetof(CCBucket, conflicts));
+
 		dlist_foreach_modify(iter, &bucket.conflicts)
 		{
 			CatCTup    *ct = dlist_container(CatCTup, cache_elem, iter.cur);
@@ -1251,12 +1261,71 @@ SearchCatCacheInternal(CatCache *cache,
 
 	/*
 	 * scan the hash bucket until we find a match or exhaust our tuples
-	 *
-	 * Note: it's okay to use dlist_foreach here, even though we modify the
-	 * dlist within the loop, because we don't continue the loop afterwards.
 	 */
 	bucket = cache->cc_bucket[hashIndex];
 
+	/* initialize pointers into members of the CCBucket union type */
+	uint32	   *hash_array = NULL;
+	Datum	   *key_array = NULL;
+	struct catctup **ct_array = NULL;
+	int			nentries = 0;
+
+	/*
+	 * XXX this is kind of ugly -- maybe better if templated.
+	 * Not great that we throw away multi-array info.
+	 * Also nentries are magic constants that shouldn't be here.
+	 */
+	switch (nkeys)
+	{
+		case 1:
+			hash_array = bucket.bucket1.hashes;
+			key_array = (Datum *) bucket.bucket1.keys;
+			ct_array = bucket.bucket1.ct;
+			nentries = 5;
+			break;
+		case 2:
+			hash_array = bucket.bucket2.hashes;
+			key_array = (Datum *) bucket.bucket2.keys;
+			ct_array = bucket.bucket2.ct;
+			nentries = 4;
+			break;
+		case 3:
+			hash_array = bucket.bucket3.hashes;
+			key_array = (Datum *) bucket.bucket3.keys;
+			ct_array = bucket.bucket3.ct;
+			nentries = 3;
+			break;
+		case 4:
+			hash_array = bucket.bucket4.hashes;
+			key_array = (Datum *) bucket.bucket4.keys;
+			ct_array = bucket.bucket4.ct;
+			nentries = 2;
+			break;
+		default:
+			Assert(false);
+	}
+
+	/* first search within L1 area of the hash bucket */
+	for (int i = 0; i < nentries; i++)
+	{
+		if (hash_array[i] != hashValue)
+			continue;			/* quickly skip entry if wrong hash val */
+
+		if (!CatalogCacheCompareTuple(cache, nkeys, &key_array[i * nkeys], arguments))
+			continue;
+
+		/* This can happen if the search keys are zero and the L1 has empty slots. */
+		if (ct_array[i] == NULL)
+			continue;
+
+		/* found a match */
+		ct = ct_array[i];
+		Assert(ct->ct_magic == CT_MAGIC);
+		Assert(ct->cache_elem.prev != NULL);
+		goto finish;
+	}
+
+	/* not in the L1, so walk the bucket chain list */
 	dlist_foreach(iter, &bucket.conflicts)
 	{
 		ct = dlist_container(CatCTup, cache_elem, iter.cur);
@@ -1268,12 +1337,34 @@ SearchCatCacheInternal(CatCache *cache,
 			continue;
 
 		/*
-		 * We found a match in the cache.  Move it to the front of the list
-		 * for its hashbucket, in order to speed subsequent searches.  (The
+		 * We found a match in the cache.  Move it to the front of the L1 area
+		 * of its hashbucket, in order to speed subsequent searches.  (The
 		 * most frequently accessed elements in any hashbucket will tend to be
-		 * near the front of the hashbucket's list.)
+		 * in the L1 area.) We skip this for negative entries.
 		 */
-		dlist_move_head(&bucket.conflicts, &ct->cache_elem);
+		if (!ct->negative)
+		{
+			memmove(&hash_array[1],
+					hash_array,
+					sizeof(hash_array) -
+					sizeof(hash_array[0]));
+			hash_array[0] = hashValue;
+
+			memmove(&key_array[nkeys],
+					key_array,
+					sizeof(key_array) -
+					nkeys * sizeof(key_array[0]));
+			memcpy(key_array, arguments,
+				   nkeys * sizeof(key_array[0]));
+
+			memmove(&ct_array[1],
+					ct_array,
+					sizeof(ct_array) -
+					sizeof(ct_array[0]));
+			ct_array[0] = ct;
+		}
+
+finish:
 
 		/*
 		 * If it's a positive entry, bump its refcount and return it. If it's
-- 
2.31.1

v1-0006-Some-adjustments-to-catcachebench-and-catcache.c-.patchapplication/octet-stream; name=v1-0006-Some-adjustments-to-catcachebench-and-catcache.c-.patchDownload
From d916e783476d704e59a50d764fea1ccb8dba2aad Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Wed, 18 Aug 2021 14:03:57 -0400
Subject: [PATCH v1 6/6] Some adjustments to catcachebench and catcache.c so it
 will run

---
 contrib/catcachebench/catcachebench.c | 57 ---------------------------
 src/backend/utils/cache/catcache.c    | 41 +++----------------
 2 files changed, 6 insertions(+), 92 deletions(-)

diff --git a/contrib/catcachebench/catcachebench.c b/contrib/catcachebench/catcachebench.c
index b5a4d794ed..f688196baa 100644
--- a/contrib/catcachebench/catcachebench.c
+++ b/contrib/catcachebench/catcachebench.c
@@ -21,7 +21,6 @@ PG_MODULE_MAGIC;
 
 double catcachebench1(void);
 double catcachebench2(void);
-double catcachebench3(void);
 void collectinfo(void);
 void catcachewarmup(void);
 
@@ -103,9 +102,6 @@ catcachebench(PG_FUNCTION_ARGS)
 
 	collectinfo();
 
-	/* flush the catalog -- safe? don't mind. */
-	CatalogCacheFlushCatalog2(StatisticRelationId);
-
 	switch (testtype)
 	{
 	case 0:
@@ -115,8 +111,6 @@ catcachebench(PG_FUNCTION_ARGS)
 		ms = catcachebench1(); break;
 	case 2:
 		ms = catcachebench2(); break;
-	case 3:
-		ms = catcachebench3(); break;
 	default:
 		elog(ERROR, "Invalid test type: %d", testtype);
 	}
@@ -192,57 +186,6 @@ catcachebench2(void)
 	return INSTR_TIME_GET_MILLISEC(duration);
 };
 
-/*
- * fetch all attribute entires of all tables twice with having expiration
- * happen.
- */
-double
-catcachebench3(void)
-{
-	const int clock_step = 1000;
-	int i, t, a;
-	instr_time	start,
-				duration;
-
-	PG_SETMASK(&BlockSig);
-	INSTR_TIME_SET_CURRENT(start);
-	for (i = 0 ; i < 4 ; i++)
-	{
-		int ct = clock_step;
-
-		for (t = 0 ; t < ntables ; t++)
-		{
-			/*
-			 * catcacheclock is updated by transaction timestamp, so needs to
-			 * be updated by other means for this test to work. Here I choosed
-			 * to update the clock every 1000 tables scan.
-			 */
-			if (--ct < 0)
-			{
-				SetCatCacheClock(GetCurrentTimestamp());
-				ct = clock_step;
-			}
-			for (a = 0 ; a < natts ; a++)
-			{
-				HeapTuple tup;
-
-				tup = SearchSysCache3(STATRELATTINH,
-									  ObjectIdGetDatum(tableoids[t]),
-									  Int16GetDatum(attnums[a]),
-									  BoolGetDatum(false));
-				/* should be null, but.. */
-				if (HeapTupleIsValid(tup))
-					ReleaseSysCache(tup);
-			}
-		}
-	}
-	INSTR_TIME_SET_CURRENT(duration);
-	INSTR_TIME_SUBTRACT(duration, start);
-	PG_SETMASK(&UnBlockSig);
-
-	return INSTR_TIME_GET_MILLISEC(duration);
-};
-
 void
 catcachewarmup(void)
 {
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index db64b19975..8d09576dda 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -760,40 +760,6 @@ CatalogCacheFlushCatalog(Oid catId)
 }
 
 
-/* FUNCTION FOR BENCHMARKING */
-void
-CatalogCacheFlushCatalog2(Oid catId)
-{
-	slist_iter	iter;
-
-	CACHE_elog(DEBUG2, "CatalogCacheFlushCatalog called for %u", catId);
-
-	slist_foreach(iter, &CacheHdr->ch_caches)
-	{
-		CatCache   *cache = slist_container(CatCache, cc_next, iter.cur);
-
-		/* Does this cache store tuples of the target catalog? */
-		if (cache->cc_reloid == catId)
-		{
-			/* Yes, so flush all its contents */
-			ResetCatalogCache(cache);
-
-			/* Tell inval.c to call syscache callbacks for this cache */
-			CallSyscacheCallbacks(cache->id, 0);
-
-			cache->cc_nbuckets = 128;
-			pfree(cache->cc_bucket);
-			cache->cc_bucket =
-				(dlist_head *) MemoryContextAllocZero(CacheMemoryContext,
-								  cache->cc_nbuckets * sizeof(dlist_head));
-			elog(LOG, "Catcache reset");
-		}
-	}
-
-	CACHE_elog(DEBUG2, "end of CatalogCacheFlushCatalog call");
-}
-/* END: FUNCTION FOR BENCHMARKING */
-
 /*
  *		InitCatCache
  *
@@ -918,13 +884,18 @@ RehashCatCache(CatCache *cp)
 	CCBucket   *newbucket;
 	int			newnbuckets;
 	int			i;
+	size_t		sz;
 
 	elog(DEBUG1, "rehashing catalog cache id %d for %s; %d tups, %d buckets",
 		 cp->id, cp->cc_relname, cp->cc_ntup, cp->cc_nbuckets);
 
 	/* Allocate a new, larger, hash table. */
 	newnbuckets = cp->cc_nbuckets * 2;
-	newbucket = (CCBucket *) MemoryContextAllocZero(CacheMemoryContext, newnbuckets * sizeof(CCBucket));
+	sz = newnbuckets * sizeof(CCBucket);
+	if (sz > MaxAllocSize)
+		return;
+
+	newbucket = (CCBucket *) MemoryContextAllocZero(CacheMemoryContext, sz);
 
 	/* Move all entries from old hash table to new. */
 	for (i = 0; i < cp->cc_nbuckets; i++)
-- 
2.31.1

gen_tbl.pltext/x-perl-script; charset=US-ASCII; name=gen_tbl.plDownload
#11John Naylor
john.naylor@enterprisedb.com
In reply to: Andres Freund (#4)
1 attachment(s)
Re: RFC: Improve CPU cache locality of syscache searches

On Thu, Aug 5, 2021 at 4:12 PM Andres Freund <andres@anarazel.de> wrote:

I have wondered before whether we should have syscache definitions

generate

code specific to each syscache definition. I do think that'd give a good

bit

of performance boost. But I don't see a trivial way to get there without
notational overhead.

I've made a small step in this direction in the attached. It uses a
template approach to generate type-specific SearchCatCache* functions, for
now only the 4-key ones. Since there's only a few of those, it's manageable
to invoke the template and change the call sites by hand. To get this to
scale up to all syscaches would require some scripting, but this much is
enough to get feedback on the approach. One possible concern in starting
down this path is that hundreds of call sites would have to be changed. (If
there's a good way around that, it hasn't occurred to me.) It might also be
possible to replace the per-key invocations with a single memcpy() for most
types, but that's a bit more work.

--
John Naylor
EDB: http://www.enterprisedb.com

Attachments:

v1-0001-Specialize-SearchSysCache4-by-the-search-key-data.patchapplication/octet-stream; name=v1-0001-Specialize-SearchSysCache4-by-the-search-key-data.patchDownload
From c584f4434576742dce3afc699bb5264773d067b1 Mon Sep 17 00:00:00 2001
From: John Naylor <john.naylor@postgresql.org>
Date: Thu, 19 Aug 2021 18:43:18 -0400
Subject: [PATCH v1] Specialize SearchSysCache4 by the search key data types

Use a modified version of SearchCatCacheInternal as a template
for specializing the search functions. The reason to do this is
to avoid calling equality and hash functions via through a pointer,
in the hopes this will improve syscache performance.
---
 src/backend/access/brin/brin_inclusion.c     |   2 +-
 src/backend/access/brin/brin_minmax.c        |   2 +-
 src/backend/access/brin/brin_minmax_multi.c  |   2 +-
 src/backend/catalog/namespace.c              |   2 +-
 src/backend/catalog/objectaddress.c          |   4 +-
 src/backend/catalog/pg_operator.c            |   2 +-
 src/backend/utils/cache/catcache.c           |  55 +++++-
 src/backend/utils/cache/lsyscache.c          |   4 +-
 src/backend/utils/cache/syscache.c           |  29 ++-
 src/include/utils/catcache.h                 |   4 +-
 src/include/utils/catcache_search_template.h | 176 +++++++++++++++++++
 src/include/utils/syscache.h                 |   6 +-
 12 files changed, 266 insertions(+), 22 deletions(-)
 create mode 100644 src/include/utils/catcache_search_template.h

diff --git a/src/backend/access/brin/brin_inclusion.c b/src/backend/access/brin/brin_inclusion.c
index 0b384c0bd1..78e546821d 100644
--- a/src/backend/access/brin/brin_inclusion.c
+++ b/src/backend/access/brin/brin_inclusion.c
@@ -634,7 +634,7 @@ inclusion_get_strategy_procinfo(BrinDesc *bdesc, uint16 attno, Oid subtype,
 
 		opfamily = bdesc->bd_index->rd_opfamily[attno - 1];
 		attr = TupleDescAttr(bdesc->bd_tupdesc, attno - 1);
-		tuple = SearchSysCache4(AMOPSTRATEGY, ObjectIdGetDatum(opfamily),
+		tuple = SearchSysCacheAMOPSTRATEGY(ObjectIdGetDatum(opfamily),
 								ObjectIdGetDatum(attr->atttypid),
 								ObjectIdGetDatum(subtype),
 								Int16GetDatum(strategynum));
diff --git a/src/backend/access/brin/brin_minmax.c b/src/backend/access/brin/brin_minmax.c
index 798f06c722..9d548f0576 100644
--- a/src/backend/access/brin/brin_minmax.c
+++ b/src/backend/access/brin/brin_minmax.c
@@ -294,7 +294,7 @@ minmax_get_strategy_procinfo(BrinDesc *bdesc, uint16 attno, Oid subtype,
 
 		opfamily = bdesc->bd_index->rd_opfamily[attno - 1];
 		attr = TupleDescAttr(bdesc->bd_tupdesc, attno - 1);
-		tuple = SearchSysCache4(AMOPSTRATEGY, ObjectIdGetDatum(opfamily),
+		tuple = SearchSysCacheAMOPSTRATEGY(ObjectIdGetDatum(opfamily),
 								ObjectIdGetDatum(attr->atttypid),
 								ObjectIdGetDatum(subtype),
 								Int16GetDatum(strategynum));
diff --git a/src/backend/access/brin/brin_minmax_multi.c b/src/backend/access/brin/brin_minmax_multi.c
index e3c98c2ffd..2ab4fb046a 100644
--- a/src/backend/access/brin/brin_minmax_multi.c
+++ b/src/backend/access/brin/brin_minmax_multi.c
@@ -2932,7 +2932,7 @@ minmax_multi_get_strategy_procinfo(BrinDesc *bdesc, uint16 attno, Oid subtype,
 
 		opfamily = bdesc->bd_index->rd_opfamily[attno - 1];
 		attr = TupleDescAttr(bdesc->bd_tupdesc, attno - 1);
-		tuple = SearchSysCache4(AMOPSTRATEGY, ObjectIdGetDatum(opfamily),
+		tuple = SearchSysCacheAMOPSTRATEGY(ObjectIdGetDatum(opfamily),
 								ObjectIdGetDatum(attr->atttypid),
 								ObjectIdGetDatum(subtype),
 								Int16GetDatum(strategynum));
diff --git a/src/backend/catalog/namespace.c b/src/backend/catalog/namespace.c
index fd767fc5cf..5c700d537b 100644
--- a/src/backend/catalog/namespace.c
+++ b/src/backend/catalog/namespace.c
@@ -1545,7 +1545,7 @@ OpernameGetOprid(List *names, Oid oprleft, Oid oprright)
 		{
 			HeapTuple	opertup;
 
-			opertup = SearchSysCache4(OPERNAMENSP,
+			opertup = SearchSysCacheOPERNAMENSP(
 									  CStringGetDatum(opername),
 									  ObjectIdGetDatum(oprleft),
 									  ObjectIdGetDatum(oprright),
diff --git a/src/backend/catalog/objectaddress.c b/src/backend/catalog/objectaddress.c
index 9882e549c4..5aca2badb3 100644
--- a/src/backend/catalog/objectaddress.c
+++ b/src/backend/catalog/objectaddress.c
@@ -1751,7 +1751,7 @@ get_object_address_opf_member(ObjectType objtype,
 				ObjectAddressSet(address, AccessMethodOperatorRelationId,
 								 InvalidOid);
 
-				tp = SearchSysCache4(AMOPSTRATEGY,
+				tp = SearchSysCacheAMOPSTRATEGY(
 									 ObjectIdGetDatum(famaddr.objectId),
 									 ObjectIdGetDatum(typeoids[0]),
 									 ObjectIdGetDatum(typeoids[1]),
@@ -1782,7 +1782,7 @@ get_object_address_opf_member(ObjectType objtype,
 				ObjectAddressSet(address, AccessMethodProcedureRelationId,
 								 InvalidOid);
 
-				tp = SearchSysCache4(AMPROCNUM,
+				tp = SearchSysCacheAMPROCNUM(
 									 ObjectIdGetDatum(famaddr.objectId),
 									 ObjectIdGetDatum(typeoids[0]),
 									 ObjectIdGetDatum(typeoids[1]),
diff --git a/src/backend/catalog/pg_operator.c b/src/backend/catalog/pg_operator.c
index 6c270f9338..0dca12010b 100644
--- a/src/backend/catalog/pg_operator.c
+++ b/src/backend/catalog/pg_operator.c
@@ -136,7 +136,7 @@ OperatorGet(const char *operatorName,
 	HeapTuple	tup;
 	Oid			operatorObjectId;
 
-	tup = SearchSysCache4(OPERNAMENSP,
+	tup = SearchSysCacheOPERNAMENSP(
 						  PointerGetDatum(operatorName),
 						  ObjectIdGetDatum(leftObjectId),
 						  ObjectIdGetDatum(rightObjectId),
diff --git a/src/backend/utils/cache/catcache.c b/src/backend/utils/cache/catcache.c
index 4fbdc62d8c..cd98766d59 100644
--- a/src/backend/utils/cache/catcache.c
+++ b/src/backend/utils/cache/catcache.c
@@ -1186,12 +1186,55 @@ SearchCatCache3(CatCache *cache,
 }
 
 
-HeapTuple
-SearchCatCache4(CatCache *cache,
-				Datum v1, Datum v2, Datum v3, Datum v4)
-{
-	return SearchCatCacheInternal(cache, 4, v1, v2, v3, v4);
-}
+/* 4-key search functions */
+
+#define CC_SEARCH SearchCatCacheOidOidOidInt16
+#define CC_NKEYS 4
+#define FASTEQFUNC1 int4eqfast
+#define FASTEQFUNC2 int4eqfast
+#define FASTEQFUNC3 int4eqfast
+#define FASTEQFUNC4 int2eqfast
+#define HASHFUNC1 int4hashfast
+#define HASHFUNC2 int4hashfast
+#define HASHFUNC3 int4hashfast
+#define HASHFUNC4 int2hashfast
+
+#include "utils/catcache_search_template.h"
+
+#undef CC_SEARCH
+#undef CC_NKEYS
+#undef FASTEQFUNC1
+#undef FASTEQFUNC2
+#undef FASTEQFUNC3
+#undef FASTEQFUNC4
+#undef HASHFUNC1
+#undef HASHFUNC2
+#undef HASHFUNC3
+#undef HASHFUNC4
+
+#define CC_SEARCH SearchCatCacheNameOidOidOid
+#define CC_NKEYS 4
+#define FASTEQFUNC1 nameeqfast
+#define FASTEQFUNC2 int4eqfast
+#define FASTEQFUNC3 int4eqfast
+#define FASTEQFUNC4 int4eqfast
+#define HASHFUNC1 namehashfast
+#define HASHFUNC2 int4hashfast
+#define HASHFUNC3 int4hashfast
+#define HASHFUNC4 int4hashfast
+
+#include "utils/catcache_search_template.h"
+
+#undef CC_SEARCH
+#undef CC_NKEYS
+#undef FASTEQFUNC1
+#undef FASTEQFUNC2
+#undef FASTEQFUNC3
+#undef FASTEQFUNC4
+#undef HASHFUNC1
+#undef HASHFUNC2
+#undef HASHFUNC3
+#undef HASHFUNC4
 
 /*
  * Work-horse for SearchCatCache/SearchCatCacheN.
diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c
index 4ebaa552a2..b5d7dcf94b 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -168,7 +168,7 @@ get_opfamily_member(Oid opfamily, Oid lefttype, Oid righttype,
 	Form_pg_amop amop_tup;
 	Oid			result;
 
-	tp = SearchSysCache4(AMOPSTRATEGY,
+	tp = SearchSysCacheAMOPSTRATEGY(
 						 ObjectIdGetDatum(opfamily),
 						 ObjectIdGetDatum(lefttype),
 						 ObjectIdGetDatum(righttype),
@@ -797,7 +797,7 @@ get_opfamily_proc(Oid opfamily, Oid lefttype, Oid righttype, int16 procnum)
 	Form_pg_amproc amproc_tup;
 	RegProcedure result;
 
-	tp = SearchSysCache4(AMPROCNUM,
+	tp = SearchSysCacheAMPROCNUM(
 						 ObjectIdGetDatum(opfamily),
 						 ObjectIdGetDatum(lefttype),
 						 ObjectIdGetDatum(righttype),
diff --git a/src/backend/utils/cache/syscache.c b/src/backend/utils/cache/syscache.c
index d6cb78dea8..b0131374ea 100644
--- a/src/backend/utils/cache/syscache.c
+++ b/src/backend/utils/cache/syscache.c
@@ -1157,14 +1157,33 @@ SearchSysCache3(int cacheId,
 }
 
 HeapTuple
-SearchSysCache4(int cacheId,
+SearchSysCacheAMOPSTRATEGY(
 				Datum key1, Datum key2, Datum key3, Datum key4)
 {
-	Assert(cacheId >= 0 && cacheId < SysCacheSize &&
-		   PointerIsValid(SysCache[cacheId]));
-	Assert(SysCache[cacheId]->cc_nkeys == 4);
+	Assert(PointerIsValid(SysCache[AMOPSTRATEGY]));
+	Assert(SysCache[AMOPSTRATEGY]->cc_nkeys == 4);
+
+	return SearchCatCacheOidOidOidInt16(SysCache[AMOPSTRATEGY], key1, key2, key3, key4);
+}
+
+HeapTuple
+SearchSysCacheAMPROCNUM(
+				Datum key1, Datum key2, Datum key3, Datum key4)
+{
+	Assert(PointerIsValid(SysCache[AMPROCNUM]));
+	Assert(SysCache[AMPROCNUM]->cc_nkeys == 4);
+
+	return SearchCatCacheOidOidOidInt16(SysCache[AMPROCNUM], key1, key2, key3, key4);
+}
+
+HeapTuple
+SearchSysCacheOPERNAMENSP(
+				Datum key1, Datum key2, Datum key3, Datum key4)
+{
+	Assert(PointerIsValid(SysCache[OPERNAMENSP]));
+	Assert(SysCache[OPERNAMENSP]->cc_nkeys == 4);
 
-	return SearchCatCache4(SysCache[cacheId], key1, key2, key3, key4);
+	return SearchCatCacheNameOidOidOid(SysCache[OPERNAMENSP], key1, key2, key3, key4);
 }
 
 /*
diff --git a/src/include/utils/catcache.h b/src/include/utils/catcache.h
index ddc2762eb3..623133f57b 100644
--- a/src/include/utils/catcache.h
+++ b/src/include/utils/catcache.h
@@ -204,7 +204,9 @@ extern HeapTuple SearchCatCache2(CatCache *cache,
 								 Datum v1, Datum v2);
 extern HeapTuple SearchCatCache3(CatCache *cache,
 								 Datum v1, Datum v2, Datum v3);
-extern HeapTuple SearchCatCache4(CatCache *cache,
+extern HeapTuple SearchCatCacheOidOidOidInt16(CatCache *cache,
+								 Datum v1, Datum v2, Datum v3, Datum v4);
+extern HeapTuple SearchCatCacheNameOidOidOid(CatCache *cache,
 								 Datum v1, Datum v2, Datum v3, Datum v4);
 extern void ReleaseCatCache(HeapTuple tuple);
 
diff --git a/src/include/utils/catcache_search_template.h b/src/include/utils/catcache_search_template.h
new file mode 100644
index 0000000000..6f5dc2573c
--- /dev/null
+++ b/src/include/utils/catcache_search_template.h
@@ -0,0 +1,176 @@
+/*-------------------------------------------------------------------------
+ * catcache_search_template.h
+ *
+ * A template for type-specialized SearchCatCache functions
+ *
+ * Usage notes:
+ *
+ * To generate functions specialized for a set of catcache keys,
+ * the following parameter macros should be #define'd before this
+ * file is included.
+ *
+ * - CC_SEARCH - the name of the search function to be generated
+ * - CC_NKEYS - the number of search keys for the tuple
+ * - FASTEQFUNCx (x in 1,2,3,4) - type-specific equality function(s)
+ * - HASHFUNCx (x in 1,2,3,4) - type-specific hash function(s)
+ *
+ *
+ * Copyright (c) 2021, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1992-1994, Regents of the University of California
+ *
+ * src/include/utils/catcache_search_template.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+
+HeapTuple
+CC_SEARCH(CatCache *cache,
+		  Datum v1,
+		  Datum v2,
+		  Datum v3,
+		  Datum v4)
+{
+	uint32		hashValue = 0;
+	uint32		oneHash;
+	Index		hashIndex;
+	dlist_iter	iter;
+	dlist_head *bucket;
+	CatCTup    *ct;
+
+	/* Make sure we're in an xact, even if this ends up being a cache hit */
+	Assert(IsTransactionState());
+
+	Assert(cache->cc_nkeys == CC_NKEYS);
+
+	/*
+	 * one-time startup overhead for each cache
+	 */
+	if (unlikely(cache->cc_tupdesc == NULL))
+		CatalogCacheInitializeCache(cache);
+
+#ifdef CATCACHE_STATS
+	cache->cc_searches++;
+#endif
+
+	/*
+	 * find the hash bucket in which to look for the tuple
+	 */
+	CACHE_elog(DEBUG2, "CatalogCacheComputeHashValue %s %d %p",
+			   cache->cc_relname, CC_NKEYS, cache);
+
+	switch (CC_NKEYS)
+	{
+		case 4:
+			oneHash = HASHFUNC4(v4);
+
+			hashValue ^= oneHash << 24;
+			hashValue ^= oneHash >> 8;
+			/* FALLTHROUGH */
+		case 3:
+			oneHash = HASHFUNC3(v3);
+
+			hashValue ^= oneHash << 16;
+			hashValue ^= oneHash >> 16;
+			/* FALLTHROUGH */
+		case 2:
+			oneHash = HASHFUNC2(v2);
+
+			hashValue ^= oneHash << 8;
+			hashValue ^= oneHash >> 24;
+			/* FALLTHROUGH */
+		case 1:
+			oneHash = HASHFUNC1(v1);
+
+			hashValue ^= oneHash;
+			break;
+		default:
+			elog(FATAL, "wrong number of hash keys: %d", CC_NKEYS);
+			break;
+	}
+
+	hashIndex = HASH_INDEX(hashValue, cache->cc_nbuckets);
+
+	/*
+	 * scan the hash bucket until we find a match or exhaust our tuples
+	 *
+	 * Note: it's okay to use dlist_foreach here, even though we modify the
+	 * dlist within the loop, because we don't continue the loop afterwards.
+	 */
+	bucket = &cache->cc_bucket[hashIndex];
+	dlist_foreach(iter, bucket)
+	{
+		ct = dlist_container(CatCTup, cache_elem, iter.cur);
+
+		if (ct->dead)
+			continue;			/* ignore dead entries */
+
+		if (ct->hash_value != hashValue)
+			continue;			/* quickly skip entry if wrong hash val */
+
+		switch (CC_NKEYS)
+		{
+			case 4:
+				if (!FASTEQFUNC4(ct->keys[3], v4))
+					continue;
+				/* FALLTHROUGH */
+			case 3:
+				if (!FASTEQFUNC3(ct->keys[2], v3))
+					continue;
+				/* FALLTHROUGH */
+			case 2:
+				if (!FASTEQFUNC2(ct->keys[1], v2))
+					continue;
+				/* FALLTHROUGH */
+			case 1:
+				if (!FASTEQFUNC1(ct->keys[0], v1))
+					continue;
+				break;
+			default:
+				elog(FATAL, "wrong number of keys: %d", CC_NKEYS);
+				break;
+		}
+
+		/*
+		 * We found a match in the cache.  Move it to the front of the list
+		 * for its hashbucket, in order to speed subsequent searches.  (The
+		 * most frequently accessed elements in any hashbucket will tend to be
+		 * near the front of the hashbucket's list.)
+		 */
+		dlist_move_head(bucket, &ct->cache_elem);
+
+		/*
+		 * If it's a positive entry, bump its refcount and return it. If it's
+		 * negative, we can report failure to the caller.
+		 */
+		if (!ct->negative)
+		{
+			ResourceOwnerEnlargeCatCacheRefs(CurrentResourceOwner);
+			ct->refcount++;
+			ResourceOwnerRememberCatCacheRef(CurrentResourceOwner, &ct->tuple);
+
+			CACHE_elog(DEBUG2, "SearchCatCache(%s): found in bucket %d",
+					   cache->cc_relname, hashIndex);
+
+#ifdef CATCACHE_STATS
+			cache->cc_hits++;
+#endif
+
+			return &ct->tuple;
+		}
+		else
+		{
+			CACHE_elog(DEBUG2, "SearchCatCache(%s): found neg entry in bucket %d",
+					   cache->cc_relname, hashIndex);
+
+#ifdef CATCACHE_STATS
+			cache->cc_neg_hits++;
+#endif
+
+			return NULL;
+		}
+	}
+
+	return SearchCatCacheMiss(cache, CC_NKEYS, hashValue, hashIndex, v1, v2, v3, v4);
+}
+
diff --git a/src/include/utils/syscache.h b/src/include/utils/syscache.h
index d74a348600..5396f9ccfd 100644
--- a/src/include/utils/syscache.h
+++ b/src/include/utils/syscache.h
@@ -130,7 +130,11 @@ extern HeapTuple SearchSysCache2(int cacheId,
 								 Datum key1, Datum key2);
 extern HeapTuple SearchSysCache3(int cacheId,
 								 Datum key1, Datum key2, Datum key3);
-extern HeapTuple SearchSysCache4(int cacheId,
+extern HeapTuple SearchSysCacheAMOPSTRATEGY(
+								 Datum key1, Datum key2, Datum key3, Datum key4);
+extern HeapTuple SearchSysCacheAMPROCNUM(
+								 Datum key1, Datum key2, Datum key3, Datum key4);
+extern HeapTuple SearchSysCacheOPERNAMENSP(
 								 Datum key1, Datum key2, Datum key3, Datum key4);
 
 extern void ReleaseSysCache(HeapTuple tuple);
-- 
2.31.1

#12Andres Freund
andres@anarazel.de
In reply to: John Naylor (#11)
Re: RFC: Improve CPU cache locality of syscache searches

Hi,

On 2021-08-19 19:10:37 -0400, John Naylor wrote:

I've made a small step in this direction in the attached. It uses a
template approach to generate type-specific SearchCatCache* functions, for
now only the 4-key ones. Since there's only a few of those, it's manageable
to invoke the template and change the call sites by hand. To get this to
scale up to all syscaches would require some scripting, but this much is
enough to get feedback on the approach. One possible concern in starting
down this path is that hundreds of call sites would have to be changed. (If
there's a good way around that, it hasn't occurred to me.)

I was thinking of avoiding the need for that via a macro. I.e. have
SearchSysCache4(AMOPSTRATEGY, ...) be mapped to
SearchSysCache4AMOPSTRATEGY(...). That way callers wouldn't need to care, and
we could continue to evolve the internals without continually doing
large-scale code changes?

diff --git a/src/include/utils/catcache_search_template.h b/src/include/utils/catcache_search_template.h
new file mode 100644
index 0000000000..6f5dc2573c
--- /dev/null
+++ b/src/include/utils/catcache_search_template.h
@@ -0,0 +1,176 @@
+/*-------------------------------------------------------------------------
+ * catcache_search_template.h
+ *
+ * A template for type-specialized SearchCatCache functions
+ *
+ * Usage notes:
+ *
+ * To generate functions specialized for a set of catcache keys,
+ * the following parameter macros should be #define'd before this
+ * file is included.
+ *
+ * - CC_SEARCH - the name of the search function to be generated
+ * - CC_NKEYS - the number of search keys for the tuple
+ * - FASTEQFUNCx (x in 1,2,3,4) - type-specific equality function(s)
+ * - HASHFUNCx (x in 1,2,3,4) - type-specific hash function(s)

It's not clear to me we need such a complicated interface here. Can't we just
have a pg_attribute_always_inline function with a bunch more parameters?

Greetings,

Andres Freund

#13John Naylor
john.naylor@enterprisedb.com
In reply to: Andres Freund (#12)
Re: RFC: Improve CPU cache locality of syscache searches

On Fri, Aug 27, 2021 at 3:42 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2021-08-19 19:10:37 -0400, John Naylor wrote:

I've made a small step in this direction in the attached. It uses a
template approach to generate type-specific SearchCatCache* functions,

for

now only the 4-key ones. Since there's only a few of those, it's

manageable

to invoke the template and change the call sites by hand. To get this to
scale up to all syscaches would require some scripting, but this much is
enough to get feedback on the approach. One possible concern in starting
down this path is that hundreds of call sites would have to be changed.

(If

there's a good way around that, it hasn't occurred to me.)

I was thinking of avoiding the need for that via a macro. I.e. have
SearchSysCache4(AMOPSTRATEGY, ...) be mapped to
SearchSysCache4AMOPSTRATEGY(...). That way callers wouldn't need to care,

and

we could continue to evolve the internals without continually doing
large-scale code changes?

Makes sense.

+ * To generate functions specialized for a set of catcache keys,
+ * the following parameter macros should be #define'd before this
+ * file is included.
+ *
+ * - CC_SEARCH - the name of the search function to be generated
+ * - CC_NKEYS - the number of search keys for the tuple
+ * - FASTEQFUNCx (x in 1,2,3,4) - type-specific equality function(s)
+ * - HASHFUNCx (x in 1,2,3,4) - type-specific hash function(s)

It's not clear to me we need such a complicated interface here. Can't we

just

have a pg_attribute_always_inline function with a bunch more parameters?

Were you thinking in terms of passing the type oid in parameters, like this?

HeapTuple
SearchCatCache1(CatCache *cache, Datum v1, Oid t1)
{
return SearchCatCacheInternal(cache, 1, v1, t1, 0, 0, 0, 0, 0, 0);
}

And then CatalogCacheComputeHashValue() and CatalogCacheCompareTuple()
would likewise take types as parameters, which they could use to pick the
right functions at compile time?

--
John Naylor
EDB: http://www.enterprisedb.com

#14Andres Freund
andres@anarazel.de
In reply to: John Naylor (#13)
Re: RFC: Improve CPU cache locality of syscache searches

Hi,

On 2021-08-31 15:06:32 -0400, John Naylor wrote:

Were you thinking in terms of passing the type oid in parameters, like this?

HeapTuple
SearchCatCache1(CatCache *cache, Datum v1, Oid t1)
{
return SearchCatCacheInternal(cache, 1, v1, t1, 0, 0, 0, 0, 0, 0);
}

And then CatalogCacheComputeHashValue() and CatalogCacheCompareTuple()
would likewise take types as parameters, which they could use to pick the
right functions at compile time?

I was thinking that the script could emit something like

static const cachedesc syscachedesc_AGGFNOID = {
.reloid = AggregateRelationId,
.indoid = AggregateFnoidIndexId,
.nkeys = 1,
.keys = {
{
.varno = Anum_pg_aggregate_aggfnoid,
.type = Oid,
.comparator = ...
}
}
};

static CatCache syscache_AGGFNOID;

HeapTuple
SearchSysCacheAGGFNOID(Datum key1)
{
return SearchSysCacheInternal(&syscachedesc_AGGFNOID, syscache_AGGFNOID, key1);
}

and SearchSysCacheInternal would have a pg_attribute_always_inline()
fastpath. That fastpath would basically be SearchCatCacheInternal(), with an
extra const cachedesc * paramter. Then the compiler should be able to generate
direct function calls to the hash functions in CatalogCacheComputeHashValue()
and direct calls to the equal function in CatalogCacheCompareTuple().

One difficulty would be that I don't see how to make this work with syscache.c
being compiled separately from catcache.c - but there's probably no need
to. The script could generate a syscache_impl.h that catcache.c includes.

Greetings,

Andres Freund