Optimizing ResouceOwner to speed up COPY

Started by Tomas Vondra3 months ago15 messages

tomas@vondra.me

3 months ago

1 attachment(s)

Hi,

While reviewing and testing a nearby patch (using COPY for batching in
postgres_fdw), I noticed some of the COPY queries are spending a
substantial amount of time in ResourceOwnerAddToHash(). The exact figure
depends on amount of data in the COPY, but it was often close to 50%
(according to perf-record/report).

The reason is pretty simple - ResourceOwner tracks the resources in a
very simple hash table, with O(n^2) behavior with duplicates. This
happens with COPY, because COPY creates an array of a 1000 tuple slots,
and each slot references the same tuple descriptor. And the descriptor
is added to ResourceOwner for each slot.

The trouble is the ResourceOwner uses a simple open addressing hash
table, and the duplicate values end up in a long dense "train" of
entries, slowing down both additions and deletions. Starting with an
empty hash table, adding the first entry is fine - the first slot is
available. But the following additions need to skip over the current
entries - the n-th addition needs to skip over n-1 entries.

With N entries this means (N * (N-1) / 2). And then the same thing
happens for deletions, in reverse.

This is not the first time I noticed ResourceOwner in profiles, but I
never investigated it. I don't know if all those cases were caused by
this same behavior, but it's likely at least some were ...

There's an easy way to improve this by allowing a single hash entry to
represent multiple references to the same resource. The attached patch
adds a "count" to the ResourceElem, tracking how many times that
resource was added. So if you add 1000 tuples slots, the descriptor will
have just one ResourceElem entry with count=1000.

This reduces the insert/delete complexity - not quite to O(1), but much
lower than O(n^2). It also reduces the memory usage with duplicates, and
postpones when the hash table needs to be enlarged. Of course, if there
are no duplicates, it'll use a bit more memory.

The deduplication is "opportunistic" in the sense that you may still end
up with multiple entries for the same value/kind. When adding a resource
finds an empty slot before hitting the other entry, it'll use the slot.
Also, only the hash table is deduplicated, not the initial array (which
however quite small).

This is a reasonable trade off because it means it does not get more
expensive for cases with no duplicates, while still improving cases with
many duplicate values.

To demonstrate the benefits, here's a throughput (tps) from a COPY of a
certain number of rows from a file into an UNLOGGED table, with 1, 4 and
8 clients.

master patched
rows | 1 4 8 | 1 4 8
---------|--------------------------|-------------------------
10 | 112090 418421 594905 | 112871 419043 589646
100 | 35265 121903 168238 | 42536 138578 183740
1000 | 1450 5700 10725 | 5847 21594 30713
10000 | 531 1943 2988 | 743 2619 3557
100000 | 72 244 324 | 76 256 332

Or, as relative throughput:

patched / master
rows | 1 4 8
---------------------------------------
10 | 101% 100% 99%
100 | 121% 114% 109%
1000 | 403% 379% 286%
10000 | 140% 135% 119%
100000 | 106% 105% 102%

Those are some nice improvements, especially with 1000 rows. Which makes
sense, because 1000 is the internal batch size in COPY. So the overhead
gradually increases up to 1000 rows, and then it amortizes over many
batches. It has very little impact for large COPY.

I've done the same test with a logged table. The results significantly
depend on the storage (how fast it can handle commits) - not surprising.
But with good storage the overall behavior is almost exactly the same.

I didn't do any benchmarking focused on cases with no/few duplicates
(although the COPY with 10 rows could qualify) yet. But as I explained
earlier, the behavior should not change at all. There's an extra uint32,
but that's it.

regards

--
Tomas Vondra

Attachments:

v20251016-0001-Deduplicate-entries-in-ResourceOwner.patchtext/x-patch; charset=UTF-8; name=v20251016-0001-Deduplicate-entries-in-ResourceOwner.patchDownload

From a8ef30e79b8cfc85d3c5fa393a11b28e304515f5 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 16 Oct 2025 13:27:25 +0200
Subject: [PATCH v20251016] Deduplicate entries in ResourceOwner

We may track the same resource many times, but the current ResourceOwner
does not handle that efficiently. Each reference is a separate entry,
which leads to long linear searches in the simple hash table.

This adds a simple opportunistic deduplication. Entries get a counter,
tracking how many references they represent. It only applies to the
"hash" phase, not to the initial array. And when adding an entry to the
hash finds an empty slot first, it uses it (instead of continuing to
look for a possible match).

This means the deduplication is not perfect - there may be multiple
entries for the same value. But in practice it seems like a good trade
off. It won't affect cases with no duplicates, and it will help cases
with many of them.

The code also continues to grow the hash table as before. In some cases
it might be enough to deduplicate the contents. But it's hard to say in
advance, and it's not much cheaper as it still requires walking the
current hash table. So there's a risk of attempting deduplication, only
to find the hash still needs to be enlarged, doubling the cost. So we
just grow the hash table. In the worst case the memory usage is not
worse than without deduplication. If deduplication is effective, the
hash table grows later or not at all.
---
 src/backend/utils/resowner/resowner.c | 55 ++++++++++++++++++++++++---
 1 file changed, 49 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index fca84ded6dd..27127c99e40 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -60,10 +60,15 @@
  *
  * All objects managed by this code are required to fit into a Datum,
  * which is fine since they are generally pointers or integers.
+ *
+ * We may see items multiple times (e.g. tuple descriptors when creating
+ * multiple tuple slots). To handle these cases efficiently we deduplicate
+ * the entries, and track the number of occurrences.
  */
 typedef struct ResourceElem
 {
 	Datum		item;
+	uint32		count;				/* number of occurrences */
 	const ResourceOwnerDesc *kind;	/* NULL indicates a free hash table slot */
 } ResourceElem;
 
@@ -198,7 +203,8 @@ static ResourceReleaseCallbackItem *ResourceRelease_callbacks = NULL;
 /* Internal routines */
 static inline uint32 hash_resource_elem(Datum value, const ResourceOwnerDesc *kind);
 static void ResourceOwnerAddToHash(ResourceOwner owner, Datum value,
-								   const ResourceOwnerDesc *kind);
+								   const ResourceOwnerDesc *kind,
+								   uint32 count);
 static int	resource_priority_cmp(const void *a, const void *b);
 static void ResourceOwnerSort(ResourceOwner owner);
 static void ResourceOwnerReleaseAll(ResourceOwner owner,
@@ -239,7 +245,8 @@ hash_resource_elem(Datum value, const ResourceOwnerDesc *kind)
  * Adds 'value' of given 'kind' to the ResourceOwner's hash table
  */
 static void
-ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc *kind)
+ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc *kind,
+					   uint32 count)
 {
 	uint32		mask = owner->capacity - 1;
 	uint32		idx;
@@ -250,12 +257,21 @@ ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc
 	idx = hash_resource_elem(value, kind) & mask;
 	for (;;)
 	{
+		/* found an exact match - just increment the counter */
+		if ((owner->hash[idx].kind == kind) &&
+			(owner->hash[idx].item == value))
+		{
+			owner->hash[idx].count += count;
+			return;
+		}
+
 		if (owner->hash[idx].kind == NULL)
 			break;				/* found a free slot */
 		idx = (idx + 1) & mask;
 	}
 	owner->hash[idx].item = value;
 	owner->hash[idx].kind = kind;
+	owner->hash[idx].count = count;
 	owner->nhash++;
 }
 
@@ -377,6 +393,7 @@ ResourceOwnerReleaseAll(ResourceOwner owner, ResourceReleasePhase phase,
 		uint32		idx = nitems - 1;
 		Datum		value = items[idx].item;
 		const ResourceOwnerDesc *kind = items[idx].kind;
+		uint32		count = items[idx].count;
 
 		if (kind->release_phase > phase)
 			break;
@@ -392,7 +409,14 @@ ResourceOwnerReleaseAll(ResourceOwner owner, ResourceReleasePhase phase,
 			elog(WARNING, "resource was not closed: %s", res_str);
 			pfree(res_str);
 		}
-		kind->ReleaseResource(value);
+
+		/* invoke the release callback for each ocurrence */
+		while (count > 0)
+		{
+			kind->ReleaseResource(value);
+			count--;
+		}
+
 		nitems--;
 	}
 	if (owner->nhash == 0)
@@ -496,7 +520,8 @@ ResourceOwnerEnlarge(ResourceOwner owner)
 			for (i = 0; i < oldcap; i++)
 			{
 				if (oldhash[i].kind != NULL)
-					ResourceOwnerAddToHash(owner, oldhash[i].item, oldhash[i].kind);
+					ResourceOwnerAddToHash(owner, oldhash[i].item, oldhash[i].kind,
+										   oldhash[i].count);
 			}
 
 			/* And release old hash table. */
@@ -506,7 +531,8 @@ ResourceOwnerEnlarge(ResourceOwner owner)
 
 	/* Move items from the array to the hash */
 	for (int i = 0; i < owner->narr; i++)
-		ResourceOwnerAddToHash(owner, owner->arr[i].item, owner->arr[i].kind);
+		ResourceOwnerAddToHash(owner, owner->arr[i].item, owner->arr[i].kind,
+							   owner->arr[i].count);
 	owner->narr = 0;
 
 	Assert(owner->nhash <= owner->grow_at);
@@ -543,6 +569,7 @@ ResourceOwnerRemember(ResourceOwner owner, Datum value, const ResourceOwnerDesc
 	idx = owner->narr;
 	owner->arr[idx].item = value;
 	owner->arr[idx].kind = kind;
+	owner->arr[idx].count = 1;
 	owner->narr++;
 }
 
@@ -597,8 +624,17 @@ ResourceOwnerForget(ResourceOwner owner, Datum value, const ResourceOwnerDesc *k
 			if (owner->hash[idx].item == value &&
 				owner->hash[idx].kind == kind)
 			{
+				/* an entry representing multiple items, just decrement */
+				if (owner->hash[idx].count > 1)
+				{
+					owner->hash[idx].count--;
+					return;
+				}
+
+				/* remove the entry entirely */
 				owner->hash[idx].item = (Datum) 0;
 				owner->hash[idx].kind = NULL;
+				owner->hash[idx].count = 0;
 				owner->nhash--;
 
 #ifdef RESOWNER_STATS
@@ -847,12 +883,19 @@ ResourceOwnerReleaseAllOfKind(ResourceOwner owner, const ResourceOwnerDesc *kind
 		if (owner->hash[i].kind == kind)
 		{
 			Datum		value = owner->hash[i].item;
+			uint32		count = owner->hash[i].count;
 
 			owner->hash[i].item = (Datum) 0;
 			owner->hash[i].kind = NULL;
+			owner->hash[i].count = 0;
 			owner->nhash--;
 
-			kind->ReleaseResource(value);
+			/* invoke the release callback once for each entry */
+			while (count > 0)
+			{
+				kind->ReleaseResource(value);
+				count--;
+			}
 		}
 	}
 	owner->releasing = false;
-- 
2.51.0

Tom Lane

tgl@sss.pgh.pa.us

3 months ago

In reply to: Tomas Vondra (#1)

Re: Optimizing ResouceOwner to speed up COPY

Tomas Vondra <tomas@vondra.me> writes:

The reason is pretty simple - ResourceOwner tracks the resources in a
very simple hash table, with O(n^2) behavior with duplicates. This
happens with COPY, because COPY creates an array of a 1000 tuple slots,
and each slot references the same tuple descriptor. And the descriptor
is added to ResourceOwner for each slot.
...
There's an easy way to improve this by allowing a single hash entry to
represent multiple references to the same resource. The attached patch
adds a "count" to the ResourceElem, tracking how many times that
resource was added. So if you add 1000 tuples slots, the descriptor will
have just one ResourceElem entry with count=1000.

Hmm. I don't love the 50% increase in sizeof(ResourceElem) ... maybe
that's negligible, or maybe it isn't. Can you find evidence of this
change being helpful for anything except this specific scenario in
COPY? Because we could probably find some way to avoid registering
all the doppelganger slots, if that's the only culprit.

regards, tom lane

Tomas Vondra

tomas@vondra.me

3 months ago

In reply to: Tom Lane (#2)

Re: Optimizing ResouceOwner to speed up COPY

On 10/16/25 20:12, Tom Lane wrote:

Tomas Vondra <tomas@vondra.me> writes:

The reason is pretty simple - ResourceOwner tracks the resources in a
very simple hash table, with O(n^2) behavior with duplicates. This
happens with COPY, because COPY creates an array of a 1000 tuple slots,
and each slot references the same tuple descriptor. And the descriptor
is added to ResourceOwner for each slot.
...
There's an easy way to improve this by allowing a single hash entry to
represent multiple references to the same resource. The attached patch
adds a "count" to the ResourceElem, tracking how many times that
resource was added. So if you add 1000 tuples slots, the descriptor will
have just one ResourceElem entry with count=1000.

Hmm. I don't love the 50% increase in sizeof(ResourceElem) ... maybe
that's negligible, or maybe it isn't.

I agree the +50% is not great. My plan was to maybe use a smaller data
type, say uint8 or uint16, and reduce the overhead that way (while still
massively reducing the number of entries). But I didn't realize there
will always be extra padding, making this pointless.

Maybe we could make the struct __packed__, and put the counter last (and
maybe make it uint16)? But I don't see any other packed structs, so I
guess there's a reason not to have them.

I'm not sure if this is (or is not) negligible. In cases with duplicates
this is actually saves a lot of memory, so it's a win. I'm not sure
about cases without duplicates - it'll use more memory, but I'm not sure
how many resource owners there usually are. Or how many entries they
contain. That'll probably determine if it's negligible.

My intuition was there wouldn't be that many, and that the typical
number of elements is fairly low (if not, why have the initial capacity
set to 64/128)? But maybe my intuition is wrong.

Can you find evidence of this change being helpful for anything
except this specific scenario in COPY?

I went through the ResourceOwnerRemember() calls, looking for other
cases that might create a lot of duplicates, similar to the tuple
descriptors, but I haven't found anything obvious. Other resources seem
to be either naturally unique or limited to very few duplicates.

Because we could probably find some way to avoid registering all the
doppelganger slots, if that's the only culprit.

I'm not against other approaches.

I didn't want to rework the foundation of the resource owner management,
and this seemed simple, and with minimal impact on cases without
duplicates (except for the struct getting larger).

thanks

--
Tomas Vondra

Tom Lane

tgl@sss.pgh.pa.us

3 months ago

In reply to: Tomas Vondra (#3)

Re: Optimizing ResouceOwner to speed up COPY

Tomas Vondra <tomas@vondra.me> writes:

On 10/16/25 20:12, Tom Lane wrote:

Can you find evidence of this change being helpful for anything
except this specific scenario in COPY?

I went through the ResourceOwnerRemember() calls, looking for other
cases that might create a lot of duplicates, similar to the tuple
descriptors, but I haven't found anything obvious. Other resources seem
to be either naturally unique or limited to very few duplicates.

I was thinking of adding some temporary instrumentation, like
just elog'ing whenever the count goes above 1, and seeing where
you get hits during the regression tests. I'm prepared to believe
this is worth doing, but it'd be nice to have more examples
in mind.

regards, tom lane

Tomas Vondra

tomas@vondra.me

3 months ago

In reply to: Tom Lane (#4)

Re: Optimizing ResouceOwner to speed up COPY

On 10/16/25 21:28, Tom Lane wrote:

Tomas Vondra <tomas@vondra.me> writes:

On 10/16/25 20:12, Tom Lane wrote:

Can you find evidence of this change being helpful for anything
except this specific scenario in COPY?

I went through the ResourceOwnerRemember() calls, looking for other
cases that might create a lot of duplicates, similar to the tuple
descriptors, but I haven't found anything obvious. Other resources seem
to be either naturally unique or limited to very few duplicates.

I was thinking of adding some temporary instrumentation, like
just elog'ing whenever the count goes above 1, and seeing where
you get hits during the regression tests. I'm prepared to believe
this is worth doing, but it'd be nice to have more examples
in mind.

I tried that, and that gives me ~30k log messages with (count > 1). But
that's a bit misleading, because a lot of that are the same "thing"
going from 1 to N, which produces N messages.

If I subtract all the COPY statements, loading data for regressison
tests, that leaves ~7500 cases. There's a lot of cases with count 2 or
3, mostly simple queries. Even a simple "\d t" produces a bunch of such
messages.

test=# \d t
WARNING: RESOURCEOWNER: snapshot reference 0x2e3787b0 resource owner
Portal count 2
WARNING: RESOURCEOWNER: relcache reference 0x79ae1302fba8 resource
owner Portal count 2
WARNING: RESOURCEOWNER: tupdesc reference 0x79ae1302fec8 resource owner
Portal count 2
WARNING: RESOURCEOWNER: relcache reference 0x79ae13034928 resource
owner Portal count 2
WARNING: RESOURCEOWNER: relcache reference 0x79ae13034928 resource
owner Portal count 3
WARNING: RESOURCEOWNER: relcache reference 0x79ae13034d88 resource
owner Portal count 2
WARNING: RESOURCEOWNER: buffer pin 0x4a resource owner Portal count 2
WARNING: RESOURCEOWNER: relcache reference 0x79ae13034928 resource
owner Portal count 4
WARNING: RESOURCEOWNER: buffer pin 0xa resource owner Portal count 2
WARNING: RESOURCEOWNER: relcache reference 0x79ae12dca6d0 resource
owner Portal count 2
WARNING: RESOURCEOWNER: relcache reference 0x79ae1303aff8 resource
owner Portal count 2

There are some more extreme ones too. For example

select infinite_recurse();

produces

WARNING: RESOURCEOWNER: plancache reference 0x34555828 resource owner
Portal count 1340

Another example is CREATE TABLE, which creates a batch of slots when
inserting attributes in InsertPgAttributeTuples, so that'll end up with
the count = number of attributes.

Of course, those are not particularly frequent operations. Most
applications are not doing CREATE TABLE nearly as often as DML.

But I had another idea - see how large the ResourceOwners get, which
would tell us how much "overhead" it really is. So I added logging into
ResourceOwnerDelete (without the patch), and with that regression tests
produce 113916 messages. And 113289 have the initial capacity 32, so
array only. From the remaining ~600, only 72 have capacity over 64.

So I guess the overhead should not be that bad. Actually, it would be
possible to completely eliminate the overhead for the array, because
that does not actually need the count at all.

regards

--
Tomas Vondra

Tom Lane

tgl@sss.pgh.pa.us

3 months ago

In reply to: Tomas Vondra (#5)

Re: Optimizing ResouceOwner to speed up COPY

Tomas Vondra <tomas@vondra.me> writes:

On 10/16/25 21:28, Tom Lane wrote:

I was thinking of adding some temporary instrumentation, like
just elog'ing whenever the count goes above 1, and seeing where
you get hits during the regression tests. I'm prepared to believe
this is worth doing, but it'd be nice to have more examples
in mind.

I tried that, and that gives me ~30k log messages with (count > 1). But
that's a bit misleading, because a lot of that are the same "thing"
going from 1 to N, which produces N messages.

If I subtract all the COPY statements, loading data for regressison
tests, that leaves ~7500 cases. There's a lot of cases with count 2 or
3, mostly simple queries. Even a simple "\d t" produces a bunch of such
messages.

Okay, so we definitely do have other cases where the count will be
more than 1.

There are some more extreme ones too. For example
select infinite_recurse();
produces
WARNING: RESOURCEOWNER: plancache reference 0x34555828 resource owner
Portal count 1340

Makes me wonder if we shouldn't make the count int64, just to remove
all worries about overflow. That'd be free on 64-bit machines ...

But I had another idea - see how large the ResourceOwners get, which
would tell us how much "overhead" it really is. So I added logging into
ResourceOwnerDelete (without the patch), and with that regression tests
produce 113916 messages. And 113289 have the initial capacity 32, so
array only. From the remaining ~600, only 72 have capacity over 64.

Ah, that's fairly convincing. Seems like we can move ahead with this.

regards, tom lane

Tomas Vondra

tomas@vondra.me

3 months ago

In reply to: Tom Lane (#6)

Re: Optimizing ResouceOwner to speed up COPY

On 10/17/25 00:17, Tom Lane wrote:

Tomas Vondra <tomas@vondra.me> writes:

On 10/16/25 21:28, Tom Lane wrote:

I was thinking of adding some temporary instrumentation, like
just elog'ing whenever the count goes above 1, and seeing where
you get hits during the regression tests. I'm prepared to believe
this is worth doing, but it'd be nice to have more examples
in mind.

I tried that, and that gives me ~30k log messages with (count > 1). But
that's a bit misleading, because a lot of that are the same "thing"
going from 1 to N, which produces N messages.

If I subtract all the COPY statements, loading data for regressison
tests, that leaves ~7500 cases. There's a lot of cases with count 2 or
3, mostly simple queries. Even a simple "\d t" produces a bunch of such
messages.

Okay, so we definitely do have other cases where the count will be
more than 1.

There are some more extreme ones too. For example
select infinite_recurse();
produces
WARNING: RESOURCEOWNER: plancache reference 0x34555828 resource owner
Portal count 1340

Makes me wonder if we shouldn't make the count int64, just to remove
all worries about overflow. That'd be free on 64-bit machines ...

We may do that, if it's free. But I doubt there's any risk of overflow
in practice. If we really wanted to support that many entries, we
wouldn't be allocating the hash table using MemoryContextAllocZero().
That puts the maximum number of unique entries at ~50M.

But I had another idea - see how large the ResourceOwners get, which
would tell us how much "overhead" it really is. So I added logging into
ResourceOwnerDelete (without the patch), and with that regression tests
produce 113916 messages. And 113289 have the initial capacity 32, so
array only. From the remaining ~600, only 72 have capacity over 64.

Ah, that's fairly convincing. Seems like we can move ahead with this.

Thanks for the feedback. I'll let it sit for a while, there's no rush to
get this committed.

regards

--
Tomas Vondra

Chao Li

li.evan.chao@gmail.com

3 months ago

In reply to: Tomas Vondra (#1)

Re: Optimizing ResouceOwner to speed up COPY

On Oct 17, 2025, at 01:46, Tomas Vondra <tomas@vondra.me> wrote:

--
Tomas Vondra<v20251016-0001-Deduplicate-entries-in-ResourceOwner.patch>

Nice patch! I eyeball reviewed the patch, only got a few small comments:

1
```
@@ -250,12 +257,21 @@ ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc
 	idx = hash_resource_elem(value, kind) & mask;
 	for (;;)
 	{
+		/* found an exact match - just increment the counter */
+		if ((owner->hash[idx].kind == kind) &&
+			(owner->hash[idx].item == value))
+		{
+			owner->hash[idx].count += count;
+			return;
+		}
+
 		if (owner->hash[idx].kind == NULL)
 			break;				/* found a free slot */
 		idx = (idx + 1) & mask;
 	}
```

I think this is the “trade-off” you mention in your email: if a free slot found earlier, then still gets duplicated entries. I have no concern to this “trade-off”, but please add a comment here, otherwise future readers may be confused, and might potentially consider that were a bug, without reading your original design email.

2
```
typedef struct ResourceElem
{
Datum item;
+ uint32 count; /* number of occurrences */
```

Nit suggestion. People usually name this type of count as “refcnt”, which looks more meaningful. But I don’t have a strong opinion here. I am absolutely fine if you don’t want to change.

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

Heikki Linnakangas

hlinnaka@iki.fi

3 months ago

In reply to: Chao Li (#8)

Re: Optimizing ResouceOwner to speed up COPY

On 17/10/2025 06:13, Chao Li wrote:

```
@@ -250,12 +257,21 @@ ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc
idx = hash_resource_elem(value, kind) & mask;
for (;;)
{
+		/* found an exact match - just increment the counter */
+		if ((owner->hash[idx].kind == kind) &&
+			(owner->hash[idx].item == value))
+		{
+			owner->hash[idx].count += count;
+			return;
+		}
+
if (owner->hash[idx].kind == NULL)
break;				/* found a free slot */
idx = (idx + 1) & mask;
}
```
I think this is the “trade-off” you mention in your email: if a free slot found earlier, then still gets duplicated entries. I have no concern to this “trade-off”, but please add a comment here, otherwise future readers may be confused, and might potentially consider that were a bug, without reading your original design email.

+1 on such a comment. Here or maybe on the ResourceElem struct itself.

typedef struct ResourceElem
{
Datum item;
+ uint32 count; /* number of occurrences */
const ResourceOwnerDesc *kind; /* NULL indicates a free hash table slot */
} ResourceElem;

Hmm, the 'count' is not used when the entry is stored in the array.
Perhaps we should have a separate struct for array and hash elements
now. Keeping the array small helps it to fit in CPU caches.

/*
* Forget that an object is owned by a ResourceOwner
*
* Note: If same resource ID is associated with the ResourceOwner more than
* once, one instance is removed.
*
* Note: Forgetting a resource does not guarantee that there is room to
* remember a new resource. One exception is when you forget the most
* recently remembered resource; that does make room for a new remember call.
* Some code callers rely on that exception.
*/
void
ResourceOwnerForget(ResourceOwner owner, Datum value, const ResourceOwnerDesc *kind)

Does this break the exception noted in the above comment? I guess it
still holds because we don't deduplicate entries in the array. That's
very subtle, needs a comment at least.

typo: ocurrence -> occurrence

- Heikki

#10

Tomas Vondra

tomas@vondra.me

3 months ago

In reply to: Heikki Linnakangas (#9)

Re: Optimizing ResouceOwner to speed up COPY

On 10/17/25 10:31, Heikki Linnakangas wrote:

On 17/10/2025 06:13, Chao Li wrote:
```
@@ -250,12 +257,21 @@ ResourceOwnerAddToHash(ResourceOwner owner,
Datum value, const ResourceOwnerDesc
      idx = hash_resource_elem(value, kind) & mask;
      for (;;)
      {
+        /* found an exact match - just increment the counter */
+        if ((owner->hash[idx].kind == kind) &&
+            (owner->hash[idx].item == value))
+        {
+            owner->hash[idx].count += count;
+            return;
+        }
+
          if (owner->hash[idx].kind == NULL)
              break;                /* found a free slot */
          idx = (idx + 1) & mask;
      }
```
I think this is the “trade-off” you mention in your email: if a free
slot found earlier, then still gets duplicated entries. I have no
concern to this “trade-off”, but please add a comment here, otherwise
future readers may be confused, and might potentially consider that
were a bug, without reading your original design email.
+1 on such a comment. Here or maybe on the ResourceElem struct itself.

Will do. It's probably appropriate to mention this deduplication in the
comment at the beginning of the file.

typedef struct ResourceElem
{
     Datum        item;
+    uint32        count;                /* number of occurrences */
     const ResourceOwnerDesc *kind;    /* NULL indicates a free hash
table slot */
} ResourceElem;

Hmm, the 'count' is not used when the entry is stored in the array.
Perhaps we should have a separate struct for array and hash elements
now. Keeping the array small helps it to fit in CPU caches.

Agreed. I had the same idea yesterday, but I haven't done it yet.

/*
* Forget that an object is owned by a ResourceOwner
*
* Note: If same resource ID is associated with the ResourceOwner more
than
* once, one instance is removed.
*
* Note: Forgetting a resource does not guarantee that there is room to
* remember a new resource. One exception is when you forget the most
* recently remembered resource; that does make room for a new
remember call.
* Some code callers rely on that exception.
*/
void
ResourceOwnerForget(ResourceOwner owner, Datum value, const
ResourceOwnerDesc *kind)

Does this break the exception noted in the above comment? I guess it
still holds because we don't deduplicate entries in the array. That's
very subtle, needs a comment at least.

Right, it doesn't break the exception - as you say, it only applies to
the hash, not to the array. I'll add a comment.

Thanks

--
Tomas Vondra

#11

Tomas Vondra

tomas@vondra.me

3 months ago

In reply to: Tomas Vondra (#10)

1 attachment(s)

Re: Optimizing ResouceOwner to speed up COPY

On 10/17/25 12:32, Tomas Vondra wrote:

On 10/17/25 10:31, Heikki Linnakangas wrote:
On 17/10/2025 06:13, Chao Li wrote:
```
@@ -250,12 +257,21 @@ ResourceOwnerAddToHash(ResourceOwner owner,
Datum value, const ResourceOwnerDesc
      idx = hash_resource_elem(value, kind) & mask;
      for (;;)
      {
+        /* found an exact match - just increment the counter */
+        if ((owner->hash[idx].kind == kind) &&
+            (owner->hash[idx].item == value))
+        {
+            owner->hash[idx].count += count;
+            return;
+        }
+
          if (owner->hash[idx].kind == NULL)
              break;                /* found a free slot */
          idx = (idx + 1) & mask;
      }
```
I think this is the “trade-off” you mention in your email: if a free
slot found earlier, then still gets duplicated entries. I have no
concern to this “trade-off”, but please add a comment here, otherwise
future readers may be confused, and might potentially consider that
were a bug, without reading your original design email.
+1 on such a comment. Here or maybe on the ResourceElem struct itself.
Will do. It's probably appropriate to mention this deduplication in the
comment at the beginning of the file.

I added a comment to the file comment, and clarified the comment when
adding the hash table entry. But I tried not to be too verbose there.

typedef struct ResourceElem
{
     Datum        item;
+    uint32        count;                /* number of occurrences */
     const ResourceOwnerDesc *kind;    /* NULL indicates a free hash
table slot */
} ResourceElem;

Hmm, the 'count' is not used when the entry is stored in the array.
Perhaps we should have a separate struct for array and hash elements
now. Keeping the array small helps it to fit in CPU caches.

Agreed. I had the same idea yesterday, but I haven't done it yet.

The attached v2 does that - it adds a separate ResourceHashElem for the
has table, and it works. But I'm not sure I like it very much, because
there are two places that relied on both the array and hash table using
the same struct to "walk" it the same way.

For ResourceOwnerSort() it's not too bad, but ResourceOwnerReleaseAll()
now duplicates most of the code. It's not terrible, but also not pretty.
I can't think of a better way, though.

/*
* Forget that an object is owned by a ResourceOwner
*
* Note: If same resource ID is associated with the ResourceOwner more
than
* once, one instance is removed.
*
* Note: Forgetting a resource does not guarantee that there is room to
* remember a new resource. One exception is when you forget the most
* recently remembered resource; that does make room for a new
remember call.
* Some code callers rely on that exception.
*/
void
ResourceOwnerForget(ResourceOwner owner, Datum value, const
ResourceOwnerDesc *kind)

Does this break the exception noted in the above comment? I guess it
still holds because we don't deduplicate entries in the array. That's
very subtle, needs a comment at least.

Right, it doesn't break the exception - as you say, it only applies to
the hash, not to the array. I'll add a comment.

Comment added.

regards

--
Tomas Vondra

Attachments:

v2-0001-Deduplicate-entries-in-ResourceOwner.patchtext/x-patch; charset=UTF-8; name=v2-0001-Deduplicate-entries-in-ResourceOwner.patchDownload

From bbb37408e5496fd011cf460279cf9ba2533c4fd6 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 16 Oct 2025 13:27:25 +0200
Subject: [PATCH v2] Deduplicate entries in ResourceOwner

We may track the same resource many times, but the current ResourceOwner
does not handle that efficiently. Each reference is a separate entry,
which leads to long linear searches in the simple hash table.

This adds a simple opportunistic deduplication. Entries get a counter,
tracking how many references they represent. It only applies to the
"hash" phase, not to the initial array. And when adding an entry to the
hash finds an empty slot first, it uses it (instead of continuing to
look for a possible match).

This means the deduplication is not perfect - there may be multiple
entries for the same value. But in practice it seems like a good trade
off. It won't affect cases with no duplicates, and it will help cases
with many of them.

The code also continues to grow the hash table as before. In some cases
it might be enough to deduplicate the contents. But it's hard to say in
advance, and it's not much cheaper as it still requires walking the
current hash table. So there's a risk of attempting deduplication, only
to find the hash still needs to be enlarged, doubling the cost. So we
just grow the hash table. In the worst case the memory usage is not
worse than without deduplication. If deduplication is effective, the
hash table grows later or not at all.
---
 src/backend/utils/resowner/resowner.c | 215 ++++++++++++++++++++------
 1 file changed, 166 insertions(+), 49 deletions(-)

diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index fca84ded6dd..055426c7761 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -15,6 +15,12 @@
  * a particular reference, you need to search both the array and the hash
  * table.
  *
+ * The hash table deduplicates entries in an opportunistic way. If it finds
+ * an entry for the same resource (same kind), it increments a counter instead.
+ * But if it finds an empty slot first, it uses the slot. If the table gets
+ * enlarged, those entries will be merged, but it's possible to have multiple
+ * entries for the same resource. The fixed-size array does not deduplicate.
+ *
  * The most frequent usage is that a resource is remembered, and forgotten
  * shortly thereafter.  For example, pin a buffer, read one tuple from it,
  * release the pin.  Linearly scanning the small array handles that case
@@ -67,6 +73,18 @@ typedef struct ResourceElem
 	const ResourceOwnerDesc *kind;	/* NULL indicates a free hash table slot */
 } ResourceElem;
 
+/*
+ * ResourceHashElem represents a reference associated with a resource owner,
+ * stored in the hash table. Similar to ResourceElem, but has a counter field
+ * tracking how many references it represents, to allow deduplication.
+ */
+typedef struct ResourceHashElem
+{
+ 	Datum		item;
+	uint32		count;				/* number of references */
+	const ResourceOwnerDesc *kind;	/* NULL indicates a free hash table slot */
+} ResourceHashElem;
+
 /*
  * Size of the fixed-size array to hold most-recently remembered resources.
  */
@@ -151,7 +169,7 @@ struct ResourceOwnerData
 	 * release priority.  The first 'nhash' elements are occupied, the rest
 	 * are empty.
 	 */
-	ResourceElem *hash;
+	ResourceHashElem *hash;
 	uint32		capacity;		/* allocated length of hash[] */
 	uint32		grow_at;		/* grow hash when reach this */
 
@@ -198,7 +216,8 @@ static ResourceReleaseCallbackItem *ResourceRelease_callbacks = NULL;
 /* Internal routines */
 static inline uint32 hash_resource_elem(Datum value, const ResourceOwnerDesc *kind);
 static void ResourceOwnerAddToHash(ResourceOwner owner, Datum value,
-								   const ResourceOwnerDesc *kind);
+								   const ResourceOwnerDesc *kind,
+								   uint32 count);
 static int	resource_priority_cmp(const void *a, const void *b);
 static void ResourceOwnerSort(ResourceOwner owner);
 static void ResourceOwnerReleaseAll(ResourceOwner owner,
@@ -239,7 +258,8 @@ hash_resource_elem(Datum value, const ResourceOwnerDesc *kind)
  * Adds 'value' of given 'kind' to the ResourceOwner's hash table
  */
 static void
-ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc *kind)
+ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc *kind,
+					   uint32 count)
 {
 	uint32		mask = owner->capacity - 1;
 	uint32		idx;
@@ -250,12 +270,25 @@ ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc
 	idx = hash_resource_elem(value, kind) & mask;
 	for (;;)
 	{
+		/*
+		 * Increment counter for a matching entry or use an empty slot,
+		 * whichever we find first.
+		 */
+
+		if ((owner->hash[idx].kind == kind) &&
+			(owner->hash[idx].item == value))
+		{
+			owner->hash[idx].count += count;
+			return;
+		}
+
 		if (owner->hash[idx].kind == NULL)
 			break;				/* found a free slot */
 		idx = (idx + 1) & mask;
 	}
 	owner->hash[idx].item = value;
 	owner->hash[idx].kind = kind;
+	owner->hash[idx].count = count;
 	owner->nhash++;
 }
 
@@ -277,6 +310,24 @@ resource_priority_cmp(const void *a, const void *b)
 		return 1;
 }
 
+/*
+ * Comparison function to sort by release phase and priority
+ */
+static int
+resource_priority_hash_cmp(const void *a, const void *b)
+{
+	const ResourceHashElem *ra = (const ResourceHashElem *) a;
+	const ResourceHashElem *rb = (const ResourceHashElem *) b;
+
+	/* Note: reverse order */
+	if (ra->kind->release_phase == rb->kind->release_phase)
+		return pg_cmp_u32(rb->kind->release_priority, ra->kind->release_priority);
+	else if (ra->kind->release_phase > rb->kind->release_phase)
+		return -1;
+	else
+		return 1;
+}
+
 /*
  * Sort resources in reverse release priority.
  *
@@ -288,16 +339,21 @@ resource_priority_cmp(const void *a, const void *b)
 static void
 ResourceOwnerSort(ResourceOwner owner)
 {
-	ResourceElem *items;
-	uint32		nitems;
-
 	if (owner->nhash == 0)
 	{
+		ResourceElem *items;
+		uint32		nitems;
+
 		items = owner->arr;
 		nitems = owner->narr;
+
+		qsort(items, nitems, sizeof(ResourceElem), resource_priority_cmp);
 	}
 	else
 	{
+		ResourceHashElem   *items;
+		uint32				nitems;
+
 		/*
 		 * Compact the hash table, so that all the elements are in the
 		 * beginning of the 'hash' array, with no empty elements.
@@ -324,7 +380,9 @@ ResourceOwnerSort(ResourceOwner owner)
 		Assert(dst + owner->narr <= owner->capacity);
 		for (int idx = 0; idx < owner->narr; idx++)
 		{
-			owner->hash[dst] = owner->arr[idx];
+			owner->hash[dst].item = owner->arr[idx].item;
+			owner->hash[dst].kind = owner->arr[idx].kind;
+			owner->hash[dst].count = 1;
 			dst++;
 		}
 		Assert(dst == owner->nhash + owner->narr);
@@ -333,9 +391,9 @@ ResourceOwnerSort(ResourceOwner owner)
 
 		items = owner->hash;
 		nitems = owner->nhash;
-	}
 
-	qsort(items, nitems, sizeof(ResourceElem), resource_priority_cmp);
+		qsort(items, nitems, sizeof(ResourceHashElem), resource_priority_hash_cmp);
+	}
 }
 
 /*
@@ -345,9 +403,6 @@ static void
 ResourceOwnerReleaseAll(ResourceOwner owner, ResourceReleasePhase phase,
 						bool printLeakWarnings)
 {
-	ResourceElem *items;
-	uint32		nitems;
-
 	/*
 	 * ResourceOwnerSort must've been called already.  All the resources are
 	 * either in the array or the hash.
@@ -356,49 +411,93 @@ ResourceOwnerReleaseAll(ResourceOwner owner, ResourceReleasePhase phase,
 	Assert(owner->sorted);
 	if (owner->nhash == 0)
 	{
+		ResourceElem *items;
+		uint32		nitems;
+
 		items = owner->arr;
 		nitems = owner->narr;
+
+		/*
+		 * The resources are sorted in reverse priority order.  Release them
+		 * starting from the end, until we hit the end of the phase that we are
+		 * releasing now.  We will continue from there when called again for the
+		 * next phase.
+		 */
+		while (nitems > 0)
+		{
+			uint32		idx = nitems - 1;
+			Datum		value = items[idx].item;
+			const ResourceOwnerDesc *kind = items[idx].kind;
+
+			if (kind->release_phase > phase)
+				break;
+			Assert(kind->release_phase == phase);
+
+			if (printLeakWarnings)
+			{
+				char	   *res_str;
+
+				res_str = kind->DebugPrint ?
+					kind->DebugPrint(value)
+					: psprintf("%s %p", kind->name, DatumGetPointer(value));
+				pfree(res_str);
+			}
+
+			kind->ReleaseResource(value);
+
+			nitems--;
+		}
+
+		owner->narr = nitems;
 	}
 	else
 	{
+		ResourceHashElem *items;
+		uint32		nitems;
+
 		Assert(owner->narr == 0);
 		items = owner->hash;
 		nitems = owner->nhash;
-	}
 
-	/*
-	 * The resources are sorted in reverse priority order.  Release them
-	 * starting from the end, until we hit the end of the phase that we are
-	 * releasing now.  We will continue from there when called again for the
-	 * next phase.
-	 */
-	while (nitems > 0)
-	{
-		uint32		idx = nitems - 1;
-		Datum		value = items[idx].item;
-		const ResourceOwnerDesc *kind = items[idx].kind;
+		/*
+		 * The resources are sorted in reverse priority order.  Release them
+		 * starting from the end, until we hit the end of the phase that we are
+		 * releasing now.  We will continue from there when called again for the
+		 * next phase.
+		 */
+		while (nitems > 0)
+		{
+			uint32		idx = nitems - 1;
+			Datum		value = items[idx].item;
+			const ResourceOwnerDesc *kind = items[idx].kind;
+			uint32		count = items[idx].count;
 
-		if (kind->release_phase > phase)
-			break;
-		Assert(kind->release_phase == phase);
+			if (kind->release_phase > phase)
+				break;
+			Assert(kind->release_phase == phase);
 
-		if (printLeakWarnings)
-		{
-			char	   *res_str;
+			if (printLeakWarnings)
+			{
+				char	   *res_str;
+
+				res_str = kind->DebugPrint ?
+					kind->DebugPrint(value)
+					: psprintf("%s %p", kind->name, DatumGetPointer(value));
+				pfree(res_str);
+			}
+
+			/* invoke the release callback for each reference */
+			while (count > 0)
+			{
+				kind->ReleaseResource(value);
+				count--;
+			}
 
-			res_str = kind->DebugPrint ?
-				kind->DebugPrint(value)
-				: psprintf("%s %p", kind->name, DatumGetPointer(value));
-			elog(WARNING, "resource was not closed: %s", res_str);
-			pfree(res_str);
+			nitems--;
 		}
-		kind->ReleaseResource(value);
-		nitems--;
-	}
-	if (owner->nhash == 0)
-		owner->narr = nitems;
-	else
+
 		owner->nhash = nitems;
+	}
 }
 
 
@@ -466,16 +565,16 @@ ResourceOwnerEnlarge(ResourceOwner owner)
 		uint32		i,
 					oldcap,
 					newcap;
-		ResourceElem *oldhash;
-		ResourceElem *newhash;
+		ResourceHashElem *oldhash;
+		ResourceHashElem *newhash;
 
 		oldhash = owner->hash;
 		oldcap = owner->capacity;
 
 		/* Double the capacity (it must stay a power of 2!) */
 		newcap = (oldcap > 0) ? oldcap * 2 : RESOWNER_HASH_INIT_SIZE;
-		newhash = (ResourceElem *) MemoryContextAllocZero(TopMemoryContext,
-														  newcap * sizeof(ResourceElem));
+		newhash = (ResourceHashElem *) MemoryContextAllocZero(TopMemoryContext,
+														  newcap * sizeof(ResourceHashElem));
 
 		/*
 		 * We assume we can't fail below this point, so OK to scribble on the
@@ -496,7 +595,8 @@ ResourceOwnerEnlarge(ResourceOwner owner)
 			for (i = 0; i < oldcap; i++)
 			{
 				if (oldhash[i].kind != NULL)
-					ResourceOwnerAddToHash(owner, oldhash[i].item, oldhash[i].kind);
+					ResourceOwnerAddToHash(owner, oldhash[i].item, oldhash[i].kind,
+										   oldhash[i].count);
 			}
 
 			/* And release old hash table. */
@@ -506,7 +606,7 @@ ResourceOwnerEnlarge(ResourceOwner owner)
 
 	/* Move items from the array to the hash */
 	for (int i = 0; i < owner->narr; i++)
-		ResourceOwnerAddToHash(owner, owner->arr[i].item, owner->arr[i].kind);
+		ResourceOwnerAddToHash(owner, owner->arr[i].item, owner->arr[i].kind, 1);
 	owner->narr = 0;
 
 	Assert(owner->nhash <= owner->grow_at);
@@ -555,7 +655,8 @@ ResourceOwnerRemember(ResourceOwner owner, Datum value, const ResourceOwnerDesc
  * Note: Forgetting a resource does not guarantee that there is room to
  * remember a new resource.  One exception is when you forget the most
  * recently remembered resource; that does make room for a new remember call.
- * Some code callers rely on that exception.
+ * Some code callers rely on that exception. The deduplication does not
+ * change this, as it only applies to the hash table.
  */
 void
 ResourceOwnerForget(ResourceOwner owner, Datum value, const ResourceOwnerDesc *kind)
@@ -597,8 +698,17 @@ ResourceOwnerForget(ResourceOwner owner, Datum value, const ResourceOwnerDesc *k
 			if (owner->hash[idx].item == value &&
 				owner->hash[idx].kind == kind)
 			{
+				/* decrement the count for entry for multiple references */
+				if (owner->hash[idx].count > 1)
+				{
+					owner->hash[idx].count--;
+					return;
+				}
+
+				/* remove the entry entirely */
 				owner->hash[idx].item = (Datum) 0;
 				owner->hash[idx].kind = NULL;
+				owner->hash[idx].count = 0;
 				owner->nhash--;
 
 #ifdef RESOWNER_STATS
@@ -847,12 +957,19 @@ ResourceOwnerReleaseAllOfKind(ResourceOwner owner, const ResourceOwnerDesc *kind
 		if (owner->hash[i].kind == kind)
 		{
 			Datum		value = owner->hash[i].item;
+			uint32		count = owner->hash[i].count;
 
 			owner->hash[i].item = (Datum) 0;
 			owner->hash[i].kind = NULL;
+			owner->hash[i].count = 0;
 			owner->nhash--;
 
-			kind->ReleaseResource(value);
+			/* invoke the release callback once for each reference */
+			while (count > 0)
+			{
+				kind->ReleaseResource(value);
+				count--;
+			}
 		}
 	}
 	owner->releasing = false;
-- 
2.51.0

#12

Heikki Linnakangas

hlinnaka@iki.fi

3 months ago

In reply to: Tomas Vondra (#11)

1 attachment(s)

Re: Optimizing ResouceOwner to speed up COPY

On 18/10/2025 01:49, Tomas Vondra wrote:

On 10/17/25 12:32, Tomas Vondra wrote:

On 10/17/25 10:31, Heikki Linnakangas wrote:

typedef struct ResourceElem
{
     Datum        item;
+    uint32        count;                /* number of occurrences */
     const ResourceOwnerDesc *kind;    /* NULL indicates a free hash
table slot */
} ResourceElem;

Hmm, the 'count' is not used when the entry is stored in the array.
Perhaps we should have a separate struct for array and hash elements
now. Keeping the array small helps it to fit in CPU caches.

Agreed. I had the same idea yesterday, but I haven't done it yet.

The attached v2 does that - it adds a separate ResourceHashElem for the
has table, and it works. But I'm not sure I like it very much, because
there are two places that relied on both the array and hash table using
the same struct to "walk" it the same way.

For ResourceOwnerSort() it's not too bad, but ResourceOwnerReleaseAll()
now duplicates most of the code. It's not terrible, but also not pretty.
I can't think of a better way, though.

Looks fine to me. The code duplication is not too bad IMO.

Here's a rebased version of the micro-benchmark I used when I was
working on the ResourceOwner refactoring
(/messages/by-id/d746cead-a1ef-7efe-fb47-933311e876a3@iki.fi).

I ran it again on my laptop. Different from the one I used back then, so
the results are not comparable with the results from that old thread.

Unpatched (commit 18d26140934):

postgres=# \i contrib/resownerbench/snaptest.sql
numkeep | numsnaps | lifo_time_ns | fifo_time_ns
---------+----------+--------------+--------------
0 | 1 | 11.6 | 11.1
0 | 5 | 12.1 | 13.1
0 | 10 | 12.3 | 13.5
0 | 60 | 14.6 | 19.4
0 | 70 | 16.0 | 18.1
0 | 100 | 16.7 | 18.0
0 | 1000 | 18.1 | 20.7
0 | 10000 | 21.9 | 29.5
9 | 10 | 11.0 | 11.1
9 | 100 | 14.9 | 20.0
9 | 1000 | 16.1 | 24.4
9 | 10000 | 21.9 | 25.7
65 | 70 | 11.7 | 12.5
65 | 100 | 13.9 | 14.8
65 | 1000 | 16.7 | 17.8
65 | 10000 | 22.5 | 27.8
(16 rows)

v2-0001-Deduplicate-entries-in-ResourceOwner.patch:

postgres=# \i contrib/resownerbench/snaptest.sql
numkeep | numsnaps | lifo_time_ns | fifo_time_ns
---------+----------+--------------+--------------
0 | 1 | 10.8 | 10.6
0 | 5 | 11.5 | 12.3
0 | 10 | 12.1 | 13.0
0 | 60 | 13.9 | 19.4
0 | 70 | 15.9 | 18.7
0 | 100 | 16.0 | 18.5
0 | 1000 | 19.2 | 21.6
0 | 10000 | 22.4 | 29.0
9 | 10 | 11.2 | 11.3
9 | 100 | 14.4 | 19.9
9 | 1000 | 16.4 | 23.8
9 | 10000 | 22.4 | 25.7
65 | 70 | 11.4 | 12.1
65 | 100 | 14.8 | 17.0
65 | 1000 | 16.9 | 18.1
65 | 10000 | 22.5 | 28.5
(16 rows)

v20251016-0001-Deduplicate-entries-in-ResourceOwner.patch:

postgres=# \i contrib/resownerbench/snaptest.sql
numkeep | numsnaps | lifo_time_ns | fifo_time_ns
---------+----------+--------------+--------------
0 | 1 | 11.3 | 11.1
0 | 5 | 12.3 | 13.0
0 | 10 | 13.0 | 14.1
0 | 60 | 14.7 | 20.5
0 | 70 | 16.3 | 19.0
0 | 100 | 16.5 | 18.4
0 | 1000 | 19.0 | 22.4
0 | 10000 | 23.2 | 29.6
9 | 10 | 11.2 | 11.1
9 | 100 | 14.8 | 20.5
9 | 1000 | 16.8 | 24.3
9 | 10000 | 23.3 | 26.5
65 | 70 | 12.4 | 13.0
65 | 100 | 15.2 | 16.6
65 | 1000 | 16.9 | 18.4
65 | 10000 | 23.4 | 29.3
(16 rows)

These are just a single run on my laptop, the error bars on individual
numbers are significant. But it seems to me that V2 is maybe a little
faster when the entries fit in the array.

- Heikki

Attachments:

v3-0001-resownerbench.patchtext/x-patch; charset=UTF-8; name=v3-0001-resownerbench.patchDownload

From c19d37c10cbe441fcd272bc67befca11d46fee21 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 21 Oct 2025 09:51:23 +0300
Subject: [PATCH v3 1/1] resownerbench

---
 contrib/Makefile                            |   1 +
 contrib/meson.build                         |   2 +
 contrib/resownerbench/Makefile              |  17 +++
 contrib/resownerbench/meson.build           |  24 +++
 contrib/resownerbench/resownerbench.c       | 154 ++++++++++++++++++++
 contrib/resownerbench/resownerbench.control |   6 +
 contrib/resownerbench/snaptest.sql          |  37 +++++
 7 files changed, 241 insertions(+)
 create mode 100644 contrib/resownerbench/Makefile
 create mode 100644 contrib/resownerbench/meson.build
 create mode 100644 contrib/resownerbench/resownerbench.c
 create mode 100644 contrib/resownerbench/resownerbench.control
 create mode 100644 contrib/resownerbench/snaptest.sql

diff --git a/contrib/Makefile b/contrib/Makefile
index 2f0a88d3f77..f84b26d52cf 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -43,6 +43,7 @@ SUBDIRS = \
 		pg_visibility	\
 		pg_walinspect	\
 		postgres_fdw	\
+		resownerbench	\
 		seg		\
 		spi		\
 		tablefunc	\
diff --git a/contrib/meson.build b/contrib/meson.build
index ed30ee7d639..9c34bb128b7 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -71,3 +71,5 @@ subdir('unaccent')
 subdir('uuid-ossp')
 subdir('vacuumlo')
 subdir('xml2')
+
+subdir('resownerbench')
diff --git a/contrib/resownerbench/Makefile b/contrib/resownerbench/Makefile
new file mode 100644
index 00000000000..9b0e1cfee1a
--- /dev/null
+++ b/contrib/resownerbench/Makefile
@@ -0,0 +1,17 @@
+MODULE_big = resownerbench
+OBJS = resownerbench.o
+
+EXTENSION = resownerbench
+DATA = resownerbench--1.0.sql
+PGFILEDESC = "resownerbench - benchmark for ResourceOwners"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/resownerbench
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/resownerbench/meson.build b/contrib/resownerbench/meson.build
new file mode 100644
index 00000000000..30051c3a375
--- /dev/null
+++ b/contrib/resownerbench/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+resownerbench_sources = files(
+  'resownerbench.c',
+)
+
+resownerbench = shared_module('resownerbench',
+  resownerbench_sources,
+  c_pch: pch_postgres_h,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += resownerbench
+
+install_data(
+  'resownerbench.control',
+  'resownerbench--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'resownerbench',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+}
diff --git a/contrib/resownerbench/resownerbench.c b/contrib/resownerbench/resownerbench.c
new file mode 100644
index 00000000000..80f98c2105b
--- /dev/null
+++ b/contrib/resownerbench/resownerbench.c
@@ -0,0 +1,154 @@
+#include "postgres.h"
+
+#include "catalog/pg_type.h"
+#include "catalog/pg_statistic.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "libpq/pqsignal.h"
+#include "utils/catcache.h"
+#include "utils/syscache.h"
+#include "utils/timestamp.h"
+#include "utils/snapmgr.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(snapshotbench_lifo);
+PG_FUNCTION_INFO_V1(snapshotbench_fifo);
+
+/*
+ * ResourceOwner Performance test, using RegisterSnapshot().
+ *
+ * This takes three parameters: numkeep, numsnaps, numiters.
+ *
+ * First, we register 'numkeep' snapshots. They are kept registed
+ * until the end of the test. Then, we repeatedly register and
+ * unregister 'numsnaps - numkeep' additional snapshots, repeating
+ * 'numiters' times. All the register/unregister calls are made in
+ * LIFO order.
+ *
+ * Returns the time spent, in milliseconds.
+ *
+ * The idea is to test the performance of ResourceOwnerRemember()
+ * and ReourceOwnerForget() operations, under different regimes.
+ *
+ * In the old implementation, if 'numsnaps' is small enough, all
+ * the entries fit in the resource owner's small array (it can
+ * hold 64 entries).
+ *
+ * In the new implementation, the array is much smaller, only 8
+ * entries, but it's used together with the hash table so that
+ * we stay in the "array regime" as long as 'numsnaps - numkeep'
+ * is smaller than 8 entries.
+ *
+ * 'numiters' can be adjusted to adjust the overall runtime to be
+ * suitable long.
+ */
+Datum
+snapshotbench_lifo(PG_FUNCTION_ARGS)
+{
+	int			numkeep = PG_GETARG_INT32(0);
+	int			numsnaps = PG_GETARG_INT32(1);
+	int			numiters = PG_GETARG_INT32(2);
+	int			i;
+	instr_time	start,
+				duration;
+	Snapshot	lsnap;
+	Snapshot   *rs;
+	int			numregistered = 0;
+
+	rs = palloc(Max(numsnaps, numkeep) * sizeof(Snapshot));
+
+	lsnap = GetLatestSnapshot();
+
+	sigprocmask(SIG_SETMASK, &BlockSig, NULL);
+	INSTR_TIME_SET_CURRENT(start);
+
+	while (numregistered < numkeep)
+	{
+		rs[numregistered] = RegisterSnapshot(lsnap);
+		numregistered++;
+	}
+
+	for (i = 0 ; i < numiters; i++)
+	{
+		while (numregistered < numsnaps)
+		{
+			rs[numregistered] = RegisterSnapshot(lsnap);
+			numregistered++;
+		}
+
+		while (numregistered > numkeep)
+		{
+			numregistered--;
+			UnregisterSnapshot(rs[numregistered]);
+		}
+	}
+
+	while (numregistered > 0)
+	{
+		numregistered--;
+		UnregisterSnapshot(rs[numregistered]);
+	}
+
+	INSTR_TIME_SET_CURRENT(duration);
+	INSTR_TIME_SUBTRACT(duration, start);
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	PG_RETURN_FLOAT8(INSTR_TIME_GET_MILLISEC(duration));
+};
+
+
+/*
+ * Same, but do the register/unregister operations in
+ * FIFO order.
+ */
+Datum
+snapshotbench_fifo(PG_FUNCTION_ARGS)
+{
+	int			numkeep = PG_GETARG_INT32(0);
+	int			numsnaps = PG_GETARG_INT32(1);
+	int			numiters = PG_GETARG_INT32(2);
+	int			i,
+				j;
+	instr_time	start,
+				duration;
+	Snapshot	lsnap;
+	Snapshot   *rs;
+	int			numregistered = 0;
+
+	rs = palloc(Max(numsnaps, numkeep) * sizeof(Snapshot));
+
+	lsnap = GetLatestSnapshot();
+
+	sigprocmask(SIG_SETMASK, &BlockSig, NULL);
+	INSTR_TIME_SET_CURRENT(start);
+
+	while (numregistered < numkeep)
+	{
+		rs[numregistered] = RegisterSnapshot(lsnap);
+		numregistered++;
+	}
+
+	for (i = 0 ; i < numiters; i++)
+	{
+		while (numregistered < numsnaps)
+		{
+			rs[numregistered] = RegisterSnapshot(lsnap);
+			numregistered++;
+		}
+
+		for (j = numkeep; j < numregistered; j++)
+			UnregisterSnapshot(rs[j]);
+		numregistered = numkeep;
+	}
+
+	for (j = 0; j < numregistered; j++)
+		UnregisterSnapshot(rs[j]);
+	numregistered = numkeep;
+
+	INSTR_TIME_SET_CURRENT(duration);
+	INSTR_TIME_SUBTRACT(duration, start);
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	PG_RETURN_FLOAT8(INSTR_TIME_GET_MILLISEC(duration));
+};
diff --git a/contrib/resownerbench/resownerbench.control b/contrib/resownerbench/resownerbench.control
new file mode 100644
index 00000000000..ada88b8eed8
--- /dev/null
+++ b/contrib/resownerbench/resownerbench.control
@@ -0,0 +1,6 @@
+# resownerbench
+
+comment = 'benchmark for ResourceOwners'
+default_version = '1.0'
+module_pathname = '$libdir/resownerbench'
+relocatable = true
diff --git a/contrib/resownerbench/snaptest.sql b/contrib/resownerbench/snaptest.sql
new file mode 100644
index 00000000000..e3c0a9710ee
--- /dev/null
+++ b/contrib/resownerbench/snaptest.sql
@@ -0,0 +1,37 @@
+--
+-- Performance test RegisterSnapshot/UnregisterSnapshot.
+--
+select numkeep, numsnaps,
+       -- numiters,
+       -- round(lifo_time_ms) as lifo_total_time_ms,
+       -- round(fifo_time_ms) as fifo_total_time_ms,
+       round((lifo_time_ms::numeric / (numkeep + (numsnaps - numkeep) * numiters)) * 1000000, 1) as lifo_time_ns,
+       round((fifo_time_ms::numeric / (numkeep + (numsnaps - numkeep) * numiters)) * 1000000, 1) as fifo_time_ns
+from
+(values (0,      1,  10000000 * 10),
+        (0,      5,   2000000 * 10),
+        (0,     10,   1000000 * 10),
+        (0,     60,    100000 * 10),
+        (0,     70,    100000 * 10),
+        (0,    100,    100000 * 10),
+        (0,   1000,     10000 * 10),
+        (0,  10000,      1000 * 10),
+
+-- These tests keep 9 snapshots registered across the iterations. That
+-- exceeds the size of the little array in the patch, so this exercises
+-- the hash lookups. Without the patch, these still fit in the array
+-- (it's 64 entries without the patch)
+        (9,     10,  10000000 * 10),
+        (9,    100,    100000 * 10),
+        (9,   1000,     10000 * 10),
+        (9,  10000,      1000 * 10),
+
+-- These exceed the 64 entry array even without the patch, so these fall
+-- in the hash table regime with and without the patch.
+        (65,    70,   1000000 * 10),
+        (65,   100,    100000 * 10),
+        (65,  1000,     10000 * 10),
+        (65, 10000,      1000 * 10)
+) AS params (numkeep, numsnaps, numiters),
+lateral snapshotbench_lifo(numkeep, numsnaps, numiters) as lifo_time_ms,
+lateral snapshotbench_fifo(numkeep, numsnaps, numiters) as fifo_time_ms;
-- 
2.47.3

#13

Tomas Vondra

tomas@vondra.me

3 months ago

In reply to: Heikki Linnakangas (#12)

3 attachment(s)

Re: Optimizing ResouceOwner to speed up COPY

On 10/21/25 09:10, Heikki Linnakangas wrote:

On 18/10/2025 01:49, Tomas Vondra wrote:

On 10/17/25 12:32, Tomas Vondra wrote:

On 10/17/25 10:31, Heikki Linnakangas wrote:

  typedef struct ResourceElem
  {
      Datum        item;
+    uint32        count;                /* number of occurrences */
      const ResourceOwnerDesc *kind;    /* NULL indicates a free hash
table slot */
  } ResourceElem;

Hmm, the 'count' is not used when the entry is stored in the array.
Perhaps we should have a separate struct for array and hash elements
now. Keeping the array small helps it to fit in CPU caches.

Agreed. I had the same idea yesterday, but I haven't done it yet.

The attached v2 does that - it adds a separate ResourceHashElem for the
has table, and it works. But I'm not sure I like it very much, because
there are two places that relied on both the array and hash table using
the same struct to "walk" it the same way.

For ResourceOwnerSort() it's not too bad, but ResourceOwnerReleaseAll()
now duplicates most of the code. It's not terrible, but also not pretty.
I can't think of a better way, though.

Looks fine to me. The code duplication is not too bad IMO.

Here's a rebased version of the micro-benchmark I used when I was
working on the ResourceOwner refactoring (https://www.postgresql.org/
message-id/d746cead-a1ef-7efe-fb47-933311e876a3%40iki.fi).

I ran it again on my laptop. Different from the one I used back then, so
the results are not comparable with the results from that old thread.

Unpatched (commit 18d26140934):

postgres=# \i contrib/resownerbench/snaptest.sql
numkeep | numsnaps | lifo_time_ns | fifo_time_ns
---------+----------+--------------+--------------
       0 |        1 |         11.6 |         11.1
       0 |        5 |         12.1 |         13.1
       0 |       10 |         12.3 |         13.5
       0 |       60 |         14.6 |         19.4
       0 |       70 |         16.0 |         18.1
       0 |      100 |         16.7 |         18.0
       0 |     1000 |         18.1 |         20.7
       0 |    10000 |         21.9 |         29.5
       9 |       10 |         11.0 |         11.1
       9 |      100 |         14.9 |         20.0
       9 |     1000 |         16.1 |         24.4
       9 |    10000 |         21.9 |         25.7
      65 |       70 |         11.7 |         12.5
      65 |      100 |         13.9 |         14.8
      65 |     1000 |         16.7 |         17.8
      65 |    10000 |         22.5 |         27.8
(16 rows)

v2-0001-Deduplicate-entries-in-ResourceOwner.patch:

postgres=# \i contrib/resownerbench/snaptest.sql
numkeep | numsnaps | lifo_time_ns | fifo_time_ns
---------+----------+--------------+--------------
       0 |        1 |         10.8 |         10.6
       0 |        5 |         11.5 |         12.3
       0 |       10 |         12.1 |         13.0
       0 |       60 |         13.9 |         19.4
       0 |       70 |         15.9 |         18.7
       0 |      100 |         16.0 |         18.5
       0 |     1000 |         19.2 |         21.6
       0 |    10000 |         22.4 |         29.0
       9 |       10 |         11.2 |         11.3
       9 |      100 |         14.4 |         19.9
       9 |     1000 |         16.4 |         23.8
       9 |    10000 |         22.4 |         25.7
      65 |       70 |         11.4 |         12.1
      65 |      100 |         14.8 |         17.0
      65 |     1000 |         16.9 |         18.1
      65 |    10000 |         22.5 |         28.5
(16 rows)

v20251016-0001-Deduplicate-entries-in-ResourceOwner.patch:

postgres=# \i contrib/resownerbench/snaptest.sql
numkeep | numsnaps | lifo_time_ns | fifo_time_ns
---------+----------+--------------+--------------
       0 |        1 |         11.3 |         11.1
       0 |        5 |         12.3 |         13.0
       0 |       10 |         13.0 |         14.1
       0 |       60 |         14.7 |         20.5
       0 |       70 |         16.3 |         19.0
       0 |      100 |         16.5 |         18.4
       0 |     1000 |         19.0 |         22.4
       0 |    10000 |         23.2 |         29.6
       9 |       10 |         11.2 |         11.1
       9 |      100 |         14.8 |         20.5
       9 |     1000 |         16.8 |         24.3
       9 |    10000 |         23.3 |         26.5
      65 |       70 |         12.4 |         13.0
      65 |      100 |         15.2 |         16.6
      65 |     1000 |         16.9 |         18.4
      65 |    10000 |         23.4 |         29.3
(16 rows)

These are just a single run on my laptop, the error bars on individual
numbers are significant. But it seems to me that V2 is maybe a little
faster when the entries fit in the array.

Thanks. Attached is a version that adds a missing .sql file defining the
benchmark functions, and then two patches. The v2 is the patch already
shared on Saturday, v3 is a minor tweak (details in a minute).

I ran the benchmark on my test machine with both v1 and v2, with 10
runs. And there seems to be a small regression of ~5-10% (compared to
master). Which is not terrible, but also not negligible. v0 is master.

Here's the "fifo" results:

keep snaps | v0 v1 v2 v3 | v1 v2 v3
==================================================================
0 1 | 10.57 10.70 10.73 10.5 | 101% 102% 99%
5 | 10.52 10.70 10.90 10.5 | 102% 104% 100%
10 | 11.41 11.90 11.94 11.5 | 104% 105% 101%
60 | 15.04 15.74 15.82 15.5 | 105% 105% 103%
70 | 13.73 14.42 14.80 14.3 | 105% 108% 104%
100 | 13.65 14.27 14.59 14.1 | 105% 107% 104%
1000 | 15.07 15.78 16.27 15.6 | 105% 108% 104%
10000 | 23.15 24.96 25.57 24.1 | 108% 110% 104%
------------------------------------------------------------------
9 10 | 10.8 10.94 10.86 10.6 | 101% 101% 98%
100 | 15.83 16.35 16.72 16.1 | 103% 106% 102%
1000 | 19.04 19.70 20.34 19.6 | 103% 107% 103%
10000 | 21.5 23.37 24.18 22.5 | 109% 112% 105%
------------------------------------------------------------------
65 70 | 10.58 10.95 11.18 10.6 | 103% 106% 100%
100 | 13.29 14.28 14.73 14.1 | 107% 111% 106%
1000 | 14.62 15.49 15.99 15.2 | 106% 109% 104%
10000 | 22.98 24.78 25.55 24.3 | 108% 111% 106%

and here's "lifo"

keep snaps | v0 v1 v2 v3 | v1 v2 v3
==================================================================
0 1 | 10.73 10.74 11.06 10.4 | 100% 103% 97%
5 | 10.44 10.45 10.62 10.2 | 100% 102% 98%
10 | 11.82 11.84 12.17 11.6 | 100% 103% 98%
60 | 12.15 12.81 12.94 12.4 | 105% 107% 102%
70 | 12.84 13.61 13.95 13.4 | 106% 109% 104%
100 | 12.98 13.73 14.06 13.5 | 106% 108% 104%
1000 | 13.71 14.56 14.80 14.2 | 106% 108% 104%
10000 | 17.68 19.51 20.38 19.0 | 110% 115% 108%
------------------------------------------------------------------
9 10 | 10.96 10.92 11.02 10.6 | 100% 101% 97%
100 | 12.58 13.11 13.26 12.8 | 104% 105% 102%
1000 | 13.67 14.38 14.73 14.1 | 105% 108% 103%
10000 | 17.67 19.79 20.20 19.2 | 112% 114% 109%
------------------------------------------------------------------
65 70 | 10.58 10.85 10.78 11.1 | 103% 102% 105%
100 | 13.03 14.12 14.27 13.5 | 108% 110% 103%
1000 | 13.65 14.56 14.88 14.1 | 107% 109% 104%
10000 | 17.62 19.61 20.30 19.0 | 111% 115% 108%

I was wondering where the regression comes from, and it occurred to me
ResourceOwnerAddToHash() may be doing the checks in the wrong order. It
should be checking for empty slot first, so I did that - that's v3.

There's still a regression, but it's about halved compared to v2, and
about equal to v1. So I tried doing the same tweak for v1, but that
didn't help much (if at all).

The results seem fairly stable, and the overall trend is clear. It'd be
great if there were no regressions, but considering how narrow is this
microbenchmark (and considering the benefits for practical COPY runs),
I'd say it's probably OK.

regards

--
Tomas Vondra

Attachments:

0001-resownerbench.patchtext/x-patch; charset=UTF-8; name=0001-resownerbench.patchDownload

From 670fd60c6c953286956a3a2d52cf0b532deaafbb Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Tue, 21 Oct 2025 09:51:23 +0300
Subject: [PATCH 1/3] resownerbench

---
 contrib/Makefile                             |   1 +
 contrib/meson.build                          |   2 +
 contrib/resownerbench/Makefile               |  17 ++
 contrib/resownerbench/meson.build            |  24 +++
 contrib/resownerbench/resownerbench--1.0.sql |   7 +
 contrib/resownerbench/resownerbench.c        | 154 +++++++++++++++++++
 contrib/resownerbench/resownerbench.control  |   6 +
 contrib/resownerbench/snaptest.sql           |  41 +++++
 8 files changed, 252 insertions(+)
 create mode 100644 contrib/resownerbench/Makefile
 create mode 100644 contrib/resownerbench/meson.build
 create mode 100644 contrib/resownerbench/resownerbench--1.0.sql
 create mode 100644 contrib/resownerbench/resownerbench.c
 create mode 100644 contrib/resownerbench/resownerbench.control
 create mode 100644 contrib/resownerbench/snaptest.sql

diff --git a/contrib/Makefile b/contrib/Makefile
index 2f0a88d3f77..f84b26d52cf 100644
--- a/contrib/Makefile
+++ b/contrib/Makefile
@@ -43,6 +43,7 @@ SUBDIRS = \
 		pg_visibility	\
 		pg_walinspect	\
 		postgres_fdw	\
+		resownerbench	\
 		seg		\
 		spi		\
 		tablefunc	\
diff --git a/contrib/meson.build b/contrib/meson.build
index ed30ee7d639..9c34bb128b7 100644
--- a/contrib/meson.build
+++ b/contrib/meson.build
@@ -71,3 +71,5 @@ subdir('unaccent')
 subdir('uuid-ossp')
 subdir('vacuumlo')
 subdir('xml2')
+
+subdir('resownerbench')
diff --git a/contrib/resownerbench/Makefile b/contrib/resownerbench/Makefile
new file mode 100644
index 00000000000..9b0e1cfee1a
--- /dev/null
+++ b/contrib/resownerbench/Makefile
@@ -0,0 +1,17 @@
+MODULE_big = resownerbench
+OBJS = resownerbench.o
+
+EXTENSION = resownerbench
+DATA = resownerbench--1.0.sql
+PGFILEDESC = "resownerbench - benchmark for ResourceOwners"
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/resownerbench
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/contrib/resownerbench/meson.build b/contrib/resownerbench/meson.build
new file mode 100644
index 00000000000..30051c3a375
--- /dev/null
+++ b/contrib/resownerbench/meson.build
@@ -0,0 +1,24 @@
+# Copyright (c) 2022-2025, PostgreSQL Global Development Group
+
+resownerbench_sources = files(
+  'resownerbench.c',
+)
+
+resownerbench = shared_module('resownerbench',
+  resownerbench_sources,
+  c_pch: pch_postgres_h,
+  kwargs: contrib_mod_args,
+)
+contrib_targets += resownerbench
+
+install_data(
+  'resownerbench.control',
+  'resownerbench--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'resownerbench',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+}
diff --git a/contrib/resownerbench/resownerbench--1.0.sql b/contrib/resownerbench/resownerbench--1.0.sql
new file mode 100644
index 00000000000..8521caed2d8
--- /dev/null
+++ b/contrib/resownerbench/resownerbench--1.0.sql
@@ -0,0 +1,7 @@
+CREATE OR REPLACE FUNCTION snapshotbench_fifo(numkeep int, numsnaps int, numiters int) RETURNS double precision
+AS 'resownerbench', 'snapshotbench_fifo'
+LANGUAGE C;
+
+CREATE OR REPLACE FUNCTION snapshotbench_lifo(numkeep int, numsnaps int, numiters int) RETURNS double precision
+AS 'resownerbench', 'snapshotbench_lifo'
+LANGUAGE C;
diff --git a/contrib/resownerbench/resownerbench.c b/contrib/resownerbench/resownerbench.c
new file mode 100644
index 00000000000..80f98c2105b
--- /dev/null
+++ b/contrib/resownerbench/resownerbench.c
@@ -0,0 +1,154 @@
+#include "postgres.h"
+
+#include "catalog/pg_type.h"
+#include "catalog/pg_statistic.h"
+#include "executor/spi.h"
+#include "funcapi.h"
+#include "libpq/pqsignal.h"
+#include "utils/catcache.h"
+#include "utils/syscache.h"
+#include "utils/timestamp.h"
+#include "utils/snapmgr.h"
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(snapshotbench_lifo);
+PG_FUNCTION_INFO_V1(snapshotbench_fifo);
+
+/*
+ * ResourceOwner Performance test, using RegisterSnapshot().
+ *
+ * This takes three parameters: numkeep, numsnaps, numiters.
+ *
+ * First, we register 'numkeep' snapshots. They are kept registed
+ * until the end of the test. Then, we repeatedly register and
+ * unregister 'numsnaps - numkeep' additional snapshots, repeating
+ * 'numiters' times. All the register/unregister calls are made in
+ * LIFO order.
+ *
+ * Returns the time spent, in milliseconds.
+ *
+ * The idea is to test the performance of ResourceOwnerRemember()
+ * and ReourceOwnerForget() operations, under different regimes.
+ *
+ * In the old implementation, if 'numsnaps' is small enough, all
+ * the entries fit in the resource owner's small array (it can
+ * hold 64 entries).
+ *
+ * In the new implementation, the array is much smaller, only 8
+ * entries, but it's used together with the hash table so that
+ * we stay in the "array regime" as long as 'numsnaps - numkeep'
+ * is smaller than 8 entries.
+ *
+ * 'numiters' can be adjusted to adjust the overall runtime to be
+ * suitable long.
+ */
+Datum
+snapshotbench_lifo(PG_FUNCTION_ARGS)
+{
+	int			numkeep = PG_GETARG_INT32(0);
+	int			numsnaps = PG_GETARG_INT32(1);
+	int			numiters = PG_GETARG_INT32(2);
+	int			i;
+	instr_time	start,
+				duration;
+	Snapshot	lsnap;
+	Snapshot   *rs;
+	int			numregistered = 0;
+
+	rs = palloc(Max(numsnaps, numkeep) * sizeof(Snapshot));
+
+	lsnap = GetLatestSnapshot();
+
+	sigprocmask(SIG_SETMASK, &BlockSig, NULL);
+	INSTR_TIME_SET_CURRENT(start);
+
+	while (numregistered < numkeep)
+	{
+		rs[numregistered] = RegisterSnapshot(lsnap);
+		numregistered++;
+	}
+
+	for (i = 0 ; i < numiters; i++)
+	{
+		while (numregistered < numsnaps)
+		{
+			rs[numregistered] = RegisterSnapshot(lsnap);
+			numregistered++;
+		}
+
+		while (numregistered > numkeep)
+		{
+			numregistered--;
+			UnregisterSnapshot(rs[numregistered]);
+		}
+	}
+
+	while (numregistered > 0)
+	{
+		numregistered--;
+		UnregisterSnapshot(rs[numregistered]);
+	}
+
+	INSTR_TIME_SET_CURRENT(duration);
+	INSTR_TIME_SUBTRACT(duration, start);
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	PG_RETURN_FLOAT8(INSTR_TIME_GET_MILLISEC(duration));
+};
+
+
+/*
+ * Same, but do the register/unregister operations in
+ * FIFO order.
+ */
+Datum
+snapshotbench_fifo(PG_FUNCTION_ARGS)
+{
+	int			numkeep = PG_GETARG_INT32(0);
+	int			numsnaps = PG_GETARG_INT32(1);
+	int			numiters = PG_GETARG_INT32(2);
+	int			i,
+				j;
+	instr_time	start,
+				duration;
+	Snapshot	lsnap;
+	Snapshot   *rs;
+	int			numregistered = 0;
+
+	rs = palloc(Max(numsnaps, numkeep) * sizeof(Snapshot));
+
+	lsnap = GetLatestSnapshot();
+
+	sigprocmask(SIG_SETMASK, &BlockSig, NULL);
+	INSTR_TIME_SET_CURRENT(start);
+
+	while (numregistered < numkeep)
+	{
+		rs[numregistered] = RegisterSnapshot(lsnap);
+		numregistered++;
+	}
+
+	for (i = 0 ; i < numiters; i++)
+	{
+		while (numregistered < numsnaps)
+		{
+			rs[numregistered] = RegisterSnapshot(lsnap);
+			numregistered++;
+		}
+
+		for (j = numkeep; j < numregistered; j++)
+			UnregisterSnapshot(rs[j]);
+		numregistered = numkeep;
+	}
+
+	for (j = 0; j < numregistered; j++)
+		UnregisterSnapshot(rs[j]);
+	numregistered = numkeep;
+
+	INSTR_TIME_SET_CURRENT(duration);
+	INSTR_TIME_SUBTRACT(duration, start);
+	sigprocmask(SIG_SETMASK, &UnBlockSig, NULL);
+
+	PG_RETURN_FLOAT8(INSTR_TIME_GET_MILLISEC(duration));
+};
diff --git a/contrib/resownerbench/resownerbench.control b/contrib/resownerbench/resownerbench.control
new file mode 100644
index 00000000000..ada88b8eed8
--- /dev/null
+++ b/contrib/resownerbench/resownerbench.control
@@ -0,0 +1,6 @@
+# resownerbench
+
+comment = 'benchmark for ResourceOwners'
+default_version = '1.0'
+module_pathname = '$libdir/resownerbench'
+relocatable = true
diff --git a/contrib/resownerbench/snaptest.sql b/contrib/resownerbench/snaptest.sql
new file mode 100644
index 00000000000..a8fd4da2f52
--- /dev/null
+++ b/contrib/resownerbench/snaptest.sql
@@ -0,0 +1,41 @@
+--
+-- Performance test RegisterSnapshot/UnregisterSnapshot.
+--
+with params as materialized (
+select * from
+generate_series(1,10) s(run),
+(values (0,      1,  10000000 * 10),
+        (0,      5,   2000000 * 10),
+        (0,     10,   1000000 * 10),
+        (0,     60,    100000 * 10),
+        (0,     70,    100000 * 10),
+        (0,    100,    100000 * 10),
+        (0,   1000,     10000 * 10),
+        (0,  10000,      1000 * 10),
+
+-- These tests keep 9 snapshots registered across the iterations. That
+-- exceeds the size of the little array in the patch, so this exercises
+-- the hash lookups. Without the patch, these still fit in the array
+-- (it's 64 entries without the patch)
+        (9,     10,  10000000 * 10),
+        (9,    100,    100000 * 10),
+        (9,   1000,     10000 * 10),
+        (9,  10000,      1000 * 10),
+
+-- These exceed the 64 entry array even without the patch, so these fall
+-- in the hash table regime with and without the patch.
+        (65,    70,   1000000 * 10),
+        (65,   100,    100000 * 10),
+        (65,  1000,     10000 * 10),
+        (65, 10000,      1000 * 10)
+) AS params (numkeep, numsnaps, numiters)
+)
+select run, numkeep, numsnaps,
+       -- numiters,
+       -- round(lifo_time_ms) as lifo_total_time_ms,
+       -- round(fifo_time_ms) as fifo_total_time_ms,
+       round((lifo_time_ms::numeric / (numkeep + (numsnaps - numkeep) * numiters)) * 1000000, 1) as lifo_time_ns,
+       round((fifo_time_ms::numeric / (numkeep + (numsnaps - numkeep) * numiters)) * 1000000, 1) as fifo_time_ns
+from params,
+lateral snapshotbench_lifo(numkeep, numsnaps, numiters) as lifo_time_ms,
+lateral snapshotbench_fifo(numkeep, numsnaps, numiters) as fifo_time_ms;
-- 
2.51.0

0002-v2-Deduplicate-entries-in-ResourceOwner.patchtext/x-patch; charset=UTF-8; name=0002-v2-Deduplicate-entries-in-ResourceOwner.patchDownload

From 83ff6c494312ea778172c446c3c4ca042dbb3d67 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Thu, 16 Oct 2025 13:27:25 +0200
Subject: [PATCH 2/3] v2: Deduplicate entries in ResourceOwner

We may track the same resource many times, but the current ResourceOwner
does not handle that efficiently. Each reference is a separate entry,
which leads to long linear searches in the simple hash table.

This adds a simple opportunistic deduplication. Entries get a counter,
tracking how many references they represent. It only applies to the
"hash" phase, not to the initial array. And when adding an entry to the
hash finds an empty slot first, it uses it (instead of continuing to
look for a possible match).

This means the deduplication is not perfect - there may be multiple
entries for the same value. But in practice it seems like a good trade
off. It won't affect cases with no duplicates, and it will help cases
with many of them.

The code also continues to grow the hash table as before. In some cases
it might be enough to deduplicate the contents. But it's hard to say in
advance, and it's not much cheaper as it still requires walking the
current hash table. So there's a risk of attempting deduplication, only
to find the hash still needs to be enlarged, doubling the cost. So we
just grow the hash table. In the worst case the memory usage is not
worse than without deduplication. If deduplication is effective, the
hash table grows later or not at all.
---
 src/backend/utils/resowner/resowner.c | 215 ++++++++++++++++++++------
 1 file changed, 166 insertions(+), 49 deletions(-)

diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index fca84ded6dd..73de45baccd 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -15,6 +15,12 @@
  * a particular reference, you need to search both the array and the hash
  * table.
  *
+ * The hash table deduplicates entries in an opportunistic way. If it finds
+ * an entry for the same resource (same kind), it increments a counter instead.
+ * But if it finds an empty slot first, it uses the slot. If the table gets
+ * enlarged, those entries will be merged, but it's possible to have multiple
+ * entries for the same resource. The fixed-size array does not deduplicate.
+ *
  * The most frequent usage is that a resource is remembered, and forgotten
  * shortly thereafter.  For example, pin a buffer, read one tuple from it,
  * release the pin.  Linearly scanning the small array handles that case
@@ -67,6 +73,18 @@ typedef struct ResourceElem
 	const ResourceOwnerDesc *kind;	/* NULL indicates a free hash table slot */
 } ResourceElem;
 
+/*
+ * ResourceHashElem represents a reference associated with a resource owner,
+ * stored in the hash table. Similar to ResourceElem, but has a counter field
+ * tracking how many references it represents, to allow deduplication.
+ */
+typedef struct ResourceHashElem
+{
+	Datum		item;
+	uint32		count;				/* number of references */
+	const ResourceOwnerDesc *kind;	/* NULL indicates a free hash table slot */
+} ResourceHashElem;
+
 /*
  * Size of the fixed-size array to hold most-recently remembered resources.
  */
@@ -151,7 +169,7 @@ struct ResourceOwnerData
 	 * release priority.  The first 'nhash' elements are occupied, the rest
 	 * are empty.
 	 */
-	ResourceElem *hash;
+	ResourceHashElem *hash;
 	uint32		capacity;		/* allocated length of hash[] */
 	uint32		grow_at;		/* grow hash when reach this */
 
@@ -198,7 +216,8 @@ static ResourceReleaseCallbackItem *ResourceRelease_callbacks = NULL;
 /* Internal routines */
 static inline uint32 hash_resource_elem(Datum value, const ResourceOwnerDesc *kind);
 static void ResourceOwnerAddToHash(ResourceOwner owner, Datum value,
-								   const ResourceOwnerDesc *kind);
+								   const ResourceOwnerDesc *kind,
+								   uint32 count);
 static int	resource_priority_cmp(const void *a, const void *b);
 static void ResourceOwnerSort(ResourceOwner owner);
 static void ResourceOwnerReleaseAll(ResourceOwner owner,
@@ -239,7 +258,8 @@ hash_resource_elem(Datum value, const ResourceOwnerDesc *kind)
  * Adds 'value' of given 'kind' to the ResourceOwner's hash table
  */
 static void
-ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc *kind)
+ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc *kind,
+					   uint32 count)
 {
 	uint32		mask = owner->capacity - 1;
 	uint32		idx;
@@ -250,12 +270,25 @@ ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc
 	idx = hash_resource_elem(value, kind) & mask;
 	for (;;)
 	{
+		/*
+		 * Increment counter for a matching entry or use an empty slot,
+		 * whichever we find first.
+		 */
+
+		if ((owner->hash[idx].kind == kind) &&
+			(owner->hash[idx].item == value))
+		{
+			owner->hash[idx].count += count;
+			return;
+		}
+
 		if (owner->hash[idx].kind == NULL)
 			break;				/* found a free slot */
 		idx = (idx + 1) & mask;
 	}
 	owner->hash[idx].item = value;
 	owner->hash[idx].kind = kind;
+	owner->hash[idx].count = count;
 	owner->nhash++;
 }
 
@@ -277,6 +310,24 @@ resource_priority_cmp(const void *a, const void *b)
 		return 1;
 }
 
+/*
+ * Comparison function to sort by release phase and priority
+ */
+static int
+resource_priority_hash_cmp(const void *a, const void *b)
+{
+	const ResourceHashElem *ra = (const ResourceHashElem *) a;
+	const ResourceHashElem *rb = (const ResourceHashElem *) b;
+
+	/* Note: reverse order */
+	if (ra->kind->release_phase == rb->kind->release_phase)
+		return pg_cmp_u32(rb->kind->release_priority, ra->kind->release_priority);
+	else if (ra->kind->release_phase > rb->kind->release_phase)
+		return -1;
+	else
+		return 1;
+}
+
 /*
  * Sort resources in reverse release priority.
  *
@@ -288,16 +339,21 @@ resource_priority_cmp(const void *a, const void *b)
 static void
 ResourceOwnerSort(ResourceOwner owner)
 {
-	ResourceElem *items;
-	uint32		nitems;
-
 	if (owner->nhash == 0)
 	{
+		ResourceElem *items;
+		uint32		nitems;
+
 		items = owner->arr;
 		nitems = owner->narr;
+
+		qsort(items, nitems, sizeof(ResourceElem), resource_priority_cmp);
 	}
 	else
 	{
+		ResourceHashElem   *items;
+		uint32				nitems;
+
 		/*
 		 * Compact the hash table, so that all the elements are in the
 		 * beginning of the 'hash' array, with no empty elements.
@@ -324,7 +380,9 @@ ResourceOwnerSort(ResourceOwner owner)
 		Assert(dst + owner->narr <= owner->capacity);
 		for (int idx = 0; idx < owner->narr; idx++)
 		{
-			owner->hash[dst] = owner->arr[idx];
+			owner->hash[dst].item = owner->arr[idx].item;
+			owner->hash[dst].kind = owner->arr[idx].kind;
+			owner->hash[dst].count = 1;
 			dst++;
 		}
 		Assert(dst == owner->nhash + owner->narr);
@@ -333,9 +391,9 @@ ResourceOwnerSort(ResourceOwner owner)
 
 		items = owner->hash;
 		nitems = owner->nhash;
-	}
 
-	qsort(items, nitems, sizeof(ResourceElem), resource_priority_cmp);
+		qsort(items, nitems, sizeof(ResourceHashElem), resource_priority_hash_cmp);
+	}
 }
 
 /*
@@ -345,9 +403,6 @@ static void
 ResourceOwnerReleaseAll(ResourceOwner owner, ResourceReleasePhase phase,
 						bool printLeakWarnings)
 {
-	ResourceElem *items;
-	uint32		nitems;
-
 	/*
 	 * ResourceOwnerSort must've been called already.  All the resources are
 	 * either in the array or the hash.
@@ -356,49 +411,93 @@ ResourceOwnerReleaseAll(ResourceOwner owner, ResourceReleasePhase phase,
 	Assert(owner->sorted);
 	if (owner->nhash == 0)
 	{
+		ResourceElem *items;
+		uint32		nitems;
+
 		items = owner->arr;
 		nitems = owner->narr;
+
+		/*
+		 * The resources are sorted in reverse priority order.  Release them
+		 * starting from the end, until we hit the end of the phase that we are
+		 * releasing now.  We will continue from there when called again for the
+		 * next phase.
+		 */
+		while (nitems > 0)
+		{
+			uint32		idx = nitems - 1;
+			Datum		value = items[idx].item;
+			const ResourceOwnerDesc *kind = items[idx].kind;
+
+			if (kind->release_phase > phase)
+				break;
+			Assert(kind->release_phase == phase);
+
+			if (printLeakWarnings)
+			{
+				char	   *res_str;
+
+				res_str = kind->DebugPrint ?
+					kind->DebugPrint(value)
+					: psprintf("%s %p", kind->name, DatumGetPointer(value));
+				pfree(res_str);
+			}
+
+			kind->ReleaseResource(value);
+
+			nitems--;
+		}
+
+		owner->narr = nitems;
 	}
 	else
 	{
+		ResourceHashElem *items;
+		uint32		nitems;
+
 		Assert(owner->narr == 0);
 		items = owner->hash;
 		nitems = owner->nhash;
-	}
 
-	/*
-	 * The resources are sorted in reverse priority order.  Release them
-	 * starting from the end, until we hit the end of the phase that we are
-	 * releasing now.  We will continue from there when called again for the
-	 * next phase.
-	 */
-	while (nitems > 0)
-	{
-		uint32		idx = nitems - 1;
-		Datum		value = items[idx].item;
-		const ResourceOwnerDesc *kind = items[idx].kind;
+		/*
+		 * The resources are sorted in reverse priority order.  Release them
+		 * starting from the end, until we hit the end of the phase that we are
+		 * releasing now.  We will continue from there when called again for the
+		 * next phase.
+		 */
+		while (nitems > 0)
+		{
+			uint32		idx = nitems - 1;
+			Datum		value = items[idx].item;
+			const ResourceOwnerDesc *kind = items[idx].kind;
+			uint32		count = items[idx].count;
 
-		if (kind->release_phase > phase)
-			break;
-		Assert(kind->release_phase == phase);
+			if (kind->release_phase > phase)
+				break;
+			Assert(kind->release_phase == phase);
 
-		if (printLeakWarnings)
-		{
-			char	   *res_str;
+			if (printLeakWarnings)
+			{
+				char	   *res_str;
+
+				res_str = kind->DebugPrint ?
+					kind->DebugPrint(value)
+					: psprintf("%s %p", kind->name, DatumGetPointer(value));
+				pfree(res_str);
+			}
 
-			res_str = kind->DebugPrint ?
-				kind->DebugPrint(value)
-				: psprintf("%s %p", kind->name, DatumGetPointer(value));
-			elog(WARNING, "resource was not closed: %s", res_str);
-			pfree(res_str);
+			/* invoke the release callback for each reference */
+			while (count > 0)
+			{
+				kind->ReleaseResource(value);
+				count--;
+			}
+
+			nitems--;
 		}
-		kind->ReleaseResource(value);
-		nitems--;
-	}
-	if (owner->nhash == 0)
-		owner->narr = nitems;
-	else
+
 		owner->nhash = nitems;
+	}
 }
 
 
@@ -466,16 +565,16 @@ ResourceOwnerEnlarge(ResourceOwner owner)
 		uint32		i,
 					oldcap,
 					newcap;
-		ResourceElem *oldhash;
-		ResourceElem *newhash;
+		ResourceHashElem *oldhash;
+		ResourceHashElem *newhash;
 
 		oldhash = owner->hash;
 		oldcap = owner->capacity;
 
 		/* Double the capacity (it must stay a power of 2!) */
 		newcap = (oldcap > 0) ? oldcap * 2 : RESOWNER_HASH_INIT_SIZE;
-		newhash = (ResourceElem *) MemoryContextAllocZero(TopMemoryContext,
-														  newcap * sizeof(ResourceElem));
+		newhash = (ResourceHashElem *) MemoryContextAllocZero(TopMemoryContext,
+														  newcap * sizeof(ResourceHashElem));
 
 		/*
 		 * We assume we can't fail below this point, so OK to scribble on the
@@ -496,7 +595,8 @@ ResourceOwnerEnlarge(ResourceOwner owner)
 			for (i = 0; i < oldcap; i++)
 			{
 				if (oldhash[i].kind != NULL)
-					ResourceOwnerAddToHash(owner, oldhash[i].item, oldhash[i].kind);
+					ResourceOwnerAddToHash(owner, oldhash[i].item, oldhash[i].kind,
+										   oldhash[i].count);
 			}
 
 			/* And release old hash table. */
@@ -506,7 +606,7 @@ ResourceOwnerEnlarge(ResourceOwner owner)
 
 	/* Move items from the array to the hash */
 	for (int i = 0; i < owner->narr; i++)
-		ResourceOwnerAddToHash(owner, owner->arr[i].item, owner->arr[i].kind);
+		ResourceOwnerAddToHash(owner, owner->arr[i].item, owner->arr[i].kind, 1);
 	owner->narr = 0;
 
 	Assert(owner->nhash <= owner->grow_at);
@@ -555,7 +655,8 @@ ResourceOwnerRemember(ResourceOwner owner, Datum value, const ResourceOwnerDesc
  * Note: Forgetting a resource does not guarantee that there is room to
  * remember a new resource.  One exception is when you forget the most
  * recently remembered resource; that does make room for a new remember call.
- * Some code callers rely on that exception.
+ * Some code callers rely on that exception. The deduplication does not
+ * change this, as it only applies to the hash table.
  */
 void
 ResourceOwnerForget(ResourceOwner owner, Datum value, const ResourceOwnerDesc *kind)
@@ -597,8 +698,17 @@ ResourceOwnerForget(ResourceOwner owner, Datum value, const ResourceOwnerDesc *k
 			if (owner->hash[idx].item == value &&
 				owner->hash[idx].kind == kind)
 			{
+				/* decrement the count for entry for multiple references */
+				if (owner->hash[idx].count > 1)
+				{
+					owner->hash[idx].count--;
+					return;
+				}
+
+				/* remove the entry entirely */
 				owner->hash[idx].item = (Datum) 0;
 				owner->hash[idx].kind = NULL;
+				owner->hash[idx].count = 0;
 				owner->nhash--;
 
 #ifdef RESOWNER_STATS
@@ -847,12 +957,19 @@ ResourceOwnerReleaseAllOfKind(ResourceOwner owner, const ResourceOwnerDesc *kind
 		if (owner->hash[i].kind == kind)
 		{
 			Datum		value = owner->hash[i].item;
+			uint32		count = owner->hash[i].count;
 
 			owner->hash[i].item = (Datum) 0;
 			owner->hash[i].kind = NULL;
+			owner->hash[i].count = 0;
 			owner->nhash--;
 
-			kind->ReleaseResource(value);
+			/* invoke the release callback once for each reference */
+			while (count > 0)
+			{
+				kind->ReleaseResource(value);
+				count--;
+			}
 		}
 	}
 	owner->releasing = false;
-- 
2.51.0

0003-v3-Check-for-empty-slots-first.patchtext/x-patch; charset=UTF-8; name=0003-v3-Check-for-empty-slots-first.patchDownload

From 98efe8f0a3ba4474bf21af6e7e727d529bb89d45 Mon Sep 17 00:00:00 2001
From: tomas <tomas>
Date: Tue, 21 Oct 2025 15:06:01 +0200
Subject: [PATCH 3/3] v3: Check for empty slots first

---
 src/backend/utils/resowner/resowner.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 73de45baccd..1a04eec945a 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -270,11 +270,13 @@ ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc
 	idx = hash_resource_elem(value, kind) & mask;
 	for (;;)
 	{
+		if (owner->hash[idx].kind == NULL)
+			break;				/* found a free slot */
+
 		/*
 		 * Increment counter for a matching entry or use an empty slot,
 		 * whichever we find first.
 		 */
-
 		if ((owner->hash[idx].kind == kind) &&
 			(owner->hash[idx].item == value))
 		{
@@ -282,8 +284,6 @@ ResourceOwnerAddToHash(ResourceOwner owner, Datum value, const ResourceOwnerDesc
 			return;
 		}
 
-		if (owner->hash[idx].kind == NULL)
-			break;				/* found a free slot */
 		idx = (idx + 1) & mask;
 	}
 	owner->hash[idx].item = value;
-- 
2.51.0

#14

Tomas Vondra

tomas@vondra.me

3 months ago

In reply to: Tomas Vondra (#13)

Re: Optimizing ResouceOwner to speed up COPY

On 10/21/25 16:43, Tomas Vondra wrote:

...

The results seem fairly stable, and the overall trend is clear. It'd be
great if there were no regressions, but considering how narrow is this
microbenchmark (and considering the benefits for practical COPY runs),
I'd say it's probably OK.

The regression for non-pathological cases (like COPY) is bothering me,
even if it's a very narrow microbenchmark and I'd bet it would not be
measurable in practice. Still, it's annoying.

I wonder if maybe a better solution would be to invent a concept of
tuple slots for the same descriptor, and allow it to be tracked only
once in the resource owner. That'd mean no duplicates, completely
eliminating the pathological case.

regards

--
Tomas Vondra

#15

Tomas Vondra

tomas@vondra.me

3 months ago

In reply to: Tomas Vondra (#14)

1 attachment(s)

Re: Optimizing ResouceOwner to speed up COPY

On 10/27/25 16:14, Tomas Vondra wrote:

On 10/21/25 16:43, Tomas Vondra wrote:

...

The results seem fairly stable, and the overall trend is clear. It'd be
great if there were no regressions, but considering how narrow is this
microbenchmark (and considering the benefits for practical COPY runs),
I'd say it's probably OK.

The regression for non-pathological cases (like COPY) is bothering me,
even if it's a very narrow microbenchmark and I'd bet it would not be
measurable in practice. Still, it's annoying.

I wonder if maybe a better solution would be to invent a concept of
tuple slots for the same descriptor, and allow it to be tracked only
once in the resource owner. That'd mean no duplicates, completely
eliminating the pathological case.

Here's a PoC of this approach. It introduces a couple functions to
create/destroy a batch of slots, and uses that for COPY.

After allocating a batch with table_slot_batch_create(), the code can
get the next slot using table_slot_batch_next(). The descriptor is
pinned only once, in table_slot_batch_create().

It needs to add these to multiple "layers", but it's mostly mechanical.
It's just a PoC and needs more thought / cleanup. E.g. some of the
duplication could be eliminated (if we allowed MakeTupleTableSlot to
skip pinning the descriptor). Also, perhaps there should be a way to
protect against freeing the batched slots individually.

Anyway, the results are unsurprising - it has the same benefits as the
earlier patches, but doesn't affect the resownerbench at all.

regards

--
Tomas Vondra

Attachments:

0001-introduce-batch-of-slots.patchtext/x-patch; charset=UTF-8; name=0001-introduce-batch-of-slots.patchDownload

From b9110bbad383ce32d4dfe1ad03c44138c0d53c09 Mon Sep 17 00:00:00 2001
From: Tomas Vondra <tomas@vondra.me>
Date: Mon, 27 Oct 2025 19:02:10 +0100
Subject: [PATCH] introduce batch of slots

---
 src/backend/access/table/tableam.c | 36 +++++++++++++
 src/backend/commands/copyfrom.c    | 15 ++++--
 src/backend/executor/execTuples.c  | 84 ++++++++++++++++++++++++++++++
 src/include/access/tableam.h       |  3 ++
 src/include/executor/tuptable.h    | 16 ++++++
 5 files changed, 150 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/table/tableam.c b/src/backend/access/table/tableam.c
index 5e41404937e..a3e67940443 100644
--- a/src/backend/access/table/tableam.c
+++ b/src/backend/access/table/tableam.c
@@ -103,6 +103,42 @@ table_slot_create(Relation relation, List **reglist)
 	return slot;
 }
 
+TupleTableSlotBatch *
+table_slot_batch_create(Relation relation, int maxslots)
+{
+	TupleTableSlotBatch *batch;
+
+	batch = palloc0(offsetof(TupleTableSlotBatch, ttsb_slots) +
+					maxslots * sizeof(TupleTableSlot *));
+
+	batch->ttsb_maxslots = maxslots;
+	batch->ttsb_nslots = 0;
+
+	/* const for optimization purposes, OK to modify at allocation time */
+	*((const TupleTableSlotOps **) &batch->ttsb_ops) = table_slot_callbacks(relation);
+	batch->ttsb_tupleDescriptor = RelationGetDescr(relation);
+
+	PinTupleDesc(batch->ttsb_tupleDescriptor);
+
+	return batch;
+}
+
+TupleTableSlot *
+table_slot_batch_next(TupleTableSlotBatch *batch)
+{
+	TupleTableSlot *slot;
+
+	if (batch->ttsb_nslots == batch->ttsb_maxslots)
+		elog(ERROR, "not enough slots in the batch");
+
+	slot = MakeSingleTupleTableSlotBatch(batch->ttsb_tupleDescriptor,
+										 batch->ttsb_ops);
+
+	batch->ttsb_slots[batch->ttsb_nslots++] = slot;
+
+	return slot;
+}
+
 
 /* ----------------------------------------------------------------------------
  * Table scan functions.
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 12781963b4f..e435b56fbe1 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -77,6 +77,7 @@
 /* Stores multi-insert data related to a single relation in CopyFrom. */
 typedef struct CopyMultiInsertBuffer
 {
+	TupleTableSlotBatch *batch;
 	TupleTableSlot *slots[MAX_BUFFERED_TUPLES]; /* Array to store tuples */
 	ResultRelInfo *resultRelInfo;	/* ResultRelInfo for 'relid' */
 	BulkInsertState bistate;	/* BulkInsertState for this rel if plain
@@ -369,6 +370,7 @@ CopyMultiInsertBufferInit(ResultRelInfo *rri)
 	buffer->resultRelInfo = rri;
 	buffer->bistate = (rri->ri_FdwRoutine == NULL) ? GetBulkInsertState() : NULL;
 	buffer->nused = 0;
+	buffer->batch = NULL;
 
 	return buffer;
 }
@@ -621,7 +623,7 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
 							 CopyMultiInsertBuffer *buffer)
 {
 	ResultRelInfo *resultRelInfo = buffer->resultRelInfo;
-	int			i;
+	// int			i;
 
 	/* Ensure buffer was flushed */
 	Assert(buffer->nused == 0);
@@ -638,8 +640,10 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
 		Assert(buffer->bistate == NULL);
 
 	/* Since we only create slots on demand, just drop the non-null ones. */
-	for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
-		ExecDropSingleTupleTableSlot(buffer->slots[i]);
+	// for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
+	//	ExecDropSingleTupleTableSlot(buffer->slots[i]);
+	if (buffer->batch != NULL)
+		ExecDropTupleTableSlotBatch(buffer->batch);
 
 	if (resultRelInfo->ri_FdwRoutine == NULL)
 		table_finish_bulk_insert(resultRelInfo->ri_RelationDesc,
@@ -743,8 +747,11 @@ CopyMultiInsertInfoNextFreeSlot(CopyMultiInsertInfo *miinfo,
 
 	nused = buffer->nused;
 
+	if (buffer->batch == NULL)
+		buffer->batch = table_slot_batch_create(rri->ri_RelationDesc, MAX_BUFFERED_TUPLES);
+
 	if (buffer->slots[nused] == NULL)
-		buffer->slots[nused] = table_slot_create(rri->ri_RelationDesc, NULL);
+		buffer->slots[nused] = table_slot_batch_next(buffer->batch);
 	return buffer->slots[nused];
 }
 
diff --git a/src/backend/executor/execTuples.c b/src/backend/executor/execTuples.c
index 8e02d68824f..b1f9435ae51 100644
--- a/src/backend/executor/execTuples.c
+++ b/src/backend/executor/execTuples.c
@@ -1350,6 +1350,57 @@ MakeTupleTableSlot(TupleDesc tupleDesc,
 	return slot;
 }
 
+TupleTableSlot *
+MakeTupleTableSlotBatch(TupleDesc tupleDesc,
+						const TupleTableSlotOps *tts_ops)
+{
+	Size		basesz,
+				allocsz;
+	TupleTableSlot *slot;
+
+	basesz = tts_ops->base_slot_size;
+
+	/*
+	 * When a fixed descriptor is specified, we can reduce overhead by
+	 * allocating the entire slot in one go.
+	 */
+	if (tupleDesc)
+		allocsz = MAXALIGN(basesz) +
+			MAXALIGN(tupleDesc->natts * sizeof(Datum)) +
+			MAXALIGN(tupleDesc->natts * sizeof(bool));
+	else
+		allocsz = basesz;
+
+	slot = palloc0(allocsz);
+	/* const for optimization purposes, OK to modify at allocation time */
+	*((const TupleTableSlotOps **) &slot->tts_ops) = tts_ops;
+	slot->type = T_TupleTableSlot;
+	slot->tts_flags |= TTS_FLAG_EMPTY;
+	if (tupleDesc != NULL)
+		slot->tts_flags |= TTS_FLAG_FIXED;
+	slot->tts_tupleDescriptor = tupleDesc;
+	slot->tts_mcxt = CurrentMemoryContext;
+	slot->tts_nvalid = 0;
+
+	if (tupleDesc != NULL)
+	{
+		slot->tts_values = (Datum *)
+			(((char *) slot)
+			 + MAXALIGN(basesz));
+		slot->tts_isnull = (bool *)
+			(((char *) slot)
+			 + MAXALIGN(basesz)
+			 + MAXALIGN(tupleDesc->natts * sizeof(Datum)));
+	}
+
+	/*
+	 * And allow slot type specific initialization.
+	 */
+	slot->tts_ops->init(slot);
+
+	return slot;
+}
+
 /* --------------------------------
  *		ExecAllocTableSlot
  *
@@ -1458,6 +1509,39 @@ ExecDropSingleTupleTableSlot(TupleTableSlot *slot)
 	pfree(slot);
 }
 
+TupleTableSlot *
+MakeSingleTupleTableSlotBatch(TupleDesc tupdesc,
+							  const TupleTableSlotOps *tts_ops)
+{
+	TupleTableSlot *slot = MakeTupleTableSlotBatch(tupdesc, tts_ops);
+
+	return slot;
+}
+
+/* FIXME Does this need to match ExecResetTupleTable's processing of one slot? */
+void
+ExecDropTupleTableSlotBatch(TupleTableSlotBatch *batch)
+{
+	for (int i = 0; i < batch->ttsb_nslots; i++)
+	{
+		TupleTableSlot *slot = batch->ttsb_slots[i];
+
+		Assert(IsA(slot, TupleTableSlot));
+		ExecClearTuple(slot);
+		slot->tts_ops->release(slot);
+		if (!TTS_FIXED(slot))
+		{
+			if (slot->tts_values)
+				pfree(slot->tts_values);
+			if (slot->tts_isnull)
+				pfree(slot->tts_isnull);
+		}
+		pfree(slot);
+	}
+
+	ReleaseTupleDesc(batch->ttsb_tupleDescriptor);
+}
+
 
 /* ----------------------------------------------------------------
  *				  tuple table slot accessor functions
diff --git a/src/include/access/tableam.h b/src/include/access/tableam.h
index e16bf025692..b9500db1129 100644
--- a/src/include/access/tableam.h
+++ b/src/include/access/tableam.h
@@ -862,6 +862,9 @@ extern const TupleTableSlotOps *table_slot_callbacks(Relation relation);
  */
 extern TupleTableSlot *table_slot_create(Relation relation, List **reglist);
 
+extern TupleTableSlotBatch *table_slot_batch_create(Relation relation, int maxslots);
+extern TupleTableSlot *table_slot_batch_next(TupleTableSlotBatch *batch);
+
 
 /* ----------------------------------------------------------------------------
  * Table scan functions.
diff --git a/src/include/executor/tuptable.h b/src/include/executor/tuptable.h
index 43f1d999b91..38a8e170c89 100644
--- a/src/include/executor/tuptable.h
+++ b/src/include/executor/tuptable.h
@@ -225,6 +225,17 @@ struct TupleTableSlotOps
 	MinimalTuple (*copy_minimal_tuple) (TupleTableSlot *slot, Size extra);
 };
 
+/* a batch of tuple table slots */
+typedef struct TupleTableSlotBatch
+{
+	NodeTag		type;
+	int			ttsb_maxslots;
+	int			ttsb_nslots;
+	const TupleTableSlotOps *const ttsb_ops; /* implementation of slot */
+	TupleDesc		ttsb_tupleDescriptor;	/* slot's tuple descriptor */
+	TupleTableSlot *ttsb_slots[FLEXIBLE_ARRAY_MEMBER];
+} TupleTableSlotBatch;
+
 /*
  * Predefined TupleTableSlotOps for various types of TupleTableSlotOps. The
  * same are used to identify the type of a given slot.
@@ -347,6 +358,11 @@ extern void slot_getmissingattrs(TupleTableSlot *slot, int startAttNum,
 								 int lastAttNum);
 extern void slot_getsomeattrs_int(TupleTableSlot *slot, int attnum);
 
+extern TupleTableSlot *MakeTupleTableSlotBatch(TupleDesc tupleDesc,
+											   const TupleTableSlotOps *tts_ops);
+extern TupleTableSlot *MakeSingleTupleTableSlotBatch(TupleDesc tupleDesc,
+													 const TupleTableSlotOps *tts_ops);
+extern void ExecDropTupleTableSlotBatch(TupleTableSlotBatch *batch);
 
 #ifndef FRONTEND
 
-- 
2.51.0