How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

Started by Kouhei Kaigaiover 11 years ago12 messages
#1Kouhei Kaigai
kaigai@ak.jp.nec.com
1 attachment(s)

Hello,

I recently got a trouble on development of my extension that utilizes
the shared buffer when it released each buffer page.

This extension transfers contents of the shared buffers to GPU device
using DMA feature, then kicks a device kernel code.
Usually 8KB (= BLCKSZ) is too small as a unit size for calculation,
so this extension pins multiple pages prior to DMA transfer, then it
releases after the device kernel execution.
For the performance reason, 16MB-64MB is a preferable data size per
a device kernel execution. DMA transfer of 16MB-64MB needs 2048-8192
pages being pinned simultaneously.

Once backend/extension calls ReadBuffer(), resowner.c tracks which
buffer was referenced by the current resource owner, to ensure these
buffers being released at end of the transaction.
However, it seems to me implementation of resowner.c didn't assume
many buffers are referenced by a particular resource owner simultaneously.
It manages the buffer index using an expandable array, then looks up
the target buffer by sequential walk but from the tail because recently
pinned buffer tends to be released first.
It made a trouble in my case. My extension pinned multiple thousands
buffers, so owner->buffers[] were enlarged and takes expensive cost
to walk on.
In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in
total during hash-joining 2M rows; even though hash-joining itself
takes less than 16 seconds.

What is the best way to solve the problem?

Idea-1) Put ResourceOwnerForgetBuffer() O(1) logic, instead of O(N^2).
The source of problem come from data structure in ResourceOwnerData,
so a straightforward way is to apply O(1) logic based on hashing,
instead of the linear search.
An issue is how beneficial or harmless to the core code, not only
my extension. Probably, it "potentially" beneficial to the core
backend also. However, its effect is not easy to observe right now
because usual workload takes enough small amount of buffers at the
same time.

The attached patch applies O(1) logic on ResourceOwnerForgetBuffer().
It makes time consumption 36sec->0.09sec during 20M rows joinning
based on hash-logic.

Idea-2) track shared buffer being referenced by extension itself
One other, but not preferable, option is to call ResourceOwnerForgetBuffer()
just after ReadBuffer() on the extension side.
Once resource-owner forget it, extension shall be responsible to
release the buffer at end of the transaction, even if it aborted.
It also makes us unavailable to use ReleaseBuffer(), so extension
has to have duplication of ReleaseBuffer() but no ResourceOwnerForgetBuffer().
This idea has few advantage towards the idea-1, but only advantage
is to avoid changes to the core PostgreSQL.

Any comments?

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Attachments:

pgsql-v9.5-resowner-forget-buffer-o1.v1.patchapplication/octet-stream; name=pgsql-v9.5-resowner-forget-buffer-o1.v1.patchDownload
 src/backend/utils/resowner/resowner.c | 76 +++++++++++++++++++++++------------
 1 file changed, 50 insertions(+), 26 deletions(-)

diff --git a/src/backend/utils/resowner/resowner.c b/src/backend/utils/resowner/resowner.c
index 0955bcc..ef6e6bd 100644
--- a/src/backend/utils/resowner/resowner.c
+++ b/src/backend/utils/resowner/resowner.c
@@ -229,6 +229,7 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 	ResourceOwner child;
 	ResourceOwner save;
 	ResourceReleaseCallbackItem *item;
+	int		i;
 
 	/* Recurse to handle descendants */
 	for (child = owner->firstchild; child != NULL; child = child->nextchild)
@@ -256,11 +257,13 @@ ResourceOwnerReleaseInternal(ResourceOwner owner,
 		 * We are careful to do the releasing back-to-front, so as to avoid
 		 * O(N^2) behavior in ResourceOwnerForgetBuffer().
 		 */
-		while (owner->nbuffers > 0)
+		for (i=0; owner->nbuffers > 0; i = (i + 1) % owner->maxbuffers)
 		{
+			if (!BufferIsValid(owner->buffers[i]))
+				continue;
 			if (isCommit)
-				PrintBufferLeakWarning(owner->buffers[owner->nbuffers - 1]);
-			ReleaseBuffer(owner->buffers[owner->nbuffers - 1]);
+				PrintBufferLeakWarning(owner->buffers[i]);
+			ReleaseBuffer(owner->buffers[i]);
 		}
 
 		/*
@@ -578,21 +581,38 @@ ResourceOwnerEnlargeBuffers(ResourceOwner owner)
 	int			newmax;
 
 	if (owner == NULL ||
-		owner->nbuffers < owner->maxbuffers)
+		owner->nbuffers < owner->maxbuffers / 2)
 		return;					/* nothing to do */
 
 	if (owner->buffers == NULL)
 	{
 		newmax = 16;
 		owner->buffers = (Buffer *)
-			MemoryContextAlloc(TopMemoryContext, newmax * sizeof(Buffer));
+			MemoryContextAllocZero(TopMemoryContext, newmax * sizeof(Buffer));
 		owner->maxbuffers = newmax;
 	}
 	else
 	{
+		Buffer *newbuf;
+		int		i, j;
+
 		newmax = owner->maxbuffers * 2;
-		owner->buffers = (Buffer *)
-			repalloc(owner->buffers, newmax * sizeof(Buffer));
+		newbuf = (Buffer *)
+			MemoryContextAllocZero(TopMemoryContext, newmax * sizeof(Buffer));
+
+		for (i=0; i < owner->maxbuffers; i++)
+		{
+			Buffer	buffer = owner->buffers[i];
+
+			if (!BufferIsValid(buffer))
+				continue;
+			j = hash_uint32((uint32)buffer) % newmax;
+			while (BufferIsValid(newbuf[j]))
+				j = (j + 1) % newmax;
+			newbuf[j] = buffer;
+		}
+		pfree(owner->buffers);
+		owner->buffers = newbuf;
 		owner->maxbuffers = newmax;
 	}
 }
@@ -610,9 +630,21 @@ ResourceOwnerRememberBuffer(ResourceOwner owner, Buffer buffer)
 {
 	if (owner != NULL)
 	{
+		int		start = hash_uint32((uint32)buffer) % owner->maxbuffers;
+		int		curr = start;
+
 		Assert(owner->nbuffers < owner->maxbuffers);
-		owner->buffers[owner->nbuffers] = buffer;
-		owner->nbuffers++;
+		do {
+			if (!BufferIsValid(owner->buffers[curr]))
+			{
+				owner->buffers[curr] = buffer;
+				owner->nbuffers++;
+				return;
+			}
+			curr = (curr + 1) % owner->maxbuffers;
+		} while (curr != start);
+		Assert(false);
+		elog(ERROR, "Couldn't find a slot to remember buffer");
 	}
 }
 
@@ -628,27 +660,19 @@ ResourceOwnerForgetBuffer(ResourceOwner owner, Buffer buffer)
 	if (owner != NULL)
 	{
 		Buffer	   *buffers = owner->buffers;
-		int			nb1 = owner->nbuffers - 1;
-		int			i;
+		int			start = hash_uint32((uint32)buffer) % owner->maxbuffers;
+		int			curr = start;
 
-		/*
-		 * Scan back-to-front because it's more likely we are releasing a
-		 * recently pinned buffer.  This isn't always the case of course, but
-		 * it's the way to bet.
-		 */
-		for (i = nb1; i >= 0; i--)
-		{
-			if (buffers[i] == buffer)
+		do {
+			if (buffers[curr] == buffer)
 			{
-				while (i < nb1)
-				{
-					buffers[i] = buffers[i + 1];
-					i++;
-				}
-				owner->nbuffers = nb1;
+				buffers[curr] = InvalidBuffer;
+				owner->nbuffers--;
 				return;
 			}
-		}
+			curr = (curr + 1) % owner->maxbuffers;
+		} while (curr != start);
+		Assert(false);
 		elog(ERROR, "buffer %d is not owned by resource owner %s",
 			 buffer, owner->name);
 	}
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Kouhei Kaigai (#1)
Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

Kouhei Kaigai <kaigai@ak.jp.nec.com> writes:

Idea-1) Put ResourceOwnerForgetBuffer() O(1) logic, instead of O(N^2).
The source of problem come from data structure in ResourceOwnerData,
so a straightforward way is to apply O(1) logic based on hashing,
instead of the linear search.

I will bet that this is a dead loss for all normal usage patterns,
because queries seldom have more than a few buffers pinned. More
than that: I do not think we should encourage coding patterns that
pin lots of buffers. There is no way that holding pins on thousands
of buffers to do one operation is a sane design.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Tom Lane (#2)
Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

Kouhei Kaigai <kaigai@ak.jp.nec.com> writes:

Idea-1) Put ResourceOwnerForgetBuffer() O(1) logic, instead of O(N^2).
The source of problem come from data structure in ResourceOwnerData,
so a straightforward way is to apply O(1) logic based on hashing,
instead of the linear search.

I will bet that this is a dead loss for all normal usage patterns, because
queries seldom have more than a few buffers pinned. More than that: I do
not think we should encourage coding patterns that pin lots of buffers.
There is no way that holding pins on thousands of buffers to do one operation
is a sane design.

Yep, that's my pain. Even though usual query does not take many buffers pinned,
my use case needs to fetch megabytes scale data at once because of performance
reason; page-by-page synchronous scan makes GPU being idle.
It is also a reason why I cared about whether the O(1) logic is harmless to
existing (major portion of) workloads.

BTW, what is the reason why we assume small amount of pages are pined at same
time? Does it come from hardware assumption that has much less RAM capacity
than the data-set to be processed?
Probably, once we pay attention a hardware that has more than several dozen GB
RAM which can keep the data-set on memory, it may lead different results.

At least, it seems to me valuable to support both of the styles of buffer
usages, as long as it is harmless to the existing workloads.
I'd like to dig down the deeper reason of this.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Kouhei Kaigai (#1)
Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

On 10/03/2014 07:08 AM, Kouhei Kaigai wrote:

Hello,

I recently got a trouble on development of my extension that utilizes
the shared buffer when it released each buffer page.

This extension transfers contents of the shared buffers to GPU device
using DMA feature, then kicks a device kernel code.

Wow, that sounds crazy.

Once backend/extension calls ReadBuffer(), resowner.c tracks which
buffer was referenced by the current resource owner, to ensure these
buffers being released at end of the transaction.
However, it seems to me implementation of resowner.c didn't assume
many buffers are referenced by a particular resource owner simultaneously.
It manages the buffer index using an expandable array, then looks up
the target buffer by sequential walk but from the tail because recently
pinned buffer tends to be released first.
It made a trouble in my case. My extension pinned multiple thousands
buffers, so owner->buffers[] were enlarged and takes expensive cost
to walk on.
In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in
total during hash-joining 2M rows; even though hash-joining itself
takes less than 16 seconds.

What is the best way to solve the problem?

How about creating a separate ResourceOwner for these buffer pins, and
doing a wholesale ResourceOwnerRelease() on it when you're done?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@2ndquadrant.com
In reply to: Heikki Linnakangas (#4)
Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

On 2014-10-03 10:35:42 +0300, Heikki Linnakangas wrote:

On 10/03/2014 07:08 AM, Kouhei Kaigai wrote:

Hello,

I recently got a trouble on development of my extension that utilizes
the shared buffer when it released each buffer page.

This extension transfers contents of the shared buffers to GPU device
using DMA feature, then kicks a device kernel code.

Wow, that sounds crazy.

Agreed. I doubt that pinning that many buffers is a sane thing to do. At
the very least you'll heavily interfere with vacuum and such.

Once backend/extension calls ReadBuffer(), resowner.c tracks which
buffer was referenced by the current resource owner, to ensure these
buffers being released at end of the transaction.
However, it seems to me implementation of resowner.c didn't assume
many buffers are referenced by a particular resource owner simultaneously.
It manages the buffer index using an expandable array, then looks up
the target buffer by sequential walk but from the tail because recently
pinned buffer tends to be released first.
It made a trouble in my case. My extension pinned multiple thousands
buffers, so owner->buffers[] were enlarged and takes expensive cost
to walk on.
In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in
total during hash-joining 2M rows; even though hash-joining itself
takes less than 16 seconds.

What is the best way to solve the problem?

How about creating a separate ResourceOwner for these buffer pins, and
doing a wholesale ResourceOwnerRelease() on it when you're done?

Or even just unpinning them in reverse order? That should already fix
the performance issues?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Heikki Linnakangas (#4)
Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

On 10/03/2014 07:08 AM, Kouhei Kaigai wrote:

Hello,

I recently got a trouble on development of my extension that utilizes
the shared buffer when it released each buffer page.

This extension transfers contents of the shared buffers to GPU device
using DMA feature, then kicks a device kernel code.

Wow, that sounds crazy.

Once backend/extension calls ReadBuffer(), resowner.c tracks which
buffer was referenced by the current resource owner, to ensure these
buffers being released at end of the transaction.
However, it seems to me implementation of resowner.c didn't assume
many buffers are referenced by a particular resource owner simultaneously.
It manages the buffer index using an expandable array, then looks up
the target buffer by sequential walk but from the tail because
recently pinned buffer tends to be released first.
It made a trouble in my case. My extension pinned multiple thousands
buffers, so owner->buffers[] were enlarged and takes expensive cost to
walk on.
In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in
total during hash-joining 2M rows; even though hash-joining itself
takes less than 16 seconds.

What is the best way to solve the problem?

How about creating a separate ResourceOwner for these buffer pins, and doing
a wholesale ResourceOwnerRelease() on it when you're done?

Let me clarify your idea.

1. Create a separate ResourceOwner under the CurrentResourceOwner.
2. Switch CurrentResourceOwner to the self-constructed resource owner
during ReadBuffer() on thousands buffers.
3. Switch back to the original CurrentResourceOwner.
4. Kick DMA and run GPU device kernel (actual job running...)
5. Switch CurrentResourceOwner to the self-constructed resource owner
again, during ReleaseBuffer() in reverse order.
6. Switch back to the original CurrentResourceOwner, then calls
ResourceOwnerDelete().

Hmm... at this moment, I cannot find something not easy to implement.
I'd like to try this idea, then report it later.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Andres Freund (#5)
Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

On 2014-10-03 10:35:42 +0300, Heikki Linnakangas wrote:

On 10/03/2014 07:08 AM, Kouhei Kaigai wrote:

Hello,

I recently got a trouble on development of my extension that
utilizes the shared buffer when it released each buffer page.

This extension transfers contents of the shared buffers to GPU
device using DMA feature, then kicks a device kernel code.

Wow, that sounds crazy.

Agreed. I doubt that pinning that many buffers is a sane thing to do. At
the very least you'll heavily interfere with vacuum and such.

My assumption is, this extension is used to handle OLAP type workload,
thus relatively less amount of write traffic to the database.
Sorry, I missed to mention about.

Once backend/extension calls ReadBuffer(), resowner.c tracks which
buffer was referenced by the current resource owner, to ensure these
buffers being released at end of the transaction.
However, it seems to me implementation of resowner.c didn't assume
many buffers are referenced by a particular resource owner

simultaneously.

It manages the buffer index using an expandable array, then looks up
the target buffer by sequential walk but from the tail because
recently pinned buffer tends to be released first.
It made a trouble in my case. My extension pinned multiple thousands
buffers, so owner->buffers[] were enlarged and takes expensive cost
to walk on.
In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in
total during hash-joining 2M rows; even though hash-joining itself
takes less than 16 seconds.

What is the best way to solve the problem?

How about creating a separate ResourceOwner for these buffer pins, and
doing a wholesale ResourceOwnerRelease() on it when you're done?

Or even just unpinning them in reverse order? That should already fix the
performance issues?

In case when multiple chunks (note: a chunk contains thousands buffers as
a unit of device kernel execution) are running asynchronously, order of
GPU job's completion is not predictable.
So, it does not help my situation if one resource-owner tracks all the
buffers.

Probably, Heikki suggested to create a separate resource-owner per chunk.
In this case, all the buffers in a particular chunk shall be released
on the same time, so ReleaseBuffer() in reverse order makes sense.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8james
james@mansionfamily.plus.com
In reply to: Kouhei Kaigai (#3)
Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

On 03/10/2014 05:53, Kouhei Kaigai wrote:

Yep, that's my pain. Even though usual query does not take many buffers pinned,
my use case needs to fetch megabytes scale data at once because of performance
reason; page-by-page synchronous scan makes GPU being idle.

Doesn't your GPU have an async queue and exec mechanism? Then you could
do an asyn
DMA to the GPU with an event, use that event in he GPU to start the
kernel and in the
DB to release the pin?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#4)
Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 10/03/2014 07:08 AM, Kouhei Kaigai wrote:

What is the best way to solve the problem?

How about creating a separate ResourceOwner for these buffer pins, and
doing a wholesale ResourceOwnerRelease() on it when you're done?

That's a thought. Another point is that if you can release the buffer
pins in reverse order of acquisition, the existing ResourceOwner code
works just fine.

I have a larger question though: how is it useful to transfer
raw contents of shared buffers to a GPU in the first place?
Surely you're not going to be putting tasks like tuple visibility
verification into the GPU. So it seems like this whole thread is
based on a dubious architectural assumption.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: james (#8)
Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

On 03/10/2014 05:53, Kouhei Kaigai wrote:

Yep, that's my pain. Even though usual query does not take many
buffers pinned, my use case needs to fetch megabytes scale data at
once because of performance reason; page-by-page synchronous scan makes

GPU being idle.
Doesn't your GPU have an async queue and exec mechanism? Then you could
do an asyn DMA to the GPU with an event, use that event in he GPU to start
the kernel and in the DB to release the pin?

That is exactly what I'm doing now. Problem is, it needs to keep multiple
buffers (likely, more than a few dozen thousands) in the queue not to make
GPU waiting for the data to be calculated. Thus, I want to pin relatively
many buffers than usual query workloads.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Tom Lane (#9)
Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 10/03/2014 07:08 AM, Kouhei Kaigai wrote:

What is the best way to solve the problem?

How about creating a separate ResourceOwner for these buffer pins, and
doing a wholesale ResourceOwnerRelease() on it when you're done?

That's a thought. Another point is that if you can release the buffer pins
in reverse order of acquisition, the existing ResourceOwner code works just
fine.

Yep, I'm now trying.

I have a larger question though: how is it useful to transfer raw contents
of shared buffers to a GPU in the first place?
Surely you're not going to be putting tasks like tuple visibility
verification into the GPU. So it seems like this whole thread is based
on a dubious architectural assumption.

In my implementation, it is a job of GPU to check tuple visibility, then
it also send GPU an array of visible item index, in addition to the buffer
contents itself. GPU code can pick up only visible tuples using this index.
OLAP workloads tends to have all-visible pages, so cost of visibility check
was not expensive.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Kouhei Kaigai (#6)
Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

On 10/03/2014 07:08 AM, Kouhei Kaigai wrote:

Hello,

I recently got a trouble on development of my extension that
utilizes the shared buffer when it released each buffer page.

This extension transfers contents of the shared buffers to GPU
device using DMA feature, then kicks a device kernel code.

Wow, that sounds crazy.

Once backend/extension calls ReadBuffer(), resowner.c tracks which
buffer was referenced by the current resource owner, to ensure these
buffers being released at end of the transaction.
However, it seems to me implementation of resowner.c didn't assume
many buffers are referenced by a particular resource owner

simultaneously.

It manages the buffer index using an expandable array, then looks up
the target buffer by sequential walk but from the tail because
recently pinned buffer tends to be released first.
It made a trouble in my case. My extension pinned multiple thousands
buffers, so owner->buffers[] were enlarged and takes expensive cost
to walk on.
In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in
total during hash-joining 2M rows; even though hash-joining itself
takes less than 16 seconds.

What is the best way to solve the problem?

How about creating a separate ResourceOwner for these buffer pins, and
doing a wholesale ResourceOwnerRelease() on it when you're done?

Let me clarify your idea.

1. Create a separate ResourceOwner under the CurrentResourceOwner.
2. Switch CurrentResourceOwner to the self-constructed resource owner
during ReadBuffer() on thousands buffers.
3. Switch back to the original CurrentResourceOwner.
4. Kick DMA and run GPU device kernel (actual job running...) 5. Switch
CurrentResourceOwner to the self-constructed resource owner
again, during ReleaseBuffer() in reverse order.
6. Switch back to the original CurrentResourceOwner, then calls
ResourceOwnerDelete().

Hmm... at this moment, I cannot find something not easy to implement.
I'd like to try this idea, then report it later.

Let me share the result.

The above approach could eliminated my problem. Individual resource-owner
for each set of shared-buffer and ReleaseBuffer() in reverse order reduced
time to release pinned buffers from 36sec->67.3msec when I run hash-joining
on 20M records; that involves 59168 buffers are pinned concurrently in
maximum (32 asynchronous requests, a request packs 1849 buffers).

Thanks for the suggestion!

postgres=# explain analyze select * from t0 natural join t1 natural join t2;
INFO: time to release buffers: 67.289ms
QUERY PLAN

-----------------------------------------------------------------------------------------------------------------------------------------------
Custom (GpuHashJoin) (cost=3468.00..471640.11 rows=19740924 width=139) (actual time=193.086..5913.459 rows=20000000 loops=1)
pseudo scan tlist: 1:(t0.bid), 2:(t0.aid), 3:<t0.id>, 4:<t0.cat>, 5:<t0.cid>, 6:<t0.did>, 7:<t0.x>, 8:<t0.y>, 9:<t0.z>, 11:<t2.btext>, 12:[t2.bid], 10:<t1.atext>, 13:[t1.aid]
hash clause 1: (t0.bid = t2.bid)
hash clause 2: (t0.aid = t1.aid)
-> Custom (GpuScan) on t0 (cost=500.00..467167.24 rows=20000024 width=73) (actual time=7.757..1056.426 rows=20000000 loops=1)
-> Custom (MultiHash) (cost=734.00..734.00 rows=40000 width=37) (actual time=23.382..23.382 rows=40000 loops=1)
hash keys: bid
-> Seq Scan on t2 (cost=0.00..734.00 rows=40000 width=37) (actual time=0.007..5.124 rows=40000 loops=1)
-> Custom (MultiHash) (cost=734.00..734.00 rows=40000 width=37) (actual time=11.919..11.919 rows=40000 loops=1)
hash keys: aid
-> Seq Scan on t1 (cost=0.00..734.00 rows=40000 width=37) (actual time=0.010..5.299 rows=40000 loops=1)
Planning time: 0.904 ms
Execution time: 6667.986 ms
(13 rows)

--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers