Asynchronous Append on postgres_fdw nodes.
Hello, this is a follow-on of [1]/messages/by-id/2020012917585385831113@highgo.ca and [2]/messages/by-id/20180515.202945.69332784.horiguchi.kyotaro@lab.ntt.co.jp.
Currently the executor visits execution nodes one-by-one. Considering
sharding, Append on multiple postgres_fdw nodes can work
simultaneously and that can largely shorten the respons of the whole
query. For example, aggregations that can be pushed-down to remote
would be accelerated by the number of remote servers. Even other than
such an extreme case, collecting tuples from multiple servers also can
be accelerated by tens of percent [2]/messages/by-id/20180515.202945.69332784.horiguchi.kyotaro@lab.ntt.co.jp.
I have suspended the work waiting asyncrohous or push-up executor to
come but the mood seems inclining toward doing that before that to
come [3]/messages/by-id/20191205181217.GA12895@momjian.us.
The patchset consists of three parts.
- v2-0001-Allow-wait-event-set-to-be-regsitered-to-resoure.patch
The async feature uses WaitEvent, and it needs to be released on
error. This patch makes it possible to register WaitEvent to
resowner to handle that case..
- v2-0002-infrastructure-for-asynchronous-execution.patch
It povides an abstraction layer of asynchronous behavior
(execAsync). Then adds ExecAppend, another version of ExecAppend,
that handles "async-capable" subnodes asynchronously. Also it
contains planner part that makes planner aware of "async-capable"
and "async-aware" path nodes.
- v2-0003-async-postgres_fdw.patch
The "async-capable" postgres_fdw. It accelerates multiple
postgres_fdw nodes on a single connection case as well as
postgres_fdw nodes on dedicate connections.
regards.
[1]: /messages/by-id/2020012917585385831113@highgo.ca
[2]: /messages/by-id/20180515.202945.69332784.horiguchi.kyotaro@lab.ntt.co.jp
[3]: /messages/by-id/20191205181217.GA12895@momjian.us
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 2/28/20 3:06 AM, Kyotaro Horiguchi wrote:
Hello, this is a follow-on of [1] and [2].
Currently the executor visits execution nodes one-by-one. Considering
sharding, Append on multiple postgres_fdw nodes can work
simultaneously and that can largely shorten the respons of the whole
query. For example, aggregations that can be pushed-down to remote
would be accelerated by the number of remote servers. Even other than
such an extreme case, collecting tuples from multiple servers also can
be accelerated by tens of percent [2].I have suspended the work waiting asyncrohous or push-up executor to
come but the mood seems inclining toward doing that before that to
come [3].The patchset consists of three parts.
Are these improvements targeted at PG13 or PG14? This seems to be a
pretty big change for the last CF of PG13.
Regards,
--
-David
david@pgmasters.net
At Wed, 4 Mar 2020 09:56:55 -0500, David Steele <david@pgmasters.net> wrote in
On 2/28/20 3:06 AM, Kyotaro Horiguchi wrote:
Hello, this is a follow-on of [1] and [2].
Currently the executor visits execution nodes one-by-one. Considering
sharding, Append on multiple postgres_fdw nodes can work
simultaneously and that can largely shorten the respons of the whole
query. For example, aggregations that can be pushed-down to remote
would be accelerated by the number of remote servers. Even other than
such an extreme case, collecting tuples from multiple servers also can
be accelerated by tens of percent [2].
I have suspended the work waiting asyncrohous or push-up executor to
come but the mood seems inclining toward doing that before that to
come [3].
The patchset consists of three parts.Are these improvements targeted at PG13 or PG14? This seems to be a
pretty big change for the last CF of PG13.
It is targeted at PG14. As we have the target version in CF-app now,
I marked it as targetting PG14.
Thank you for the suggestion.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Fri, Feb 28, 2020 at 9:08 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
- v2-0001-Allow-wait-event-set-to-be-regsitered-to-resoure.patch
The async feature uses WaitEvent, and it needs to be released on
error. This patch makes it possible to register WaitEvent to
resowner to handle that case..
+1
- v2-0002-infrastructure-for-asynchronous-execution.patch
It povides an abstraction layer of asynchronous behavior
(execAsync). Then adds ExecAppend, another version of ExecAppend,
that handles "async-capable" subnodes asynchronously. Also it
contains planner part that makes planner aware of "async-capable"
and "async-aware" path nodes.
This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.
What other nodes do you think could be async aware? I suppose you
would teach joins to pass through the async support of their children,
and then you could make partition-wise join work like that.
+ /* choose appropriate version of Exec function */
+ if (appendstate->as_nasyncplans == 0)
+ appendstate->ps.ExecProcNode = ExecAppend;
+ else
+ appendstate->ps.ExecProcNode = ExecAppendAsync;
Cool. No extra cost for people not using the new feature.
+ slot = ExecProcNode(subnode);
+ if (subnode->asyncstate == AS_AVAILABLE)
So, now when you execute a node, you get a result AND you get some
information that you access by reaching into the child node's
PlanState. The ExecProcNode() interface is extremely limiting, but
I'm not sure if this is the right way to extend it. Maybe
ExecAsyncProcNode() with a wide enough interface to do the job?
+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+ static int *refind = NULL;
+ static int refindsize = 0;
...
+ if (refindsize < n)
...
+ static ExecAsync_mcbarg mcb_arg =
+ { &refind, &refindsize };
+ static MemoryContextCallback mcb =
+ { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
...
+ MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
This seems a bit strange. Why not just put the pointer in the plan
state? I suppose you want to avoid allocating a new buffer for every
query. Perhaps you could fix that by having a small fixed-size buffer
in the PlanState to cover common cases and allocating a larger one in
a per-query memory context if that one is too small, or just not
worrying about it and allocating every time since you know the desired
size.
+ wes = CreateWaitEventSet(TopTransactionContext,
TopTransactionResourceOwner, n);
...
+ FreeWaitEventSet(wes);
BTW, just as an FYI, I am proposing[1]https://commitfest.postgresql.org/27/2452/ to add support for
RemoveWaitEvent(), so that you could have a single WaitEventSet for
the lifetime of the executor node, and then add and remove sockets
only as needed. I'm hoping to commit that for PG13, if there are no
objections or better ideas soon, because it's useful for some other
places where we currently create and destroy WaitEventSets frequently.
One complication when working with long-lived WaitEventSet objects is
that libpq (or some other thing used by some other hypothetical
async-capable FDW) is free to close and reopen its socket whenever it
wants, so you need a way to know when it has done that. In that patch
set I added pqSocketChangeCount() so that you can see when pgSocket()
refers to a new socket (even if the file descriptor number is the same
by coincidence), but that imposes some book-keeping duties on the
caller, and now I'm wondering how that would look in your patch set.
My goal is to generate the minimum number of systems calls. I think
it would be nice if a 1000-shard query only calls epoll_ctl() when a
child node needs to be added or removed from the set, not
epoll_create(), 1000 * epoll_ctl(), epoll_wait(), close() for every
wait. But I suppose there is an argument that it's more complication
than it's worth.
Thank you for the comment.
At Thu, 5 Mar 2020 16:17:24 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in
On Fri, Feb 28, 2020 at 9:08 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:- v2-0001-Allow-wait-event-set-to-be-regsitered-to-resoure.patch
The async feature uses WaitEvent, and it needs to be released on
error. This patch makes it possible to register WaitEvent to
resowner to handle that case..+1
- v2-0002-infrastructure-for-asynchronous-execution.patch
It povides an abstraction layer of asynchronous behavior
(execAsync). Then adds ExecAppend, another version of ExecAppend,
that handles "async-capable" subnodes asynchronously. Also it
contains planner part that makes planner aware of "async-capable"
and "async-aware" path nodes.This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.What other nodes do you think could be async aware? I suppose you
would teach joins to pass through the async support of their children,
and then you could make partition-wise join work like that.
An Append node is fed from many immediate-child async-capable nodes,
so the Apeend node can pick any child node that have fired.
Unfortunately joins are not wide but deep. That means a set of
async-capable nodes have multiple async-aware (and async capable at
the same time for intermediate nodes) parent nodes. So if we want to
be async on that configuration, we need "push-up" executor engine. In
my last trial, ignoring performane, I could turn almost all nodes into
push-up style but a few nodes, like WindowAgg or HashJoin have a quite
low affinity with push-up style since the caller sites to child nodes
are many and scattered. I got through the low-affinity by turning the
nodes into state machines, but I don't think it is good.
+ /* choose appropriate version of Exec function */ + if (appendstate->as_nasyncplans == 0) + appendstate->ps.ExecProcNode = ExecAppend; + else + appendstate->ps.ExecProcNode = ExecAppendAsync;Cool. No extra cost for people not using the new feature.
It creates some duplicate code but I agree on the performance
perspective.
+ slot = ExecProcNode(subnode); + if (subnode->asyncstate == AS_AVAILABLE)So, now when you execute a node, you get a result AND you get some
information that you access by reaching into the child node's
PlanState. The ExecProcNode() interface is extremely limiting, but
I'm not sure if this is the right way to extend it. Maybe
ExecAsyncProcNode() with a wide enough interface to do the job?
Sounds resonable and seems easy to do.
+Bitmapset * +ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout) +{ + static int *refind = NULL; + static int refindsize = 0; ... + if (refindsize < n) ... + static ExecAsync_mcbarg mcb_arg = + { &refind, &refindsize }; + static MemoryContextCallback mcb = + { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL }; ... + MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);This seems a bit strange. Why not just put the pointer in the plan
state? I suppose you want to avoid allocating a new buffer for every
query. Perhaps you could fix that by having a small fixed-size buffer
in the PlanState to cover common cases and allocating a larger one in
a per-query memory context if that one is too small, or just not
worrying about it and allocating every time since you know the desired
size.
The most significant factor for the shape would be ExecAsync is not a
kind of ExecNode. So ExecAsyncEventWait doen't have direcgt access to
EState other than one of given mutiple nodes. I consider tryig to use
given ExecNodes as an access path to ESttate.
+ wes = CreateWaitEventSet(TopTransactionContext, TopTransactionResourceOwner, n); ... + FreeWaitEventSet(wes);BTW, just as an FYI, I am proposing[1] to add support for
RemoveWaitEvent(), so that you could have a single WaitEventSet for
the lifetime of the executor node, and then add and remove sockets
only as needed. I'm hoping to commit that for PG13, if there are no
objections or better ideas soon, because it's useful for some other
places where we currently create and destroy WaitEventSets frequently.
Yes! I have wanted that (but haven't done by myself..., and I didn't
understand the details from the title "Reducint WaitEventSet syscall
churn":p)
One complication when working with long-lived WaitEventSet objects is
that libpq (or some other thing used by some other hypothetical
async-capable FDW) is free to close and reopen its socket whenever it
wants, so you need a way to know when it has done that. In that patch
set I added pqSocketChangeCount() so that you can see when pgSocket()
refers to a new socket (even if the file descriptor number is the same
by coincidence), but that imposes some book-keeping duties on the
caller, and now I'm wondering how that would look in your patch set.
As for postgres-fdw, unsponaneous disconnection immedately leands to
query ERROR.
My goal is to generate the minimum number of systems calls. I think
it would be nice if a 1000-shard query only calls epoll_ctl() when a
child node needs to be added or removed from the set, not
epoll_create(), 1000 * epoll_ctl(), epoll_wait(), close() for every
wait. But I suppose there is an argument that it's more complication
than it's worth.
I'm not sure how it gives performance gain, but reducing syscalls
itself is good. I'll look on it.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested
I have tested the feature and it shows great performance in queries
which have small amount result compared with base scan amount.
The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested
I occur a strange issue when a exec 'make installcheck-world', it is:
##########################################################
...
============== running regression test queries ==============
test adminpack ... FAILED 60 ms
======================
1 of 1 tests failed.
======================
The differences that caused some tests to fail can be viewed in the
file "/work/src/postgres_app_for/contrib/adminpack/regression.diffs". A copy of the test summary that you see
above is saved in the file "/work/src/postgres_app_for/contrib/adminpack/regression.out".
...
##########################################################
And the content in 'contrib/adminpack/regression.out' is:
##########################################################
SELECT pg_file_write('/tmp/test_file0', 'test0', false);
ERROR: absolute path not allowed
SELECT pg_file_write(current_setting('data_directory') || '/test_file4', 'test4', false);
- pg_file_write
----------------
- 5
-(1 row)
-
+ERROR: reference to parent directory ("..") not allowed
SELECT pg_file_write(current_setting('data_directory') || '/../test_file4', 'test4', false);
ERROR: reference to parent directory ("..") not allowed
RESET ROLE;
@@ -149,7 +145,7 @@
SELECT pg_file_unlink('test_file4');
pg_file_unlink
----------------
- t
+ f
(1 row)
##########################################################
However the issue does not occur when I do a 'make check-world'.
And it doesn't occur when I test the 'make installcheck-world' without the patch.
The new status of this patch is: Waiting on Author
Hello. Thank you for testing.
At Tue, 10 Mar 2020 05:13:42 +0000, movead li <movead.li@highgo.ca> wrote in
The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not testedI occur a strange issue when a exec 'make installcheck-world', it is:
I had't done that.. Bud it worked for me.
##########################################################
...
============== running regression test queries ==============
test adminpack ... FAILED 60 ms======================
1 of 1 tests failed.
======================The differences that caused some tests to fail can be viewed in the
file "/work/src/postgres_app_for/contrib/adminpack/regression.diffs". A copy of the test summary that you see
above is saved in the file "/work/src/postgres_app_for/contrib/adminpack/regression.out".
...
##########################################################And the content in 'contrib/adminpack/regression.out' is:
I don't see that file. Maybe regression.diff?
########################################################## SELECT pg_file_write('/tmp/test_file0', 'test0', false); ERROR: absolute path not allowed SELECT pg_file_write(current_setting('data_directory') || '/test_file4', 'test4', false); - pg_file_write ---------------- - 5 -(1 row) - +ERROR: reference to parent directory ("..") not allowed
It seems to me that you are setting a path containing ".." to PGDATA.
However the issue does not occur when I do a 'make check-world'.
And it doesn't occur when I test the 'make installcheck-world' without the patch.
check-world doesn't use path containing ".." as PGDATA.
The new status of this patch is: Waiting on Author
Thanks for noticing that.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
It seems to me that you are setting a path containing ".." to PGDATA.
Thanks point it for me.
Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca
The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested
I redo the make installcheck-world as Kyotaro Horiguchi point out and the
result nothing wrong. And I think the patch is good in feature and performance
here is the test result thread I made before:
/messages/by-id/CA+9bhCK7chd0qx+mny+U9xaOs2FDNJ7RaxG4=9rpgT6oAKBgWA@mail.gmail.com
The new status of this patch is: Ready for Committer
Hi,
On Wed, Mar 11, 2020 at 10:47 AM movead li <movead.li@highgo.ca> wrote:
I redo the make installcheck-world as Kyotaro Horiguchi point out and the
result nothing wrong. And I think the patch is good in feature and performance
here is the test result thread I made before:
/messages/by-id/CA+9bhCK7chd0qx+mny+U9xaOs2FDNJ7RaxG4=9rpgT6oAKBgWA@mail.gmail.comThe new status of this patch is: Ready for Committer
As discussed upthread, this is a material for PG14, so I moved this to
the next commitfest, keeping the same status. I've not looked at the
patch in any detail yet, so I'm not sure that that is the right status
for the patch, though. I'd like to work on this for PG14 if I have
time.
Thanks!
Best regards,
Etsuro Fujita
On 3/30/20 1:15 PM, Etsuro Fujita wrote:
Hi,
On Wed, Mar 11, 2020 at 10:47 AM movead li <movead.li@highgo.ca> wrote:
I redo the make installcheck-world as Kyotaro Horiguchi point out and the
result nothing wrong. And I think the patch is good in feature and performance
here is the test result thread I made before:
/messages/by-id/CA+9bhCK7chd0qx+mny+U9xaOs2FDNJ7RaxG4=9rpgT6oAKBgWA@mail.gmail.comThe new status of this patch is: Ready for Committer
As discussed upthread, this is a material for PG14, so I moved this to
the next commitfest, keeping the same status. I've not looked at the
patch in any detail yet, so I'm not sure that that is the right status
for the patch, though. I'd like to work on this for PG14 if I have
time.
Hi,
This patch no longer applies cleanly.
In addition, code comments contain spelling errors.
--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company
Hello, Andrey.
At Wed, 3 Jun 2020 15:00:06 +0500, Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote in
This patch no longer applies cleanly.
In addition, code comments contain spelling errors.
Sure. Thaks for noticing of them and sorry for the many typos.
Additional item in WaitEventIPC conflicted with this.
I found the following typos.
connection.c:
s/Rerturns/Returns/
postgres-fdw.c:
s/Retrive/Retrieve/
s/ForeginScanState/ForeignScanState/
s/manipuration/manipulation/
s/asyncstate/async state/
s/alrady/already/
nodeAppend.c:
s/Rery/Retry/
createplan.c:
s/chidlren/children/
resowner.c:
s/identier/identifier/ X 2
execnodes.h:
s/sutff/stuff/
plannodes.h:
s/asyncronous/asynchronous/
Removed a useless variable PgFdwScanState.result_ready.
Removed duplicate code from remove_async_node() by using move_to_next_waiter().
Done some minor cleanups.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 6/4/20 11:00 AM, Kyotaro Horiguchi wrote:
Removed a useless variable PgFdwScanState.result_ready.
Removed duplicate code from remove_async_node() by using move_to_next_waiter().
Done some minor cleanups.
I am reviewing your code.
A couple of variables are no longer needed (see changes.patch in attachment.
Something about the cost of an asynchronous plan:
At the simple query plan (see below) I see:
1. Startup cost of local SeqScan is equal 0, ForeignScan - 100. But
startup cost of Append is 0.
2. Total cost of an Append node is a sum of the subplans. Maybe in the
case of asynchronous append we need to use some reduce factor?
explain select * from parts;
With Async Append:
=====================
Append (cost=0.00..2510.30 rows=106780 width=8)
Async subplans: 3
-> Async Foreign Scan on part_1 parts_2 (cost=100.00..177.80
rows=2260 width=8)
-> Async Foreign Scan on part_2 parts_3 (cost=100.00..177.80
rows=2260 width=8)
-> Async Foreign Scan on part_3 parts_4 (cost=100.00..177.80
rows=2260 width=8)
-> Seq Scan on part_0 parts_1 (cost=0.00..1443.00 rows=100000 width=8)
Without Async Append:
=====================
Append (cost=0.00..2510.30 rows=106780 width=8)
-> Seq Scan on part_0 parts_1 (cost=0.00..1443.00 rows=100000 width=8)
-> Foreign Scan on part_1 parts_2 (cost=100.00..177.80 rows=2260
width=8)
-> Foreign Scan on part_2 parts_3 (cost=100.00..177.80 rows=2260
width=8)
-> Foreign Scan on part_3 parts_4 (cost=100.00..177.80 rows=2260
width=8)
--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
The Russian Postgres Company
Attachments:
changes.patchtext/x-patch; charset=UTF-8; name=changes.patchDownload+3-7
Hello, Andrey.
At Tue, 9 Jun 2020 14:20:42 +0500, Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote in
On 6/4/20 11:00 AM, Kyotaro Horiguchi wrote:
Removed a useless variable PgFdwScanState.result_ready.
Removed duplicate code from remove_async_node() by using
move_to_next_waiter().
Done some minor cleanups.I am reviewing your code.
A couple of variables are no longer needed (see changes.patch in
attachment.
Thanks! The recent changes made them useless. Fixed.
Something about the cost of an asynchronous plan:
At the simple query plan (see below) I see:
1. Startup cost of local SeqScan is equal 0, ForeignScan - 100. But
startup cost of Append is 0.
The result itself is right that the append doesn't wait for foreign
scans for the first iteration then fetches a tuple from the local
table. But the estimation is made just by an accident. If you
defined a foreign table as the first partition, the cost of Append
would be 100, which is rather wrong.
2. Total cost of an Append node is a sum of the subplans. Maybe in the
case of asynchronous append we need to use some reduce factor?
Yes. For the reason mentioned above, foreign subpaths don't affect
the startup cost of Append as far as any sync subpaths exist. If no
sync subpaths exist, the Append's startup cost is the minimum startup
cost among the async subpaths.
I fixed cost_append so that it calculates the correct startup
cost. Now the function estimates as follows.
Append (Foreign(100), Foreign(100), Local(0)) => 0;
Append (Local(0), Foreign(100), Foreign(100)) => 0;
Append (Foreign(100), Foreign(100)) => 100;
explain select * from parts;
With Async Append:
=====================Append (cost=0.00..2510.30 rows=106780 width=8)
Async subplans: 3
-> Async Foreign Scan on part_1 parts_2 (cost=100.00..177.80 rows=2260
-> width=8)
-> Async Foreign Scan on part_2 parts_3 (cost=100.00..177.80 rows=2260
-> width=8)
-> Async Foreign Scan on part_3 parts_4 (cost=100.00..177.80 rows=2260
-> width=8)
-> Seq Scan on part_0 parts_1 (cost=0.00..1443.00 rows=100000 width=8)
The SeqScan seems to be the first partition for the parent. It is the
first subnode at cost estimation. The result is right but it comes
from a wrong logic.
Without Async Append:
=====================Append (cost=0.00..2510.30 rows=106780 width=8)
-> Seq Scan on part_0 parts_1 (cost=0.00..1443.00 rows=100000 width=8)
-> Foreign Scan on part_1 parts_2 (cost=100.00..177.80 rows=2260 width=8)
-> Foreign Scan on part_2 parts_3 (cost=100.00..177.80 rows=2260 width=8)
-> Foreign Scan on part_3 parts_4 (cost=100.00..177.80 rows=2260 width=8)
The starup cost of the Append is the cost of the first subnode, that is, 0.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 6/10/20 8:05 AM, Kyotaro Horiguchi wrote:
Hello, Andrey.
At Tue, 9 Jun 2020 14:20:42 +0500, Andrey Lepikhov <a.lepikhov@postgrespro.ru> wrote in
On 6/4/20 11:00 AM, Kyotaro Horiguchi wrote:
2. Total cost of an Append node is a sum of the subplans. Maybe in the
case of asynchronous append we need to use some reduce factor?Yes. For the reason mentioned above, foreign subpaths don't affect
the startup cost of Append as far as any sync subpaths exist. If no
sync subpaths exist, the Append's startup cost is the minimum startup
cost among the async subpaths.
I mean that you can possibly change computation of total cost of the
Async append node. It may affect the planner choice between ForeignScan
(followed by the execution of the JOIN locally) and partitionwise join
strategies.
Have you also considered the possibility of dynamic choice between
synchronous and async append (during optimization)? This may be useful
for a query with the LIMIT clause.
--
Andrey Lepikhov
Postgres Professional
The patch has a problem with partitionwise aggregates.
Asynchronous append do not allow the planner to use partial aggregates.
Example you can see in attachment. I can't understand why: costs of
partitionwise join are less.
Initial script and explains of the query with and without the patch you
can see in attachment.
--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
Thanks for testing, but..
At Mon, 15 Jun 2020 08:51:23 +0500, "Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru> wrote in
The patch has a problem with partitionwise aggregates.
Asynchronous append do not allow the planner to use partial
aggregates. Example you can see in attachment. I can't understand why:
costs of partitionwise join are less.
Initial script and explains of the query with and without the patch
you can see in attachment.
I had more or less the same plan with the second one without the patch
(that is, vanilla master/HEAD, but used merge joins instead).
I'm not sure what prevented join pushdown, but the difference between
the two is whether the each partitionwise join is pushed down to
remote or not, That is hardly seems related to the async execution
patch.
Could you tell me how did you get the first plan?
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On 6/15/20 1:29 PM, Kyotaro Horiguchi wrote:
Thanks for testing, but..
At Mon, 15 Jun 2020 08:51:23 +0500, "Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru> wrote in
The patch has a problem with partitionwise aggregates.
Asynchronous append do not allow the planner to use partial
aggregates. Example you can see in attachment. I can't understand why:
costs of partitionwise join are less.
Initial script and explains of the query with and without the patch
you can see in attachment.I had more or less the same plan with the second one without the patch
(that is, vanilla master/HEAD, but used merge joins instead).I'm not sure what prevented join pushdown, but the difference between
the two is whether the each partitionwise join is pushed down to
remote or not, That is hardly seems related to the async execution
patch.Could you tell me how did you get the first plan?
1. Use clear current vanilla master.
2. Start two instances with the script 'frgn2n.sh' from attachment.
There are I set GUCs:
enable_partitionwise_join = true
enable_partitionwise_aggregate = true
3. Execute query:
explain analyze SELECT sum(parts.b)
FROM parts, second
WHERE parts.a = second.a AND second.b < 100;
That's all.
--
Andrey Lepikhov
Postgres Professional
https://postgrespro.com
Attachments:
Thanks.
My conclusion on this is the async patch is not the cause of the
behavior change mentioned here.
At Mon, 15 Jun 2020 14:59:18 +0500, "Andrey V. Lepikhov" <a.lepikhov@postgrespro.ru> wrote in
Could you tell me how did you get the first plan?
1. Use clear current vanilla master.
2. Start two instances with the script 'frgn2n.sh' from attachment.
There are I set GUCs:
enable_partitionwise_join = true
enable_partitionwise_aggregate = true3. Execute query:
explain analyze SELECT sum(parts.b)
FROM parts, second
WHERE parts.a = second.a AND second.b < 100;That's all.
With mater/HEAD, I got the second (local join) plan for a while first
then got the first (remote join). The cause of the plan change was
found to be autovacuum on the remote node.
Before the vacuum the result of remote estimation was as follows.
Node2 (remote)
=# EXPLAIN SELECT r4.b FROM (public.part_1 r4 INNER JOIN public.second_1 r8 ON (((r4.a = r8.a)) AND ((r8.b < 100))));
QUERY PLAN
---------------------------------------------------------------------------
Merge Join (cost=2269.20..3689.70 rows=94449 width=4)
Merge Cond: (r8.a = r4.a)
-> Sort (cost=74.23..76.11 rows=753 width=4)
Sort Key: r8.a
-> Seq Scan on second_1 r8 (cost=0.00..38.25 rows=753 width=4)
Filter: (b < 100)
-> Sort (cost=2194.97..2257.68 rows=25086 width=8)
Sort Key: r4.a
-> Seq Scan on part_1 r4 (cost=0.00..361.86 rows=25086 width=8)
(9 rows)
After running a vacuum it changes as follows.
QUERY PLAN
------------------------------------------------------------------------
Hash Join (cost=5.90..776.31 rows=9741 width=4)
Hash Cond: (r4.a = r8.a)
-> Seq Scan on part_1 r4 (cost=0.00..360.78 rows=24978 width=8)
-> Hash (cost=4.93..4.93 rows=78 width=4)
-> Seq Scan on second_1 r8 (cost=0.00..4.93 rows=78 width=4)
Filter: (b < 100)
(6 rows)
That changes the plan on the local side the way you saw. I saw the
exactly same behavior with the async execution patch.
regards.
FYI, the explain results for another plan changed as follows. It is
estimated to return 25839 rows, which is far less than 94449. So local
join beated remote join.
=# EXPLAIN SELECT a, b FROM public.part_1 ORDER BY a ASC NULLS LAST;
QUERY PLAN
------------------------------------------------------------------
Sort (cost=2194.97..2257.68 rows=25086 width=8)
Sort Key: a
-> Seq Scan on part_1 (cost=0.00..361.86 rows=25086 width=8)
(3 rows)
=# EXPLAIN SELECT a FROM public.second_1 WHERE ((b < 100)) ORDER BY a ASC NULLS LAST;
QUERY PLAN
-----------------------------------------------------------------
Sort (cost=74.23..76.11 rows=753 width=4)
Sort Key: a
-> Seq Scan on second_1 (cost=0.00..38.25 rows=753 width=4)
Filter: (b < 100)
(4 rows)
Are changed to:
=# EXPLAIN SELECT a, b FROM public.part_1 ORDER BY a ASC NULLS LAST;
QUERY PLAN
------------------------------------------------------------------
Sort (cost=2185.22..2247.66 rows=24978 width=8)
Sort Key: a
-> Seq Scan on part_1 (cost=0.00..360.78 rows=24978 width=8)
(3 rows)
horiguti=# EXPLAIN SELECT a FROM public.second_1 WHERE ((b < 100)) ORDER BY a ASC NULLS LAST;
QUERY PLAN
---------------------------------------------------------------
Sort (cost=7.38..7.57 rows=78 width=4)
Sort Key: a
-> Seq Scan on second_1 (cost=0.00..4.93 rows=78 width=4)
Filter: (b < 100)
(4 rows)
They return 25056 rows, which is far more than 9741 rows. So remote
join won.
Of course the number of returning rows is not the only factor of the
cost change but is the most significant factor in this case.
--
Kyotaro Horiguchi
NTT Open Source Software Center