Asynchronous Append on postgres_fdw nodes.

Started by Kyotaro Horiguchiabout 6 years ago163 messageshackers

horikyota.ntt@gmail.com

about 6 years ago

Hello, this is a follow-on of [1]/messages/by-id/2020012917585385831113@highgo.ca and [2]/messages/by-id/20180515.202945.69332784.horiguchi.kyotaro@lab.ntt.co.jp.

Currently the executor visits execution nodes one-by-one. Considering
sharding, Append on multiple postgres_fdw nodes can work
simultaneously and that can largely shorten the respons of the whole
query. For example, aggregations that can be pushed-down to remote
would be accelerated by the number of remote servers. Even other than
such an extreme case, collecting tuples from multiple servers also can
be accelerated by tens of percent [2]/messages/by-id/20180515.202945.69332784.horiguchi.kyotaro@lab.ntt.co.jp.

I have suspended the work waiting asyncrohous or push-up executor to
come but the mood seems inclining toward doing that before that to
come [3]/messages/by-id/20191205181217.GA12895@momjian.us.

The patchset consists of three parts.

- v2-0001-Allow-wait-event-set-to-be-regsitered-to-resoure.patch
The async feature uses WaitEvent, and it needs to be released on
error. This patch makes it possible to register WaitEvent to
resowner to handle that case..

- v2-0002-infrastructure-for-asynchronous-execution.patch
It povides an abstraction layer of asynchronous behavior
(execAsync). Then adds ExecAppend, another version of ExecAppend,
that handles "async-capable" subnodes asynchronously. Also it
contains planner part that makes planner aware of "async-capable"
and "async-aware" path nodes.

- v2-0003-async-postgres_fdw.patch
The "async-capable" postgres_fdw. It accelerates multiple
postgres_fdw nodes on a single connection case as well as
postgres_fdw nodes on dedicate connections.

regards.

[1]: /messages/by-id/2020012917585385831113@highgo.ca
[2]: /messages/by-id/20180515.202945.69332784.horiguchi.kyotaro@lab.ntt.co.jp
[3]: /messages/by-id/20191205181217.GA12895@momjian.us

--
Kyotaro Horiguchi
NTT Open Source Software Center

David Steele

david@pgmasters.net

about 6 years ago

In reply to: Kyotaro Horiguchi (#1)

Re: Asynchronous Append on postgres_fdw nodes.

On 2/28/20 3:06 AM, Kyotaro Horiguchi wrote:

Hello, this is a follow-on of [1] and [2].

Currently the executor visits execution nodes one-by-one. Considering
sharding, Append on multiple postgres_fdw nodes can work
simultaneously and that can largely shorten the respons of the whole
query. For example, aggregations that can be pushed-down to remote
would be accelerated by the number of remote servers. Even other than
such an extreme case, collecting tuples from multiple servers also can
be accelerated by tens of percent [2].

I have suspended the work waiting asyncrohous or push-up executor to
come but the mood seems inclining toward doing that before that to
come [3].

The patchset consists of three parts.

Are these improvements targeted at PG13 or PG14? This seems to be a
pretty big change for the last CF of PG13.

Regards,
--
-David
david@pgmasters.net

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 6 years ago

In reply to: David Steele (#2)

Re: Asynchronous Append on postgres_fdw nodes.

At Wed, 4 Mar 2020 09:56:55 -0500, David Steele <david@pgmasters.net> wrote in

On 2/28/20 3:06 AM, Kyotaro Horiguchi wrote:

Hello, this is a follow-on of [1] and [2].
Currently the executor visits execution nodes one-by-one. Considering
sharding, Append on multiple postgres_fdw nodes can work
simultaneously and that can largely shorten the respons of the whole
query. For example, aggregations that can be pushed-down to remote
would be accelerated by the number of remote servers. Even other than
such an extreme case, collecting tuples from multiple servers also can
be accelerated by tens of percent [2].
I have suspended the work waiting asyncrohous or push-up executor to
come but the mood seems inclining toward doing that before that to
come [3].
The patchset consists of three parts.

Are these improvements targeted at PG13 or PG14? This seems to be a
pretty big change for the last CF of PG13.

It is targeted at PG14. As we have the target version in CF-app now,
I marked it as targetting PG14.

Thank you for the suggestion.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Thomas Munro

thomas.munro@gmail.com

about 6 years ago

In reply to: Kyotaro Horiguchi (#1)

Re: Asynchronous Append on postgres_fdw nodes.

On Fri, Feb 28, 2020 at 9:08 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

- v2-0001-Allow-wait-event-set-to-be-regsitered-to-resoure.patch
The async feature uses WaitEvent, and it needs to be released on
error. This patch makes it possible to register WaitEvent to
resowner to handle that case..

- v2-0002-infrastructure-for-asynchronous-execution.patch
It povides an abstraction layer of asynchronous behavior
(execAsync). Then adds ExecAppend, another version of ExecAppend,
that handles "async-capable" subnodes asynchronously. Also it
contains planner part that makes planner aware of "async-capable"
and "async-aware" path nodes.

This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.

What other nodes do you think could be async aware? I suppose you
would teach joins to pass through the async support of their children,
and then you could make partition-wise join work like that.

+    /* choose appropriate version of Exec function */
+    if (appendstate->as_nasyncplans == 0)
+        appendstate->ps.ExecProcNode = ExecAppend;
+    else
+        appendstate->ps.ExecProcNode = ExecAppendAsync;

Cool. No extra cost for people not using the new feature.

+        slot = ExecProcNode(subnode);
+        if (subnode->asyncstate == AS_AVAILABLE)

So, now when you execute a node, you get a result AND you get some
information that you access by reaching into the child node's
PlanState. The ExecProcNode() interface is extremely limiting, but
I'm not sure if this is the right way to extend it. Maybe
ExecAsyncProcNode() with a wide enough interface to do the job?

+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+    static int *refind = NULL;
+    static int refindsize = 0;
...
+    if (refindsize < n)
...
+            static ExecAsync_mcbarg mcb_arg =
+                { &refind, &refindsize };
+            static MemoryContextCallback mcb =
+                { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
...
+            MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);

This seems a bit strange. Why not just put the pointer in the plan
state? I suppose you want to avoid allocating a new buffer for every
query. Perhaps you could fix that by having a small fixed-size buffer
in the PlanState to cover common cases and allocating a larger one in
a per-query memory context if that one is too small, or just not
worrying about it and allocating every time since you know the desired
size.

+    wes = CreateWaitEventSet(TopTransactionContext,
TopTransactionResourceOwner, n);
...
+    FreeWaitEventSet(wes);

BTW, just as an FYI, I am proposing[1]https://commitfest.postgresql.org/27/2452/ to add support for
RemoveWaitEvent(), so that you could have a single WaitEventSet for
the lifetime of the executor node, and then add and remove sockets
only as needed. I'm hoping to commit that for PG13, if there are no
objections or better ideas soon, because it's useful for some other
places where we currently create and destroy WaitEventSets frequently.
One complication when working with long-lived WaitEventSet objects is
that libpq (or some other thing used by some other hypothetical
async-capable FDW) is free to close and reopen its socket whenever it
wants, so you need a way to know when it has done that. In that patch
set I added pqSocketChangeCount() so that you can see when pgSocket()
refers to a new socket (even if the file descriptor number is the same
by coincidence), but that imposes some book-keeping duties on the
caller, and now I'm wondering how that would look in your patch set.
My goal is to generate the minimum number of systems calls. I think
it would be nice if a 1000-shard query only calls epoll_ctl() when a
child node needs to be added or removed from the set, not
epoll_create(), 1000 * epoll_ctl(), epoll_wait(), close() for every
wait. But I suppose there is an argument that it's more complication
than it's worth.

[1]: https://commitfest.postgresql.org/27/2452/

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 6 years ago

In reply to: Thomas Munro (#4)

Re: Asynchronous Append on postgres_fdw nodes.

Thank you for the comment.

At Thu, 5 Mar 2020 16:17:24 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in

On Fri, Feb 28, 2020 at 9:08 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

- v2-0001-Allow-wait-event-set-to-be-regsitered-to-resoure.patch
The async feature uses WaitEvent, and it needs to be released on
error. This patch makes it possible to register WaitEvent to
resowner to handle that case..

+1

- v2-0002-infrastructure-for-asynchronous-execution.patch
It povides an abstraction layer of asynchronous behavior
(execAsync). Then adds ExecAppend, another version of ExecAppend,
that handles "async-capable" subnodes asynchronously. Also it
contains planner part that makes planner aware of "async-capable"
and "async-aware" path nodes.

This patch add an infrastructure for asynchronous execution. As a PoC
this makes only Append capable to handle asynchronously executable
subnodes.

What other nodes do you think could be async aware? I suppose you
would teach joins to pass through the async support of their children,
and then you could make partition-wise join work like that.

An Append node is fed from many immediate-child async-capable nodes,
so the Apeend node can pick any child node that have fired.

Unfortunately joins are not wide but deep. That means a set of
async-capable nodes have multiple async-aware (and async capable at
the same time for intermediate nodes) parent nodes. So if we want to
be async on that configuration, we need "push-up" executor engine. In
my last trial, ignoring performane, I could turn almost all nodes into
push-up style but a few nodes, like WindowAgg or HashJoin have a quite
low affinity with push-up style since the caller sites to child nodes
are many and scattered. I got through the low-affinity by turning the
nodes into state machines, but I don't think it is good.

+    /* choose appropriate version of Exec function */
+    if (appendstate->as_nasyncplans == 0)
+        appendstate->ps.ExecProcNode = ExecAppend;
+    else
+        appendstate->ps.ExecProcNode = ExecAppendAsync;

Cool. No extra cost for people not using the new feature.

It creates some duplicate code but I agree on the performance
perspective.

+        slot = ExecProcNode(subnode);
+        if (subnode->asyncstate == AS_AVAILABLE)
So, now when you execute a node, you get a result AND you get some
information that you access by reaching into the child node's
PlanState. The ExecProcNode() interface is extremely limiting, but
I'm not sure if this is the right way to extend it. Maybe
ExecAsyncProcNode() with a wide enough interface to do the job?

Sounds resonable and seems easy to do.

+Bitmapset *
+ExecAsyncEventWait(PlanState **nodes, Bitmapset *waitnodes, long timeout)
+{
+    static int *refind = NULL;
+    static int refindsize = 0;
...
+    if (refindsize < n)
...
+            static ExecAsync_mcbarg mcb_arg =
+                { &refind, &refindsize };
+            static MemoryContextCallback mcb =
+                { ExecAsyncMemoryContextCallback, (void *)&mcb_arg, NULL };
...
+            MemoryContextRegisterResetCallback(TopTransactionContext, &mcb);
This seems a bit strange. Why not just put the pointer in the plan
state? I suppose you want to avoid allocating a new buffer for every
query. Perhaps you could fix that by having a small fixed-size buffer
in the PlanState to cover common cases and allocating a larger one in
a per-query memory context if that one is too small, or just not
worrying about it and allocating every time since you know the desired
size.

The most significant factor for the shape would be ExecAsync is not a
kind of ExecNode. So ExecAsyncEventWait doen't have direcgt access to
EState other than one of given mutiple nodes. I consider tryig to use
given ExecNodes as an access path to ESttate.

+    wes = CreateWaitEventSet(TopTransactionContext,
TopTransactionResourceOwner, n);
...
+    FreeWaitEventSet(wes);
BTW, just as an FYI, I am proposing[1] to add support for
RemoveWaitEvent(), so that you could have a single WaitEventSet for
the lifetime of the executor node, and then add and remove sockets
only as needed. I'm hoping to commit that for PG13, if there are no
objections or better ideas soon, because it's useful for some other
places where we currently create and destroy WaitEventSets frequently.

Yes! I have wanted that (but haven't done by myself..., and I didn't
understand the details from the title "Reducint WaitEventSet syscall
churn":p)

One complication when working with long-lived WaitEventSet objects is
that libpq (or some other thing used by some other hypothetical
async-capable FDW) is free to close and reopen its socket whenever it
wants, so you need a way to know when it has done that. In that patch
set I added pqSocketChangeCount() so that you can see when pgSocket()
refers to a new socket (even if the file descriptor number is the same
by coincidence), but that imposes some book-keeping duties on the
caller, and now I'm wondering how that would look in your patch set.

As for postgres-fdw, unsponaneous disconnection immedately leands to
query ERROR.

My goal is to generate the minimum number of systems calls. I think
it would be nice if a 1000-shard query only calls epoll_ctl() when a
child node needs to be added or removed from the set, not
epoll_create(), 1000 * epoll_ctl(), epoll_wait(), close() for every
wait. But I suppose there is an argument that it's more complication
than it's worth.

[1] https://commitfest.postgresql.org/27/2452/

I'm not sure how it gives performance gain, but reducing syscalls
itself is good. I'll look on it.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

movead.li@highgo.ca

about 6 years ago

In reply to: Kyotaro Horiguchi (#5)

Re: Asynchronous Append on postgres_fdw nodes.

The following review has been posted through the commitfest application:
make installcheck-world: not tested
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested

I have tested the feature and it shows great performance in queries
which have small amount result compared with base scan amount.

movead.li@highgo.ca

about 6 years ago

In reply to: movead.li@highgo.ca (#6)

Re: Asynchronous Append on postgres_fdw nodes.

The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested

I occur a strange issue when a exec 'make installcheck-world', it is:

##########################################################
...
============== running regression test queries ==============
test adminpack ... FAILED 60 ms

======================
1 of 1 tests failed.
======================

The differences that caused some tests to fail can be viewed in the
file "/work/src/postgres_app_for/contrib/adminpack/regression.diffs". A copy of the test summary that you see
above is saved in the file "/work/src/postgres_app_for/contrib/adminpack/regression.out".
...
##########################################################

And the content in 'contrib/adminpack/regression.out' is:
##########################################################
SELECT pg_file_write('/tmp/test_file0', 'test0', false);
 ERROR:  absolute path not allowed
 SELECT pg_file_write(current_setting('data_directory') || '/test_file4', 'test4', false);
- pg_file_write 
----------------
-             5
-(1 row)
-
+ERROR:  reference to parent directory ("..") not allowed
 SELECT pg_file_write(current_setting('data_directory') || '/../test_file4', 'test4', false);
 ERROR:  reference to parent directory ("..") not allowed
 RESET ROLE;
@@ -149,7 +145,7 @@
 SELECT pg_file_unlink('test_file4');
  pg_file_unlink 
 ----------------
- t
+ f
 (1 row)
##########################################################

However the issue does not occur when I do a 'make check-world'.
And it doesn't occur when I test the 'make installcheck-world' without the patch.

The new status of this patch is: Waiting on Author

Kyotaro Horiguchi

horikyota.ntt@gmail.com

about 6 years ago

In reply to: movead.li@highgo.ca (#7)

Re: Asynchronous Append on postgres_fdw nodes.

Hello. Thank you for testing.

At Tue, 10 Mar 2020 05:13:42 +0000, movead li <movead.li@highgo.ca> wrote in

The following review has been posted through the commitfest application:
make installcheck-world: tested, failed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested

I occur a strange issue when a exec 'make installcheck-world', it is:

I had't done that.. Bud it worked for me.

##########################################################
...
============== running regression test queries ==============
test adminpack ... FAILED 60 ms

======================
1 of 1 tests failed.
======================

The differences that caused some tests to fail can be viewed in the
file "/work/src/postgres_app_for/contrib/adminpack/regression.diffs". A copy of the test summary that you see
above is saved in the file "/work/src/postgres_app_for/contrib/adminpack/regression.out".
...
##########################################################

And the content in 'contrib/adminpack/regression.out' is:

I don't see that file. Maybe regression.diff?

##########################################################
SELECT pg_file_write('/tmp/test_file0', 'test0', false);
ERROR:  absolute path not allowed
SELECT pg_file_write(current_setting('data_directory') || '/test_file4', 'test4', false);
- pg_file_write 
----------------
-             5
-(1 row)
-
+ERROR:  reference to parent directory ("..") not allowed

It seems to me that you are setting a path containing ".." to PGDATA.

However the issue does not occur when I do a 'make check-world'.
And it doesn't occur when I test the 'make installcheck-world' without the patch.

check-world doesn't use path containing ".." as PGDATA.

The new status of this patch is: Waiting on Author

Thanks for noticing that.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

movead.li@highgo.ca

about 6 years ago

In reply to: Kyotaro Horiguchi (#1)

Re: Re: Asynchronous Append on postgres_fdw nodes.

It seems to me that you are setting a path containing ".." to PGDATA.

Thanks point it for me.

Highgo Software (Canada/China/Pakistan)
URL : www.highgo.ca
EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

#10

movead.li@highgo.ca

about 6 years ago

In reply to: Kyotaro Horiguchi (#8)

Re: Asynchronous Append on postgres_fdw nodes.

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested

I redo the make installcheck-world as Kyotaro Horiguchi point out and the
result nothing wrong. And I think the patch is good in feature and performance
here is the test result thread I made before:
/messages/by-id/CA+9bhCK7chd0qx+mny+U9xaOs2FDNJ7RaxG4=9rpgT6oAKBgWA@mail.gmail.com

The new status of this patch is: Ready for Committer

Asynchronous Append on postgres_fdw nodes.

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: