Append with naive multiplexing of FDWs

bruce@momjian.us

over 6 years ago

In reply to: Thomas Munro (#1)

Re: Append with naive multiplexing of FDWs

On Wed, Sep 4, 2019 at 06:18:31PM +1200, Thomas Munro wrote:

Hello,

A few years back[1] I experimented with a simple readiness API that
would allow Append to start emitting tuples from whichever Foreign
Scan has data available, when working with FDW-based sharding. I used
that primarily as a way to test Andres's new WaitEventSet stuff and my
kqueue implementation of that, but I didn't pursue it seriously
because I knew we wanted a more ambitious async executor rewrite and
many people had ideas about that, with schedulers capable of jumping
all over the tree etc.

Anyway, Stephen Frost pinged me off-list to ask about that patch, and
asked why we don't just do this naive thing until we have something
better. It's a very localised feature that works only between Append
and its immediate children. The patch makes it work for postgres_fdw,
but it should work for any FDW that can get its hands on a socket.

Here's a quick rebase of that old POC patch, along with a demo. Since
2016, Parallel Append landed, but I didn't have time to think about
how to integrate with that so I did a quick "sledgehammer" rebase that
disables itself if parallelism is in the picture.

Yes, sharding has been waiting on parallel FDW scans. Would this work
for parallel partition scans if the partitions were FDWs?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

thomas.munro@gmail.com

over 6 years ago

In reply to: Bruce Momjian (#2)

Re: Append with naive multiplexing of FDWs

On Sat, Sep 28, 2019 at 4:20 AM Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Sep 4, 2019 at 06:18:31PM +1200, Thomas Munro wrote:

A few years back[1] I experimented with a simple readiness API that
would allow Append to start emitting tuples from whichever Foreign
Scan has data available, when working with FDW-based sharding. I used
that primarily as a way to test Andres's new WaitEventSet stuff and my
kqueue implementation of that, but I didn't pursue it seriously
because I knew we wanted a more ambitious async executor rewrite and
many people had ideas about that, with schedulers capable of jumping
all over the tree etc.

Anyway, Stephen Frost pinged me off-list to ask about that patch, and
asked why we don't just do this naive thing until we have something
better. It's a very localised feature that works only between Append
and its immediate children. The patch makes it work for postgres_fdw,
but it should work for any FDW that can get its hands on a socket.

Here's a quick rebase of that old POC patch, along with a demo. Since
2016, Parallel Append landed, but I didn't have time to think about
how to integrate with that so I did a quick "sledgehammer" rebase that
disables itself if parallelism is in the picture.

Yes, sharding has been waiting on parallel FDW scans. Would this work
for parallel partition scans if the partitions were FDWs?

Yeah, this works for partitions that are FDWs (as shown), but only for
Append, not for Parallel Append. So you'd have parallelism in the
sense that your N remote shard servers are all doing stuff at the same
time, but it couldn't be in a parallel query on your 'home' server,
which is probably good for things that push down aggregation and bring
back just a few tuples from each shard, but bad for anything wanting
to ship back millions of tuples to chew on locally. Do you think
that'd be useful enough on its own?

The problem is that parallel safe non-partial plans (like postgres_fdw
scans) are exclusively 'claimed' by one process under Parallel Append,
so with the patch as posted, if you modify it to allow parallelism
then it'll probably give correct answers but nothing prevents a single
process from claiming and starting all the scans and then waiting for
them to be ready, while the other processes miss out on doing any work
at all. There's probably some kludgy solution involving not letting
any one worker start more than X, and some space cadet solution
involving passing sockets around and teaching libpq to hand over
connections at certain controlled phases of the protocol (due to lack
of threads), but nothing like that has jumped out as the right path so
far.

One idea that seems promising but requires a bunch more infrastructure
is to offload the libpq multiplexing to a background worker that owns
all the sockets, and have it push tuples into a multi-consumer shared
memory queue that regular executor processes could read from. I have
been wondering if that would be best done by each FDW implementation,
or if there is a way to make a generic infrastructure for converting
parallel-safe executor nodes into partial plans by the use of a
'Scatter' (opposite of Gather) node that can spread the output of any
node over many workers.

If you had that, you'd still want a way for Parallel Append to be
readiness-based, but it would probably look a bit different to this
patch because it'd need to use (vapourware) multiconsumer shm queue
readiness, not fd readiness. And another kind of fd-readiness
multiplexing would be going on inside the new (vapourware) worker that
handles all the libpq connections (and maybe other kinds of work for
other FDWs that are able to expose a socket).

bruce@momjian.us

over 6 years ago

In reply to: Thomas Munro (#3)

Re: Append with naive multiplexing of FDWs

On Sun, Nov 17, 2019 at 09:54:55PM +1300, Thomas Munro wrote:

On Sat, Sep 28, 2019 at 4:20 AM Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Sep 4, 2019 at 06:18:31PM +1200, Thomas Munro wrote:

A few years back[1] I experimented with a simple readiness API that
would allow Append to start emitting tuples from whichever Foreign
Scan has data available, when working with FDW-based sharding. I used
that primarily as a way to test Andres's new WaitEventSet stuff and my
kqueue implementation of that, but I didn't pursue it seriously
because I knew we wanted a more ambitious async executor rewrite and
many people had ideas about that, with schedulers capable of jumping
all over the tree etc.

Anyway, Stephen Frost pinged me off-list to ask about that patch, and
asked why we don't just do this naive thing until we have something
better. It's a very localised feature that works only between Append
and its immediate children. The patch makes it work for postgres_fdw,
but it should work for any FDW that can get its hands on a socket.

Here's a quick rebase of that old POC patch, along with a demo. Since
2016, Parallel Append landed, but I didn't have time to think about
how to integrate with that so I did a quick "sledgehammer" rebase that
disables itself if parallelism is in the picture.

Yes, sharding has been waiting on parallel FDW scans. Would this work
for parallel partition scans if the partitions were FDWs?

Yeah, this works for partitions that are FDWs (as shown), but only for
Append, not for Parallel Append. So you'd have parallelism in the
sense that your N remote shard servers are all doing stuff at the same
time, but it couldn't be in a parallel query on your 'home' server,
which is probably good for things that push down aggregation and bring
back just a few tuples from each shard, but bad for anything wanting
to ship back millions of tuples to chew on locally. Do you think
that'd be useful enough on its own?

Yes, I think so. There are many data warehouse queries that want to
return only aggregate values, or filter for a small number of rows.
Even OLTP queries might return only a few rows from multiple partitions.
This would allow for a proof-of-concept implementation so we can see how
realistic this approach is.

The problem is that parallel safe non-partial plans (like postgres_fdw
scans) are exclusively 'claimed' by one process under Parallel Append,
so with the patch as posted, if you modify it to allow parallelism
then it'll probably give correct answers but nothing prevents a single
process from claiming and starting all the scans and then waiting for
them to be ready, while the other processes miss out on doing any work
at all. There's probably some kludgy solution involving not letting
any one worker start more than X, and some space cadet solution
involving passing sockets around and teaching libpq to hand over
connections at certain controlled phases of the protocol (due to lack
of threads), but nothing like that has jumped out as the right path so
far.

I am unclear how many queries can do any meaningful work until all
shards have giving their full results.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

/messages/by-id/20180515.202945.69332784.horiguchi.kyotaro@lab.ntt.co.jp

horikyota.ntt@gmail.com

over 6 years ago

In reply to: Bruce Momjian (#4)

Re: Append with naive multiplexing of FDWs

Hello.

At Sat, 30 Nov 2019 14:26:11 -0500, Bruce Momjian <bruce@momjian.us> wrote in

On Sun, Nov 17, 2019 at 09:54:55PM +1300, Thomas Munro wrote:

On Sat, Sep 28, 2019 at 4:20 AM Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Sep 4, 2019 at 06:18:31PM +1200, Thomas Munro wrote:

A few years back[1] I experimented with a simple readiness API that
would allow Append to start emitting tuples from whichever Foreign
Scan has data available, when working with FDW-based sharding. I used
that primarily as a way to test Andres's new WaitEventSet stuff and my
kqueue implementation of that, but I didn't pursue it seriously
because I knew we wanted a more ambitious async executor rewrite and
many people had ideas about that, with schedulers capable of jumping
all over the tree etc.

Anyway, Stephen Frost pinged me off-list to ask about that patch, and
asked why we don't just do this naive thing until we have something
better. It's a very localised feature that works only between Append
and its immediate children. The patch makes it work for postgres_fdw,
but it should work for any FDW that can get its hands on a socket.

Here's a quick rebase of that old POC patch, along with a demo. Since
2016, Parallel Append landed, but I didn't have time to think about
how to integrate with that so I did a quick "sledgehammer" rebase that
disables itself if parallelism is in the picture.

Yes, sharding has been waiting on parallel FDW scans. Would this work
for parallel partition scans if the partitions were FDWs?

Yeah, this works for partitions that are FDWs (as shown), but only for
Append, not for Parallel Append. So you'd have parallelism in the
sense that your N remote shard servers are all doing stuff at the same
time, but it couldn't be in a parallel query on your 'home' server,
which is probably good for things that push down aggregation and bring
back just a few tuples from each shard, but bad for anything wanting
to ship back millions of tuples to chew on locally. Do you think
that'd be useful enough on its own?

Yes, I think so. There are many data warehouse queries that want to
return only aggregate values, or filter for a small number of rows.
Even OLTP queries might return only a few rows from multiple partitions.
This would allow for a proof-of-concept implementation so we can see how
realistic this approach is.

The problem is that parallel safe non-partial plans (like postgres_fdw
scans) are exclusively 'claimed' by one process under Parallel Append,
so with the patch as posted, if you modify it to allow parallelism
then it'll probably give correct answers but nothing prevents a single
process from claiming and starting all the scans and then waiting for
them to be ready, while the other processes miss out on doing any work
at all. There's probably some kludgy solution involving not letting
any one worker start more than X, and some space cadet solution
involving passing sockets around and teaching libpq to hand over
connections at certain controlled phases of the protocol (due to lack
of threads), but nothing like that has jumped out as the right path so
far.

I am unclear how many queries can do any meaningful work until all
shards have giving their full results.

There's my pending (somewhat stale) patch, which allows to run local
scans while waiting for remote servers.

I (or we) wanted to introduce the asynchronous node mechanism as the
basis of async-capable postgres_fdw. The reason why it is stopping is
that we are seeing and I am waiting the executor change that makes
executor push-up style, on which the async-node mechanism will be
constructed. If that won't happen shortly, I'd like to continue that
work..

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

[1]: /messages/by-id/CAEepm=1CuAWfxDk==jZ7pgCDCv52fiUnDSpUvmznmVmRKU5zpA@mail.gmail.com
[2]: /messages/by-id/CA+Tgmobx8su_bYtAa3DgrqB+R7xZG6kHRj0ccMUUshKAQVftww@mail.gmail.com
[3]: /messages/by-id/CA+TgmoaXQEt4tZ03FtQhnzeDEMzBck+Lrni0UWHVVgOTnA6C1w@mail.gmail.com

thomas.munro@gmail.com

over 6 years ago

In reply to: Kyotaro Horiguchi (#5)

Re: Append with naive multiplexing of FDWs

On Thu, Dec 5, 2019 at 4:26 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

There's my pending (somewhat stale) patch, which allows to run local
scans while waiting for remote servers.

/messages/by-id/20180515.202945.69332784.horiguchi.kyotaro@lab.ntt.co.jp

I (or we) wanted to introduce the asynchronous node mechanism as the
basis of async-capable postgres_fdw. The reason why it is stopping is
that we are seeing and I am waiting the executor change that makes
executor push-up style, on which the async-node mechanism will be
constructed. If that won't happen shortly, I'd like to continue that
work..

After rereading some threads to remind myself what happened here...
right, my little patch began life in March 2016[1]/messages/by-id/CAEepm=1CuAWfxDk==jZ7pgCDCv52fiUnDSpUvmznmVmRKU5zpA@mail.gmail.com when I wanted a
test case to test Andres's work on WaitEventSets, and your patch set
started a couple of months later and is vastly more ambitious[2]/messages/by-id/CA+Tgmobx8su_bYtAa3DgrqB+R7xZG6kHRj0ccMUUshKAQVftww@mail.gmail.com[3]/messages/by-id/CA+TgmoaXQEt4tZ03FtQhnzeDEMzBck+Lrni0UWHVVgOTnA6C1w@mail.gmail.com.
It wants to escape from the volcano give-me-one-tuple-or-give-me-EOF
model. And I totally agree that there are lots of reason to want to
do that (including yielding to other parts of the plan instead of
waiting for I/O, locks and some parallelism primitives enabling new
kinds of parallelism), and I'm hoping to help with some small pieces
of that if I can.

My patch set (rebased upthread) was extremely primitive, with no new
planner concepts, and added only a very simple new executor node
method: ExecReady(). Append used that to try to ask its children if
they'd like some time to warm up. By default, ExecReady() says "I
don't know what you're talking about, go away", but FDWs can provide
an implementation that says "yes, please call me again when this fd is
ready" or "yes, I am ready, please call ExecProc() now". It doesn't
deal with anything more complicated than that, and in particular it
doesn't work if there are extra planner nodes in between Append and
the foreign scan. (It also doesn't mix particularly well with
parallelism, as mentioned.)

The reason I reposted this unambitious work is because Stephen keeps
asking me why we don't consider the stupidly simple thing that would
help with simple foreign partition-based queries today, instead of
waiting for someone to redesign the entire executor, because that's
... really hard.

bruce@momjian.us

over 6 years ago

In reply to: Thomas Munro (#6)

Re: Append with naive multiplexing of FDWs

On Thu, Dec 5, 2019 at 05:45:24PM +1300, Thomas Munro wrote:

My patch set (rebased upthread) was extremely primitive, with no new
planner concepts, and added only a very simple new executor node
method: ExecReady(). Append used that to try to ask its children if
they'd like some time to warm up. By default, ExecReady() says "I
don't know what you're talking about, go away", but FDWs can provide
an implementation that says "yes, please call me again when this fd is
ready" or "yes, I am ready, please call ExecProc() now". It doesn't
deal with anything more complicated than that, and in particular it
doesn't work if there are extra planner nodes in between Append and
the foreign scan. (It also doesn't mix particularly well with
parallelism, as mentioned.)

The reason I reposted this unambitious work is because Stephen keeps
asking me why we don't consider the stupidly simple thing that would
help with simple foreign partition-based queries today, instead of
waiting for someone to redesign the entire executor, because that's
... really hard.

I agree with Stephen's request. We have been waiting for the executor
rewrite for a while, so let's just do something simple and see how it
performs.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

Robert Haas

robertmhaas@gmail.com

over 6 years ago

In reply to: Bruce Momjian (#7)

Re: Append with naive multiplexing of FDWs

On Thu, Dec 5, 2019 at 1:12 PM Bruce Momjian <bruce@momjian.us> wrote:

I agree with Stephen's request. We have been waiting for the executor
rewrite for a while, so let's just do something simple and see how it
performs.

I'm sympathetic to the frustration here, and I think it would be great
if we could find a way forward that doesn't involve waiting for a full
rewrite of the executor. However, I seem to remember that when we
tested the various patches that various people had written for this
feature (I wrote one, too) they all had a noticeable performance
penalty in the case of a plain old Append that involved no FDWs and
nothing asynchronous. I don't think it's OK to have, say, a 2%
regression on every query that involves an Append, because especially
now that we have partitioning, that's a lot of queries.

I don't know whether this patch has that kind of problem. If it
doesn't, I would consider that a promising sign.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

thomas.munro@gmail.com

over 6 years ago

In reply to: Robert Haas (#8)

Re: Append with naive multiplexing of FDWs

On Fri, Dec 6, 2019 at 9:20 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Dec 5, 2019 at 1:12 PM Bruce Momjian <bruce@momjian.us> wrote:

I agree with Stephen's request. We have been waiting for the executor
rewrite for a while, so let's just do something simple and see how it
performs.

I'm sympathetic to the frustration here, and I think it would be great
if we could find a way forward that doesn't involve waiting for a full
rewrite of the executor. However, I seem to remember that when we
tested the various patches that various people had written for this
feature (I wrote one, too) they all had a noticeable performance
penalty in the case of a plain old Append that involved no FDWs and
nothing asynchronous. I don't think it's OK to have, say, a 2%
regression on every query that involves an Append, because especially
now that we have partitioning, that's a lot of queries.

I don't know whether this patch has that kind of problem. If it
doesn't, I would consider that a promising sign.

I'll look into that. If there is a measurable impact, I suspect it
can be avoided by, for example, installing a different ExecProcNode
function.

#10

horikyota.ntt@gmail.com

over 6 years ago

In reply to: Thomas Munro (#9)

Re: Append with naive multiplexing of FDWs

At Fri, 6 Dec 2019 10:03:44 +1300, Thomas Munro <thomas.munro@gmail.com> wrote in

On Fri, Dec 6, 2019 at 9:20 AM Robert Haas <robertmhaas@gmail.com> wrote:

I don't know whether this patch has that kind of problem. If it
doesn't, I would consider that a promising sign.

I'll look into that. If there is a measurable impact, I suspect it
can be avoided by, for example, installing a different ExecProcNode
function.

Replacing ExecProcNode perfectly isolates additional process in
ExecAppendAsync. Thus, for pure local appends, the patch can impact
performance through only planner and execinit. But I don't believe it
cannot be as large as observable in a large scan.

As the mail pointed upthread, the patch acceleartes all remote cases
when fetch_size is >= 200. The problem was that local scans seemed
slightly slowed down. I dusted off the old patch (FWIW I attached it)
and.. will re-run on the current development environment. (And
re-check the code.).

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#11

bruce@momjian.us

over 6 years ago

In reply to: Robert Haas (#8)

Re: Append with naive multiplexing of FDWs

On Thu, Dec 5, 2019 at 03:19:50PM -0500, Robert Haas wrote:

On Thu, Dec 5, 2019 at 1:12 PM Bruce Momjian <bruce@momjian.us> wrote:

I agree with Stephen's request. We have been waiting for the executor
rewrite for a while, so let's just do something simple and see how it
performs.

I'm sympathetic to the frustration here, and I think it would be great
if we could find a way forward that doesn't involve waiting for a full
rewrite of the executor. However, I seem to remember that when we
tested the various patches that various people had written for this
feature (I wrote one, too) they all had a noticeable performance
penalty in the case of a plain old Append that involved no FDWs and
nothing asynchronous. I don't think it's OK to have, say, a 2%
regression on every query that involves an Append, because especially
now that we have partitioning, that's a lot of queries.

I don't know whether this patch has that kind of problem. If it
doesn't, I would consider that a promising sign.

Certainly any overhead on normal queries would be unacceptable.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

#12

horikyota.ntt@gmail.com

over 6 years ago

In reply to: Bruce Momjian (#11)

Re: Append with naive multiplexing of FDWs

Hello.

I think I can say that this patch doesn't slows non-AsyncAppend,
non-postgres_fdw scans.

At Mon, 9 Dec 2019 12:18:44 -0500, Bruce Momjian <bruce@momjian.us> wrote in

Certainly any overhead on normal queries would be unacceptable.

I took performance numbers on the current shape of the async execution
patch for the following scan cases.

t0 : single local table (parallel disabled)
pll : local partitioning (local Append, parallel disabled)
ft0 : single foreign table
pf0 : inheritance on 4 foreign tables, single connection
pf1 : inheritance on 4 foreign tables, 4 connections
ptf0 : partition on 4 foreign tables, single connection
ptf1 : partition on 4 foreign tables, 4 connections

The benchmarking system is configured as the follows on a single
machine.

[ benchmark client ]
| |
(localhost:5433) (localhost:5432)
| |
+----+ | +------+ |
| V V V | V
| [master server] | [async server]
| V | V
+--fdw--+ +--fdw--+

The patch works roughly in the following steps.

1. Planner decides how many children out of an append can run
asynchrnously (called as async-capable.).

2. While ExecInit if an Append doesn't have an async-capable children,
ExecAppend that is exactly the same function is set as
ExecProcNode. Otherwise ExecAppendAsync is used.

If the infrastructure part in the patch causes any degradation, the
"t0"(scan on local single table) and/or "pll" test (scan on a local
paritioned table) gets slow.

3. postgresql_fdw always runs async-capable code path.

If the postgres_fdw part causes degradation, ft0 reflects that.

The tables has two integers and the query does sum(a) on all tuples.

With the default fetch_size = 100, number is run time in ms. Each
number is the average of 14 runs.

master patched gain
t0 7325 7130 +2.7%
pll 4558 4484 +1.7%
ft0 3670 3675 -0.1%
pf0 2322 1550 +33.3%
pf1 2367 1475 +37.7%
ptf0 2517 1624 +35.5%
ptf1 2343 1497 +36.2%

With larger fetch_size (200) the gain mysteriously decreases for
sharing single connection cases (pf0, ptf0), but others don't seem
change so much.

master patched gain
t0 7212 7252 -0.6%
pll 4546 4397 +3.3%
ft0 3712 3731 -0.5%
pf0 2131 1570 +26.4%
pf1 1926 1189 +38.3%
ptf0 2001 1557 +22.2%
ptf1 1903 1193 +37.4%

FWIW, attached are the test script.

gentblr2.sql: Table creation script.
testrun.sh : Benchmarking script.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13

Ahsan Hadi

ahsan.hadi@highgo.ca

over 6 years ago

In reply to: Kyotaro Horiguchi (#12)

Re: Append with naive multiplexing of FDWs

Hi Hackers,

Sharing the email below from Movead Li, I believe he wanted to share the
benchmarking results as a response to this email thread but it started a
new thread.. Here it is...

"
Hello

I have tested the patch with a partition table with several foreign
partitions living on seperate data nodes. The initial testing was done
with a partition table having 3 foreign partitions, test was done with
variety of scale facters. The seonnd test was with fixed data per data
node but number of data nodes were increased incrementally to see
the peformance impact as more nodes are added to the cluster. The
test three is similar to the initial test but with much huge data and
4 nodes.

The results are summary is given below and test script attached:

*Test ENV*
Parent node:2Core 8G
Child Nodes:2Core 4G

*Test one:*

1.1 The partition struct as below:

[ ptf:(a int, b int, c varchar)]
(Parent node)
| | |
[ptf1] [ptf2] [ptf3]
(Node1) (Node2) (Node3)

The table data is partitioned across nodes, the test is done using a
simple select query and a count aggregate as shown below. The result
is an average of executing each query multiple times to ensure reliable
and consistent results.

①select * from ptf where b = 100;
②select count(*) from ptf;

1.2. Test Results

For ① result:
scalepernode master patched performance
2G 7s 2s 350%
5G 173s 63s 275%
10G 462s 156s 296%
20G 968s 327s 296%
30G 1472s 494s 297%

For ② result:
scalepernode master patched performance
2G 1079s 291s 370%
5G 2688s 741s 362%
10G 4473s 1493s 299%

It takes too long time to test a aggregate so the test was done with a
smaller data size.

1.3. summary

With the table partitioned over 3 nodes, the average performance gain
across variety of scale factors is almost 300%

*Test Two*
2.1 The partition struct as below:

[ ptf:(a int, b int, c varchar)]
(Parent node)
| | |
[ptf1] ... [ptfN]
(Node1) (...) (NodeN)

①select * from ptf
②select * from ptf where b = 100;

This test is done with same size of data per node but table is partitioned
across N number of nodes. Each varation (master or patches) is tested
at-least 3 times to get reliable and consistent results. The purpose of the
test is to see impact on performance as number of data nodes are increased.

2.2 The results

For ① result（scalepernode=2G）:
nodenumber master patched performance
2 432s 180s 240%
3 636s 223s 285%
4 830s 283s 293%
5 1065s 361s 295%
For ② result（scalepernode=10G）:
nodenumber master patched performance
2 281s 140s 201%
3 421s 140s 300%
4 562s 141s 398%
5 702s 141s 497%
6 833s 139s 599%
7 986s 141s 699%
8 1125s 140s 803%

*Test Three*

This test is similar to the [test one] but with much huge data and
4 nodes.

For ① result:
scalepernode master patched performance
100G 6592s 1649s 399%
For ② result:
scalepernode master patched performance
100G 35383 12363 286%
The result show it work well in much huge data.

*Summary*
The patch is pretty good, it works well when there were little data back to
the parent node. The patch doesn’t provide parallel FDW scan, it ensures
that child nodes can send data to parent in parallel but the parent can
only
sequennly process the data from data nodes.

Providing there is no performance degrdation for non FDW append queries,
I would recomend to consider this patch as an interim soluton while we are
waiting for parallel FDW scan.
"

On Thu, Dec 12, 2019 at 5:41 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:

Show quoted text

Hello.

I think I can say that this patch doesn't slows non-AsyncAppend,
non-postgres_fdw scans.

At Mon, 9 Dec 2019 12:18:44 -0500, Bruce Momjian <bruce@momjian.us> wrote
in

Certainly any overhead on normal queries would be unacceptable.

I took performance numbers on the current shape of the async execution
patch for the following scan cases.

t0 : single local table (parallel disabled)
pll : local partitioning (local Append, parallel disabled)
ft0 : single foreign table
pf0 : inheritance on 4 foreign tables, single connection
pf1 : inheritance on 4 foreign tables, 4 connections
ptf0 : partition on 4 foreign tables, single connection
ptf1 : partition on 4 foreign tables, 4 connections

The benchmarking system is configured as the follows on a single
machine.

[ benchmark client ]
| |
(localhost:5433) (localhost:5432)
| |
+----+ | +------+ |
| V V V | V
| [master server] | [async server]
| V | V
+--fdw--+ +--fdw--+

The patch works roughly in the following steps.

1. Planner decides how many children out of an append can run
asynchrnously (called as async-capable.).

2. While ExecInit if an Append doesn't have an async-capable children,
ExecAppend that is exactly the same function is set as
ExecProcNode. Otherwise ExecAppendAsync is used.

If the infrastructure part in the patch causes any degradation, the
"t0"(scan on local single table) and/or "pll" test (scan on a local
paritioned table) gets slow.

3. postgresql_fdw always runs async-capable code path.

If the postgres_fdw part causes degradation, ft0 reflects that.

The tables has two integers and the query does sum(a) on all tuples.

With the default fetch_size = 100, number is run time in ms. Each
number is the average of 14 runs.

master patched gain
t0 7325 7130 +2.7%
pll 4558 4484 +1.7%
ft0 3670 3675 -0.1%
pf0 2322 1550 +33.3%
pf1 2367 1475 +37.7%
ptf0 2517 1624 +35.5%
ptf1 2343 1497 +36.2%

With larger fetch_size (200) the gain mysteriously decreases for
sharing single connection cases (pf0, ptf0), but others don't seem
change so much.

master patched gain
t0 7212 7252 -0.6%
pll 4546 4397 +3.3%
ft0 3712 3731 -0.5%
pf0 2131 1570 +26.4%
pf1 1926 1189 +38.3%
ptf0 2001 1557 +22.2%
ptf1 1903 1193 +37.4%

FWIW, attached are the test script.

gentblr2.sql: Table creation script.
testrun.sh : Benchmarking script.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#14

bruce@momjian.us

over 6 years ago

In reply to: Ahsan Hadi (#13)

Re: Append with naive multiplexing of FDWs

On Tue, Jan 14, 2020 at 02:37:48PM +0500, Ahsan Hadi wrote:

Summary
The patch is pretty good, it works well when there were little data back to
the parent node. The patch doesn’t provide parallel FDW scan, it ensures
that child nodes can send data to parent in parallel but the parent can only
sequennly process the data from data nodes.

Providing there is no performance degrdation for non FDW append queries,
I would recomend to consider this patch as an interim soluton while we are
waiting for parallel FDW scan.

Wow, these are very impressive results!

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +

#15

horikyota.ntt@gmail.com

over 6 years ago

In reply to: Bruce Momjian (#14)

Re: Append with naive multiplexing of FDWs

Thank you very much for the testing of the patch, Ahsan!

At Wed, 15 Jan 2020 15:41:04 -0500, Bruce Momjian <bruce@momjian.us> wrote in

On Tue, Jan 14, 2020 at 02:37:48PM +0500, Ahsan Hadi wrote:

Summary
The patch is pretty good, it works well when there were little data back to
the parent node. The patch doesn’t provide parallel FDW scan, it ensures
that child nodes can send data to parent in parallel but the parent can only
sequennly process the data from data nodes.

"Parallel scan" at the moment means multiple workers fetch unique
blocks from *one* table in an arbitrated manner. In this sense
"parallel FDW scan" means multiple local workers fetch unique bundles
of tuples from *one* foreign table, which means it is running on a
single session. That doesn't offer an advantage.

If parallel query processing worked in worker-per-table mode,
especially on partitioned tables, maybe the current FDW would work
without much of modification. But I believe asynchronous append on
foreign tables on a single process is far resource-effective and
moderately faster than parallel append.

Providing there is no performance degrdation for non FDW append queries,
I would recomend to consider this patch as an interim soluton while we are
waiting for parallel FDW scan.

Wow, these are very impressive results!

Thanks.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#16

https://www.highgo.ca/2019/08/22/parallel-foreign-scan-of-postgresql/

thomas.munro@gmail.com

over 6 years ago

In reply to: Bruce Momjian (#14)

Re: Append with naive multiplexing of FDWs

On Thu, Jan 16, 2020 at 9:41 AM Bruce Momjian <bruce@momjian.us> wrote:

On Tue, Jan 14, 2020 at 02:37:48PM +0500, Ahsan Hadi wrote:

Summary
The patch is pretty good, it works well when there were little data back to
the parent node. The patch doesn’t provide parallel FDW scan, it ensures
that child nodes can send data to parent in parallel but the parent can only
sequennly process the data from data nodes.

Providing there is no performance degrdation for non FDW append queries,
I would recomend to consider this patch as an interim soluton while we are
waiting for parallel FDW scan.

Wow, these are very impressive results!

Thanks Ahsan and Movead. Could you please confirm which patch set you tested?

#17

movead.li@highgo.ca

about 6 years ago

In reply to: Kyotaro Horiguchi (#15)

Re: Append with naive multiplexing of FDWs

Hello Kyotaro,

"Parallel scan" at the moment means multiple workers fetch unique

blocks from *one* table in an arbitrated manner. In this sense

"parallel FDW scan" means multiple local workers fetch unique bundles

of tuples from *one* foreign table, which means it is running on a

single session. That doesn't offer an advantage.

It maybe not "parallel FDW scan", it can be "parallel shards scan"

the local workers will pick every foreign partition to scan. I have ever

draw a picture about that you can see it in the link below.

I think the "parallel shards scan" make sence in this way.

If parallel query processing worked in worker-per-table mode,

especially on partitioned tables, maybe the current FDW would work

without much of modification. But I believe asynchronous append on

foreign tables on a single process is far resource-effective and

moderately faster than parallel append.

As the test result, current patch can not gain more performance when

it returns a huge number of tuples. By "parallel shards scan" method,

it can work well, because the 'parallel' can take full use of CPUs while

'asynchronous' can't.

Highgo Software (Canada/China/Pakistan)

URL : http://www.highgo.ca/

EMAIL: mailto:movead(dot)li(at)highgo(dot)ca

#18