[DESIGN] ParallelAppend

Started by KaiGai Koheialmost 11 years ago48 messageshackers

kaigai@ak.jp.nec.com

almost 11 years ago

Hello,

I'm recently working/investigating on ParallelAppend feature
towards the next commit fest. Below is my design proposal.

1. Concept
----------
Its concept is quite simple anybody might consider more than once.
ParallelAppend node kicks background worker process to execute
child nodes in parallel / asynchronous.
It intends to improve the performance to scan a large partitioned
tables from standpoint of entire throughput, however, latency of
the first multi-hundred rows are not scope of this project.
From standpoint of technology trend, it primarily tries to utilize
multi-cores capability within a system, but also enables to expand
distributed database environment using foreign-tables inheritance
features.
Its behavior is very similar to Funnel node except for several
points, thus, we can reuse its infrastructure we have had long-
standing discussion through the v9.5 development cycle.

2. Problems to be solved
-------------------------
Typical OLAP workloads takes tons of tables join and scan on large
tables which are often partitioned, and its KPI is query response
time but very small number of sessions are active simultaneously.
So, we are required to run a single query as rapid as possible even
if it consumes larger computing resources than typical OLTP workloads.

Current implementation to scan heap is painful when we look at its
behavior from the standpoint - how many rows we can read within a
certain time, because of synchronous manner.
In the worst case, when SeqScan node tries to fetch the next tuple,
heap_getnext() looks up a block on shared buffer, then ReadBuffer()
calls storage manager to read the target block from the filesystem
if not on the buffer. Next, operating system makes the caller
process slept until required i/o get completed.
Most of the cases are helped in earlier stage than the above worst
case, however, the best scenario we can expect is: the next tuple
already appear on top of the message queue (of course visibility
checks are already done also) with no fall down to buffer manager
or deeper.
If we can run multiple scans in parallel / asynchronous, CPU core
shall be assigned to another process by operating system, thus,
it eventually improves the i/o density and enables higher processing
throughput.
Append node is an ideal point to be parallelized because
- child nodes can have physically different location by tablespace,
so further tuning is possible according to the system landscape.
- it can control whether subplan is actually executed on background
worker, per subplan basis. If subplan contains large tables and
small tables, ParallelAppend may kick background worker to scan
large tables only, but scan on small tables are by itself.
- Like as Funnel node, we don't need to care about enhancement of
individual node types. SeqScan, IndexScan, ForeignScan or others
can perform as usual, but actually in parallel.

3. Implementation
------------------
* Plan & Cost

ParallelAppend shall appear where Appen can appear except for the
usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
both of AppendPath and ParallelAppendPath with cost for each.
Cost estimation logic shall take further discussions, however,
I expect the logic below to estimate the cost for ParallelAppend.
1. Sum startup_cost and run_cost for each child pathnode, but
distinguish according to synchronous or asynchronous.
Probably, total cost of pathnode is less than:
(parallel_setup_cost + its total cost / parallel_append_degree
+ number of rows * cpu_tuple_comm_cost)
is nonsense to run on background worker.
2. parallel_setup_cost * (# of asynchronous nodes) are added to
sum of startup_cost of asynchronous nodes.
3. sum of run_cost of asynchronous nodes are divided by
parallel_append_degree, then cpu_tuple_comm_cost * (total # of
rows by asynchronous nodes) are added.
4. both of synchronous and asynchronous cost are added, then it
becomes the cost of ParallelAppend.
Obviously, it stand on the viewpoint that says: cost reflects response
time of the underlying plan. So, cost of ParallelAppend can be smaller
than sum of underlying child nodes.

* Execution

Like Funnel node, it kicks background worker on the ExecProcNode handler,
thus, its startup time may be later than Fujita-san's approach if call
of ParallelAppend would be late. For example, when ParallelAppend is
located under HashJoin but inner Hash loads billion of rows.
Even though I expect ExecParallelAppend takes, at least, simple round-
robin scheduling like funnel_getnext(), we may give synchronous nodes
than asynchronous just after the background worker startup.

4. Further challenges
----------------------
* Serialization of CustomScan via outfuncs.c/readfuncs.c
Because methods field is, basically, a set of pointers per process basis,
we need to have an infrastructure to reproduce same table on the background
worker process identified by the name.
(I also try to design it.)

* Duplication of the parallel
If Funnel+PartialSeqScan is located under ParallelAppend, directly
or indirectly, it eventually leads background worker process to launch
another background workers. Is it expected usage of the current background
workers??

* Join pushdown
Distribution of nested-loop and hash-join may have advantage by parallel
processing, and by reduction of hash-size if CHECK() constraint of
individual partitioned tables informs rows obviously not to be joined.
Also see the thread:
[idea] table partition + hash join: http://bit.ly/1S2xpHT
My colleague already started to investigate / develop this feature
based on existing Append, to reduce num_batches.

As an aside, my GpuJoin feature works most effectively if entire inner
relations can be loaded to hash-table on GPU RAM, so features are very
welcome.

* Sort break-down
If mergejoin tried to have ParallelAppend node on left or right input,
we may be able to compare its cost with MargeParallelAppend + Sort on
the partial relation.

* Aggregate Push Down
It is what I exactly want to do.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Kyotaro Horiguchi

horikyota.ntt@gmail.com

almost 11 years ago

In reply to: KaiGai Kohei (#1)

Re: [DESIGN] ParallelAppend

Hello, can I ask some questions?

I suppose we can take this as the analog of ParalleSeqScan. I
can see not so distinction between Append(ParalleSeqScan) and
ParallelAppend(SeqScan). What difference is there between them?

If other nodes will have the same functionality as you mention at
the last of this proposal, it might be better that some part of
this feature is implemented as a part of existing executor
itself, but not as a deidicated additional node, just as my
asynchronous fdw execution patch patially does. (Although it
lacks planner part and bg worker launching..) If that is the
case, it might be better that ExecProcNode is modified so that it
supports both in-process and inter-bgworker cases by the single
API.

What do you think about this?

regards,

Hello,

I'm recently working/investigating on ParallelAppend feature
towards the next commit fest. Below is my design proposal.

1. Concept
----------
Its concept is quite simple anybody might consider more than once.
ParallelAppend node kicks background worker process to execute
child nodes in parallel / asynchronous.
It intends to improve the performance to scan a large partitioned
tables from standpoint of entire throughput, however, latency of
the first multi-hundred rows are not scope of this project.
From standpoint of technology trend, it primarily tries to utilize
multi-cores capability within a system, but also enables to expand
distributed database environment using foreign-tables inheritance
features.
Its behavior is very similar to Funnel node except for several
points, thus, we can reuse its infrastructure we have had long-
standing discussion through the v9.5 development cycle.

2. Problems to be solved
-------------------------
Typical OLAP workloads takes tons of tables join and scan on large
tables which are often partitioned, and its KPI is query response
time but very small number of sessions are active simultaneously.
So, we are required to run a single query as rapid as possible even
if it consumes larger computing resources than typical OLTP workloads.

Current implementation to scan heap is painful when we look at its
behavior from the standpoint - how many rows we can read within a
certain time, because of synchronous manner.
In the worst case, when SeqScan node tries to fetch the next tuple,
heap_getnext() looks up a block on shared buffer, then ReadBuffer()
calls storage manager to read the target block from the filesystem
if not on the buffer. Next, operating system makes the caller
process slept until required i/o get completed.
Most of the cases are helped in earlier stage than the above worst
case, however, the best scenario we can expect is: the next tuple
already appear on top of the message queue (of course visibility
checks are already done also) with no fall down to buffer manager
or deeper.
If we can run multiple scans in parallel / asynchronous, CPU core
shall be assigned to another process by operating system, thus,
it eventually improves the i/o density and enables higher processing
throughput.
Append node is an ideal point to be parallelized because
- child nodes can have physically different location by tablespace,
so further tuning is possible according to the system landscape.
- it can control whether subplan is actually executed on background
worker, per subplan basis. If subplan contains large tables and
small tables, ParallelAppend may kick background worker to scan
large tables only, but scan on small tables are by itself.
- Like as Funnel node, we don't need to care about enhancement of
individual node types. SeqScan, IndexScan, ForeignScan or others
can perform as usual, but actually in parallel.

3. Implementation
------------------
* Plan & Cost

ParallelAppend shall appear where Appen can appear except for the
usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
both of AppendPath and ParallelAppendPath with cost for each.
Cost estimation logic shall take further discussions, however,
I expect the logic below to estimate the cost for ParallelAppend.
1. Sum startup_cost and run_cost for each child pathnode, but
distinguish according to synchronous or asynchronous.
Probably, total cost of pathnode is less than:
(parallel_setup_cost + its total cost / parallel_append_degree
+ number of rows * cpu_tuple_comm_cost)
is nonsense to run on background worker.
2. parallel_setup_cost * (# of asynchronous nodes) are added to
sum of startup_cost of asynchronous nodes.
3. sum of run_cost of asynchronous nodes are divided by
parallel_append_degree, then cpu_tuple_comm_cost * (total # of
rows by asynchronous nodes) are added.
4. both of synchronous and asynchronous cost are added, then it
becomes the cost of ParallelAppend.
Obviously, it stand on the viewpoint that says: cost reflects response
time of the underlying plan. So, cost of ParallelAppend can be smaller
than sum of underlying child nodes.

* Execution

Like Funnel node, it kicks background worker on the ExecProcNode handler,
thus, its startup time may be later than Fujita-san's approach if call
of ParallelAppend would be late. For example, when ParallelAppend is
located under HashJoin but inner Hash loads billion of rows.
Even though I expect ExecParallelAppend takes, at least, simple round-
robin scheduling like funnel_getnext(), we may give synchronous nodes
than asynchronous just after the background worker startup.

4. Further challenges
----------------------
* Serialization of CustomScan via outfuncs.c/readfuncs.c
Because methods field is, basically, a set of pointers per process basis,
we need to have an infrastructure to reproduce same table on the background
worker process identified by the name.
(I also try to design it.)

* Duplication of the parallel
If Funnel+PartialSeqScan is located under ParallelAppend, directly
or indirectly, it eventually leads background worker process to launch
another background workers. Is it expected usage of the current background
workers??

* Join pushdown
Distribution of nested-loop and hash-join may have advantage by parallel
processing, and by reduction of hash-size if CHECK() constraint of
individual partitioned tables informs rows obviously not to be joined.
Also see the thread:
[idea] table partition + hash join: http://bit.ly/1S2xpHT
My colleague already started to investigate / develop this feature
based on existing Append, to reduce num_batches.

As an aside, my GpuJoin feature works most effectively if entire inner
relations can be loaded to hash-table on GPU RAM, so features are very
welcome.

* Sort break-down
If mergejoin tried to have ParallelAppend node on left or right input,
we may be able to compare its cost with MargeParallelAppend + Sort on
the partial relation.

* Aggregate Push Down
It is what I exactly want to do.

Thanks,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: KaiGai Kohei (#1)

Re: [DESIGN] ParallelAppend

On Sun, Jul 26, 2015 at 8:43 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Hello,

I'm recently working/investigating on ParallelAppend feature
towards the next commit fest. Below is my design proposal.

1. Concept
----------
Its concept is quite simple anybody might consider more than once.
ParallelAppend node kicks background worker process to execute
child nodes in parallel / asynchronous.
It intends to improve the performance to scan a large partitioned
tables from standpoint of entire throughput, however, latency of
the first multi-hundred rows are not scope of this project.
From standpoint of technology trend, it primarily tries to utilize
multi-cores capability within a system, but also enables to expand
distributed database environment using foreign-tables inheritance
features.
Its behavior is very similar to Funnel node except for several
points, thus, we can reuse its infrastructure we have had long-
standing discussion through the v9.5 development cycle.

2. Problems to be solved
-------------------------
Typical OLAP workloads takes tons of tables join and scan on large
tables which are often partitioned, and its KPI is query response
time but very small number of sessions are active simultaneously.
So, we are required to run a single query as rapid as possible even
if it consumes larger computing resources than typical OLTP workloads.

Current implementation to scan heap is painful when we look at its
behavior from the standpoint - how many rows we can read within a
certain time, because of synchronous manner.
In the worst case, when SeqScan node tries to fetch the next tuple,
heap_getnext() looks up a block on shared buffer, then ReadBuffer()
calls storage manager to read the target block from the filesystem
if not on the buffer. Next, operating system makes the caller
process slept until required i/o get completed.
Most of the cases are helped in earlier stage than the above worst
case, however, the best scenario we can expect is: the next tuple
already appear on top of the message queue (of course visibility
checks are already done also) with no fall down to buffer manager
or deeper.
If we can run multiple scans in parallel / asynchronous, CPU core
shall be assigned to another process by operating system, thus,
it eventually improves the i/o density and enables higher processing
throughput.
Append node is an ideal point to be parallelized because
- child nodes can have physically different location by tablespace,
so further tuning is possible according to the system landscape.
- it can control whether subplan is actually executed on background
worker, per subplan basis. If subplan contains large tables and
small tables, ParallelAppend may kick background worker to scan
large tables only, but scan on small tables are by itself.
- Like as Funnel node, we don't need to care about enhancement of
individual node types. SeqScan, IndexScan, ForeignScan or others
can perform as usual, but actually in parallel.

3. Implementation
------------------
* Plan & Cost

ParallelAppend shall appear where Appen can appear except for the
usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
both of AppendPath and ParallelAppendPath with cost for each.

Is there a real need to have new node like ParallelAppendPath?
Can't we have Funnel node beneath AppendNode and then each
worker will be responsible to have SeqScan on each inherited child
relation. Something like

Append
---> Funnel
--> SeqScan rel1
--> SeqScan rel2

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

KaiGai Kohei

kaigai@ak.jp.nec.com

almost 11 years ago

In reply to: Kyotaro Horiguchi (#2)

Re: [DESIGN] ParallelAppend

Hello, can I ask some questions?

I suppose we can take this as the analog of ParalleSeqScan. I
can see not so distinction between Append(ParalleSeqScan) and
ParallelAppend(SeqScan). What difference is there between them?

Append does not start to execute the second or later node until
first node reaches end of the scan.
On the other hands, ParallelAppend will kick all the child nodes
(almost) simultaneously.

If other nodes will have the same functionality as you mention at
the last of this proposal, it might be better that some part of
this feature is implemented as a part of existing executor
itself, but not as a deidicated additional node, just as my
asynchronous fdw execution patch patially does. (Although it
lacks planner part and bg worker launching..) If that is the
case, it might be better that ExecProcNode is modified so that it
supports both in-process and inter-bgworker cases by the single
API.

What do you think about this?

Its downside is that we need to adjust all the existing nodes to
follow the new executor's capability. At this moment, we have 38
node types delivered from Plan. I think, it is not an easy job to
review a patch that changes multi-dozens files.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

regards,

Hello,

I'm recently working/investigating on ParallelAppend feature
towards the next commit fest. Below is my design proposal.

1. Concept
----------
Its concept is quite simple anybody might consider more than once.
ParallelAppend node kicks background worker process to execute
child nodes in parallel / asynchronous.
It intends to improve the performance to scan a large partitioned
tables from standpoint of entire throughput, however, latency of
the first multi-hundred rows are not scope of this project.
From standpoint of technology trend, it primarily tries to utilize
multi-cores capability within a system, but also enables to expand
distributed database environment using foreign-tables inheritance
features.
Its behavior is very similar to Funnel node except for several
points, thus, we can reuse its infrastructure we have had long-
standing discussion through the v9.5 development cycle.

2. Problems to be solved
-------------------------
Typical OLAP workloads takes tons of tables join and scan on large
tables which are often partitioned, and its KPI is query response
time but very small number of sessions are active simultaneously.
So, we are required to run a single query as rapid as possible even
if it consumes larger computing resources than typical OLTP workloads.

Current implementation to scan heap is painful when we look at its
behavior from the standpoint - how many rows we can read within a
certain time, because of synchronous manner.
In the worst case, when SeqScan node tries to fetch the next tuple,
heap_getnext() looks up a block on shared buffer, then ReadBuffer()
calls storage manager to read the target block from the filesystem
if not on the buffer. Next, operating system makes the caller
process slept until required i/o get completed.
Most of the cases are helped in earlier stage than the above worst
case, however, the best scenario we can expect is: the next tuple
already appear on top of the message queue (of course visibility
checks are already done also) with no fall down to buffer manager
or deeper.
If we can run multiple scans in parallel / asynchronous, CPU core
shall be assigned to another process by operating system, thus,
it eventually improves the i/o density and enables higher processing
throughput.
Append node is an ideal point to be parallelized because
- child nodes can have physically different location by tablespace,
so further tuning is possible according to the system landscape.
- it can control whether subplan is actually executed on background
worker, per subplan basis. If subplan contains large tables and
small tables, ParallelAppend may kick background worker to scan
large tables only, but scan on small tables are by itself.
- Like as Funnel node, we don't need to care about enhancement of
individual node types. SeqScan, IndexScan, ForeignScan or others
can perform as usual, but actually in parallel.

3. Implementation
------------------
* Plan & Cost

ParallelAppend shall appear where Appen can appear except for the
usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
both of AppendPath and ParallelAppendPath with cost for each.
Cost estimation logic shall take further discussions, however,
I expect the logic below to estimate the cost for ParallelAppend.
1. Sum startup_cost and run_cost for each child pathnode, but
distinguish according to synchronous or asynchronous.
Probably, total cost of pathnode is less than:
(parallel_setup_cost + its total cost / parallel_append_degree
+ number of rows * cpu_tuple_comm_cost)
is nonsense to run on background worker.
2. parallel_setup_cost * (# of asynchronous nodes) are added to
sum of startup_cost of asynchronous nodes.
3. sum of run_cost of asynchronous nodes are divided by
parallel_append_degree, then cpu_tuple_comm_cost * (total # of
rows by asynchronous nodes) are added.
4. both of synchronous and asynchronous cost are added, then it
becomes the cost of ParallelAppend.
Obviously, it stand on the viewpoint that says: cost reflects response
time of the underlying plan. So, cost of ParallelAppend can be smaller
than sum of underlying child nodes.

* Execution

Like Funnel node, it kicks background worker on the ExecProcNode handler,
thus, its startup time may be later than Fujita-san's approach if call
of ParallelAppend would be late. For example, when ParallelAppend is
located under HashJoin but inner Hash loads billion of rows.
Even though I expect ExecParallelAppend takes, at least, simple round-
robin scheduling like funnel_getnext(), we may give synchronous nodes
than asynchronous just after the background worker startup.

4. Further challenges
----------------------
* Serialization of CustomScan via outfuncs.c/readfuncs.c
Because methods field is, basically, a set of pointers per process basis,
we need to have an infrastructure to reproduce same table on the background
worker process identified by the name.
(I also try to design it.)

* Duplication of the parallel
If Funnel+PartialSeqScan is located under ParallelAppend, directly
or indirectly, it eventually leads background worker process to launch
another background workers. Is it expected usage of the current background
workers??

* Join pushdown
Distribution of nested-loop and hash-join may have advantage by parallel
processing, and by reduction of hash-size if CHECK() constraint of
individual partitioned tables informs rows obviously not to be joined.
Also see the thread:
[idea] table partition + hash join: http://bit.ly/1S2xpHT
My colleague already started to investigate / develop this feature
based on existing Append, to reduce num_batches.

As an aside, my GpuJoin feature works most effectively if entire inner
relations can be loaded to hash-table on GPU RAM, so features are very
welcome.

* Sort break-down
If mergejoin tried to have ParallelAppend node on left or right input,
we may be able to compare its cost with MargeParallelAppend + Sort on
the partial relation.

* Aggregate Push Down
It is what I exactly want to do.

Thanks,

--
Kyotaro Horiguchi
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

KaiGai Kohei

kaigai@ak.jp.nec.com

almost 11 years ago

In reply to: KaiGai Kohei (#4)

Re: [DESIGN] ParallelAppend

On Sun, Jul 26, 2015 at 8:43 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Hello,

I'm recently working/investigating on ParallelAppend feature
towards the next commit fest. Below is my design proposal.

1. Concept
----------
Its concept is quite simple anybody might consider more than once.
ParallelAppend node kicks background worker process to execute
child nodes in parallel / asynchronous.
It intends to improve the performance to scan a large partitioned
tables from standpoint of entire throughput, however, latency of
the first multi-hundred rows are not scope of this project.
From standpoint of technology trend, it primarily tries to utilize
multi-cores capability within a system, but also enables to expand
distributed database environment using foreign-tables inheritance
features.
Its behavior is very similar to Funnel node except for several
points, thus, we can reuse its infrastructure we have had long-
standing discussion through the v9.5 development cycle.

2. Problems to be solved
-------------------------
Typical OLAP workloads takes tons of tables join and scan on large
tables which are often partitioned, and its KPI is query response
time but very small number of sessions are active simultaneously.
So, we are required to run a single query as rapid as possible even
if it consumes larger computing resources than typical OLTP workloads.

Current implementation to scan heap is painful when we look at its
behavior from the standpoint - how many rows we can read within a
certain time, because of synchronous manner.
In the worst case, when SeqScan node tries to fetch the next tuple,
heap_getnext() looks up a block on shared buffer, then ReadBuffer()
calls storage manager to read the target block from the filesystem
if not on the buffer. Next, operating system makes the caller
process slept until required i/o get completed.
Most of the cases are helped in earlier stage than the above worst
case, however, the best scenario we can expect is: the next tuple
already appear on top of the message queue (of course visibility
checks are already done also) with no fall down to buffer manager
or deeper.
If we can run multiple scans in parallel / asynchronous, CPU core
shall be assigned to another process by operating system, thus,
it eventually improves the i/o density and enables higher processing
throughput.
Append node is an ideal point to be parallelized because
- child nodes can have physically different location by tablespace,
so further tuning is possible according to the system landscape.
- it can control whether subplan is actually executed on background
worker, per subplan basis. If subplan contains large tables and
small tables, ParallelAppend may kick background worker to scan
large tables only, but scan on small tables are by itself.
- Like as Funnel node, we don't need to care about enhancement of
individual node types. SeqScan, IndexScan, ForeignScan or others
can perform as usual, but actually in parallel.

3. Implementation
------------------
* Plan & Cost

ParallelAppend shall appear where Appen can appear except for the
usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
both of AppendPath and ParallelAppendPath with cost for each.

Is there a real need to have new node like ParallelAppendPath?
Can't we have Funnel node beneath AppendNode and then each
worker will be responsible to have SeqScan on each inherited child
relation. Something like

Append
---> Funnel
--> SeqScan rel1
--> SeqScan rel2

If Funnel can handle both of horizontal and vertical parallelism,
it is a great simplification. I never stick a new node.

Once Funnel get a capability to have multiple child nodes, probably,
Append node above will have gone. I expect set_append_rel_pathlist()
add two paths based on Append and Funnel, then planner will choose
the cheaper one according to its cost.

We will need to pay attention another issues we will look at when Funnel
kicks background worker towards asymmetric relations.

If number of rows of individual child nodes are various, we may
want to assign 10 background workers to scan rel1 with PartialSeqScan.
On the other hands, rel2 may have very small number of rows thus
its total_cost may be smaller than cost to launch a worker.
In this case, Funnel has child nodes to be executed asynchronously and
synchronously.

If cheapest path of the child relation is a pair of Funnel and
PartialSeqScan, we have to avoid to stack Funnel node. Probably,
Funnel node that performs like Append needs to pull up underlying
Funnel and assign equivalen number of workers as follows.

Append
--> Funnel
--> PartialSeqScan on rel1 (num_workers = 4)
--> Funnel
--> PartialSeqScan on rel2 (num_workers = 8)
--> SeqScan on rel3

shall be rewritten to
Funnel
--> PartialSeqScan on rel1 (num_workers = 4)
--> PartialSeqScan on rel2 (num_workers = 8)
--> SeqScan on rel3 (num_workers = 1)

We also need to consider whether Funnel will have capability
equivalent to MergeAppend, even though parallel sorting is
a fantastic challenge.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Import Notes

Resolved by subject fallback

KaiGai Kohei

kaigai@ak.jp.nec.com

almost 11 years ago

In reply to: KaiGai Kohei (#5)

Re: [DESIGN] ParallelAppend

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
Sent: Monday, July 27, 2015 11:07 PM
To: Amit Kapila
Cc: pgsql-hackers@postgresql.org; Robert Haas; Kyotaro HORIGUCHI
Subject: Re: [HACKERS] [DESIGN] ParallelAppend

On Sun, Jul 26, 2015 at 8:43 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

Hello,

I'm recently working/investigating on ParallelAppend feature
towards the next commit fest. Below is my design proposal.

1. Concept
----------
Its concept is quite simple anybody might consider more than once.
ParallelAppend node kicks background worker process to execute
child nodes in parallel / asynchronous.
It intends to improve the performance to scan a large partitioned
tables from standpoint of entire throughput, however, latency of
the first multi-hundred rows are not scope of this project.
From standpoint of technology trend, it primarily tries to utilize
multi-cores capability within a system, but also enables to expand
distributed database environment using foreign-tables inheritance
features.
Its behavior is very similar to Funnel node except for several
points, thus, we can reuse its infrastructure we have had long-
standing discussion through the v9.5 development cycle.

2. Problems to be solved
-------------------------
Typical OLAP workloads takes tons of tables join and scan on large
tables which are often partitioned, and its KPI is query response
time but very small number of sessions are active simultaneously.
So, we are required to run a single query as rapid as possible even
if it consumes larger computing resources than typical OLTP workloads.

Current implementation to scan heap is painful when we look at its
behavior from the standpoint - how many rows we can read within a
certain time, because of synchronous manner.
In the worst case, when SeqScan node tries to fetch the next tuple,
heap_getnext() looks up a block on shared buffer, then ReadBuffer()
calls storage manager to read the target block from the filesystem
if not on the buffer. Next, operating system makes the caller
process slept until required i/o get completed.
Most of the cases are helped in earlier stage than the above worst
case, however, the best scenario we can expect is: the next tuple
already appear on top of the message queue (of course visibility
checks are already done also) with no fall down to buffer manager
or deeper.
If we can run multiple scans in parallel / asynchronous, CPU core
shall be assigned to another process by operating system, thus,
it eventually improves the i/o density and enables higher processing
throughput.
Append node is an ideal point to be parallelized because
- child nodes can have physically different location by tablespace,
so further tuning is possible according to the system landscape.
- it can control whether subplan is actually executed on background
worker, per subplan basis. If subplan contains large tables and
small tables, ParallelAppend may kick background worker to scan
large tables only, but scan on small tables are by itself.
- Like as Funnel node, we don't need to care about enhancement of
individual node types. SeqScan, IndexScan, ForeignScan or others
can perform as usual, but actually in parallel.

3. Implementation
------------------
* Plan & Cost

ParallelAppend shall appear where Appen can appear except for the
usage for dummy. So, I'll enhance set_append_rel_pathlist() to add
both of AppendPath and ParallelAppendPath with cost for each.

Is there a real need to have new node like ParallelAppendPath?
Can't we have Funnel node beneath AppendNode and then each
worker will be responsible to have SeqScan on each inherited child
relation. Something like

Append
---> Funnel
--> SeqScan rel1
--> SeqScan rel2

If Funnel can handle both of horizontal and vertical parallelism,
it is a great simplification. I never stick a new node.

Once Funnel get a capability to have multiple child nodes, probably,
Append node above will have gone. I expect set_append_rel_pathlist()
add two paths based on Append and Funnel, then planner will choose
the cheaper one according to its cost.

In the latest v16 patch, Funnel is declared as follows:

typedef struct Funnel
{
Scan scan;
int num_workers;
} Funnel;

If we try to add Append capability here, I expects the structure will
be adjusted as follows, for example:

typedef struct Funnel
{
Scan scan;
List *funnel_plans;
List *funnel_num_workers;
} Funnel;

As literal, funnel_plans saves underlying Plan nodes instead of the
lefttree. Also, funnel_num_workers saves number of expected workers
to be assigned on individual child plans.

Even though create_parallelscan_paths() in v16 set num_workers not
larger than parallel_seqscan_degree, total number of the concurrent
background workers may exceed this configuration if more than two
PartialSeqScan nodes are underlying.
It is a different configuration from max_worker_processes, so it is
not a matter as long as we have another restriction.
However, how do we control the cap of number of worker processes per
"appendable" Funnel node? For example, if a parent table has 200
child tables but max_worker_processes are configured to 50.
It is obviously impossible to launch all the background workers
simultaneously. One idea I have is to suspend launch of some plans
until earlier ones are completed.

We will need to pay attention another issues we will look at when Funnel
kicks background worker towards asymmetric relations.

If number of rows of individual child nodes are various, we may
want to assign 10 background workers to scan rel1 with PartialSeqScan.
On the other hands, rel2 may have very small number of rows thus
its total_cost may be smaller than cost to launch a worker.
In this case, Funnel has child nodes to be executed asynchronously and
synchronously.

If cheapest path of the child relation is a pair of Funnel and
PartialSeqScan, we have to avoid to stack Funnel node. Probably,
Funnel node that performs like Append needs to pull up underlying
Funnel and assign equivalen number of workers as follows.

Append
--> Funnel
--> PartialSeqScan on rel1 (num_workers = 4)
--> Funnel
--> PartialSeqScan on rel2 (num_workers = 8)
--> SeqScan on rel3

shall be rewritten to
Funnel
--> PartialSeqScan on rel1 (num_workers = 4)
--> PartialSeqScan on rel2 (num_workers = 8)
--> SeqScan on rel3 (num_workers = 1)

We also need to consider whether Funnel will have capability
equivalent to MergeAppend, even though parallel sorting is
a fantastic challenge.

--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: KaiGai Kohei (#6)

Re: [DESIGN] ParallelAppend

On Tue, Jul 28, 2015 at 7:59 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
Sent: Monday, July 27, 2015 11:07 PM
To: Amit Kapila

Is there a real need to have new node like ParallelAppendPath?
Can't we have Funnel node beneath AppendNode and then each
worker will be responsible to have SeqScan on each inherited child
relation. Something like

Append
---> Funnel
--> SeqScan rel1
--> SeqScan rel2

If Funnel can handle both of horizontal and vertical parallelism,
it is a great simplification. I never stick a new node.

Once Funnel get a capability to have multiple child nodes, probably,
Append node above will have gone. I expect set_append_rel_pathlist()
add two paths based on Append and Funnel, then planner will choose
the cheaper one according to its cost.

In the latest v16 patch, Funnel is declared as follows:

typedef struct Funnel
{
Scan scan;
int num_workers;
} Funnel;

If we try to add Append capability here, I expects the structure will
be adjusted as follows, for example:

typedef struct Funnel
{
Scan scan;
List *funnel_plans;
List *funnel_num_workers;
} Funnel;

As literal, funnel_plans saves underlying Plan nodes instead of the
lefttree. Also, funnel_num_workers saves number of expected workers
to be assigned on individual child plans.

or shall we have a node like above and name it as FunnelAppend or
AppenFunnel?

Even though create_parallelscan_paths() in v16 set num_workers not
larger than parallel_seqscan_degree, total number of the concurrent
background workers may exceed this configuration if more than two
PartialSeqScan nodes are underlying.
It is a different configuration from max_worker_processes, so it is
not a matter as long as we have another restriction.
However, how do we control the cap of number of worker processes per
"appendable" Funnel node? For example, if a parent table has 200
child tables but max_worker_processes are configured to 50.
It is obviously impossible to launch all the background workers
simultaneously. One idea I have is to suspend launch of some plans
until earlier ones are completed.

Okay, but I think in that idea you need to re-launch the workers again for
new set of relation scan's which could turn out to be costly, how about
designing some way where workers after completing their assigned work
check for new set of task/'s (which in this case would be to scan a new) and
then execute the same. I think in this way we can achieve dynamic
allocation
of work and achieve maximum parallelism with available set of workers.
We have achieved this in ParallelSeqScan by scanning at block level, once
a worker finishes a block, it checks for new block to scan.

We will need to pay attention another issues we will look at when Funnel
kicks background worker towards asymmetric relations.

If number of rows of individual child nodes are various, we may
want to assign 10 background workers to scan rel1 with PartialSeqScan.
On the other hands, rel2 may have very small number of rows thus
its total_cost may be smaller than cost to launch a worker.
In this case, Funnel has child nodes to be executed asynchronously and
synchronously.

I think this might turn out to be slightly tricky, for example how do we
know
for what size of relation, how many workers are sufficient?
Another way to look at dividing the work in this case could be in terms of
chunk-of-blocks, once a worker finishes it current set of block/'s, it
should be
able to get new set of block's to scan. So let us assume if we decide
chunk-size as 32 and total number of blocks in whole inheritance hierarchy
are 3200, then the max workers we should allocate to this scan are 100 and
if we have parallel_seqscan degree lesser than that then we can use those
many workers and then let them scan 32-blocks-at-a-time.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 11 years ago

In reply to: KaiGai Kohei (#5)

Re: [DESIGN] ParallelAppend

KaiGai-san,

On 2015-07-27 PM 11:07, Kouhei Kaigai wrote:

Append
--> Funnel
--> PartialSeqScan on rel1 (num_workers = 4)
--> Funnel
--> PartialSeqScan on rel2 (num_workers = 8)
--> SeqScan on rel3

shall be rewritten to
Funnel
--> PartialSeqScan on rel1 (num_workers = 4)
--> PartialSeqScan on rel2 (num_workers = 8)
--> SeqScan on rel3 (num_workers = 1)

In the rewritten plan, are respective scans (PartialSeq or Seq) on rel1,
rel2 and rel3 asynchronous w.r.t each other? Or does each one wait for the
earlier one to finish? I would think the answer is no because then it
would not be different from the former case, right? Because the original
premise seems that (partitions) rel1, rel2, rel3 may be on different
volumes so parallelism across volumes seems like a goal of parallelizing
Append.

From my understanding of parallel seqscan patch, each worker's
PartialSeqScan asks for a block to scan using a shared parallel heap scan
descriptor that effectively keeps track of division of work among
PartialSeqScans in terms of blocks. What if we invent a PartialAppend
which each worker would run in case of a parallelized Append. It would use
some kind of shared descriptor to pick a relation (Append member) to scan.
The shared structure could be the list of subplans including the mutex for
concurrency. It doesn't sound as effective as proposed
ParallelHeapScanDescData does for PartialSeqScan but any more granular
might be complicated. For example, consider (current_relation,
current_block) pair. If there are more workers than subplans/partitions,
then multiple workers might start working on the same relation after a
round-robin assignment of relations (but of course, a later worker would
start scanning from a later block in the same relation). I imagine that
might help with parallelism across volumes if that's the case. MergeAppend
parallelization might involve a bit more complication but may be feasible
with a PartialMergeAppend with slightly different kind of coordination
among workers. What do you think of such an approach?

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

David Rowley

dgrowleyml@gmail.com

almost 11 years ago

In reply to: Kyotaro Horiguchi (#2)

Re: [DESIGN] ParallelAppend

On 27 July 2015 at 21:09, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp

wrote:

Hello, can I ask some questions?

I suppose we can take this as the analog of ParalleSeqScan. I
can see not so distinction between Append(ParalleSeqScan) and
ParallelAppend(SeqScan). What difference is there between them?

If other nodes will have the same functionality as you mention at
the last of this proposal, it might be better that some part of
this feature is implemented as a part of existing executor
itself, but not as a deidicated additional node, just as my
asynchronous fdw execution patch patially does. (Although it
lacks planner part and bg worker launching..) If that is the
case, it might be better that ExecProcNode is modified so that it
supports both in-process and inter-bgworker cases by the single
API.

What do you think about this?

I have to say that I really like the thought of us having parallel enabled
stuff in Postgres, but I also have to say that I don't think inventing all
these special parallel node types is a good idea. If we think about
everything that we can parallelise...

Perhaps.... sort, hash join, seqscan, hash, bitmap heap scan, nested loop.
I don't want to debate that, but perhaps there's more, perhaps less.
Are we really going to duplicate all of the code and add in the parallel
stuff as new node types?

My other concern here is that I seldom hear people talk about the planner's
architectural lack of ability to make a good choice about how many parallel
workers to choose. Surely to properly calculate costs you need to know the
exact number of parallel workers that will be available at execution time,
but you need to know this at planning time!? I can't see how this works,
apart from just being very conservative about parallel workers, which I
think is really bad, as many databases have busy times in the day, and also
quiet times, generally quiet time is when large batch stuff gets done, and
that's the time that parallel stuff is likely most useful. Remember queries
are not always planned just before they're executed. We could have a
PREPAREd query, or we could have better plan caching in the future, or if
we build some intelligence into the planner to choose a good number of
workers based on the current server load, then what's to say that the
server will be under this load at exec time? If we plan during a quiet
time, and exec in a busy time all hell may break loose.

I really do think that existing nodes should just be initialized in a
parallel mode, and each node type can have a function to state if it
supports parallelism or not.

I'd really like to hear more opinions in the ideas I discussed here:

/messages/by-id/CAApHDvp2STf0=pQfpq+e7WA4QdYmpFM5qu_YtUpE7R0jLnH82Q@mail.gmail.com

This design makes use of the Funnel node that Amit has already made and
allows more than 1 node to be executed in parallel at once.

It appears that parallel enabling the executor node by node is
fundamentally locked into just 1 node being executed in parallel, then
perhaps a Funnel node gathering up the parallel worker buffers and
streaming those back in serial mode. I believe by design, this does not
permit a whole plan branch from executing in parallel and I really feel
like doing things this way is going to be very hard to undo and improve
later. I might be too stupid to figure it out, but how would parallel hash
join work if it can't gather tuples from the inner and outer nodes in
parallel?

Sorry for the rant, but I just feel like we're painting ourselves into a
corner by parallel enabling the executor node by node.
Apologies if I've completely misunderstood things.

Regards

David Rowley

--
David Rowley http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Training & Services

#10

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

almost 11 years ago

In reply to: David Rowley (#9)

Re: [DESIGN] ParallelAppend

On Tue, Jul 28, 2015 at 12:59 PM, David Rowley <david.rowley@2ndquadrant.com

wrote:

On 27 July 2015 at 21:09, Kyotaro HORIGUCHI <
horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Hello, can I ask some questions?

I suppose we can take this as the analog of ParalleSeqScan. I
can see not so distinction between Append(ParalleSeqScan) and
ParallelAppend(SeqScan). What difference is there between them?

If other nodes will have the same functionality as you mention at
the last of this proposal, it might be better that some part of
this feature is implemented as a part of existing executor
itself, but not as a deidicated additional node, just as my
asynchronous fdw execution patch patially does. (Although it
lacks planner part and bg worker launching..) If that is the
case, it might be better that ExecProcNode is modified so that it
supports both in-process and inter-bgworker cases by the single
API.

What do you think about this?

I have to say that I really like the thought of us having parallel enabled
stuff in Postgres, but I also have to say that I don't think inventing all
these special parallel node types is a good idea. If we think about
everything that we can parallelise...

Perhaps.... sort, hash join, seqscan, hash, bitmap heap scan, nested loop.
I don't want to debate that, but perhaps there's more, perhaps less.
Are we really going to duplicate all of the code and add in the parallel
stuff as new node types?

My other concern here is that I seldom hear people talk about the
planner's architectural lack of ability to make a good choice about how
many parallel workers to choose. Surely to properly calculate costs you
need to know the exact number of parallel workers that will be available at
execution time, but you need to know this at planning time!? I can't see
how this works, apart from just being very conservative about parallel
workers, which I think is really bad, as many databases have busy times in
the day, and also quiet times, generally quiet time is when large batch
stuff gets done, and that's the time that parallel stuff is likely most
useful. Remember queries are not always planned just before they're
executed. We could have a PREPAREd query, or we could have better plan
caching in the future, or if we build some intelligence into the planner to
choose a good number of workers based on the current server load, then
what's to say that the server will be under this load at exec time? If we
plan during a quiet time, and exec in a busy time all hell may break loose.

I really do think that existing nodes should just be initialized in a
parallel mode, and each node type can have a function to state if it
supports parallelism or not.

I'd really like to hear more opinions in the ideas I discussed here:

/messages/by-id/CAApHDvp2STf0=pQfpq+e7WA4QdYmpFM5qu_YtUpE7R0jLnH82Q@mail.gmail.com

This design makes use of the Funnel node that Amit has already made and
allows more than 1 node to be executed in parallel at once.

It appears that parallel enabling the executor node by node is
fundamentally locked into just 1 node being executed in parallel, then
perhaps a Funnel node gathering up the parallel worker buffers and
streaming those back in serial mode. I believe by design, this does not
permit a whole plan branch from executing in parallel and I really feel
like doing things this way is going to be very hard to undo and improve
later. I might be too stupid to figure it out, but how would parallel hash
join work if it can't gather tuples from the inner and outer nodes in
parallel?

Sorry for the rant, but I just feel like we're painting ourselves into a
corner by parallel enabling the executor node by node.
Apologies if I've completely misunderstood things.

+1, well articulated.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

#11

KaiGai Kohei

kaigai@ak.jp.nec.com

almost 11 years ago

In reply to: Amit Kapila (#7)

Re: [DESIGN] ParallelAppend

On Tue, Jul 28, 2015 at 7:59 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
Sent: Monday, July 27, 2015 11:07 PM
To: Amit Kapila

Is there a real need to have new node like ParallelAppendPath?
Can't we have Funnel node beneath AppendNode and then each
worker will be responsible to have SeqScan on each inherited child
relation. Something like

Append
---> Funnel
--> SeqScan rel1
--> SeqScan rel2

If Funnel can handle both of horizontal and vertical parallelism,
it is a great simplification. I never stick a new node.

Once Funnel get a capability to have multiple child nodes, probably,
Append node above will have gone. I expect set_append_rel_pathlist()
add two paths based on Append and Funnel, then planner will choose
the cheaper one according to its cost.

In the latest v16 patch, Funnel is declared as follows:

typedef struct Funnel
{
Scan scan;
int num_workers;
} Funnel;

If we try to add Append capability here, I expects the structure will
be adjusted as follows, for example:

typedef struct Funnel
{
Scan scan;
List *funnel_plans;
List *funnel_num_workers;
} Funnel;

As literal, funnel_plans saves underlying Plan nodes instead of the
lefttree. Also, funnel_num_workers saves number of expected workers
to be assigned on individual child plans.

or shall we have a node like above and name it as FunnelAppend or
AppenFunnel?

It is better to have smaller number of node types which are capable to
kick background workers because of simplification of path construction.

Let's assume the case below. When planner considers a path to append
child scans on rel1, rel2 and rel3 but the cheapest path of rel2 is
Funnel+PartialSeqScan, we cannot put Funnel here unless we don't pull
up Funnel of rel2, can we?

(Append? or Funnel)
--> SeqScan on rel1
--> Funnel
--> PartialSeqScan on rel2
--> IndexScan on rel3

If we pull Funnel here, I think the plan shall be as follows:
Funnel
--> SeqScan on rel1
--> PartialSeqScan on rel2
--> IndexScan on rel3

If all we have to pay attention is Funnel node, it makes the code
around path construction and pull-up logic much simpler, rather than
multiple node types can kick background workers.

Even though create_parallelscan_paths() in v16 set num_workers not
larger than parallel_seqscan_degree, total number of the concurrent
background workers may exceed this configuration if more than two
PartialSeqScan nodes are underlying.
It is a different configuration from max_worker_processes, so it is
not a matter as long as we have another restriction.
However, how do we control the cap of number of worker processes per
"appendable" Funnel node? For example, if a parent table has 200
child tables but max_worker_processes are configured to 50.
It is obviously impossible to launch all the background workers
simultaneously. One idea I have is to suspend launch of some plans
until earlier ones are completed.

Okay, but I think in that idea you need to re-launch the workers again for
new set of relation scan's which could turn out to be costly, how about
designing some way where workers after completing their assigned work
check for new set of task/'s (which in this case would be to scan a new) and
then execute the same. I think in this way we can achieve dynamic allocation
of work and achieve maximum parallelism with available set of workers.
We have achieved this in ParallelSeqScan by scanning at block level, once
a worker finishes a block, it checks for new block to scan.

Is it possible to put multiple PlannedStmt on TOC, isn't it?
If background worker picks up an uncompleted PlannedStmt first
(based on round-robin likely?), it may achieve the maximum
parallelism. Yep, it seems to me a good idea which I want to try.
If (num of worker) > (num of sub-plans), some of sub-plans can
have multiple workers from the beginning, then, other workers
also help to execute heavy plans later.
It may be better to put PlannedStmt in order of total_cost to
bias multi-workers execution from the beginning.

TODO: Even if a heavy query occupied most of available worker slots,
another session wants to use parallel execution later but during
execution of the primary query. We may need to have a 'scoreboard'
on shared memory to know how many workers are potentially needed
and how much ones are overused by somebody. If someone overconsumed
background workers, it should exit first, rather than picking up
the next PlannedStmt.

We will need to pay attention another issues we will look at when Funnel
kicks background worker towards asymmetric relations.

If number of rows of individual child nodes are various, we may
want to assign 10 background workers to scan rel1 with PartialSeqScan.
On the other hands, rel2 may have very small number of rows thus
its total_cost may be smaller than cost to launch a worker.
In this case, Funnel has child nodes to be executed asynchronously and
synchronously.

I think this might turn out to be slightly tricky, for example how do we know
for what size of relation, how many workers are sufficient?

I expected comparison between total_cost of the sub-plan and a threshold that
represents the cost to kick background workers.
However, I'm inclined to the above approach (multiple PlannedStmt on TOC,
then picked up by background workers by round-robin).

Another way to look at dividing the work in this case could be in terms of
chunk-of-blocks, once a worker finishes it current set of block/'s, it should
be
able to get new set of block's to scan. So let us assume if we decide
chunk-size as 32 and total number of blocks in whole inheritance hierarchy
are 3200, then the max workers we should allocate to this scan are 100 and
if we have parallel_seqscan degree lesser than that then we can use those
many workers and then let them scan 32-blocks-at-a-time.

If we use the above multi-PlannedStmt approach, TOC also need to have a counter
to track how many background workers are running on a particular PlannedStmt,
then if enough number of worker is running on the PlannedStmt, next available
worker will skip this PlannedStmt (even if round-robin) or just exit?
Anyway, I think an infrastructure may be needed to avoid too aggressive
parallel execution.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

KaiGai Kohei

kaigai@ak.jp.nec.com

almost 11 years ago

In reply to: Amit Langote (#8)

Re: [DESIGN] ParallelAppend

KaiGai-san,

On 2015-07-27 PM 11:07, Kouhei Kaigai wrote:

Append
--> Funnel
--> PartialSeqScan on rel1 (num_workers = 4)
--> Funnel
--> PartialSeqScan on rel2 (num_workers = 8)
--> SeqScan on rel3

shall be rewritten to
Funnel
--> PartialSeqScan on rel1 (num_workers = 4)
--> PartialSeqScan on rel2 (num_workers = 8)
--> SeqScan on rel3 (num_workers = 1)

In the rewritten plan, are respective scans (PartialSeq or Seq) on rel1,
rel2 and rel3 asynchronous w.r.t each other? Or does each one wait for the
earlier one to finish? I would think the answer is no because then it
would not be different from the former case, right? Because the original
premise seems that (partitions) rel1, rel2, rel3 may be on different
volumes so parallelism across volumes seems like a goal of parallelizing
Append.

From my understanding of parallel seqscan patch, each worker's
PartialSeqScan asks for a block to scan using a shared parallel heap scan
descriptor that effectively keeps track of division of work among
PartialSeqScans in terms of blocks. What if we invent a PartialAppend
which each worker would run in case of a parallelized Append. It would use
some kind of shared descriptor to pick a relation (Append member) to scan.
The shared structure could be the list of subplans including the mutex for
concurrency. It doesn't sound as effective as proposed
ParallelHeapScanDescData does for PartialSeqScan but any more granular
might be complicated. For example, consider (current_relation,
current_block) pair. If there are more workers than subplans/partitions,
then multiple workers might start working on the same relation after a
round-robin assignment of relations (but of course, a later worker would
start scanning from a later block in the same relation). I imagine that
might help with parallelism across volumes if that's the case.

I initially thought ParallelAppend kicks fixed number of background workers
towards sub-plans, according to the estimated cost on the planning stage.
However, I'm now inclined that background worker picks up an uncompleted
PlannedStmt first. (For more details, please see the reply to Amit Kapila)
It looks like less less-grained worker's job distribution.
Once number of workers gets larger than number of volumes / partitions,
it means more than two workers begin to assign same PartialSeqScan, thus
it takes fine-grained job distribution using shared parallel heap scan.

MergeAppend
parallelization might involve a bit more complication but may be feasible
with a PartialMergeAppend with slightly different kind of coordination
among workers. What do you think of such an approach?

Do we need to have something special in ParallelMergeAppend?
If individual child nodes are designed to return sorted results,
what we have to do seems to me same.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

KaiGai Kohei

kaigai@ak.jp.nec.com

almost 11 years ago

In reply to: David Rowley (#9)

Re: [DESIGN] ParallelAppend

On 27 July 2015 at 21:09, Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
wrote:

Hello, can I ask some questions?

I suppose we can take this as the analog of ParalleSeqScan. I
can see not so distinction between Append(ParalleSeqScan) and
ParallelAppend(SeqScan). What difference is there between them?

If other nodes will have the same functionality as you mention at
the last of this proposal, it might be better that some part of
this feature is implemented as a part of existing executor
itself, but not as a deidicated additional node, just as my
asynchronous fdw execution patch patially does. (Although it
lacks planner part and bg worker launching..) If that is the
case, it might be better that ExecProcNode is modified so that it
supports both in-process and inter-bgworker cases by the single
API.

What do you think about this?

I have to say that I really like the thought of us having parallel enabled stuff
in Postgres, but I also have to say that I don't think inventing all these special
parallel node types is a good idea. If we think about everything that we can
parallelise...

Perhaps.... sort, hash join, seqscan, hash, bitmap heap scan, nested loop. I don't
want to debate that, but perhaps there's more, perhaps less.
Are we really going to duplicate all of the code and add in the parallel stuff
as new node types?

My other concern here is that I seldom hear people talk about the planner's
architectural lack of ability to make a good choice about how many parallel workers
to choose. Surely to properly calculate costs you need to know the exact number
of parallel workers that will be available at execution time, but you need to
know this at planning time!? I can't see how this works, apart from just being
very conservative about parallel workers, which I think is really bad, as many
databases have busy times in the day, and also quiet times, generally quiet time
is when large batch stuff gets done, and that's the time that parallel stuff is
likely most useful. Remember queries are not always planned just before they're
executed. We could have a PREPAREd query, or we could have better plan caching
in the future, or if we build some intelligence into the planner to choose a good
number of workers based on the current server load, then what's to say that the
server will be under this load at exec time? If we plan during a quiet time, and
exec in a busy time all hell may break loose.

Even though it is not easy to estimate available workers at planning time,
it might be possible to define a "target" number of workers to run.
If Funnel cannot get enough number of workers less than target, my preference
is to tell other workers (via scoreboard?) not to pick up next PlannedStmt and
exit when another Funnel cannot launch enough number of workers.

I really do think that existing nodes should just be initialized in a parallel
mode, and each node type can have a function to state if it supports parallelism
or not.

I'd really like to hear more opinions in the ideas I discussed here:

/messages/by-id/CAApHDvp2STf0=pQfpq+e7WA4QdYmpFM5qu_YtU
pE7R0jLnH82Q@mail.gmail.com

This design makes use of the Funnel node that Amit has already made and allows
more than 1 node to be executed in parallel at once.

It appears that parallel enabling the executor node by node is fundamentally locked
into just 1 node being executed in parallel, then perhaps a Funnel node gathering
up the parallel worker buffers and streaming those back in serial mode. I believe
by design, this does not permit a whole plan branch from executing in parallel
and I really feel like doing things this way is going to be very hard to undo
and improve later. I might be too stupid to figure it out, but how would parallel
hash join work if it can't gather tuples from the inner and outer nodes in parallel?

Hash-Join and Nest-Loop should not have PartialSeqScan in the inner-side, but
outer side can be PartialSeqScan under the Funnel node.
In case of Hash-Join, SeqScan of inner-side loads any tuples (*1) to hash-table
once, then records come from outer-side shall be combined with the hash-table.
Even though inner-side is read redundantly, advantage of parallel join will win
as long as inner-side is enough small; This assumption is right on usual pair of
master tables (small) and fact table (big).

(*1) Our colleague is now working on this feature. It enables to drop unnecessary
rows under the partitioned tables. So, we may not need to have entire hash table
for each background workers.
/messages/by-id/9A28C8860F777E439AA12E8AEA7694F8010F672B@BPXM15GP.gisp.nec.co.jp

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 11 years ago

In reply to: KaiGai Kohei (#12)

Re: [DESIGN] ParallelAppend

KaiGai-san,

On 2015-07-28 PM 09:58, Kouhei Kaigai wrote:

From my understanding of parallel seqscan patch, each worker's
PartialSeqScan asks for a block to scan using a shared parallel heap scan
descriptor that effectively keeps track of division of work among
PartialSeqScans in terms of blocks. What if we invent a PartialAppend
which each worker would run in case of a parallelized Append. It would use
some kind of shared descriptor to pick a relation (Append member) to scan.
The shared structure could be the list of subplans including the mutex for
concurrency. It doesn't sound as effective as proposed
ParallelHeapScanDescData does for PartialSeqScan but any more granular
might be complicated. For example, consider (current_relation,
current_block) pair. If there are more workers than subplans/partitions,
then multiple workers might start working on the same relation after a
round-robin assignment of relations (but of course, a later worker would
start scanning from a later block in the same relation). I imagine that
might help with parallelism across volumes if that's the case.

I initially thought ParallelAppend kicks fixed number of background workers
towards sub-plans, according to the estimated cost on the planning stage.
However, I'm now inclined that background worker picks up an uncompleted
PlannedStmt first. (For more details, please see the reply to Amit Kapila)
It looks like less less-grained worker's job distribution.
Once number of workers gets larger than number of volumes / partitions,
it means more than two workers begin to assign same PartialSeqScan, thus
it takes fine-grained job distribution using shared parallel heap scan.

I like your idea of using round-robin assignment of partial/non-partial
sub-plans to workers. Do you think there are two considerations of cost
here: sub-plans themselves could have parallel paths to consider and (I
think) your proposal introduces a new consideration - a plain old
synchronous Append path vs. parallel asynchronous Append with Funnel
(below/above?) it. I guess the asynchronous version would always be
cheaper. So, even if we end up with non-parallel sub-plans do we still add
a Funnel to make Append asynchronous? Am I missing something?

MergeAppend
parallelization might involve a bit more complication but may be feasible
with a PartialMergeAppend with slightly different kind of coordination
among workers. What do you think of such an approach?

Do we need to have something special in ParallelMergeAppend?
If individual child nodes are designed to return sorted results,
what we have to do seems to me same.

Sorry, I was wrongly worried because I did not really know that
MergeAppend uses a binaryheap to store tuples before returning.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

KaiGai Kohei

kaigai@ak.jp.nec.com

almost 11 years ago

In reply to: Amit Langote (#14)

Re: [DESIGN] ParallelAppend

On 2015-07-28 PM 09:58, Kouhei Kaigai wrote:

From my understanding of parallel seqscan patch, each worker's
PartialSeqScan asks for a block to scan using a shared parallel heap scan
descriptor that effectively keeps track of division of work among
PartialSeqScans in terms of blocks. What if we invent a PartialAppend
which each worker would run in case of a parallelized Append. It would use
some kind of shared descriptor to pick a relation (Append member) to scan.
The shared structure could be the list of subplans including the mutex for
concurrency. It doesn't sound as effective as proposed
ParallelHeapScanDescData does for PartialSeqScan but any more granular
might be complicated. For example, consider (current_relation,
current_block) pair. If there are more workers than subplans/partitions,
then multiple workers might start working on the same relation after a
round-robin assignment of relations (but of course, a later worker would
start scanning from a later block in the same relation). I imagine that
might help with parallelism across volumes if that's the case.

I initially thought ParallelAppend kicks fixed number of background workers
towards sub-plans, according to the estimated cost on the planning stage.
However, I'm now inclined that background worker picks up an uncompleted
PlannedStmt first. (For more details, please see the reply to Amit Kapila)
It looks like less less-grained worker's job distribution.
Once number of workers gets larger than number of volumes / partitions,
it means more than two workers begin to assign same PartialSeqScan, thus
it takes fine-grained job distribution using shared parallel heap scan.

I like your idea of using round-robin assignment of partial/non-partial
sub-plans to workers. Do you think there are two considerations of cost
here: sub-plans themselves could have parallel paths to consider and (I
think) your proposal introduces a new consideration - a plain old
synchronous Append path vs. parallel asynchronous Append with Funnel
(below/above?) it. I guess the asynchronous version would always be
cheaper. So, even if we end up with non-parallel sub-plans do we still add
a Funnel to make Append asynchronous? Am I missing something?

I expect Funnel itself will get Append capability but run sub-plans in
background workers, to simplify path constructions. So, if Funnel with
multiple sub-plans have cheaper cost than Append, it will replace the
AppendPath by FunnelPath.

Regarding to the cost estimation, I don't think parallel version is always
cheaper than traditional Append, because of the cost to launch background
workers. It increases startup cost to process the relation, thus, if upper
node prefers small startup cost (like Limit), traditional Append still has
advantages.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

almost 11 years ago

In reply to: KaiGai Kohei (#15)

Re: [DESIGN] ParallelAppend

On 2015-07-29 AM 11:02, Kouhei Kaigai wrote:

...
synchronous Append path vs. parallel asynchronous Append with Funnel
(below/above?) it. I guess the asynchronous version would always be
cheaper. So, even if we end up with non-parallel sub-plans do we still add
a Funnel to make Append asynchronous? Am I missing something?

I expect Funnel itself will get Append capability but run sub-plans in
background workers, to simplify path constructions. So, if Funnel with
multiple sub-plans have cheaper cost than Append, it will replace the
AppendPath by FunnelPath.

Regarding to the cost estimation, I don't think parallel version is always
cheaper than traditional Append, because of the cost to launch background
workers. It increases startup cost to process the relation, thus, if upper
node prefers small startup cost (like Limit), traditional Append still has
advantages.

Right, I almost forgot about the start-up cost.

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: KaiGai Kohei (#11)

Re: [DESIGN] ParallelAppend

On Tue, Jul 28, 2015 at 6:08 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

On Tue, Jul 28, 2015 at 7:59 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com>

wrote:

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei

Kaigai

Sent: Monday, July 27, 2015 11:07 PM
To: Amit Kapila

Is there a real need to have new node like ParallelAppendPath?
Can't we have Funnel node beneath AppendNode and then each
worker will be responsible to have SeqScan on each inherited child
relation. Something like

Append
---> Funnel
--> SeqScan rel1
--> SeqScan rel2

If Funnel can handle both of horizontal and vertical parallelism,
it is a great simplification. I never stick a new node.

Once Funnel get a capability to have multiple child nodes, probably,
Append node above will have gone. I expect set_append_rel_pathlist()
add two paths based on Append and Funnel, then planner will choose
the cheaper one according to its cost.

In the latest v16 patch, Funnel is declared as follows:

typedef struct Funnel
{
Scan scan;
int num_workers;
} Funnel;

If we try to add Append capability here, I expects the structure will
be adjusted as follows, for example:

typedef struct Funnel
{
Scan scan;
List *funnel_plans;
List *funnel_num_workers;
} Funnel;

As literal, funnel_plans saves underlying Plan nodes instead of the
lefttree. Also, funnel_num_workers saves number of expected workers
to be assigned on individual child plans.

or shall we have a node like above and name it as FunnelAppend or
AppenFunnel?

It is better to have smaller number of node types which are capable to
kick background workers because of simplification of path construction.

Let's assume the case below. When planner considers a path to append
child scans on rel1, rel2 and rel3 but the cheapest path of rel2 is
Funnel+PartialSeqScan, we cannot put Funnel here unless we don't pull
up Funnel of rel2, can we?

(Append? or Funnel)
--> SeqScan on rel1
--> Funnel
--> PartialSeqScan on rel2
--> IndexScan on rel3

I am not sure, but what problem do you see in putting Funnel node
for one of the relation scans and not for the others.

If we pull Funnel here, I think the plan shall be as follows:
Funnel
--> SeqScan on rel1
--> PartialSeqScan on rel2
--> IndexScan on rel3

So if we go this route, then Funnel should have capability
to execute non-parallel part of plan as well, like in this
case it should be able to execute non-parallel IndexScan on
rel3 as well and then it might need to distinguish between
parallel and non-parallel part of plans. I think this could
make Funnel node complex.

If all we have to pay attention is Funnel node, it makes the code
around path construction and pull-up logic much simpler, rather than
multiple node types can kick background workers.

Okay, but I think pulling-up Funnel node makes sense only when all
nodes beneath it needs to be executed parallely.

Even though create_parallelscan_paths() in v16 set num_workers not
larger than parallel_seqscan_degree, total number of the concurrent
background workers may exceed this configuration if more than two
PartialSeqScan nodes are underlying.
It is a different configuration from max_worker_processes, so it is
not a matter as long as we have another restriction.
However, how do we control the cap of number of worker processes per
"appendable" Funnel node? For example, if a parent table has 200
child tables but max_worker_processes are configured to 50.
It is obviously impossible to launch all the background workers
simultaneously. One idea I have is to suspend launch of some plans
until earlier ones are completed.

Okay, but I think in that idea you need to re-launch the workers again

for

new set of relation scan's which could turn out to be costly, how about
designing some way where workers after completing their assigned work
check for new set of task/'s (which in this case would be to scan a

new) and

then execute the same. I think in this way we can achieve dynamic

allocation

of work and achieve maximum parallelism with available set of workers.
We have achieved this in ParallelSeqScan by scanning at block level,

once

a worker finishes a block, it checks for new block to scan.

Is it possible to put multiple PlannedStmt on TOC, isn't it?

Yes, I don't see any problem in doing that way. So here for
each different (child) relation, you want to create a separate
PlannedStmt or do you have something else in mind?

If background worker picks up an uncompleted PlannedStmt first
(based on round-robin likely?), it may achieve the maximum
parallelism.

I think this can work well for the cases when there are insufficient
number of workers to execute the different planned statements.

Yep, it seems to me a good idea which I want to try.
If (num of worker) > (num of sub-plans), some of sub-plans can
have multiple workers from the beginning, then, other workers
also help to execute heavy plans later.
It may be better to put PlannedStmt in order of total_cost to
bias multi-workers execution from the beginning.

Yeah, that might be better, but I think for doing so you might
need to traverse each child plan and compare there costs while
constructing multiple planned statements which might incur some
overhead when number of plans are large, however OTOH this cost
should be much smaller as compare to starting up workers, so
probably it should be okay.

TODO: Even if a heavy query occupied most of available worker slots,
another session wants to use parallel execution later but during
execution of the primary query. We may need to have a 'scoreboard'
on shared memory to know how many workers are potentially needed
and how much ones are overused by somebody. If someone overconsumed
background workers, it should exit first, rather than picking up
the next PlannedStmt.

Actually distribution of workers among parallel queriesis a very
tricky problem and I think we have to keep on working on it till
we get some good solution for it.

Another way to look at dividing the work in this case could be in terms

chunk-of-blocks, once a worker finishes it current set of block/'s, it

should

be
able to get new set of block's to scan. So let us assume if we decide
chunk-size as 32 and total number of blocks in whole inheritance

hierarchy

are 3200, then the max workers we should allocate to this scan are 100

and

if we have parallel_seqscan degree lesser than that then we can use

those

many workers and then let them scan 32-blocks-at-a-time.

If we use the above multi-PlannedStmt approach, TOC also need to have a

counter

to track how many background workers are running on a particular

PlannedStmt,

then if enough number of worker is running on the PlannedStmt, next

available

worker will skip this PlannedStmt (even if round-robin) or just exit?

I think for a particular PlannedStmt, number of workers must have
been decided before start of execution, so if those many workers are
available to work on that particular PlannedStmt, then next/new
worker should work on next PlannedStmt.

Anyway, I think an infrastructure may be needed to avoid too aggressive
parallel execution.

Yes, I think we need some infrastructure for workers if we have
to follow the design discussed above.

So I think we have three main parts to work for this patch.

1. Allocation of work among workers which needs some different
mechanism than ParallelSeqScan Patch.
2. Execution of work by workers and Funnel node and then pass
the results back to upper node. I think this needs some more
work in addition to ParallelSeqScan patch.
3. Generation of parallel plan for Append node needs somewhat
different mechanism as we might want to have some additional
logic for transaformation of nodes.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#18

KaiGai Kohei

kaigai@ak.jp.nec.com

almost 11 years ago

In reply to: Amit Kapila (#17)

Re: [DESIGN] ParallelAppend

On Tue, Jul 28, 2015 at 6:08 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

On Tue, Jul 28, 2015 at 7:59 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

-----Original Message-----
From: pgsql-hackers-owner@postgresql.org
[mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Kouhei Kaigai
Sent: Monday, July 27, 2015 11:07 PM
To: Amit Kapila

Is there a real need to have new node like ParallelAppendPath?
Can't we have Funnel node beneath AppendNode and then each
worker will be responsible to have SeqScan on each inherited child
relation. Something like

Append
---> Funnel
--> SeqScan rel1
--> SeqScan rel2

If Funnel can handle both of horizontal and vertical parallelism,
it is a great simplification. I never stick a new node.

Once Funnel get a capability to have multiple child nodes, probably,
Append node above will have gone. I expect set_append_rel_pathlist()
add two paths based on Append and Funnel, then planner will choose
the cheaper one according to its cost.

In the latest v16 patch, Funnel is declared as follows:

typedef struct Funnel
{
Scan scan;
int num_workers;
} Funnel;

If we try to add Append capability here, I expects the structure will
be adjusted as follows, for example:

typedef struct Funnel
{
Scan scan;
List *funnel_plans;
List *funnel_num_workers;
} Funnel;

As literal, funnel_plans saves underlying Plan nodes instead of the
lefttree. Also, funnel_num_workers saves number of expected workers
to be assigned on individual child plans.

or shall we have a node like above and name it as FunnelAppend or
AppenFunnel?

It is better to have smaller number of node types which are capable to
kick background workers because of simplification of path construction.

Let's assume the case below. When planner considers a path to append
child scans on rel1, rel2 and rel3 but the cheapest path of rel2 is
Funnel+PartialSeqScan, we cannot put Funnel here unless we don't pull
up Funnel of rel2, can we?

(Append? or Funnel)
--> SeqScan on rel1
--> Funnel
--> PartialSeqScan on rel2
--> IndexScan on rel3

I am not sure, but what problem do you see in putting Funnel node
for one of the relation scans and not for the others.

At this moment, I'm not certain whether background worker can/ought
to launch another background workers.
If sub-Funnel node is executed by 10-processes then it also launch
10-processes for each, 100-processes will run for each?

If we pull Funnel here, I think the plan shall be as follows:
Funnel
--> SeqScan on rel1
--> PartialSeqScan on rel2
--> IndexScan on rel3

So if we go this route, then Funnel should have capability
to execute non-parallel part of plan as well, like in this
case it should be able to execute non-parallel IndexScan on
rel3 as well and then it might need to distinguish between
parallel and non-parallel part of plans. I think this could
make Funnel node complex.

It is difference from what I plan now. In the above example,
Funnel node has two non-parallel aware node (rel1 and rel3)
and one parallel aware node, then three PlannedStmt for each
shall be put on the TOC segment. Even though the background
workers pick up a PlannedStmt from the three, only one worker
can pick up the PlannedStmt for rel1 and rel3, however, rel2
can be executed by multiple workers simultaneously.
(Note: if number of workers are less than three in this case,
PlannedStmt for rel3 shall not be picked up unless any other
worker don't complete to run other plan on rel1 or rel2).

From the standpoint of the Funnel, it just kicks background
workers with:
- multiple PlannedStmt nodes
- maximum number of workers for each plan
in addition to the current form.

Then, it continues to fetch records from the shm_mq.
Probably, it does not change the current form so much.

If all we have to pay attention is Funnel node, it makes the code
around path construction and pull-up logic much simpler, rather than
multiple node types can kick background workers.

Okay, but I think pulling-up Funnel node makes sense only when all
nodes beneath it needs to be executed parallely.

I think its decision should be based on the cost, that includes
additional startup_cost to launch background worker, as long as
non-parallel node is also capable to run on the worker side.

Even though create_parallelscan_paths() in v16 set num_workers not
larger than parallel_seqscan_degree, total number of the concurrent
background workers may exceed this configuration if more than two
PartialSeqScan nodes are underlying.
It is a different configuration from max_worker_processes, so it is
not a matter as long as we have another restriction.
However, how do we control the cap of number of worker processes per
"appendable" Funnel node? For example, if a parent table has 200
child tables but max_worker_processes are configured to 50.
It is obviously impossible to launch all the background workers
simultaneously. One idea I have is to suspend launch of some plans
until earlier ones are completed.

Okay, but I think in that idea you need to re-launch the workers again for
new set of relation scan's which could turn out to be costly, how about
designing some way where workers after completing their assigned work
check for new set of task/'s (which in this case would be to scan a new) and
then execute the same. I think in this way we can achieve dynamic allocation
of work and achieve maximum parallelism with available set of workers.
We have achieved this in ParallelSeqScan by scanning at block level, once
a worker finishes a block, it checks for new block to scan.

Is it possible to put multiple PlannedStmt on TOC, isn't it?

Yes, I don't see any problem in doing that way. So here for
each different (child) relation, you want to create a separate
PlannedStmt or do you have something else in mind?

I plan to create a separate PlannedStmt for each sub-plan, then
a background worker will focus on a particular PlannedStmt until
it completes the current focused one.

If background worker picks up an uncompleted PlannedStmt first
(based on round-robin likely?), it may achieve the maximum
parallelism.

I think this can work well for the cases when there are insufficient
number of workers to execute the different planned statements.

Yep, it is the biggest reason why I like the design than what
I initially proposed; fixed number of workers for each sub-plan.

Yep, it seems to me a good idea which I want to try.
If (num of worker) > (num of sub-plans), some of sub-plans can
have multiple workers from the beginning, then, other workers
also help to execute heavy plans later.
It may be better to put PlannedStmt in order of total_cost to
bias multi-workers execution from the beginning.

Yeah, that might be better, but I think for doing so you might
need to traverse each child plan and compare there costs while
constructing multiple planned statements which might incur some
overhead when number of plans are large, however OTOH this cost
should be much smaller as compare to starting up workers, so
probably it should be okay.

Yep. If we have to execute thousands of child plans, its execution
cost is relatively large, not only planning cost. :-)

TODO: Even if a heavy query occupied most of available worker slots,
another session wants to use parallel execution later but during
execution of the primary query. We may need to have a 'scoreboard'
on shared memory to know how many workers are potentially needed
and how much ones are overused by somebody. If someone overconsumed
background workers, it should exit first, rather than picking up
the next PlannedStmt.

Actually distribution of workers among parallel queriesis a very
tricky problem and I think we have to keep on working on it till
we get some good solution for it.

I agree. Even if initial version adopts simple solution, we can
improve the logic according to our experiences.

Another way to look at dividing the work in this case could be in terms of
chunk-of-blocks, once a worker finishes it current set of block/'s, it should
be
able to get new set of block's to scan. So let us assume if we decide
chunk-size as 32 and total number of blocks in whole inheritance hierarchy
are 3200, then the max workers we should allocate to this scan are 100 and
if we have parallel_seqscan degree lesser than that then we can use those
many workers and then let them scan 32-blocks-at-a-time.

If we use the above multi-PlannedStmt approach, TOC also need to have a counter
to track how many background workers are running on a particular PlannedStmt,
then if enough number of worker is running on the PlannedStmt, next available
worker will skip this PlannedStmt (even if round-robin) or just exit?

I think for a particular PlannedStmt, number of workers must have
been decided before start of execution, so if those many workers are
available to work on that particular PlannedStmt, then next/new
worker should work on next PlannedStmt.

My concern about pre-determined number of workers is, it depends on the
run-time circumstances of concurrent sessions. Even if planner wants to
assign 10-workers on a particular sub-plan, only 4-workers may be
available on the run-time because of consumption by side sessions.
So, I expect only maximum number of workers is meaningful configuration.

Anyway, I think an infrastructure may be needed to avoid too aggressive
parallel execution.

Yes, I think we need some infrastructure for workers if we have
to follow the design discussed above.

So I think we have three main parts to work for this patch.

1. Allocation of work among workers which needs some different
mechanism than ParallelSeqScan Patch.

Yes, I expect to extend the format of TOC, to store multiple PlannedStmt
nodes and state information for each node like PartialSeqScanState.

2. Execution of work by workers and Funnel node and then pass
the results back to upper node. I think this needs some more
work in addition to ParallelSeqScan patch.

I expect we can utilize existing infrastructure here. It just picks
up the records come from the underlying workers, then raise it to
the upper node.

3. Generation of parallel plan for Append node needs somewhat
different mechanism as we might want to have some additional
logic for transaformation of nodes.

I expect set_append_rel_pathlist() is the best location to add
FunnelPath in addition to AppendPath. If its cost is more attractive
than AppendPath, planner will pick up.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: KaiGai Kohei (#18)

Re: [DESIGN] ParallelAppend

On Sat, Aug 1, 2015 at 6:39 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

On Tue, Jul 28, 2015 at 6:08 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com>

wrote:

I am not sure, but what problem do you see in putting Funnel node
for one of the relation scans and not for the others.

At this moment, I'm not certain whether background worker can/ought
to launch another background workers.
If sub-Funnel node is executed by 10-processes then it also launch
10-processes for each, 100-processes will run for each?

Yes, that could be more work than current, but what I had in mind
is not that way, rather I was thinking that master backend will only
kick of workers for Funnel nodes in plan.

If we pull Funnel here, I think the plan shall be as follows:
Funnel
--> SeqScan on rel1
--> PartialSeqScan on rel2
--> IndexScan on rel3

So if we go this route, then Funnel should have capability
to execute non-parallel part of plan as well, like in this
case it should be able to execute non-parallel IndexScan on
rel3 as well and then it might need to distinguish between
parallel and non-parallel part of plans. I think this could
make Funnel node complex.

It is difference from what I plan now. In the above example,
Funnel node has two non-parallel aware node (rel1 and rel3)
and one parallel aware node, then three PlannedStmt for each
shall be put on the TOC segment. Even though the background
workers pick up a PlannedStmt from the three, only one worker
can pick up the PlannedStmt for rel1 and rel3, however, rel2
can be executed by multiple workers simultaneously.

Okay, now I got your point, but I think the cost of execution
of non-parallel node by additional worker is not small considering
the communication cost and setting up an addional worker for
each sub-plan (assume the case where out of 100-child nodes
only few (2 or 3) nodes actually need multiple workers).

I think for a particular PlannedStmt, number of workers must have
been decided before start of execution, so if those many workers are
available to work on that particular PlannedStmt, then next/new
worker should work on next PlannedStmt.

My concern about pre-determined number of workers is, it depends on the
run-time circumstances of concurrent sessions. Even if planner wants to
assign 10-workers on a particular sub-plan, only 4-workers may be
available on the run-time because of consumption by side sessions.
So, I expect only maximum number of workers is meaningful configuration.

In that case, there is possibility that many of the workers are just
working on one or two of the nodes and other nodes execution might
get starved. I understand this is tricky problem to allocate number
of workers for different nodes, however we should try to develop any
algorithm where there is some degree of fairness in allocation of workers
for different nodes.

2. Execution of work by workers and Funnel node and then pass
the results back to upper node. I think this needs some more
work in addition to ParallelSeqScan patch.

I expect we can utilize existing infrastructure here. It just picks
up the records come from the underlying workers, then raise it to
the upper node.

Sure, but still you need some work atleast in the area of making
workers understand different node types, I am guessing you need
to modify readfuncs.c to support new plan node if any for this
work.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#20

KaiGai Kohei

kaigai@ak.jp.nec.com

almost 11 years ago

In reply to: Amit Kapila (#19)

Re: [DESIGN] ParallelAppend

On Sat, Aug 1, 2015 at 6:39 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

On Tue, Jul 28, 2015 at 6:08 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

I am not sure, but what problem do you see in putting Funnel node
for one of the relation scans and not for the others.

At this moment, I'm not certain whether background worker can/ought
to launch another background workers.
If sub-Funnel node is executed by 10-processes then it also launch
10-processes for each, 100-processes will run for each?

Yes, that could be more work than current, but what I had in mind
is not that way, rather I was thinking that master backend will only
kick of workers for Funnel nodes in plan.

I agree with, it is fair enough approach, so I mention about
pull-up of Funnel node.

If we pull Funnel here, I think the plan shall be as follows:
Funnel
--> SeqScan on rel1
--> PartialSeqScan on rel2
--> IndexScan on rel3

So if we go this route, then Funnel should have capability
to execute non-parallel part of plan as well, like in this
case it should be able to execute non-parallel IndexScan on
rel3 as well and then it might need to distinguish between
parallel and non-parallel part of plans. I think this could
make Funnel node complex.

It is difference from what I plan now. In the above example,
Funnel node has two non-parallel aware node (rel1 and rel3)
and one parallel aware node, then three PlannedStmt for each
shall be put on the TOC segment. Even though the background
workers pick up a PlannedStmt from the three, only one worker
can pick up the PlannedStmt for rel1 and rel3, however, rel2
can be executed by multiple workers simultaneously.

Okay, now I got your point, but I think the cost of execution
of non-parallel node by additional worker is not small considering
the communication cost and setting up an addional worker for
each sub-plan (assume the case where out of 100-child nodes
only few (2 or 3) nodes actually need multiple workers).

It is a competition between traditional Append that takes Funnel
children and the new appendable Funnel that takes parallel and
non-parallel children. Probably, key factors are cpu_tuple_comm_cost,
parallel_setup_cost and degree of selectivity of sub-plans.
Both cases has advantage and disadvantage depending on the query,
so we can never determine which is better without path consideration.

I think for a particular PlannedStmt, number of workers must have
been decided before start of execution, so if those many workers are
available to work on that particular PlannedStmt, then next/new
worker should work on next PlannedStmt.

My concern about pre-determined number of workers is, it depends on the
run-time circumstances of concurrent sessions. Even if planner wants to
assign 10-workers on a particular sub-plan, only 4-workers may be
available on the run-time because of consumption by side sessions.
So, I expect only maximum number of workers is meaningful configuration.

In that case, there is possibility that many of the workers are just
working on one or two of the nodes and other nodes execution might
get starved. I understand this is tricky problem to allocate number
of workers for different nodes, however we should try to develop any
algorithm where there is some degree of fairness in allocation of workers
for different nodes.

I like to agree, however, I also want to keep the first version as
simple as possible we can. We can develop alternative logic to assign
suitable number of workers later.

2. Execution of work by workers and Funnel node and then pass
the results back to upper node. I think this needs some more
work in addition to ParallelSeqScan patch.

I expect we can utilize existing infrastructure here. It just picks
up the records come from the underlying workers, then raise it to
the upper node.

Sure, but still you need some work atleast in the area of making
workers understand different node types, I am guessing you need
to modify readfuncs.c to support new plan node if any for this
work.

Yes, it was not a creative work. :-)
https://github.com/kaigai/sepgsql/blob/fappend/src/backend/nodes/readfuncs.c#L1479

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Amit Kapila

amit.kapila16@gmail.com

almost 11 years ago

In reply to: KaiGai Kohei (#20)

#22