Re: [DESIGN] ParallelAppend

Started by Kouhei Kaigaiover 10 years ago6 messages
#1Kouhei Kaigai
kaigai@ak.jp.nec.com

On Fri, Aug 7, 2015 at 2:15 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

On Sat, Aug 1, 2015 at 6:39 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

If we pull Funnel here, I think the plan shall be as follows:
Funnel
--> SeqScan on rel1
--> PartialSeqScan on rel2
--> IndexScan on rel3

So if we go this route, then Funnel should have capability
to execute non-parallel part of plan as well, like in this
case it should be able to execute non-parallel IndexScan on
rel3 as well and then it might need to distinguish between
parallel and non-parallel part of plans. I think this could
make Funnel node complex.

It is difference from what I plan now. In the above example,
Funnel node has two non-parallel aware node (rel1 and rel3)
and one parallel aware node, then three PlannedStmt for each
shall be put on the TOC segment. Even though the background
workers pick up a PlannedStmt from the three, only one worker
can pick up the PlannedStmt for rel1 and rel3, however, rel2
can be executed by multiple workers simultaneously.

Okay, now I got your point, but I think the cost of execution
of non-parallel node by additional worker is not small considering
the communication cost and setting up an addional worker for
each sub-plan (assume the case where out of 100-child nodes
only few (2 or 3) nodes actually need multiple workers).

It is a competition between traditional Append that takes Funnel
children and the new appendable Funnel that takes parallel and
non-parallel children. Probably, key factors are cpu_tuple_comm_cost,
parallel_setup_cost and degree of selectivity of sub-plans.
Both cases has advantage and disadvantage depending on the query,
so we can never determine which is better without path consideration.

Sure, that is what we should do, however the tricky part would be when
the path for doing local scan is extremely cheaper than path for parallel
scan for one of the child nodes. For such cases, pulling up Funnel-node
can incur more cost. I think some of the other possible ways to make this
work could be to extend Funnel so that it is capable of executing both parallel
and non-parallel nodes, have a new Funnel like node which has such a
capability.

I think it is job of (more intelligent) planner but not in the first
version. If subplans of Append are mixture of nodes which has or does
not have worth of parallel execution, we will be able to arrange the
original form:

  Append
   + Scan on rel1 (large)
   + Scan on rel2 (large)
   + Scan on rel3 (middle)
   + Scan on rel4 (tiny)
   + Scan on rel5 (tiny)

to Funnel aware form, but partially:

Append
+ Funnel
| + Scan on rel1 (large)
| + Scan on rel2 (large)
| + Scan on rel3 (large)
+ Scan on rel4 (tiny)
+ Scan on rel5 (tiny)

It does not require special functionalities of Append/Funnel more
than what we have discussed, as long as planner is enough intelligent.
One downside of this approach is, plan tree tends to become more
complicated, thus makes logic to pushdown joins also becomes complicated.

Here is one other issue I found. Existing code assumes a TOC segment has
only one contents per node type, so it uses pre-defined key (like
PARALLEL_KEY_SCAN) per node type, however, it is problematic if we put
multiple PlannedStmt or PartialSeqScan node on a TOC segment.
My idea is enhancement of Plan node to have an unique identifier within
a particular plan trees. Once a unique identifier is assigned, we can
put individual information on the TOC segment, even if multiple
PartialSeqScan nodes are packed.
Did we have a discussion about this topic in the past?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Amit Kapila
amit.kapila16@gmail.com
In reply to: Kouhei Kaigai (#1)

On Thu, Aug 13, 2015 at 5:26 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

On Fri, Aug 7, 2015 at 2:15 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com>

wrote:

Sure, that is what we should do, however the tricky part would be when
the path for doing local scan is extremely cheaper than path for parallel
scan for one of the child nodes. For such cases, pulling up Funnel-node
can incur more cost. I think some of the other possible ways to make

this

work could be to extend Funnel so that it is capable of executing both

parallel

and non-parallel nodes, have a new Funnel like node which has such a
capability.

I think it is job of (more intelligent) planner but not in the first
version. If subplans of Append are mixture of nodes which has or does
not have worth of parallel execution, we will be able to arrange the
original form:

Append
+ Scan on rel1 (large)
+ Scan on rel2 (large)
+ Scan on rel3 (middle)
+ Scan on rel4 (tiny)
+ Scan on rel5 (tiny)

to Funnel aware form, but partially:

Append
+ Funnel
| + Scan on rel1 (large)
| + Scan on rel2 (large)
| + Scan on rel3 (large)
+ Scan on rel4 (tiny)
+ Scan on rel5 (tiny)

This is exactly what I have in mind.

Here is one other issue I found. Existing code assumes a TOC segment has
only one contents per node type, so it uses pre-defined key (like
PARALLEL_KEY_SCAN) per node type, however, it is problematic if we put
multiple PlannedStmt or PartialSeqScan node on a TOC segment.

We have few keys in parallel-seq-scan patch
(PARALLEL_KEY_TUPLE_QUEUE and PARALLEL_KEY_INST_INFO) for
which multiple structures are shared between master and worker backends.

Check if something similar can work for your use case.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#3Robert Haas
robertmhaas@gmail.com
In reply to: Amit Kapila (#2)

On Tue, Aug 18, 2015 at 11:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Here is one other issue I found. Existing code assumes a TOC segment has
only one contents per node type, so it uses pre-defined key (like
PARALLEL_KEY_SCAN) per node type, however, it is problematic if we put
multiple PlannedStmt or PartialSeqScan node on a TOC segment.

We have few keys in parallel-seq-scan patch
(PARALLEL_KEY_TUPLE_QUEUE and PARALLEL_KEY_INST_INFO) for
which multiple structures are shared between master and worker backends.

Check if something similar can work for your use case.

I think you are possibly missing the point. I think KaiGai's correct,
and I pointed out the same problem to you before. The parallel key
for the Partial Seq Scan needs to be allocated on the fly and carried
in the node, or we'll never be able to push multiple things below the
funnel. I'm not quite sure what you're trying to explain with this
response.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Amit Kapila
amit.kapila16@gmail.com
In reply to: Robert Haas (#3)

On Fri, Aug 21, 2015 at 7:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 18, 2015 at 11:27 PM, Amit Kapila <amit.kapila16@gmail.com>

wrote:

Here is one other issue I found. Existing code assumes a TOC segment

has

only one contents per node type, so it uses pre-defined key (like
PARALLEL_KEY_SCAN) per node type, however, it is problematic if we put
multiple PlannedStmt or PartialSeqScan node on a TOC segment.

We have few keys in parallel-seq-scan patch
(PARALLEL_KEY_TUPLE_QUEUE and PARALLEL_KEY_INST_INFO) for
which multiple structures are shared between master and worker backends.

Check if something similar can work for your use case.

I think you are possibly missing the point.

It could be possible, but let me summarize what I thought would be required
for above use case. For Parallel Append, we need to push multiple
planned statements in contrast to one planned statement as is done for
current patch and then one or more parallel workers needs to work on each
planned statement. So if we know in advance how many planned statements
are we passing down (which we should), then using ParallelWorkerNumber
(ParallelWorkerNumber % num_planned_statements or some other similar
way), workers can find the the planned statement on which they need to work
and similarly information for PartialSeqScan (which currently is parallel
heap
scan descriptor information).

I think KaiGai's correct,
and I pointed out the same problem to you before. The parallel key
for the Partial Seq Scan needs to be allocated on the fly and carried
in the node, or we'll never be able to push multiple things below the
funnel.

Okay, immediately I don't see what is the best way to achieve this but
let us discuss this separately on Parallel Seq Scan thread and let me
know if you have something specific in your mind. I will also give this
a more thought.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#5Kouhei Kaigai
kaigai@ak.jp.nec.com
In reply to: Amit Kapila (#4)

On Fri, Aug 21, 2015 at 7:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 18, 2015 at 11:27 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

Here is one other issue I found. Existing code assumes a TOC segment has
only one contents per node type, so it uses pre-defined key (like
PARALLEL_KEY_SCAN) per node type, however, it is problematic if we put
multiple PlannedStmt or PartialSeqScan node on a TOC segment.

We have few keys in parallel-seq-scan patch
(PARALLEL_KEY_TUPLE_QUEUE and PARALLEL_KEY_INST_INFO) for
which multiple structures are shared between master and worker backends.

Check if something similar can work for your use case.

I think you are possibly missing the point.

It could be possible, but let me summarize what I thought would be required
for above use case. For Parallel Append, we need to push multiple
planned statements in contrast to one planned statement as is done for
current patch and then one or more parallel workers needs to work on each
planned statement. So if we know in advance how many planned statements
are we passing down (which we should), then using ParallelWorkerNumber
(ParallelWorkerNumber % num_planned_statements or some other similar
way), workers can find the the planned statement on which they need to work
and similarly information for PartialSeqScan (which currently is parallel heap
scan descriptor information).

My problem is that we have no identifier to point a particular element on
the TOC segment even if PARALLEL_KEY_PLANNEDSTMT or PARALLEL_KEY_SCAN can
have multiple items.
Please assume a situation when ExecPartialSeqScan() has to lookup
a particular item on TOC but multiple PartialSeqScan nodes can exist.

Currently, it does:
pscan = shm_toc_lookup(node->ss.ps.toc, PARALLEL_KEY_SCAN);

However, ExecPartialSeqScan() cannot know which is the index of mine,
or it is not reasonable to pay attention on other node in this level.
Even if PARALLEL_KEY_SCAN has multiple items, PartialSeqScan node also
needs to have identifier.

I think KaiGai's correct,
and I pointed out the same problem to you before. The parallel key
for the Partial Seq Scan needs to be allocated on the fly and carried
in the node, or we'll never be able to push multiple things below the
funnel.

Okay, immediately I don't see what is the best way to achieve this but
let us discuss this separately on Parallel Seq Scan thread and let me
know if you have something specific in your mind. I will also give this
a more thought.

I want to have 'node_id' in the Plan node, then unique identifier is
assigned on the field prior to serialization. It is a property of the
Plan node, so we can reproduce this identifier on the background worker
side using stringToNode(), then ExecPartialSeqScan can pull out a proper
field from the TOC segment by this node_id.
Probably, we can co-exist this structure without big changes.

1. Define PARALLEL_KEY_DYNAMIC_LEAST as a least value that is larger
than any static TOC key (like PARALLEL_KEY_TUPLE_QUEUE).
2. Run plan-tree node walker on InitializeParallelWorkers, just before
nodeToString(), to assign node_id larger than the above label and
with increasing for each node.
3. Use node_id instead of the static PARALLEL_KEY_SCAN on
ExecPartialSeqScan

Even though we need some more trivial fixes are needed, it seems to
me the above approach shall work.
Also, please note that I don't assume only PartialSeqScan want to
have its field on TOC segment, but some CustomScan node also wants
to have its own shared field when co-working under Funnel node.

On the other hand, I think it is too aggressive to complete the
initial work of this patch by the starting day of the next commit
fest, so I think the target is middle of October.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Amit Kapila
amit.kapila16@gmail.com
In reply to: Kouhei Kaigai (#5)

On Tue, Aug 25, 2015 at 6:19 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:

On Fri, Aug 21, 2015 at 7:40 PM, Robert Haas <robertmhaas@gmail.com>

wrote:

It could be possible, but let me summarize what I thought would be

required

for above use case. For Parallel Append, we need to push multiple
planned statements in contrast to one planned statement as is done for
current patch and then one or more parallel workers needs to work on

each

planned statement. So if we know in advance how many planned statements
are we passing down (which we should), then using ParallelWorkerNumber
(ParallelWorkerNumber % num_planned_statements or some other similar
way), workers can find the the planned statement on which they need to

work

and similarly information for PartialSeqScan (which currently is

parallel heap

scan descriptor information).

My problem is that we have no identifier to point a particular element on
the TOC segment even if PARALLEL_KEY_PLANNEDSTMT or PARALLEL_KEY_SCAN can
have multiple items.
Please assume a situation when ExecPartialSeqScan() has to lookup
a particular item on TOC but multiple PartialSeqScan nodes can exist.

Currently, it does:
pscan = shm_toc_lookup(node->ss.ps.toc, PARALLEL_KEY_SCAN);

However, ExecPartialSeqScan() cannot know which is the index of mine,
or it is not reasonable to pay attention on other node in this level.
Even if PARALLEL_KEY_SCAN has multiple items, PartialSeqScan node also
needs to have identifier.

Yes that's right and I think we can find out the same. Basically we need to
know the planned statement number on which current worker is working and
that anyway we have to do before the worker can start the work. One way is
as I have explained above that use ParallelWorkerNumber
(ParallelWorkerNumber % num_planned_statements) to find or might need
some sophisticated way to find that out, but definitely we need to know that
before start of execution by worker and once we know that we can use it
find the PARALLEL_KEY_SCAN or whatever key for this worker (as the
the position of PARALLEL_KEY_SCAN will be same as of planned stmt
for a worker).

I think KaiGai's correct,
and I pointed out the same problem to you before. The parallel key
for the Partial Seq Scan needs to be allocated on the fly and carried
in the node, or we'll never be able to push multiple things below the
funnel.

Okay, immediately I don't see what is the best way to achieve this but
let us discuss this separately on Parallel Seq Scan thread and let me
know if you have something specific in your mind. I will also give this
a more thought.

I want to have 'node_id' in the Plan node, then unique identifier is
assigned on the field prior to serialization. It is a property of the
Plan node, so we can reproduce this identifier on the background worker
side using stringToNode(), then ExecPartialSeqScan can pull out a proper
field from the TOC segment by this node_id.

Okay, this can also work, but why to introduce identifier in plan node, if
it
can work without it.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com