Parallel Append implementation

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Amit Khandekar (#1)

Re: Parallel Append implementation

On Fri, Dec 23, 2016 at 10:51 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Currently an Append plan node does not execute its subplans in
parallel. There is no distribution of workers across its subplans. The
second subplan starts running only after the first subplan finishes,
although the individual subplans may be running parallel scans.

Secondly, we create a partial Append path for an appendrel, but we do
that only if all of its member subpaths are partial paths. If one or
more of the subplans is a non-parallel path, there will be only a
non-parallel Append. So whatever node is sitting on top of Append is
not going to do a parallel plan; for example, a select count(*) won't
divide it into partial aggregates if the underlying Append is not
partial.

The attached patch removes both of the above restrictions. There has
already been a mail thread [1] that discusses an approach suggested by
Robert Haas for implementing this feature. This patch uses this same
approach.

The first goal requires some kind of synchronization which will allow workers
to be distributed across the subplans. The second goal requires some kind of
synchronization to prevent multiple workers from executing non-parallel
subplans. The patch uses different mechanisms to achieve the goals. If
we create two different patches addressing each goal, those may be
easier to handle.

We may want to think about a third goal: preventing too many workers
from executing the same plan. As per comment in get_parallel_divisor()
we do not see any benefit if more than 4 workers execute the same
node. So, an append node can distribute more than 4 worker nodes
equally among the available subplans. It might be better to do that as
a separate patch.

Attached is pgbench_create_partition.sql (derived from the one
included in the above thread) that distributes pgbench_accounts table
data into 3 partitions pgbench_account_[1-3]. The below queries use
this schema.

Consider a query such as :
select count(*) from pgbench_accounts;

Now suppose, these two partitions do not allow parallel scan :
alter table pgbench_accounts_1 set (parallel_workers=0);
alter table pgbench_accounts_2 set (parallel_workers=0);

On HEAD, due to some of the partitions having non-parallel scans, the
whole Append would be a sequential scan :

Aggregate
-> Append
-> Index Only Scan using pgbench_accounts_pkey on pgbench_accounts
-> Seq Scan on pgbench_accounts_1
-> Seq Scan on pgbench_accounts_2
-> Seq Scan on pgbench_accounts_3

Whereas, with the patch, the Append looks like this :

Finalize Aggregate
-> Gather
Workers Planned: 6
-> Partial Aggregate
-> Parallel Append
-> Parallel Seq Scan on pgbench_accounts
-> Seq Scan on pgbench_accounts_1
-> Seq Scan on pgbench_accounts_2
-> Parallel Seq Scan on pgbench_accounts_3

Above, Parallel Append is generated, and it executes all these
subplans in parallel, with 1 worker executing each of the sequential
scans, and multiple workers executing each of the parallel subplans.

======= Implementation details ========

------- Adding parallel-awareness -------

In a given worker, this Append plan node will be executing just like
the usual partial Append node. It will run a subplan until completion.
The subplan may or may not be a partial parallel-aware plan like
parallelScan. After the subplan is done, Append will choose the next
subplan. It is here where it will be different than the current
partial Append plan: it is parallel-aware. The Append nodes in the
workers will be aware that there are other Append nodes running in
parallel. The partial Append will have to coordinate with other Append
nodes while choosing the next subplan.

------- Distribution of workers --------

The coordination info is stored in a shared array, each element of
which describes the per-subplan info. This info contains the number of
workers currently executing the subplan, and the maximum number of
workers that should be executing it at the same time. For non-partial
sublans, max workers would always be 1. For choosing the next subplan,
the Append executor will sequentially iterate over the array to find a
subplan having the least number of workers currently being executed,
AND which is not already being executed by the maximum number of
workers assigned for the subplan. Once it gets one, it increments
current_workers, and releases the Spinlock, so that other workers can
choose their next subplan if they are waiting.

This way, workers would be fairly distributed across subplans.

The shared array needs to be initialized and made available to
workers. For this, we can do exactly what sequential scan does for
being parallel-aware : Using function ExecAppendInitializeDSM()
similar to ExecSeqScanInitializeDSM() in the backend to allocate the
array. Similarly, for workers, have ExecAppendInitializeWorker() to
retrieve the shared array.

-------- Generating Partial Append plan having non-partial subplans --------

In set_append_rel_pathlist(), while generating a partial path for
Append, also include the non-partial child subpaths, besides the
partial subpaths. This way, it can contain a mix of partial and
non-partial children paths. But for a given child, its path would be
either the cheapest partial path or the cheapest non-partial path.

For a non-partial child path, it will only be included if it is
parallel-safe. If there is no parallel-safe path, a partial Append
path would not be generated. This behaviour also automatically
prevents paths that have a Gather node beneath.

Finally when it comes to create a partial append path using these
child paths, we also need to store a bitmap set indicating which of
the child paths are non-partial paths. For this, have a new BitmapSet
field : Append->partial_subplans. At execution time, this will be used
to set the maximum workers for a non-partial subpath to 1.

We may be able to eliminate this field. Please check comment 6 below.

-------- Costing -------

For calculating per-worker parallel Append path cost, it first
calculates a total of child subplan costs considering all of their
workers, and then divides it by the Append node's parallel_divisor,
similar to how parallel scan uses this "parallel_divisor".

For startup cost, it is assumed that Append would start returning
tuples when the child node having the lowest startup cost is done
setting up. So Append startup cost is equal to startup cost of the
child with minimum startup cost.

-------- Scope --------

There are two different code paths where Append path is generated.
1. One is where append rel is generated : Inheritance table, and UNION
ALL clause.
2. Second codepath is in prepunion.c. This gets executed for UNION
(without ALL) and INTERSECT/EXCEPT [ALL]. The patch does not support
Parallel Append in this scenario. It can be later taken up as
extension, once this patch is reviewed.

Here are some review comments

1. struct ParallelAppendDescData is being used at other places. The declaration
style doesn't seem to be very common in the code or in the directory where the
file is located.
+struct ParallelAppendDescData
+{
+    slock_t        pa_mutex;        /* mutual exclusion to choose
next subplan */
+    parallel_append_info pa_info[FLEXIBLE_ARRAY_MEMBER];
+};
Defining it like
typdef struct ParallelAppendDescData
{
    slock_t        pa_mutex;        /* mutual exclusion to choose next
subplan */
    parallel_append_info pa_info[FLEXIBLE_ARRAY_MEMBER];
};
will make its use handy. Instead of struct ParallelAppendDescData, you will
need to use just ParallelAppendDescData. May be we want to rename
parallel_append_info as ParallelAppendInfo and change the style to match other
declarations.

2. The comment below refers to the constant which it describes, which looks
odd. May be it should be worded as "A special value of
AppendState::as_whichplan, to indicate no plans left to be executed.". Also
using INVALID for "no plans left ..." seems to be a misnomer.
/*
* For Parallel Append, AppendState::as_whichplan can have PA_INVALID_PLAN
* value, which indicates there are no plans left to be executed.
*/
#define PA_INVALID_PLAN -1

3. The sentence "We have got NULL", looks odd. Probably we don't need it as
it's clear from the code above that this code deals with the case when the
current subplan didn't return any row.
/*
* We have got NULL. There might be other workers still processing the
* last chunk of rows for this same node, but there's no point for new
* workers to run this node, so mark this node as finished.
*/
4. In the same comment, I guess, the word "node" refers to "subnode" and not
the node pointed by variable "node". May be you want to use word "subplan"
here.

4. set_finished()'s prologue has different indentation compared to other
functions in the file.

5. Multilevel comment starts with an empty line.
+ /* Keep track of the node with the least workers so far. For the very

6. By looking at parallel_worker field of a path, we can say whether it's
partial or not. We probably do not require to maintain a bitmap for that at in
the Append path. The bitmap can be constructed, if required, at the time of
creating the partial append plan. The reason to take this small step is 1. we
want to minimize our work at the time of creating paths, 2. while freeing a
path in add_path, we don't free the internal structures, in this case the
Bitmap, which will waste memory if the path is not chosen while planning.

7. If we consider 6, we don't need concat_append_subpaths(), but still here are
some comments about that function. Instead of accepting two separate arguments
childpaths and child_partial_subpaths_set, which need to be in sync, we can
just pass the path which contains both of those. In the same following code may
be optimized by adding a utility function to Bitmapset, which advances
all members
by given offset and using that function with bms_union() to merge the
bitmapset e.g.
bms_union(*partial_subpaths_set,
bms_advance_members(bms_copy(child_partial_subpaths_set), append_subpath_len));
if (partial_subpaths_set)
{
for (i = 0; i < list_length(childpaths); i++)
{
/*
* The child paths themselves may or may not be partial paths. The
* bitmapset numbers of these paths will need to be set considering
* that these are to be appended onto the partial_subpaths_set.
*/
if (!child_partial_subpaths_set ||
bms_is_member(i, child_partial_subpaths_set))
{
*partial_subpaths_set = bms_add_member(*partial_subpaths_set,
append_subpath_len + i);
}
}
}

8.
-            parallel_workers = Max(parallel_workers, path->parallel_workers);
+            /*
+             * partial_subpaths can have non-partial subpaths so
+             * path->parallel_workers can be 0. For such paths, allocate one
+             * worker.
+             */
+            parallel_workers +=
+                (path->parallel_workers > 0 ? path->parallel_workers : 1);

This looks odd. Earlier code was choosing maximum of all parallel workers,
whereas new code adds them all. E.g. if parallel_workers for subpaths is 3, 4,
3, without your change, it will pick up 4. But with your change it will pick
10. I think, you intend to write this as
parallel_workers = Max(parallel_workers, path->parallel_workers ?
path->parallel_workers : 1);

If you do that probably you don't need since parallel_workers are never set
more than max_parallel_workers_per_gather.
+        /* In no case use more than max_parallel_workers_per_gather. */
+        parallel_workers = Min(parallel_workers,
+                               max_parallel_workers_per_gather);
+

9. Shouldn't this funciton return double?
int
get_parallel_divisor(int parallel_workers)

9. In get_parallel_divisor(), if parallel_worker is 0 i.e. it's a partial path
the return value will be 2, which isn't true. This function is being called for
all the subpaths to get the original number of rows and costs of partial paths.
I think we don't need to call this function on subpaths which are not partial
paths or make it work parallel_workers = 0.

10. We should probably move the parallel_safe calculation out of cost_append().
+            path->parallel_safe = path->parallel_safe &&
+                                  subpath->parallel_safe;

11. This check shouldn't be part of cost_append().
+            /* All child paths must have same parameterization */
+            Assert(bms_equal(PATH_REQ_OUTER(subpath), required_outer));

12. cost_append() essentially adds costs of all the subpaths and then divides
by parallel_divisor. This might work if all the subpaths are partial paths. But
for the subpaths which are not partial, a single worker will incur the whole
cost of that subpath. Hence just dividing all the total cost doesn't seem the
right thing to do. We should apply different logic for costing non-partial
subpaths and partial subpaths.

13. No braces required for single line block
+    /* Increment worker count for the chosen node, if at all we found one. */
+    if (min_whichplan != PA_INVALID_PLAN)
+    {
+        padesc->pa_info[min_whichplan].pa_num_workers++;
+    }

14. exec_append_scan_first() is a one-liner function, should we just inline it?

15. This patch replaces exec_append_initialize_next() with
exec_append_scan_first(). The earlier function was handling backward and
forward scans separately, but the later function doesn't do that. Why?

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

amitdkhan.pg@gmail.com

over 9 years ago

In reply to: Ashutosh Bapat (#2)

Re: Parallel Append implementation

Thanks Ashutosh for the feedback.

On 6 January 2017 at 17:04, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Dec 23, 2016 at 10:51 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Currently an Append plan node does not execute its subplans in
parallel. There is no distribution of workers across its subplans. The
second subplan starts running only after the first subplan finishes,
although the individual subplans may be running parallel scans.

Secondly, we create a partial Append path for an appendrel, but we do
that only if all of its member subpaths are partial paths. If one or
more of the subplans is a non-parallel path, there will be only a
non-parallel Append. So whatever node is sitting on top of Append is
not going to do a parallel plan; for example, a select count(*) won't
divide it into partial aggregates if the underlying Append is not
partial.

The attached patch removes both of the above restrictions. There has
already been a mail thread [1] that discusses an approach suggested by
Robert Haas for implementing this feature. This patch uses this same
approach.

The first goal requires some kind of synchronization which will allow workers
to be distributed across the subplans. The second goal requires some kind of
synchronization to prevent multiple workers from executing non-parallel
subplans. The patch uses different mechanisms to achieve the goals. If
we create two different patches addressing each goal, those may be
easier to handle.

Goal A : Allow non-partial subpaths in Partial Append.
Goal B : Distribute workers across the Append subplans.
Both of these require some kind of synchronization while choosing the
next subplan. So, goal B is achieved by doing all the synchronization
stuff. And implementation of goal A requires that goal B is
implemented. So there is a dependency between these two goals. While
implementing goal B, we should keep in mind that it should also work
for goal A; it does not make sense later changing the synchronization
logic in goal A.

I am ok with splitting the patch into 2 patches :
a) changes required for goal A
b) changes required for goal B.
But I think we should split it only when we are ready to commit them
(commit for B, immediately followed by commit for A). Until then, we
should consider both of these together because they are interconnected
as explained above.

We may want to think about a third goal: preventing too many workers
from executing the same plan. As per comment in get_parallel_divisor()
we do not see any benefit if more than 4 workers execute the same
node. So, an append node can distribute more than 4 worker nodes
equally among the available subplans. It might be better to do that as
a separate patch.

I think that comment is for calculating leader contribution. It does
not say that 4 workers is too many workers in general.

But yes, I agree, and I have it in mind as the next improvement.
Basically, it does not make sense to give more than 3 workers to a
subplan when parallel_workers for that subplan are 3. For e.g., if
gather max workers is 10, and we have 2 Append subplans s1 and s2 with
parallel workers 3 and 5 respectively. Then, with the current patch,
it will distribute 4 workers to each of these workers. What we should
do is : once both of the subplans get 3 workers each, we should give
the 7th and 8th worker to s2.

Now that I think of that, I think for implementing above, we need to
keep track of per-subplan max_workers in the Append path; and with
that, the bitmap will be redundant. Instead, it can be replaced with
max_workers. Let me check if it is easy to do that. We don't want to
have the bitmap if we are sure it would be replaced by some other data
structure.

Here are some review comments

I will handle the other comments, but first, just a quick response to
some important ones :

6. By looking at parallel_worker field of a path, we can say whether it's
partial or not. We probably do not require to maintain a bitmap for that at in
the Append path. The bitmap can be constructed, if required, at the time of
creating the partial append plan. The reason to take this small step is 1. we
want to minimize our work at the time of creating paths, 2. while freeing a
path in add_path, we don't free the internal structures, in this case the
Bitmap, which will waste memory if the path is not chosen while planning.

Let me try keeping the per-subplan max_worker info in Append path
itself, like I mentioned above. If that works, the bitmap will be
replaced by max_worker field. In case of non-partial subpath,
max_worker will be 1. (this is the same info kept in AppendState node
in the patch, but now we might need to keep it in Append path node as
well).

7. If we consider 6, we don't need concat_append_subpaths(), but still here are
some comments about that function. Instead of accepting two separate arguments
childpaths and child_partial_subpaths_set, which need to be in sync, we can
just pass the path which contains both of those. In the same following code may
be optimized by adding a utility function to Bitmapset, which advances
all members
by given offset and using that function with bms_union() to merge the
bitmapset e.g.
bms_union(*partial_subpaths_set,
bms_advance_members(bms_copy(child_partial_subpaths_set), append_subpath_len));
if (partial_subpaths_set)
{
for (i = 0; i < list_length(childpaths); i++)
{
/*
* The child paths themselves may or may not be partial paths. The
* bitmapset numbers of these paths will need to be set considering
* that these are to be appended onto the partial_subpaths_set.
*/
if (!child_partial_subpaths_set ||
bms_is_member(i, child_partial_subpaths_set))
{
*partial_subpaths_set = bms_add_member(*partial_subpaths_set,
append_subpath_len + i);
}
}
}

Again, for the reason mentioned above, we will defer this point for now.

8.
-            parallel_workers = Max(parallel_workers, path->parallel_workers);
+            /*
+             * partial_subpaths can have non-partial subpaths so
+             * path->parallel_workers can be 0. For such paths, allocate one
+             * worker.
+             */
+            parallel_workers +=
+                (path->parallel_workers > 0 ? path->parallel_workers : 1);
This looks odd. Earlier code was choosing maximum of all parallel workers,
whereas new code adds them all. E.g. if parallel_workers for subpaths is 3, 4,
3, without your change, it will pick up 4. But with your change it will pick
10. I think, you intend to write this as
parallel_workers = Max(parallel_workers, path->parallel_workers ?
path->parallel_workers : 1);

The intention is to add all workers, because a parallel-aware Append
is going to need them in order to make the subplans run with their
full capacity in parallel. So with subpaths with 3, 4, and 3 workers,
the Append path will need 10 workers. If it allocates 4 workers, its
not sufficient : Each of them would get only 1 worker, or max 2. In
the existing code, 4 is correct, because all the workers are going to
execute the same subplan node at a time.

9. Shouldn't this funciton return double?
int
get_parallel_divisor(int parallel_workers)

Yes, right, I will do that.

9. In get_parallel_divisor(), if parallel_worker is 0 i.e. it's a partial path
the return value will be 2, which isn't true. This function is being called for
all the subpaths to get the original number of rows and costs of partial paths.
I think we don't need to call this function on subpaths which are not partial
paths or make it work parallel_workers = 0.

I didn't understand this. I checked again get_parallel_divisor()
function code. I think it will return 1 if parallel_workers is 0. But
I may be missing something.

12. cost_append() essentially adds costs of all the subpaths and then divides
by parallel_divisor. This might work if all the subpaths are partial paths. But
for the subpaths which are not partial, a single worker will incur the whole
cost of that subpath. Hence just dividing all the total cost doesn't seem the
right thing to do. We should apply different logic for costing non-partial
subpaths and partial subpaths.

WIth the current partial path costing infrastructure, it is assumed
that a partial path node should return the average per-worker cost.
Hence, I thought it would be best to do it in a similar way for
Append. But let me think if we can do something. With the current
parallelism costing infrastructure, I am not sure though.

13. No braces required for single line block
+    /* Increment worker count for the chosen node, if at all we found one. */
+    if (min_whichplan != PA_INVALID_PLAN)
+    {
+        padesc->pa_info[min_whichplan].pa_num_workers++;
+    }
14. exec_append_scan_first() is a one-liner function, should we just inline it?

15. This patch replaces exec_append_initialize_next() with
exec_append_scan_first(). The earlier function was handling backward and
forward scans separately, but the later function doesn't do that. Why?

I will come to these and some other ones later.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Amit Khandekar (#3)

Re: Parallel Append implementation

On Mon, Jan 16, 2017 at 9:49 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Thanks Ashutosh for the feedback.

On 6 January 2017 at 17:04, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Fri, Dec 23, 2016 at 10:51 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Currently an Append plan node does not execute its subplans in
parallel. There is no distribution of workers across its subplans. The
second subplan starts running only after the first subplan finishes,
although the individual subplans may be running parallel scans.

Secondly, we create a partial Append path for an appendrel, but we do
that only if all of its member subpaths are partial paths. If one or
more of the subplans is a non-parallel path, there will be only a
non-parallel Append. So whatever node is sitting on top of Append is
not going to do a parallel plan; for example, a select count(*) won't
divide it into partial aggregates if the underlying Append is not
partial.

The attached patch removes both of the above restrictions. There has
already been a mail thread [1] that discusses an approach suggested by
Robert Haas for implementing this feature. This patch uses this same
approach.

The first goal requires some kind of synchronization which will allow workers
to be distributed across the subplans. The second goal requires some kind of
synchronization to prevent multiple workers from executing non-parallel
subplans. The patch uses different mechanisms to achieve the goals. If
we create two different patches addressing each goal, those may be
easier to handle.

Goal A : Allow non-partial subpaths in Partial Append.
Goal B : Distribute workers across the Append subplans.
Both of these require some kind of synchronization while choosing the
next subplan. So, goal B is achieved by doing all the synchronization
stuff. And implementation of goal A requires that goal B is
implemented. So there is a dependency between these two goals. While
implementing goal B, we should keep in mind that it should also work
for goal A; it does not make sense later changing the synchronization
logic in goal A.

I am ok with splitting the patch into 2 patches :
a) changes required for goal A
b) changes required for goal B.
But I think we should split it only when we are ready to commit them
(commit for B, immediately followed by commit for A). Until then, we
should consider both of these together because they are interconnected
as explained above.

For B, we need to know, how much gain that brings and in which cases.
If that gain is not worth the complexity added, we may have to defer
Goal B. Goal A would certainly be useful since it will improve
performance of the targetted cases. The synchronization required for
Goal A is simpler than that of B and thus if we choose to implement
only A, we can live with a simpler synchronization.

BTW, Right now, the patch does not consider non-partial paths for a
child which has partial paths. Do we know, for sure, that a path
containing partial paths for a child, which has it, is always going to
be cheaper than the one which includes non-partial path. If not,
should we build another paths which contains non-partial paths for all
child relations. This sounds like a 0/1 knapsack problem.

Here are some review comments

I will handle the other comments, but first, just a quick response to
some important ones :

6. By looking at parallel_worker field of a path, we can say whether it's
partial or not. We probably do not require to maintain a bitmap for that at in
the Append path. The bitmap can be constructed, if required, at the time of
creating the partial append plan. The reason to take this small step is 1. we
want to minimize our work at the time of creating paths, 2. while freeing a
path in add_path, we don't free the internal structures, in this case the
Bitmap, which will waste memory if the path is not chosen while planning.

Let me try keeping the per-subplan max_worker info in Append path
itself, like I mentioned above. If that works, the bitmap will be
replaced by max_worker field. In case of non-partial subpath,
max_worker will be 1. (this is the same info kept in AppendState node
in the patch, but now we might need to keep it in Append path node as
well).

It will be better if we can fetch that information from each subpath
when creating the plan. As I have explained before, a path is minimal
structure, which should be easily disposable, when throwing away the
path.

7. If we consider 6, we don't need concat_append_subpaths(), but still here are
some comments about that function. Instead of accepting two separate arguments
childpaths and child_partial_subpaths_set, which need to be in sync, we can
just pass the path which contains both of those. In the same following code may
be optimized by adding a utility function to Bitmapset, which advances
all members
by given offset and using that function with bms_union() to merge the
bitmapset e.g.
bms_union(*partial_subpaths_set,
bms_advance_members(bms_copy(child_partial_subpaths_set), append_subpath_len));
if (partial_subpaths_set)
{
for (i = 0; i < list_length(childpaths); i++)
{
/*
* The child paths themselves may or may not be partial paths. The
* bitmapset numbers of these paths will need to be set considering
* that these are to be appended onto the partial_subpaths_set.
*/
if (!child_partial_subpaths_set ||
bms_is_member(i, child_partial_subpaths_set))
{
*partial_subpaths_set = bms_add_member(*partial_subpaths_set,
append_subpath_len + i);
}
}
}

Again, for the reason mentioned above, we will defer this point for now.

Ok.

8.
-            parallel_workers = Max(parallel_workers, path->parallel_workers);
+            /*
+             * partial_subpaths can have non-partial subpaths so
+             * path->parallel_workers can be 0. For such paths, allocate one
+             * worker.
+             */
+            parallel_workers +=
+                (path->parallel_workers > 0 ? path->parallel_workers : 1);
This looks odd. Earlier code was choosing maximum of all parallel workers,
whereas new code adds them all. E.g. if parallel_workers for subpaths is 3, 4,
3, without your change, it will pick up 4. But with your change it will pick
10. I think, you intend to write this as
parallel_workers = Max(parallel_workers, path->parallel_workers ?
path->parallel_workers : 1);
The intention is to add all workers, because a parallel-aware Append
is going to need them in order to make the subplans run with their
full capacity in parallel. So with subpaths with 3, 4, and 3 workers,
the Append path will need 10 workers. If it allocates 4 workers, its
not sufficient : Each of them would get only 1 worker, or max 2. In
the existing code, 4 is correct, because all the workers are going to
execute the same subplan node at a time.

Ok, makes sense if we take up Goal B.

9. In get_parallel_divisor(), if parallel_worker is 0 i.e. it's a partial path
the return value will be 2, which isn't true. This function is being called for
all the subpaths to get the original number of rows and costs of partial paths.
I think we don't need to call this function on subpaths which are not partial
paths or make it work parallel_workers = 0.

I didn't understand this. I checked again get_parallel_divisor()
function code. I think it will return 1 if parallel_workers is 0. But
I may be missing something.

Sorry, I also don't understand why I had that comment. For some
reason, I thought we are sending 1 when parallel_workers = 0 to
get_parallel_divisor(). But I don't understand why I thought so.
Anyway, I will provide better explanation next time I bounce against
this.

12. cost_append() essentially adds costs of all the subpaths and then divides
by parallel_divisor. This might work if all the subpaths are partial paths. But
for the subpaths which are not partial, a single worker will incur the whole
cost of that subpath. Hence just dividing all the total cost doesn't seem the
right thing to do. We should apply different logic for costing non-partial
subpaths and partial subpaths.

WIth the current partial path costing infrastructure, it is assumed
that a partial path node should return the average per-worker cost.
Hence, I thought it would be best to do it in a similar way for
Append. But let me think if we can do something. With the current
parallelism costing infrastructure, I am not sure though.

The current parallel mechanism is in sync with that costing. Each
worker is supposed to take the same burden, hence the same (average)
cost. But it will change when a single worker has to scan an entire
child relation and different child relations have different sizes.

Thanks for working on the comments.
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 9 years ago

In reply to: Amit Khandekar (#1)

Re: Parallel Append implementation

Hi Amit,

On 2016/12/23 14:21, Amit Khandekar wrote:

Currently an Append plan node does not execute its subplans in
parallel. There is no distribution of workers across its subplans. The
second subplan starts running only after the first subplan finishes,
although the individual subplans may be running parallel scans.

Secondly, we create a partial Append path for an appendrel, but we do
that only if all of its member subpaths are partial paths. If one or
more of the subplans is a non-parallel path, there will be only a
non-parallel Append. So whatever node is sitting on top of Append is
not going to do a parallel plan; for example, a select count(*) won't
divide it into partial aggregates if the underlying Append is not
partial.

The attached patch removes both of the above restrictions. There has
already been a mail thread [1] that discusses an approach suggested by
Robert Haas for implementing this feature. This patch uses this same
approach.

I was looking at the executor portion of this patch and I noticed that in
exec_append_initialize_next():

if (appendstate->as_padesc)
return parallel_append_next(appendstate);

/*
* Not parallel-aware. Fine, just go on to the next subplan in the
* appropriate direction.
*/
if (ScanDirectionIsForward(appendstate->ps.state->es_direction))
appendstate->as_whichplan++;
else
appendstate->as_whichplan--;

which seems to mean that executing Append in parallel mode disregards the
scan direction. I am not immediately sure what implications that has, so
I checked what heap scan does when executing in parallel mode, and found
this in heapgettup():

else if (backward)
{
/* backward parallel scan not supported */
Assert(scan->rs_parallel == NULL);

Perhaps, AppendState.as_padesc would not have been set if scan direction
is backward, because parallel mode would be disabled for the whole query
in that case (PlannerGlobal.parallelModeOK = false). Maybe add an
Assert() similar to one in heapgettup().

Thanks,
Amit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Michael Paquier

michael@paquier.xyz

over 9 years ago

In reply to: Amit Langote (#5)

Re: Parallel Append implementation

On Tue, Jan 17, 2017 at 2:40 PM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

Hi Amit,

On 2016/12/23 14:21, Amit Khandekar wrote:

Currently an Append plan node does not execute its subplans in
parallel. There is no distribution of workers across its subplans. The
second subplan starts running only after the first subplan finishes,
although the individual subplans may be running parallel scans.

Secondly, we create a partial Append path for an appendrel, but we do
that only if all of its member subpaths are partial paths. If one or
more of the subplans is a non-parallel path, there will be only a
non-parallel Append. So whatever node is sitting on top of Append is
not going to do a parallel plan; for example, a select count(*) won't
divide it into partial aggregates if the underlying Append is not
partial.

The attached patch removes both of the above restrictions. There has
already been a mail thread [1] that discusses an approach suggested by
Robert Haas for implementing this feature. This patch uses this same
approach.

I was looking at the executor portion of this patch and I noticed that in
exec_append_initialize_next():

if (appendstate->as_padesc)
return parallel_append_next(appendstate);

/*
* Not parallel-aware. Fine, just go on to the next subplan in the
* appropriate direction.
*/
if (ScanDirectionIsForward(appendstate->ps.state->es_direction))
appendstate->as_whichplan++;
else
appendstate->as_whichplan--;

which seems to mean that executing Append in parallel mode disregards the
scan direction. I am not immediately sure what implications that has, so
I checked what heap scan does when executing in parallel mode, and found
this in heapgettup():

else if (backward)
{
/* backward parallel scan not supported */
Assert(scan->rs_parallel == NULL);

Perhaps, AppendState.as_padesc would not have been set if scan direction
is backward, because parallel mode would be disabled for the whole query
in that case (PlannerGlobal.parallelModeOK = false). Maybe add an
Assert() similar to one in heapgettup().

There have been some reviews, but the patch has not been updated in
two weeks. Marking as "returned with feedback".
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

amitdkhan.pg@gmail.com

over 9 years ago

In reply to: Ashutosh Bapat (#4)

Re: Parallel Append implementation

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

We may want to think about a third goal: preventing too many workers
from executing the same plan. As per comment in get_parallel_divisor()
we do not see any benefit if more than 4 workers execute the same
node. So, an append node can distribute more than 4 worker nodes
equally among the available subplans. It might be better to do that as
a separate patch.

I think that comment is for calculating leader contribution. It does
not say that 4 workers is too many workers in general.

But yes, I agree, and I have it in mind as the next improvement.
Basically, it does not make sense to give more than 3 workers to a
subplan when parallel_workers for that subplan are 3. For e.g., if
gather max workers is 10, and we have 2 Append subplans s1 and s2 with
parallel workers 3 and 5 respectively. Then, with the current patch,
it will distribute 4 workers to each of these workers. What we should
do is : once both of the subplans get 3 workers each, we should give
the 7th and 8th worker to s2.

Now that I think of that, I think for implementing above, we need to
keep track of per-subplan max_workers in the Append path; and with
that, the bitmap will be redundant. Instead, it can be replaced with
max_workers. Let me check if it is easy to do that. We don't want to
have the bitmap if we are sure it would be replaced by some other data
structure.

Attached is v2 patch, which implements above. Now Append plan node
stores a list of per-subplan max worker count, rather than the
Bitmapset. But still Bitmapset turned out to be necessary for
AppendPath. More details are in the subsequent comments.

Goal A : Allow non-partial subpaths in Partial Append.
Goal B : Distribute workers across the Append subplans.
Both of these require some kind of synchronization while choosing the
next subplan. So, goal B is achieved by doing all the synchronization
stuff. And implementation of goal A requires that goal B is
implemented. So there is a dependency between these two goals. While
implementing goal B, we should keep in mind that it should also work
for goal A; it does not make sense later changing the synchronization
logic in goal A.

I am ok with splitting the patch into 2 patches :
a) changes required for goal A
b) changes required for goal B.
But I think we should split it only when we are ready to commit them
(commit for B, immediately followed by commit for A). Until then, we
should consider both of these together because they are interconnected
as explained above.

For B, we need to know, how much gain that brings and in which cases.
If that gain is not worth the complexity added, we may have to defer
Goal B. Goal A would certainly be useful since it will improve
performance of the targetted cases. The synchronization required for
Goal A is simpler than that of B and thus if we choose to implement
only A, we can live with a simpler synchronization.

For Goal A , the logic for a worker synchronously choosing a subplan will be :
Go the next subplan. If that subplan has not already assigned max
workers, choose this subplan, otherwise, go the next subplan, and so
on.
For Goal B , the logic will be :
Among the subplans which are yet to achieve max workers, choose the
subplan with the minimum number of workers currently assigned.

I don't think there is a significant difference between the complexity
of the above two algorithms. So I think here the complexity does not
look like a factor based on which we can choose the particular logic.
We should choose the logic which has more potential for benefits. The
logic for goal B will work for goal A as well. And secondly, if the
subplans are using their own different system resources, the resource
contention might be less. One case is : all subplans using different
disks. Second case is : some of the subplans may be using a foreign
scan, so it would start using foreign server resources sooner. These
benefits apply when the Gather max workers count is not sufficient for
running all the subplans in their full capacity. If they are
sufficient, then the workers will be distributed over the subplans
using both the logics. Just the order of assignments of workers to
subplans will be different.

Also, I don't see a disadvantage if we follow the logic of Goal B.

BTW, Right now, the patch does not consider non-partial paths for a
child which has partial paths. Do we know, for sure, that a path
containing partial paths for a child, which has it, is always going to
be cheaper than the one which includes non-partial path. If not,
should we build another paths which contains non-partial paths for all
child relations. This sounds like a 0/1 knapsack problem.

I didn't quite get this. We do create a non-partial Append path using
non-partial child paths anyways.

Here are some review comments

I will handle the other comments, but first, just a quick response to
some important ones :

6. By looking at parallel_worker field of a path, we can say whether it's
partial or not. We probably do not require to maintain a bitmap for that at in
the Append path. The bitmap can be constructed, if required, at the time of
creating the partial append plan. The reason to take this small step is 1. we
want to minimize our work at the time of creating paths, 2. while freeing a
path in add_path, we don't free the internal structures, in this case the
Bitmap, which will waste memory if the path is not chosen while planning.

Let me try keeping the per-subplan max_worker info in Append path
itself, like I mentioned above. If that works, the bitmap will be
replaced by max_worker field. In case of non-partial subpath,
max_worker will be 1. (this is the same info kept in AppendState node
in the patch, but now we might need to keep it in Append path node as
well).

It will be better if we can fetch that information from each subpath
when creating the plan. As I have explained before, a path is minimal
structure, which should be easily disposable, when throwing away the
path.

Now in the v2 patch, we store per-subplan worker count. But still, we
cannot use the path->parallel_workers to determine whether it's a
partial path. This is because even for a non-partial path, it seems
the parallel_workers can be non-zero. For e.g., in
create_subqueryscan_path(), it sets path->parallel_workers to
subpath->parallel_workers. But this path is added as a non-partial
path. So we need a separate info as to which of the subpaths in Append
path are partial subpaths. So in the v2 patch, I continued to use
Bitmapset in AppendPath. But in Append plan node, number of workers is
calculated using this bitmapset. Check the new function
get_append_num_workers().

7. If we consider 6, we don't need concat_append_subpaths(),

As explained above, I have kept the BitmapSet for AppendPath.

but still here are
some comments about that function. Instead of accepting two separate arguments
childpaths and child_partial_subpaths_set, which need to be in sync, we can
just pass the path which contains both of those. In the same following code may
be optimized by adding a utility function to Bitmapset, which advances
all members
by given offset and using that function with bms_union() to merge the
bitmapset e.g.
bms_union(*partial_subpaths_set,
bms_advance_members(bms_copy(child_partial_subpaths_set), append_subpath_len));
if (partial_subpaths_set)

I will get back on this after more thought.

12. cost_append() essentially adds costs of all the subpaths and then divides
by parallel_divisor. This might work if all the subpaths are partial paths. But
for the subpaths which are not partial, a single worker will incur the whole
cost of that subpath. Hence just dividing all the total cost doesn't seem the
right thing to do. We should apply different logic for costing non-partial
subpaths and partial subpaths.

WIth the current partial path costing infrastructure, it is assumed
that a partial path node should return the average per-worker cost.
Hence, I thought it would be best to do it in a similar way for
Append. But let me think if we can do something. With the current
parallelism costing infrastructure, I am not sure though.

The current parallel mechanism is in sync with that costing. Each
worker is supposed to take the same burden, hence the same (average)
cost. But it will change when a single worker has to scan an entire
child relation and different child relations have different sizes.

I gave more thought on this. Considering each subplan has different
number of workers, I think it makes sense to calculate average
per-worker cost even in parallel Append. In case of non-partial
subplan, a single worker will execute it, but it will next choose
another subplan. So on average each worker is going to process the
same number of rows, and also the same amount of CPU. And that amount
of CPU cost and rows cost should be calculated by taking the total
count and dividing it by number of workers (parallel_divsor actually).

Here are some review comments
1. struct ParallelAppendDescData is being used at other places. The declaration
style doesn't seem to be very common in the code or in the directory where the
file is located.
+struct ParallelAppendDescData
+{
+    slock_t        pa_mutex;        /* mutual exclusion to choose
next subplan */
+    parallel_append_info pa_info[FLEXIBLE_ARRAY_MEMBER];
+};
Defining it like
typdef struct ParallelAppendDescData
{
slock_t        pa_mutex;        /* mutual exclusion to choose next
subplan */
parallel_append_info pa_info[FLEXIBLE_ARRAY_MEMBER];
};
will make its use handy. Instead of struct ParallelAppendDescData, you will
need to use just ParallelAppendDescData. May be we want to rename
parallel_append_info as ParallelAppendInfo and change the style to match other
declarations.
2. The comment below refers to the constant which it describes, which looks
odd. May be it should be worded as "A special value of
AppendState::as_whichplan, to indicate no plans left to be executed.". Also
using INVALID for "no plans left ..." seems to be a misnomer.
/*
* For Parallel Append, AppendState::as_whichplan can have PA_INVALID_PLAN
* value, which indicates there are no plans left to be executed.
*/
#define PA_INVALID_PLAN -1

3. The sentence "We have got NULL", looks odd. Probably we don't need it as
it's clear from the code above that this code deals with the case when the
current subplan didn't return any row.
/*
* We have got NULL. There might be other workers still processing the
* last chunk of rows for this same node, but there's no point for new
* workers to run this node, so mark this node as finished.
*/
4. In the same comment, I guess, the word "node" refers to "subnode" and not
the node pointed by variable "node". May be you want to use word "subplan"
here.

4. set_finished()'s prologue has different indentation compared to other
functions in the file.

5. Multilevel comment starts with an empty line.
+ /* Keep track of the node with the least workers so far. For the very

Done 1. to 5. above, as per your suggestions.

9. Shouldn't this funciton return double?
int
get_parallel_divisor(int parallel_workers)

v2 patch is rebased on latest master branch, which already contains
this function returning double.

10. We should probably move the parallel_safe calculation out of cost_append().
+            path->parallel_safe = path->parallel_safe &&
+                                  subpath->parallel_safe;

11. This check shouldn't be part of cost_append().
+            /* All child paths must have same parameterization */
+            Assert(bms_equal(PATH_REQ_OUTER(subpath), required_outer));

Yet to handle the above ones.

amitdkhan.pg@gmail.com

over 9 years ago

In reply to: Amit Khandekar (#7)

Re: Parallel Append implementation

v2 patch was not rebased over the latest master branch commits. Please
refer to the attached ParallelAppend_v3.patch, instead.

On 6 February 2017 at 11:06, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

We may want to think about a third goal: preventing too many workers
from executing the same plan. As per comment in get_parallel_divisor()
we do not see any benefit if more than 4 workers execute the same
node. So, an append node can distribute more than 4 worker nodes
equally among the available subplans. It might be better to do that as
a separate patch.

I think that comment is for calculating leader contribution. It does
not say that 4 workers is too many workers in general.

But yes, I agree, and I have it in mind as the next improvement.
Basically, it does not make sense to give more than 3 workers to a
subplan when parallel_workers for that subplan are 3. For e.g., if
gather max workers is 10, and we have 2 Append subplans s1 and s2 with
parallel workers 3 and 5 respectively. Then, with the current patch,
it will distribute 4 workers to each of these workers. What we should
do is : once both of the subplans get 3 workers each, we should give
the 7th and 8th worker to s2.

Now that I think of that, I think for implementing above, we need to
keep track of per-subplan max_workers in the Append path; and with
that, the bitmap will be redundant. Instead, it can be replaced with
max_workers. Let me check if it is easy to do that. We don't want to
have the bitmap if we are sure it would be replaced by some other data
structure.

Attached is v2 patch, which implements above. Now Append plan node
stores a list of per-subplan max worker count, rather than the
Bitmapset. But still Bitmapset turned out to be necessary for
AppendPath. More details are in the subsequent comments.

Goal A : Allow non-partial subpaths in Partial Append.
Goal B : Distribute workers across the Append subplans.
Both of these require some kind of synchronization while choosing the
next subplan. So, goal B is achieved by doing all the synchronization
stuff. And implementation of goal A requires that goal B is
implemented. So there is a dependency between these two goals. While
implementing goal B, we should keep in mind that it should also work
for goal A; it does not make sense later changing the synchronization
logic in goal A.

I am ok with splitting the patch into 2 patches :
a) changes required for goal A
b) changes required for goal B.
But I think we should split it only when we are ready to commit them
(commit for B, immediately followed by commit for A). Until then, we
should consider both of these together because they are interconnected
as explained above.

For B, we need to know, how much gain that brings and in which cases.
If that gain is not worth the complexity added, we may have to defer
Goal B. Goal A would certainly be useful since it will improve
performance of the targetted cases. The synchronization required for
Goal A is simpler than that of B and thus if we choose to implement
only A, we can live with a simpler synchronization.

For Goal A , the logic for a worker synchronously choosing a subplan will be :
Go the next subplan. If that subplan has not already assigned max
workers, choose this subplan, otherwise, go the next subplan, and so
on.
For Goal B , the logic will be :
Among the subplans which are yet to achieve max workers, choose the
subplan with the minimum number of workers currently assigned.

I don't think there is a significant difference between the complexity
of the above two algorithms. So I think here the complexity does not
look like a factor based on which we can choose the particular logic.
We should choose the logic which has more potential for benefits. The
logic for goal B will work for goal A as well. And secondly, if the
subplans are using their own different system resources, the resource
contention might be less. One case is : all subplans using different
disks. Second case is : some of the subplans may be using a foreign
scan, so it would start using foreign server resources sooner. These
benefits apply when the Gather max workers count is not sufficient for
running all the subplans in their full capacity. If they are
sufficient, then the workers will be distributed over the subplans
using both the logics. Just the order of assignments of workers to
subplans will be different.

Also, I don't see a disadvantage if we follow the logic of Goal B.

BTW, Right now, the patch does not consider non-partial paths for a
child which has partial paths. Do we know, for sure, that a path
containing partial paths for a child, which has it, is always going to
be cheaper than the one which includes non-partial path. If not,
should we build another paths which contains non-partial paths for all
child relations. This sounds like a 0/1 knapsack problem.

I didn't quite get this. We do create a non-partial Append path using
non-partial child paths anyways.

Here are some review comments

I will handle the other comments, but first, just a quick response to
some important ones :

6. By looking at parallel_worker field of a path, we can say whether it's
partial or not. We probably do not require to maintain a bitmap for that at in
the Append path. The bitmap can be constructed, if required, at the time of
creating the partial append plan. The reason to take this small step is 1. we
want to minimize our work at the time of creating paths, 2. while freeing a
path in add_path, we don't free the internal structures, in this case the
Bitmap, which will waste memory if the path is not chosen while planning.

Let me try keeping the per-subplan max_worker info in Append path
itself, like I mentioned above. If that works, the bitmap will be
replaced by max_worker field. In case of non-partial subpath,
max_worker will be 1. (this is the same info kept in AppendState node
in the patch, but now we might need to keep it in Append path node as
well).

It will be better if we can fetch that information from each subpath
when creating the plan. As I have explained before, a path is minimal
structure, which should be easily disposable, when throwing away the
path.

Now in the v2 patch, we store per-subplan worker count. But still, we
cannot use the path->parallel_workers to determine whether it's a
partial path. This is because even for a non-partial path, it seems
the parallel_workers can be non-zero. For e.g., in
create_subqueryscan_path(), it sets path->parallel_workers to
subpath->parallel_workers. But this path is added as a non-partial
path. So we need a separate info as to which of the subpaths in Append
path are partial subpaths. So in the v2 patch, I continued to use
Bitmapset in AppendPath. But in Append plan node, number of workers is
calculated using this bitmapset. Check the new function
get_append_num_workers().

7. If we consider 6, we don't need concat_append_subpaths(),

As explained above, I have kept the BitmapSet for AppendPath.

but still here are
some comments about that function. Instead of accepting two separate arguments
childpaths and child_partial_subpaths_set, which need to be in sync, we can
just pass the path which contains both of those. In the same following code may
be optimized by adding a utility function to Bitmapset, which advances
all members
by given offset and using that function with bms_union() to merge the
bitmapset e.g.
bms_union(*partial_subpaths_set,
bms_advance_members(bms_copy(child_partial_subpaths_set), append_subpath_len));
if (partial_subpaths_set)

I will get back on this after more thought.

12. cost_append() essentially adds costs of all the subpaths and then divides
by parallel_divisor. This might work if all the subpaths are partial paths. But
for the subpaths which are not partial, a single worker will incur the whole
cost of that subpath. Hence just dividing all the total cost doesn't seem the
right thing to do. We should apply different logic for costing non-partial
subpaths and partial subpaths.

WIth the current partial path costing infrastructure, it is assumed
that a partial path node should return the average per-worker cost.
Hence, I thought it would be best to do it in a similar way for
Append. But let me think if we can do something. With the current
parallelism costing infrastructure, I am not sure though.

The current parallel mechanism is in sync with that costing. Each
worker is supposed to take the same burden, hence the same (average)
cost. But it will change when a single worker has to scan an entire
child relation and different child relations have different sizes.

I gave more thought on this. Considering each subplan has different
number of workers, I think it makes sense to calculate average
per-worker cost even in parallel Append. In case of non-partial
subplan, a single worker will execute it, but it will next choose
another subplan. So on average each worker is going to process the
same number of rows, and also the same amount of CPU. And that amount
of CPU cost and rows cost should be calculated by taking the total
count and dividing it by number of workers (parallel_divsor actually).
Here are some review comments
1. struct ParallelAppendDescData is being used at other places. The declaration
style doesn't seem to be very common in the code or in the directory where the
file is located.
+struct ParallelAppendDescData
+{
+    slock_t        pa_mutex;        /* mutual exclusion to choose
next subplan */
+    parallel_append_info pa_info[FLEXIBLE_ARRAY_MEMBER];
+};
Defining it like
typdef struct ParallelAppendDescData
{
slock_t        pa_mutex;        /* mutual exclusion to choose next
subplan */
parallel_append_info pa_info[FLEXIBLE_ARRAY_MEMBER];
};
will make its use handy. Instead of struct ParallelAppendDescData, you will
need to use just ParallelAppendDescData. May be we want to rename
parallel_append_info as ParallelAppendInfo and change the style to match other
declarations.
2. The comment below refers to the constant which it describes, which looks
odd. May be it should be worded as "A special value of
AppendState::as_whichplan, to indicate no plans left to be executed.". Also
using INVALID for "no plans left ..." seems to be a misnomer.
/*
* For Parallel Append, AppendState::as_whichplan can have PA_INVALID_PLAN
* value, which indicates there are no plans left to be executed.
*/
#define PA_INVALID_PLAN -1

3. The sentence "We have got NULL", looks odd. Probably we don't need it as
it's clear from the code above that this code deals with the case when the
current subplan didn't return any row.
/*
* We have got NULL. There might be other workers still processing the
* last chunk of rows for this same node, but there's no point for new
* workers to run this node, so mark this node as finished.
*/
4. In the same comment, I guess, the word "node" refers to "subnode" and not
the node pointed by variable "node". May be you want to use word "subplan"
here.

4. set_finished()'s prologue has different indentation compared to other
functions in the file.

5. Multilevel comment starts with an empty line.
+ /* Keep track of the node with the least workers so far. For the very
Done 1. to 5. above, as per your suggestions.

9. Shouldn't this funciton return double?
int
get_parallel_divisor(int parallel_workers)

v2 patch is rebased on latest master branch, which already contains
this function returning double.
10. We should probably move the parallel_safe calculation out of cost_append().
+            path->parallel_safe = path->parallel_safe &&
+                                  subpath->parallel_safe;
11. This check shouldn't be part of cost_append().
+            /* All child paths must have same parameterization */
+            Assert(bms_equal(PATH_REQ_OUTER(subpath), required_outer));
Yet to handle the above ones.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Amit Khandekar (#7)

Re: Parallel Append implementation

On Mon, Feb 6, 2017 at 11:06 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Ashutosh Bapat <ashutosh.bapat@enterprisedb.com> wrote:

We may want to think about a third goal: preventing too many workers
from executing the same plan. As per comment in get_parallel_divisor()
we do not see any benefit if more than 4 workers execute the same
node. So, an append node can distribute more than 4 worker nodes
equally among the available subplans. It might be better to do that as
a separate patch.

I think that comment is for calculating leader contribution. It does
not say that 4 workers is too many workers in general.

But yes, I agree, and I have it in mind as the next improvement.
Basically, it does not make sense to give more than 3 workers to a
subplan when parallel_workers for that subplan are 3. For e.g., if
gather max workers is 10, and we have 2 Append subplans s1 and s2 with
parallel workers 3 and 5 respectively. Then, with the current patch,
it will distribute 4 workers to each of these workers. What we should
do is : once both of the subplans get 3 workers each, we should give
the 7th and 8th worker to s2.

Now that I think of that, I think for implementing above, we need to
keep track of per-subplan max_workers in the Append path; and with
that, the bitmap will be redundant. Instead, it can be replaced with
max_workers. Let me check if it is easy to do that. We don't want to
have the bitmap if we are sure it would be replaced by some other data
structure.

Attached is v2 patch, which implements above. Now Append plan node
stores a list of per-subplan max worker count, rather than the
Bitmapset. But still Bitmapset turned out to be necessary for
AppendPath. More details are in the subsequent comments.

Goal A : Allow non-partial subpaths in Partial Append.
Goal B : Distribute workers across the Append subplans.
Both of these require some kind of synchronization while choosing the
next subplan. So, goal B is achieved by doing all the synchronization
stuff. And implementation of goal A requires that goal B is
implemented. So there is a dependency between these two goals. While
implementing goal B, we should keep in mind that it should also work
for goal A; it does not make sense later changing the synchronization
logic in goal A.

I am ok with splitting the patch into 2 patches :
a) changes required for goal A
b) changes required for goal B.
But I think we should split it only when we are ready to commit them
(commit for B, immediately followed by commit for A). Until then, we
should consider both of these together because they are interconnected
as explained above.

For B, we need to know, how much gain that brings and in which cases.
If that gain is not worth the complexity added, we may have to defer
Goal B. Goal A would certainly be useful since it will improve
performance of the targetted cases. The synchronization required for
Goal A is simpler than that of B and thus if we choose to implement
only A, we can live with a simpler synchronization.

For Goal A , the logic for a worker synchronously choosing a subplan will be :
Go the next subplan. If that subplan has not already assigned max
workers, choose this subplan, otherwise, go the next subplan, and so
on.

Right, at a given time, we have to remember only the next plan to
assign worker to. That's simpler than remembering the number of
workers for each subplan and updating those concurrently. That's why I
am saying synchronization for A is simpler than that of B.

For Goal B , the logic will be :
Among the subplans which are yet to achieve max workers, choose the
subplan with the minimum number of workers currently assigned.

I don't think there is a significant difference between the complexity
of the above two algorithms. So I think here the complexity does not
look like a factor based on which we can choose the particular logic.
We should choose the logic which has more potential for benefits. The
logic for goal B will work for goal A as well. And secondly, if the
subplans are using their own different system resources, the resource
contention might be less. One case is : all subplans using different
disks. Second case is : some of the subplans may be using a foreign
scan, so it would start using foreign server resources sooner. These
benefits apply when the Gather max workers count is not sufficient for
running all the subplans in their full capacity. If they are
sufficient, then the workers will be distributed over the subplans
using both the logics. Just the order of assignments of workers to
subplans will be different.

Also, I don't see a disadvantage if we follow the logic of Goal B.

Do we have any performance measurements where we see that Goal B
performs better than Goal A, in such a situation? Do we have any
performance measurement comparing these two approaches in other
situations. If implementation for Goal B beats that of Goal A always,
we can certainly implement it directly. But it may not. Also,
separating patches for Goal A and Goal B might make reviews easier.

BTW, Right now, the patch does not consider non-partial paths for a
child which has partial paths. Do we know, for sure, that a path
containing partial paths for a child, which has it, is always going to
be cheaper than the one which includes non-partial path. If not,
should we build another paths which contains non-partial paths for all
child relations. This sounds like a 0/1 knapsack problem.

I didn't quite get this. We do create a non-partial Append path using
non-partial child paths anyways.

Let's say a given child-relation has both partial and non-partial
paths, your approach would always pick up a partial path. But now that
parallel append can handle non-partial paths as well, it may happen
that picking up non-partial path instead of partial one when both are
available gives an overall better performance. Have we ruled out that
possibility.

Here are some review comments

I will handle the other comments, but first, just a quick response to
some important ones :

6. By looking at parallel_worker field of a path, we can say whether it's
partial or not. We probably do not require to maintain a bitmap for that at in
the Append path. The bitmap can be constructed, if required, at the time of
creating the partial append plan. The reason to take this small step is 1. we
want to minimize our work at the time of creating paths, 2. while freeing a
path in add_path, we don't free the internal structures, in this case the
Bitmap, which will waste memory if the path is not chosen while planning.

Let me try keeping the per-subplan max_worker info in Append path
itself, like I mentioned above. If that works, the bitmap will be
replaced by max_worker field. In case of non-partial subpath,
max_worker will be 1. (this is the same info kept in AppendState node
in the patch, but now we might need to keep it in Append path node as
well).

It will be better if we can fetch that information from each subpath
when creating the plan. As I have explained before, a path is minimal
structure, which should be easily disposable, when throwing away the
path.

Now in the v2 patch, we store per-subplan worker count. But still, we
cannot use the path->parallel_workers to determine whether it's a
partial path. This is because even for a non-partial path, it seems
the parallel_workers can be non-zero. For e.g., in
create_subqueryscan_path(), it sets path->parallel_workers to
subpath->parallel_workers. But this path is added as a non-partial
path. So we need a separate info as to which of the subpaths in Append
path are partial subpaths. So in the v2 patch, I continued to use
Bitmapset in AppendPath. But in Append plan node, number of workers is
calculated using this bitmapset. Check the new function
get_append_num_workers().

If the subpath from childrel->partial_pathlist, then we set the
corresponding bit in the bitmap. Now we can infer that for any path if
that path is found in path->parent->partial_pathlist. Since the code
always chooses the first partial path, the search in partial_pathlist
should not affect performance. So, we can avoid maintaining a bitmap
in the path and keep accumulating it when collapsing append paths.

7. If we consider 6, we don't need concat_append_subpaths(),

As explained above, I have kept the BitmapSet for AppendPath.

but still here are
some comments about that function. Instead of accepting two separate arguments
childpaths and child_partial_subpaths_set, which need to be in sync, we can
just pass the path which contains both of those. In the same following code may
be optimized by adding a utility function to Bitmapset, which advances
all members
by given offset and using that function with bms_union() to merge the
bitmapset e.g.
bms_union(*partial_subpaths_set,
bms_advance_members(bms_copy(child_partial_subpaths_set), append_subpath_len));
if (partial_subpaths_set)

I will get back on this after more thought.

Another possibility, you could use a loop like offset_relid_set(),
using bms_next_member(). That way we could combine the for loop and
bms_is_member() call into a loop over bms_next_member().

12. cost_append() essentially adds costs of all the subpaths and then divides
by parallel_divisor. This might work if all the subpaths are partial paths. But
for the subpaths which are not partial, a single worker will incur the whole
cost of that subpath. Hence just dividing all the total cost doesn't seem the
right thing to do. We should apply different logic for costing non-partial
subpaths and partial subpaths.

WIth the current partial path costing infrastructure, it is assumed
that a partial path node should return the average per-worker cost.
Hence, I thought it would be best to do it in a similar way for
Append. But let me think if we can do something. With the current
parallelism costing infrastructure, I am not sure though.

The current parallel mechanism is in sync with that costing. Each
worker is supposed to take the same burden, hence the same (average)
cost. But it will change when a single worker has to scan an entire
child relation and different child relations have different sizes.

I gave more thought on this. Considering each subplan has different
number of workers, I think it makes sense to calculate average
per-worker cost even in parallel Append. In case of non-partial
subplan, a single worker will execute it, but it will next choose
another subplan. So on average each worker is going to process the
same number of rows, and also the same amount of CPU. And that amount
of CPU cost and rows cost should be calculated by taking the total
count and dividing it by number of workers (parallel_divsor actually).

That's not entirely true. Consider N child relations with chosen paths
with costs C1, C2, ... CN which are very very different. If there are
N workers, the total cost should correspond to the highest of the
costs of subpaths, since no worker will execute more than one plan.
The unfortunate worker which executes the costliest path would take
the longest time. The cost of parallel append should reflect that. The
patch does not make any attempt to distribute workers based on the
actual load, so such skews should be considered into costing. I don't
think we can do anything to the condition I explained.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

robertmhaas@gmail.com

over 9 years ago

In reply to: Amit Khandekar (#7)

Re: Parallel Append implementation

On Mon, Feb 6, 2017 at 12:36 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Now that I think of that, I think for implementing above, we need to
keep track of per-subplan max_workers in the Append path; and with
that, the bitmap will be redundant. Instead, it can be replaced with
max_workers. Let me check if it is easy to do that. We don't want to
have the bitmap if we are sure it would be replaced by some other data
structure.

Attached is v2 patch, which implements above. Now Append plan node
stores a list of per-subplan max worker count, rather than the
Bitmapset. But still Bitmapset turned out to be necessary for
AppendPath. More details are in the subsequent comments.

Keep in mind that, for a non-partial path, the cap of 1 worker for
that subplan is a hard limit. Anything more will break the world.
But for a partial plan, the limit -- whether 1 or otherwise -- is a
soft limit. It may not help much to route more workers to that node,
and conceivably it could even hurt, but it shouldn't yield any
incorrect result. I'm not sure it's a good idea to conflate those two
things. For example, suppose that I have a scan of two children, one
of which has parallel_workers of 4, and the other of which has
parallel_workers of 3. If I pick parallel_workers of 7 for the
Parallel Append, that's probably too high. Had those two tables been
a single unpartitioned table, I would have picked 4 or 5 workers, not
7. On the other hand, if I pick parallel_workers of 4 or 5 for the
Parallel Append, and I finish with the larger table first, I think I
might as well throw all 4 of those workers at the smaller table even
though it would normally have only used 3 workers. Having the extra
1-2 workers exist does not seem better.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

robertmhaas@gmail.com

over 9 years ago

In reply to: Robert Haas (#10)

Re: Parallel Append implementation

On Tue, Feb 14, 2017 at 12:05 PM, Robert Haas <robertmhaas@gmail.com> wrote:

Having the extra
1-2 workers exist does not seem better.

Err, exit, not exist.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

amitdkhan.pg@gmail.com

over 9 years ago

In reply to: Robert Haas (#10)

Re: Parallel Append implementation

On 14 February 2017 at 22:35, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Feb 6, 2017 at 12:36 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

Now that I think of that, I think for implementing above, we need to
keep track of per-subplan max_workers in the Append path; and with
that, the bitmap will be redundant. Instead, it can be replaced with
max_workers. Let me check if it is easy to do that. We don't want to
have the bitmap if we are sure it would be replaced by some other data
structure.

Attached is v2 patch, which implements above. Now Append plan node
stores a list of per-subplan max worker count, rather than the
Bitmapset. But still Bitmapset turned out to be necessary for
AppendPath. More details are in the subsequent comments.

Keep in mind that, for a non-partial path, the cap of 1 worker for
that subplan is a hard limit. Anything more will break the world.
But for a partial plan, the limit -- whether 1 or otherwise -- is a
soft limit. It may not help much to route more workers to that node,
and conceivably it could even hurt, but it shouldn't yield any
incorrect result. I'm not sure it's a good idea to conflate those two
things.

Yes, the logic that I used in the patch assumes that
"Path->parallel_workers field not only suggests how many workers to
allocate, but also prevents allocation of too many workers for that
path". For seqscan path, this field is calculated based on the
relation pages count. I believe the theory is that, too many workers
might even slow down the parallel scan. And the same theory would be
applied for calculating other types of low-level paths like index
scan.

The only reason I combined the soft limit and the hard limit is
because it is not necessary to have two different fields. But of
course this is again under the assumption that allocating more than
parallel_workers would never improve the speed, in fact it can even
slow it down.

Do we have such a case currently where the actual number of workers
launched turns out to be *more* than Path->parallel_workers ?

For example, suppose that I have a scan of two children, one
of which has parallel_workers of 4, and the other of which has
parallel_workers of 3. If I pick parallel_workers of 7 for the
Parallel Append, that's probably too high. Had those two tables been
a single unpartitioned table, I would have picked 4 or 5 workers, not
7. On the other hand, if I pick parallel_workers of 4 or 5 for the
Parallel Append, and I finish with the larger table first, I think I
might as well throw all 4 of those workers at the smaller table even
though it would normally have only used 3 workers.

Having the extra 1-2 workers exit does not seem better.

It is here, where I didn't understand exactly why would we want to
assign these extra workers to a subplan which tells use that it is
already being run by 'parallel_workers' number of workers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

amitdkhan.pg@gmail.com

over 9 years ago

In reply to: Amit Khandekar (#12)

Re: Parallel Append implementation

On 14 February 2017 at 22:35, Robert Haas <robertmhaas@gmail.com> wrote:

For example, suppose that I have a scan of two children, one
of which has parallel_workers of 4, and the other of which has
parallel_workers of 3. If I pick parallel_workers of 7 for the
Parallel Append, that's probably too high.

In the patch, in such case, 7 workers are indeed selected for Parallel
Append path, so that both the subplans are able to execute in parallel
with their full worker capacity. Are you suggesting that we should not
?

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

robertmhaas@gmail.com

over 9 years ago

In reply to: Amit Khandekar (#12)

Re: Parallel Append implementation

On Wed, Feb 15, 2017 at 2:33 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

The only reason I combined the soft limit and the hard limit is
because it is not necessary to have two different fields. But of
course this is again under the assumption that allocating more than
parallel_workers would never improve the speed, in fact it can even
slow it down.

That could be true in extreme cases, but in general I think it's probably false.

Do we have such a case currently where the actual number of workers
launched turns out to be *more* than Path->parallel_workers ?

No.

For example, suppose that I have a scan of two children, one
of which has parallel_workers of 4, and the other of which has
parallel_workers of 3. If I pick parallel_workers of 7 for the
Parallel Append, that's probably too high. Had those two tables been
a single unpartitioned table, I would have picked 4 or 5 workers, not
7. On the other hand, if I pick parallel_workers of 4 or 5 for the
Parallel Append, and I finish with the larger table first, I think I
might as well throw all 4 of those workers at the smaller table even
though it would normally have only used 3 workers.

Having the extra 1-2 workers exit does not seem better.

It is here, where I didn't understand exactly why would we want to
assign these extra workers to a subplan which tells use that it is
already being run by 'parallel_workers' number of workers.

The decision to use fewer workers for a smaller scan isn't really
because we think that using more workers will cause a regression.
It's because we think it may not help very much, and because it's not
worth firing up a ton of workers for a relatively small scan given
that workers are a limited resource. I think once we've got a bunch
of workers started, we might as well try to use them.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

robertmhaas@gmail.com

over 9 years ago

In reply to: Amit Khandekar (#13)

Re: Parallel Append implementation

On Wed, Feb 15, 2017 at 4:43 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 14 February 2017 at 22:35, Robert Haas <robertmhaas@gmail.com> wrote:

For example, suppose that I have a scan of two children, one
of which has parallel_workers of 4, and the other of which has
parallel_workers of 3. If I pick parallel_workers of 7 for the
Parallel Append, that's probably too high.

In the patch, in such case, 7 workers are indeed selected for Parallel
Append path, so that both the subplans are able to execute in parallel
with their full worker capacity. Are you suggesting that we should not
?

Absolutely. I think that's going to be way too many workers. Imagine
that there are 100 child tables and each one is big enough to qualify
for 2 or 3 workers. No matter what value the user has selected for
max_parallel_workers_per_gather, they should not get a scan involving
200 workers.

What I was thinking about is something like this:

1. First, take the maximum parallel_workers value from among all the children.

2. Second, compute log2(num_children)+1 and round up. So, for 1
child, 1; for 2 children, 2; for 3-4 children, 3; for 5-8 children, 4;
for 9-16 children, 5, and so on.

3. Use as the number of parallel workers for the children the maximum
of the value computed in step 1 and the value computed in step 2.

With this approach, a plan with 100 children qualifies for 8 parallel
workers (unless one of the children individually qualifies for some
larger number, or unless max_parallel_workers_per_gather is set to a
smaller value). That seems fairly reasonable to me.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Robert Haas (#15)

Re: Parallel Append implementation

On Wed, Feb 15, 2017 at 6:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 15, 2017 at 4:43 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 14 February 2017 at 22:35, Robert Haas <robertmhaas@gmail.com> wrote:

For example, suppose that I have a scan of two children, one
of which has parallel_workers of 4, and the other of which has
parallel_workers of 3. If I pick parallel_workers of 7 for the
Parallel Append, that's probably too high.

In the patch, in such case, 7 workers are indeed selected for Parallel
Append path, so that both the subplans are able to execute in parallel
with their full worker capacity. Are you suggesting that we should not
?

Absolutely. I think that's going to be way too many workers. Imagine
that there are 100 child tables and each one is big enough to qualify
for 2 or 3 workers. No matter what value the user has selected for
max_parallel_workers_per_gather, they should not get a scan involving
200 workers.

If the user is ready throw 200 workers and if the subplans can use
them to speed up the query 200 times (obviously I am exaggerating),
why not to use those? When the user set
max_parallel_workers_per_gather to that high a number, he meant it to
be used by a gather, and that's what we should be doing.

What I was thinking about is something like this:

1. First, take the maximum parallel_workers value from among all the children.

2. Second, compute log2(num_children)+1 and round up. So, for 1
child, 1; for 2 children, 2; for 3-4 children, 3; for 5-8 children, 4;
for 9-16 children, 5, and so on.

Can you please explain the rationale behind this maths?

3. Use as the number of parallel workers for the children the maximum
of the value computed in step 1 and the value computed in step 2.

With this approach, a plan with 100 children qualifies for 8 parallel
workers (unless one of the children individually qualifies for some
larger number, or unless max_parallel_workers_per_gather is set to a
smaller value). That seems fairly reasonable to me.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

amitdkhan.pg@gmail.com

over 9 years ago

In reply to: Robert Haas (#15)

Re: Parallel Append implementation

On 15 February 2017 at 18:40, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 15, 2017 at 4:43 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

On 14 February 2017 at 22:35, Robert Haas <robertmhaas@gmail.com> wrote:

For example, suppose that I have a scan of two children, one
of which has parallel_workers of 4, and the other of which has
parallel_workers of 3. If I pick parallel_workers of 7 for the
Parallel Append, that's probably too high.

In the patch, in such case, 7 workers are indeed selected for Parallel
Append path, so that both the subplans are able to execute in parallel
with their full worker capacity. Are you suggesting that we should not
?

Absolutely. I think that's going to be way too many workers. Imagine
that there are 100 child tables and each one is big enough to qualify
for 2 or 3 workers. No matter what value the user has selected for
max_parallel_workers_per_gather, they should not get a scan involving
200 workers.

What I was thinking about is something like this:

1. First, take the maximum parallel_workers value from among all the children.

2. Second, compute log2(num_children)+1 and round up. So, for 1
child, 1; for 2 children, 2; for 3-4 children, 3; for 5-8 children, 4;
for 9-16 children, 5, and so on.

3. Use as the number of parallel workers for the children the maximum
of the value computed in step 1 and the value computed in step 2.

Ah, now that I closely look at compute_parallel_worker(), I see what
you are getting at.

For plain unpartitioned table, parallel_workers is calculated as
roughly equal to log(num_pages) (actually it is log3). So if the table
size is n, the workers will be log(n). So if it is partitioned into p
partitions of size n/p each, still the number of workers should be
log(n). Whereas, in the patch, it is calculated as (total of all the
child workers) i.e. n * log(n/p) for this case. But log(n) != p *
log(x/p). For e.g. log(1000) is much less than log(300) + log(300) +
log(300).

That means, the way it is calculated in the patch turns out to be much
larger than if it were calculated using log(total of sizes of all
children). So I think for the step 2 above, log(total_rel_size)
formula seems to be appropriate. What do you think ? For
compute_parallel_worker(), it is actually log3 by the way.

BTW this formula is just an extension of how parallel_workers is
calculated for an unpartitioned table.

For example, suppose that I have a scan of two children, one
of which has parallel_workers of 4, and the other of which has
parallel_workers of 3. If I pick parallel_workers of 7 for the
Parallel Append, that's probably too high. Had those two tables been
a single unpartitioned table, I would have picked 4 or 5 workers, not
7. On the other hand, if I pick parallel_workers of 4 or 5 for the
Parallel Append, and I finish with the larger table first, I think I
might as well throw all 4 of those workers at the smaller table even
though it would normally have only used 3 workers.

Having the extra 1-2 workers exit does not seem better.

It is here, where I didn't understand exactly why would we want to
assign these extra workers to a subplan which tells use that it is
already being run by 'parallel_workers' number of workers.

The decision to use fewer workers for a smaller scan isn't really
because we think that using more workers will cause a regression.
It's because we think it may not help very much, and because it's not
worth firing up a ton of workers for a relatively small scan given
that workers are a limited resource. I think once we've got a bunch
of workers started, we might as well try to use them.

One possible side-effect I see due to this is : Other sessions might
not get a fair share of workers due to this. But again, there might be
counter argument that, because Append is now focussing all the workers
on a last subplan, it may finish faster, and release *all* of its
workers earlier.

BTW, there is going to be some logic change in the choose-next-subplan
algorithm if we consider giving extra workers to subplans.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

robertmhaas@gmail.com

over 9 years ago

In reply to: Ashutosh Bapat (#16)

Re: Parallel Append implementation

On Wed, Feb 15, 2017 at 11:15 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

If the user is ready throw 200 workers and if the subplans can use
them to speed up the query 200 times (obviously I am exaggerating),
why not to use those? When the user set
max_parallel_workers_per_gather to that high a number, he meant it to
be used by a gather, and that's what we should be doing.

The reason is because of what Amit Khandekar wrote in his email -- you
get a result with a partitioned table that is wildly inconsistent with
the result you get for an unpartitioned table. You could equally well
argue that if the user sets max_parallel_workers_per_gather to 200,
and there's a parallel sequential scan of an 8MB table to be
performed, we ought to use all 200 workers for that. But the planner
in fact estimates a much lesser number of workers, because using 200
workers for that task wastes a lot of resources for no real
performance benefit. If you partition that 8MB table into 100 tables
that are each 80kB, that shouldn't radically increase the number of
workers that get used.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

robertmhaas@gmail.com

over 9 years ago

In reply to: Amit Khandekar (#17)

Re: Parallel Append implementation

On Thu, Feb 16, 2017 at 1:34 AM, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:

What I was thinking about is something like this:

1. First, take the maximum parallel_workers value from among all the children.

2. Second, compute log2(num_children)+1 and round up. So, for 1
child, 1; for 2 children, 2; for 3-4 children, 3; for 5-8 children, 4;
for 9-16 children, 5, and so on.

3. Use as the number of parallel workers for the children the maximum
of the value computed in step 1 and the value computed in step 2.

Ah, now that I closely look at compute_parallel_worker(), I see what
you are getting at.

For plain unpartitioned table, parallel_workers is calculated as
roughly equal to log(num_pages) (actually it is log3). So if the table
size is n, the workers will be log(n). So if it is partitioned into p
partitions of size n/p each, still the number of workers should be
log(n). Whereas, in the patch, it is calculated as (total of all the
child workers) i.e. n * log(n/p) for this case. But log(n) != p *
log(x/p). For e.g. log(1000) is much less than log(300) + log(300) +
log(300).

That means, the way it is calculated in the patch turns out to be much
larger than if it were calculated using log(total of sizes of all
children). So I think for the step 2 above, log(total_rel_size)
formula seems to be appropriate. What do you think ? For
compute_parallel_worker(), it is actually log3 by the way.

BTW this formula is just an extension of how parallel_workers is
calculated for an unpartitioned table.

log(total_rel_size) would be a reasonable way to estimate workers when
we're scanning an inheritance hierarchy, but I'm hoping Parallel
Append is also going to apply to UNION ALL queries, where there's no
concept of the total rel size. For that we need something else, which
is why the algorithm that I proposed upthread doesn't rely on it.

The decision to use fewer workers for a smaller scan isn't really
because we think that using more workers will cause a regression.
It's because we think it may not help very much, and because it's not
worth firing up a ton of workers for a relatively small scan given
that workers are a limited resource. I think once we've got a bunch
of workers started, we might as well try to use them.

One possible side-effect I see due to this is : Other sessions might
not get a fair share of workers due to this. But again, there might be
counter argument that, because Append is now focussing all the workers
on a last subplan, it may finish faster, and release *all* of its
workers earlier.

Right. I think in general it's pretty clear that there are possible
fairness problems with parallel query. The first process that comes
along seizes however many workers it thinks it should use, and
everybody else can use whatever (if anything) is left. In the long
run, I think it would be cool to have a system where workers can leave
one parallel query in progress and join a different one (or exit and
spawn a new worker to join a different one), automatically rebalancing
as the number of parallel queries in flight fluctuates. But that's
clearly way beyond anything we can do right now. I think we should
assume that any parallel workers our process has obtained are ours to
use for the duration of the query, and use them as best we can. Note
that even if the Parallel Append tells one of the workers that there
are no more tuples and it should go away, some higher level of the
query plan could make a different choice anyway; there might be
another Append elsewhere in the plan tree.

BTW, there is going to be some logic change in the choose-next-subplan
algorithm if we consider giving extra workers to subplans.

I'm not sure that it's going to be useful to make this logic very
complicated. I think the most important thing is to give 1 worker to
each plan before we give a second worker to any plan. In general I
think it's sufficient to assign a worker that becomes available to the
subplan with the fewest number of workers (or one of them, if there's
a tie) without worrying too much about the target number of workers
for that subplan.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Robert Haas (#18)

Re: Parallel Append implementation

On Thu, Feb 16, 2017 at 8:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Feb 15, 2017 at 11:15 PM, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

If the user is ready throw 200 workers and if the subplans can use
them to speed up the query 200 times (obviously I am exaggerating),
why not to use those? When the user set
max_parallel_workers_per_gather to that high a number, he meant it to
be used by a gather, and that's what we should be doing.

The reason is because of what Amit Khandekar wrote in his email -- you
get a result with a partitioned table that is wildly inconsistent with
the result you get for an unpartitioned table. You could equally well
argue that if the user sets max_parallel_workers_per_gather to 200,
and there's a parallel sequential scan of an 8MB table to be
performed, we ought to use all 200 workers for that. But the planner
in fact estimates a much lesser number of workers, because using 200
workers for that task wastes a lot of resources for no real
performance benefit. If you partition that 8MB table into 100 tables
that are each 80kB, that shouldn't radically increase the number of
workers that get used.

That's true for a partitioned table, but not necessarily for every
append relation. Amit's patch is generic for all append relations. If
the child plans are joins or subquery segments of set operations, I
doubt if the same logic works. It may be better if we throw as many
workers (or some function "summing" those up) as specified by those
subplans. I guess, we have to use different logic for append relations
which are base relations and append relations which are not base
relations.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

amitdkhan.pg@gmail.com

over 9 years ago

In reply to: Robert Haas (#19)

#22

amitdkhan.pg@gmail.com

over 9 years ago

In reply to: Ashutosh Bapat (#9)

#23

amitdkhan.pg@gmail.com

over 9 years ago

In reply to: Robert Haas (#19)

#24

robertmhaas@gmail.com

over 9 years ago

In reply to: Ashutosh Bapat (#20)

#25

robertmhaas@gmail.com

over 9 years ago

In reply to: Amit Khandekar (#21)

#26

ashutosh.bapat@enterprisedb.com

over 9 years ago

In reply to: Robert Haas (#24)

#27

robertmhaas@gmail.com

over 9 years ago

In reply to: Ashutosh Bapat (#26)

#28

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Robert Haas (#25)

#29

robertmhaas@gmail.com

about 9 years ago

In reply to: Amit Khandekar (#28)

#30

ashutosh.bapat@enterprisedb.com

about 9 years ago

In reply to: Robert Haas (#29)

#31

robertmhaas@gmail.com

about 9 years ago

In reply to: Ashutosh Bapat (#30)

#32

ashutosh.bapat@enterprisedb.com

about 9 years ago

In reply to: Robert Haas (#31)

#33

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Ashutosh Bapat (#32)

#34

ashutosh.bapat@enterprisedb.com

about 9 years ago

In reply to: Amit Khandekar (#33)

#35

ashutosh.bapat@enterprisedb.com

about 9 years ago

In reply to: Ashutosh Bapat (#34)

#36

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Ashutosh Bapat (#35)

#37

ashutosh.bapat@enterprisedb.com

about 9 years ago

In reply to: Amit Khandekar (#36)

#38

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Ashutosh Bapat (#37)

#39

Tels

nospam-pg-abuse@bloodgate.com

about 9 years ago

In reply to: Amit Khandekar (#36)

#40

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Robert Haas (#29)

#41

robertmhaas@gmail.com

about 9 years ago

In reply to: Amit Khandekar (#33)

#42

robertmhaas@gmail.com

about 9 years ago

In reply to: Amit Khandekar (#40)

#43

robertmhaas@gmail.com

about 9 years ago

In reply to: Tels (#39)

#44

Tels

nospam-pg-abuse@bloodgate.com

about 9 years ago

In reply to: Robert Haas (#43)

#45

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Robert Haas (#41)

#46

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Tels (#44)

#47

robertmhaas@gmail.com

about 9 years ago

In reply to: Amit Khandekar (#45)

#48

robertmhaas@gmail.com

about 9 years ago

In reply to: Robert Haas (#47)

#49

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Amit Khandekar (#45)

#50

ashutosh.bapat@enterprisedb.com

about 9 years ago

In reply to: Amit Khandekar (#49)

#51

robertmhaas@gmail.com

about 9 years ago

In reply to: Ashutosh Bapat (#50)

#52

robertmhaas@gmail.com

about 9 years ago

In reply to: Amit Khandekar (#49)

#53

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Robert Haas (#52)

#54

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Ashutosh Bapat (#50)

#55

Peter Geoghegan

pg@bowt.ie

about 9 years ago

In reply to: Amit Khandekar (#53)

#56

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Amit Khandekar (#53)

#57

robertmhaas@gmail.com

about 9 years ago

In reply to: Amit Khandekar (#53)

#58

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Robert Haas (#57)

#59

robertmhaas@gmail.com

about 9 years ago

In reply to: Amit Khandekar (#58)

#60

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Robert Haas (#59)

#61

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Amit Khandekar (#60)

#62

Rajkumar Raghuwanshi

rajkumar.raghuwanshi@enterprisedb.com

about 9 years ago

In reply to: Amit Khandekar (#61)

#63

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Rajkumar Raghuwanshi (#62)

#64

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Amit Khandekar (#61)

#65

andres@anarazel.de

about 9 years ago

In reply to: Amit Khandekar (#64)

#66

robertmhaas@gmail.com

about 9 years ago

In reply to: Andres Freund (#65)

#67

andres@anarazel.de

about 9 years ago

In reply to: Robert Haas (#66)

#68

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Andres Freund (#67)

#69

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Amit Khandekar (#68)

#70

robertmhaas@gmail.com

about 9 years ago

In reply to: Andres Freund (#67)

#71

robertmhaas@gmail.com

about 9 years ago

In reply to: Andres Freund (#65)

#72

andres@anarazel.de

about 9 years ago

In reply to: Robert Haas (#70)

#73

ashutosh.bapat@enterprisedb.com

about 9 years ago

In reply to: Andres Freund (#72)

#74

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Andres Freund (#72)

#75

robertmhaas@gmail.com

about 9 years ago

In reply to: Andres Freund (#72)

#76

andres@anarazel.de

about 9 years ago

In reply to: Amit Khandekar (#74)

#77

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Andres Freund (#76)

#78

andres@anarazel.de

about 9 years ago

In reply to: Amit Khandekar (#77)

#79

amitdkhan.pg@gmail.com

about 9 years ago

In reply to: Andres Freund (#78)

#80

rafia.sabih@enterprisedb.com

almost 9 years ago

In reply to: Amit Khandekar (#69)

#81

amitdkhan.pg@gmail.com

almost 9 years ago

In reply to: Rafia Sabih (#80)

#82

robertmhaas@gmail.com

almost 9 years ago

In reply to: Amit Khandekar (#81)

#83

amitdkhan.pg@gmail.com

almost 9 years ago

In reply to: Robert Haas (#82)

#84

rafia.sabih@enterprisedb.com

almost 9 years ago

In reply to: Amit Khandekar (#83)

#85

robertmhaas@gmail.com

almost 9 years ago

In reply to: Rafia Sabih (#84)

#86

amitdkhan.pg@gmail.com

almost 9 years ago

In reply to: Robert Haas (#85)

#87

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#86)

#88

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#87)

#89

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#87)

#90

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#88)

#91

rafia.sabih@enterprisedb.com

over 8 years ago

In reply to: Amit Khandekar (#87)

#92

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Rafia Sabih (#91)

#93

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#90)

#94

rafia.sabih@enterprisedb.com

over 8 years ago

In reply to: Amit Khandekar (#87)

#95

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#93)

#96

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#95)

#97

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#96)

#98

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Kapila (#97)

#99

amit.kapila16@gmail.com

over 8 years ago

In reply to: Robert Haas (#98)

#100

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#99)

#101

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#97)

#102

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#101)

#103

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#100)

#104

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Kapila (#103)

#105

amit.kapila16@gmail.com

over 8 years ago

In reply to: Robert Haas (#104)

#106

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#100)

#107

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Kapila (#105)

#108

amit.kapila16@gmail.com

over 8 years ago

In reply to: Robert Haas (#107)

#109

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Kapila (#108)

#110

amit.kapila16@gmail.com

over 8 years ago

In reply to: Robert Haas (#109)

#111

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#106)

#112

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Kapila (#110)

#113

amit.kapila16@gmail.com

over 8 years ago

In reply to: Robert Haas (#112)

#114

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#111)

#115

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#114)

#116

amit.kapila16@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#115)

#117

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Amit Kapila (#116)

#118

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#117)

#119

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#118)

#120

robertmhaas@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#119)

#121

robertmhaas@gmail.com

over 8 years ago

In reply to: Robert Haas (#120)

#122

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#121)

#123

rafia.sabih@enterprisedb.com

over 8 years ago

In reply to: Amit Khandekar (#122)

#124

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Rafia Sabih (#123)

#125

sulamul@gmail.com

over 8 years ago

In reply to: Amit Khandekar (#124)

#126

robertmhaas@gmail.com

over 8 years ago

In reply to: Amul Sul (#125)

#127

sulamul@gmail.com

over 8 years ago

In reply to: Robert Haas (#126)

#128

Rajkumar Raghuwanshi

rajkumar.raghuwanshi@enterprisedb.com

over 8 years ago

In reply to: Amul Sul (#127)

#129

sulamul@gmail.com

over 8 years ago

In reply to: Rajkumar Raghuwanshi (#128)

#130

rafia.sabih@enterprisedb.com

over 8 years ago

In reply to: Amul Sul (#125)

#131

Rajkumar Raghuwanshi

rajkumar.raghuwanshi@enterprisedb.com

over 8 years ago

In reply to: Amul Sul (#129)

#132

sulamul@gmail.com

over 8 years ago

In reply to: Rajkumar Raghuwanshi (#131)

#133

sulamul@gmail.com

over 8 years ago

In reply to: Amul Sul (#132)

#134

Michael Paquier

michael@paquier.xyz

over 8 years ago

In reply to: Amul Sul (#133)

#135

robertmhaas@gmail.com

over 8 years ago

In reply to: Amul Sul (#133)

#136

amitdkhan.pg@gmail.com

over 8 years ago

In reply to: Robert Haas (#135)

#137

Adrien Nayrat

adrien.nayrat@anayrat.info

about 8 years ago

In reply to: Robert Haas (#135)

#138

robertmhaas@gmail.com

about 8 years ago

In reply to: Adrien Nayrat (#137)

#139

thomas.munro@gmail.com

about 8 years ago

In reply to: Robert Haas (#138)

#140

robertmhaas@gmail.com

about 8 years ago

In reply to: Thomas Munro (#139)

#141

thomas.munro@gmail.com

about 8 years ago

In reply to: Robert Haas (#140)

#142

robertmhaas@gmail.com

about 8 years ago

In reply to: Thomas Munro (#141)

#143

thomas.munro@gmail.com

almost 8 years ago

In reply to: Robert Haas (#142)

#144

robertmhaas@gmail.com

almost 8 years ago

In reply to: Thomas Munro (#143)

#145

thomas.munro@gmail.com

almost 8 years ago

In reply to: Robert Haas (#144)

#146