Parallel Append can break run-time partition pruning

Started by David Rowleyover 5 years ago24 messages

dgrowleyml@gmail.com

over 5 years ago

I had a report from the wilds that run-time partition pruning was not
working in certain cases.

After some investigation and obtaining the mockup of the actual case,
I discovered that the problem was down to accumulate_append_subpath()
hitting the case where it does not pullup a Parallel Append where the
first parallel node is > 0.

What's actually happening is that the plan is left with a nested
Append, and in this particular case, the top-level Append only has a
single subpath, to which the code for 8edd0e794 (Suppress Append and
MergeAppend plan nodes that have a single child) causes the nested
Append to be pulled up to become the main Append. This causes
run-time pruning to break since we only attach the pruning information
to the top-level Append.

The most simplified test case I can find to demonstrate this issue is:

create table list (a int, b int) partition by list(a);
create table list_12 partition of list for values in(1,2) partition by list(a);
create table list_12_1 partition of list_12 for values in(1);
create table list_12_2 partition of list_12 for values in(2);

insert into list select 2,0 from generate_Series(1,1000000) x;
vacuum analyze list;

explain (analyze on, costs off, timing off, summary off)
select * from list where a = (select 1) and b > 0;

-- force the 2nd subnode of the Append to be non-parallel.
alter table list_12_1 set (parallel_workers=0);

explain (analyze on, costs off, timing off, summary off)
select * from list where a = (select 1) and b > 0;

The results of this in master are:

postgres=# explain (analyze on, costs off, timing off, summary off)
select * from list where a = (select 1) and b > 0;
QUERY PLAN
---------------------------------------------------------------------------
Gather (actual rows=0 loops=1)
Workers Planned: 2
Params Evaluated: $0
Workers Launched: 2
InitPlan 1 (returns $0)
-> Result (actual rows=1 loops=1)
-> Parallel Append (actual rows=0 loops=3)
-> Parallel Seq Scan on list_12_2 list_2 (never executed)
Filter: ((b > 0) AND (a = $0))
-> Parallel Seq Scan on list_12_1 list_1 (actual rows=0 loops=1)
Filter: ((b > 0) AND (a = $0))
(11 rows)

postgres=# alter table list_12_1 set (parallel_workers=0);
ALTER TABLE
postgres=# explain (analyze on, costs off, timing off, summary off)
select * from list where a = (select 1) and b > 0;
QUERY PLAN
---------------------------------------------------------------------------
Gather (actual rows=0 loops=1)
Workers Planned: 2
Params Evaluated: $0
Workers Launched: 2
InitPlan 1 (returns $0)
-> Result (actual rows=1 loops=1)
-> Parallel Append (actual rows=0 loops=3)
-> Seq Scan on list_12_1 list_1 (actual rows=0 loops=1)
Filter: ((b > 0) AND (a = $0))
-> Parallel Seq Scan on list_12_2 list_2 (actual rows=0 loops=3)
Filter: ((b > 0) AND (a = $0))
Rows Removed by Filter: 333333
(12 rows)

Notice that we don't get "(never executed)" for list_12_2 in the 2nd case.

I'm a bit divided on what the correct fix is. If I blame Parallel
Append for not trying hard enough to pull up the lower Append in
accumulate_append_subpath(), then clearly the parallel append code is
to blame. However, perhaps run-time pruning should be tagging on
PartitionPruneInfo to more than top-level Appends. Fixing the latter
case, code-wise is about as simple as removing the "rel->reloptkind ==
RELOPT_BASEREL &&" line from create_append_plan(). Certainly, if the
outer Append hadn't been a single subpath Append, then we wouldn't
have pulled up the lower-level Append, so perhaps we should be
run-time pruning lower-level ones too.

What do other people think?

(copying in Robert and Amit K due to their work on Parallel Append,
Tom as I seem to remember him complaining about
accumulate_append_subpath() at some point and Amit L because...
partitioning...)

David

Amit Langote

amitlangote09@gmail.com

over 5 years ago

In reply to: David Rowley (#1)

1 attachment(s)

Re: Parallel Append can break run-time partition pruning

On Wed, Apr 15, 2020 at 4:18 PM David Rowley <dgrowleyml@gmail.com> wrote:

I'm a bit divided on what the correct fix is. If I blame Parallel
Append for not trying hard enough to pull up the lower Append in
accumulate_append_subpath(), then clearly the parallel append code is
to blame.

I spent some time trying to understand how Append parallelism works
and I am tempted to agree with you that there might be problems with
how accumulate_append_subpath()'s interacts with parallelism. Maybe it
would be better to disregard a non-parallel-aware partial Append if it
requires us to fail on flattening a child Append. I have as attached
a PoC fix to show that. While a nested Append is not really a problem
in general, it appears to me that our run-time code is not in position
to work correctly with them, or at least not with how things stand
today...

However, perhaps run-time pruning should be tagging on
PartitionPruneInfo to more than top-level Appends. Fixing the latter
case, code-wise is about as simple as removing the "rel->reloptkind ==
RELOPT_BASEREL &&" line from create_append_plan(). Certainly, if the
outer Append hadn't been a single subpath Append, then we wouldn't
have pulled up the lower-level Append, so perhaps we should be
run-time pruning lower-level ones too.

While looking at this, I observed that the PartitionPruneInfo of the
top-level Append (the one that later gets thrown out) contains bogus
information:

{PARTITIONPRUNEINFO
:prune_infos ((
{PARTITIONEDRELPRUNEINFO
:rtindex 1
:present_parts (b 0)
:nparts 1
:subplan_map 0
:subpart_map 1

One of these should be -1.

{PARTITIONEDRELPRUNEINFO
:rtindex 2
:present_parts (b)
:nparts 2
:subplan_map -1 -1
:subpart_map -1 -1

subplan_map values are not correct, because subpaths list that would
have been passed would not include paths of lower-level partitions as
the flattening didn't occur.

))
:other_subplans (b)
}

I guess the problem is that we let an Append be nested, but don't
account for that in how partitioned_rels list it parent Append is
constructed. The top-level Append's partitioned_rels should not have
contained sub-partitioned table's RTI if it got its own Append. Maybe
if we want to make run-time pruning work with nested Appends, we need
to fix how partitioned_rels are gathered.

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

Attachments:

avoid-partial-nested-append.patchapplication/octet-stream; name=avoid-partial-nested-append.patchDownload

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 255f56b..b7144d9 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -104,7 +104,7 @@ static void generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 												   RelOptInfo *rel,
 												   Relids required_outer);
-static void accumulate_append_subpath(Path *path,
+static bool accumulate_append_subpath(Path *path,
 									  List **subpaths, List **special_subpaths);
 static Path *get_singleton_append_subpath(Path *path);
 static void set_dummy_rel_pathlist(RelOptInfo *rel);
@@ -1395,8 +1395,14 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (childrel->partial_pathlist != NIL)
 		{
 			cheapest_partial_path = linitial(childrel->partial_pathlist);
-			accumulate_append_subpath(cheapest_partial_path,
-									  &partial_subpaths, NULL);
+
+			/*
+			 * Don't allow partial Append if it prevents a child Append from
+			 * being flattened.
+			 */
+			if (!accumulate_append_subpath(cheapest_partial_path,
+										   &partial_subpaths, NULL))
+				partial_subpaths_valid = false;
 		}
 		else
 			partial_subpaths_valid = false;
@@ -2037,8 +2043,10 @@ get_cheapest_parameterized_child_path(PlannerInfo *root, RelOptInfo *rel,
  * children to subpaths and the rest to special_subpaths.  If the latter is
  * NULL, we don't flatten the path at all (unless it contains only partial
  * paths).
+ *
+ * Returns whether able to flatten or not.
  */
-static void
+static bool
 accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 {
 	if (IsA(path, AppendPath))
@@ -2048,7 +2056,7 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 		if (!apath->path.parallel_aware || apath->first_partial_path == 0)
 		{
 			*subpaths = list_concat(*subpaths, apath->subpaths);
-			return;
+			return true;
 		}
 		else if (special_subpaths != NULL)
 		{
@@ -2063,18 +2071,21 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 							  apath->first_partial_path);
 			*special_subpaths = list_concat(*special_subpaths,
 											new_special_subpaths);
-			return;
+			return true;
 		}
+		else
+			return false;
 	}
 	else if (IsA(path, MergeAppendPath))
 	{
 		MergeAppendPath *mpath = (MergeAppendPath *) path;
 
 		*subpaths = list_concat(*subpaths, mpath->subpaths);
-		return;
+		return true;
 	}
 
 	*subpaths = lappend(*subpaths, path);
+	return true;
 }
 
 /*

David Rowley

dgrowleyml@gmail.com

over 5 years ago

In reply to: Amit Langote (#2)

Re: Parallel Append can break run-time partition pruning

On Fri, 17 Apr 2020 at 19:08, Amit Langote <amitlangote09@gmail.com> wrote:

While looking at this, I observed that the PartitionPruneInfo of the
top-level Append (the one that later gets thrown out) contains bogus
information:

{PARTITIONPRUNEINFO
:prune_infos ((
{PARTITIONEDRELPRUNEINFO
:rtindex 1
:present_parts (b 0)
:nparts 1
:subplan_map 0
:subpart_map 1

One of these should be -1.

{PARTITIONEDRELPRUNEINFO
:rtindex 2
:present_parts (b)
:nparts 2
:subplan_map -1 -1
:subpart_map -1 -1

subplan_map values are not correct, because subpaths list that would
have been passed would not include paths of lower-level partitions as
the flattening didn't occur.

It's not great that we're generating that, but as far as I can see,
it's not going to cause any misbehaviour. It'll cause a small
slowdown in run-time pruning due to perhaps having to perform an
additional call to find_matching_subplans_recurse() during execution.
In this case, it'll never find any subnodes that match due to both
maps having all -1 element values.

Since f2343653f5, we're not using partitioned_rels for anything else,
so we should likely fix this so that we don't add the item to
partitioned_rels when we don't pullup the sub-Append. I think we
should hold off on fixing that until we decide if any adjustments need
to be made to the sub-Append pullup code.

David

David Rowley

dgrowleyml@gmail.com

over 5 years ago

In reply to: Amit Langote (#2)

Re: Parallel Append can break run-time partition pruning

On Fri, 17 Apr 2020 at 19:08, Amit Langote <amitlangote09@gmail.com> wrote:

On Wed, Apr 15, 2020 at 4:18 PM David Rowley <dgrowleyml@gmail.com> wrote:

I'm a bit divided on what the correct fix is. If I blame Parallel
Append for not trying hard enough to pull up the lower Append in
accumulate_append_subpath(), then clearly the parallel append code is
to blame.

I spent some time trying to understand how Append parallelism works
and I am tempted to agree with you that there might be problems with
how accumulate_append_subpath()'s interacts with parallelism. Maybe it
would be better to disregard a non-parallel-aware partial Append if it
requires us to fail on flattening a child Append. I have as attached
a PoC fix to show that. While a nested Append is not really a problem
in general, it appears to me that our run-time code is not in position
to work correctly with them, or at least not with how things stand
today...

Thanks for taking a look at this. I've now looked at this in more
detail and below is my understanding of what's going on:

It seems, in this case, what's going on is, on the following line:

accumulate_append_subpath(cheapest_partial_path,
&partial_subpaths, NULL);

we don't manage to pullup the sub-Append due to passing a NULL pointer
for the final special_subpaths argument. This results in just taking
the child's Append path verbatim. i.e. nested Append

Later, when we do:

else if (nppath == NULL ||
(cheapest_partial_path != NULL &&
cheapest_partial_path->total_cost < nppath->total_cost))
{
/* Partial path is cheaper or the only option. */
Assert(cheapest_partial_path != NULL);
accumulate_append_subpath(cheapest_partial_path,
&pa_partial_subpaths,
&pa_nonpartial_subpaths);

we do pass a non-NULL special_subpaths argument to allow the
sub-Append to be pulled up.

So, now we have 2 paths, one with a nested Append and one with a
flattened Append. Both paths have the same cost, but due to the fact
that we call add_partial_path() for the nested Append version first,
the logic in add_partial_path() accepts that path. However, the
subsequent call of add_partial_path(), the one for the non-nested
Append, that path is rejected due to the total cost being too similar
to one of the existing partial path. We just end up keeping the nested
Append as the cheapest partial path... That path, since in the example
case only has a single subpath, is pulled up into the main append by
the logic added in 8edd0e794.

I think you've realised this and that's why your PoC patch just
rejected the first path when it's unable to do the pullup. We'll get a
better path later when we allow mixed partial and non-partial paths.

(We'll never fail to do a pullup when calling
accumulate_append_subpath() for "nppath", since that's a non-parallel
path and accumulate_append_subpath() will always pull Append paths up
when they're not parallel aware.)

I wonder if the fix should be more something along the lines of trying
to merge things do we only generate a single partial path. That way
we wouldn't be at the mercy of the logic in add_partial_path() to
accept or reject the path based on the order the paths are added.

David

Amit Langote

amitlangote09@gmail.com

over 5 years ago

In reply to: David Rowley (#4)

Re: Parallel Append can break run-time partition pruning

Hi David,

On Tue, Apr 21, 2020 at 12:03 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 17 Apr 2020 at 19:08, Amit Langote <amitlangote09@gmail.com> wrote:

On Wed, Apr 15, 2020 at 4:18 PM David Rowley <dgrowleyml@gmail.com> wrote:

I'm a bit divided on what the correct fix is. If I blame Parallel
Append for not trying hard enough to pull up the lower Append in
accumulate_append_subpath(), then clearly the parallel append code is
to blame.

I spent some time trying to understand how Append parallelism works
and I am tempted to agree with you that there might be problems with
how accumulate_append_subpath()'s interacts with parallelism. Maybe it
would be better to disregard a non-parallel-aware partial Append if it
requires us to fail on flattening a child Append. I have as attached
a PoC fix to show that. While a nested Append is not really a problem
in general, it appears to me that our run-time code is not in position
to work correctly with them, or at least not with how things stand
today...

Thanks for taking a look at this. I've now looked at this in more
detail and below is my understanding of what's going on:

It seems, in this case, what's going on is, on the following line:

accumulate_append_subpath(cheapest_partial_path,
&partial_subpaths, NULL);

we don't manage to pullup the sub-Append due to passing a NULL pointer
for the final special_subpaths argument. This results in just taking
the child's Append path verbatim. i.e. nested Append

Later, when we do:

else if (nppath == NULL ||
(cheapest_partial_path != NULL &&
cheapest_partial_path->total_cost < nppath->total_cost))
{
/* Partial path is cheaper or the only option. */
Assert(cheapest_partial_path != NULL);
accumulate_append_subpath(cheapest_partial_path,
&pa_partial_subpaths,
&pa_nonpartial_subpaths);

we do pass a non-NULL special_subpaths argument to allow the
sub-Append to be pulled up.

So, now we have 2 paths, one with a nested Append and one with a
flattened Append. Both paths have the same cost, but due to the fact
that we call add_partial_path() for the nested Append version first,
the logic in add_partial_path() accepts that path. However, the
subsequent call of add_partial_path(), the one for the non-nested
Append, that path is rejected due to the total cost being too similar
to one of the existing partial path. We just end up keeping the nested
Append as the cheapest partial path... That path, since in the example
case only has a single subpath, is pulled up into the main append by
the logic added in 8edd0e794.

I think you've realised this and that's why your PoC patch just
rejected the first path when it's unable to do the pullup.

Right.

We'll get a
better path later when we allow mixed partial and non-partial paths.

Yes, but only if parallel-aware Append is allowed (pa_subpaths_valid).
So it's possible that the Append may not participate in any
parallelism whatsoever if we reject partial Append on failing to fold
a child Append, which does somewhat suck.

(We'll never fail to do a pullup when calling
accumulate_append_subpath() for "nppath", since that's a non-parallel
path and accumulate_append_subpath() will always pull Append paths up
when they're not parallel aware.)

I wonder if the fix should be more something along the lines of trying
to merge things do we only generate a single partial path. That way
we wouldn't be at the mercy of the logic in add_partial_path() to
accept or reject the path based on the order the paths are added.

So as things stand, parallel-aware partial Append (Parallel Append)
path competes with non-parallel partial Append path on cost grounds.
As far as I can see, it's only the latter that can contain among its
subpaths an (nested) Append which can be problematic. Given that, out
choice between the two types of partial Append paths becomes based on
something that is not cost, but is that okay?

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

David Rowley

dgrowleyml@gmail.com

over 5 years ago

In reply to: David Rowley (#4)

1 attachment(s)

Re: Parallel Append can break run-time partition pruning

On Tue, 21 Apr 2020 at 15:03, David Rowley <dgrowleyml@gmail.com> wrote:

I wonder if the fix should be more something along the lines of trying
to merge things do we only generate a single partial path. That way
we wouldn't be at the mercy of the logic in add_partial_path() to
accept or reject the path based on the order the paths are added.

I took a shot at doing things this way.

First, I'll recap on the problem this is trying to solve:

add_paths_to_append_rel() attempts to create two separate partial
Append paths. I'll describe both of these below:

Path1: This path is generated regardless of if Parallel Append is
enabled and contains all the cheapest partial paths from each child
relation. If parallel append is enabled this will become a Parallel
Append. If it's not then a non-parallel append will be created
containing the list of partial subpaths. Here's an example from
select_parallel.out:

QUERY PLAN
--------------------------------------------------------------
Finalize Aggregate
-> Gather
Workers Planned: 1
-> Partial Aggregate
-> Append
-> Parallel Seq Scan on a_star a_star_1
-> Parallel Seq Scan on b_star a_star_2
-> Parallel Seq Scan on c_star a_star_3
-> Parallel Seq Scan on d_star a_star_4
-> Parallel Seq Scan on e_star a_star_5
-> Parallel Seq Scan on f_star a_star_6

Path2: We only ever consider this one when enable_parallel_append ==
true and the append rel's consider_parallel == true. When this path
is generated, it'll always be for a Parallel Append. This path may
contain a mix of partial paths for child rels and parallel_safe child
paths, of which will only be visited by a single worker.

The problem is that path1 does not pullup child Appends when the child
append path contains a mix of partial and parallel safe paths (i.e a
sub-path2, per above). Since we create path2 in addition to path1, the
costs come out the same even though path1 couldn't pullup the
sub-Append paths. Unfortunately, the costs are the same so path1 is
prefered since it's added first. add_partial_path() just rejects path2
based on it being too similar in cost to the existing path1.

In the attached, I'm trying to solve this by only created 1 partial
Append path in the first place. This path will always try to use the
cheapest partial path, or the cheapest parallel safe path, if parallel
append is allowed and that path is cheaper than the cheapest partial
path.

I believe the attached gives us what we want and additionally, since
it should now always pullup the sub-Appends, then there's no need to
consider adjusting partitioned_rels based on if the pull-up occurred
or not. Those should now be right in all cases. This should also fix
the run-time pruning issue too since in my original test case it'll
pullup the sub-Append which means that the code added in 8edd0e794 is
no longer going to do anything with it as the top-level Append will
never contain just 1 subpath.

I'm reasonably certain that this is correct, but I did find it a bit
mind-bending considering all the possible cases, so it could do with
some more eyes on it. I've not really done a final polish of the
comments yet. I'll do that if the patch is looking promising.

The output of the original test with the attached is as follows:

postgres=# explain (analyze on, costs off, timing off, summary off)
postgres-# select * from list where a = (select 1) and b > 0;
QUERY PLAN
---------------------------------------------------------------------------
Gather (actual rows=0 loops=1)
Workers Planned: 2
Params Evaluated: $0
Workers Launched: 2
InitPlan 1 (returns $0)
-> Result (actual rows=1 loops=1)
-> Parallel Append (actual rows=0 loops=3)
-> Parallel Seq Scan on list_12_2 list_2 (never executed)
Filter: ((b > 0) AND (a = $0))
-> Parallel Seq Scan on list_12_1 list_1 (actual rows=0 loops=1)
Filter: ((b > 0) AND (a = $0))
(11 rows)

postgres=# -- force the 2nd subnode of the Append to be non-parallel.
postgres=# alter table list_12_1 set (parallel_workers=0);
ALTER TABLE
postgres=# explain (analyze on, costs off, timing off, summary off)
postgres-# select * from list where a = (select 1) and b > 0;
QUERY PLAN
--------------------------------------------------------------------
Gather (actual rows=0 loops=1)
Workers Planned: 2
Params Evaluated: $0
Workers Launched: 2
InitPlan 1 (returns $0)
-> Result (actual rows=1 loops=1)
-> Parallel Append (actual rows=0 loops=3)
-> Seq Scan on list_12_1 list_1 (actual rows=0 loops=1)
Filter: ((b > 0) AND (a = $0))
-> Parallel Seq Scan on list_12_2 list_2 (never executed)
Filter: ((b > 0) AND (a = $0))
(11 rows)

Notice we get "(never executed)" in both cases.

David

Attachments:

always_pullup_child_appends_to_top_level_append.patchapplication/octet-stream; name=always_pullup_child_appends_to_top_level_append.patchDownload

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 255f56b827..f332ac531f 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1299,17 +1299,13 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	List	   *partial_subpaths = NIL;
 	List	   *pa_partial_subpaths = NIL;
 	List	   *pa_nonpartial_subpaths = NIL;
-	bool		partial_subpaths_valid = true;
-	bool		pa_subpaths_valid;
+	bool		pa_subpaths_valid = true;
 	List	   *all_child_pathkeys = NIL;
 	List	   *all_child_outers = NIL;
 	ListCell   *l;
 	List	   *partitioned_rels = NIL;
 	double		partial_rows = -1;
 
-	/* If appropriate, consider parallel append */
-	pa_subpaths_valid = enable_parallel_append && rel->consider_parallel;
-
 	/*
 	 * AppendPath generated for partitioned tables must record the RT indexes
 	 * of partitioned tables that are direct or indirect children of this
@@ -1366,7 +1362,6 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	{
 		RelOptInfo *childrel = lfirst(l);
 		ListCell   *lcp;
-		Path	   *cheapest_partial_path = NULL;
 
 		/*
 		 * For UNION ALLs with non-empty partitioned_child_rels, accumulate
@@ -1391,26 +1386,26 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		else
 			subpaths_valid = false;
 
-		/* Same idea, but for a partial plan. */
-		if (childrel->partial_pathlist != NIL)
-		{
-			cheapest_partial_path = linitial(childrel->partial_pathlist);
-			accumulate_append_subpath(cheapest_partial_path,
-									  &partial_subpaths, NULL);
-		}
-		else
-			partial_subpaths_valid = false;
-
 		/*
-		 * Same idea, but for a parallel append mixing partial and non-partial
-		 * paths.
+		 * We also build a set of paths for each child by trying to use the
+		 * cheapest partial path, or the cheapest parallel safe normal path
+		 * either when that is cheaper, or if parallel Append is disabled.
 		 */
 		if (pa_subpaths_valid)
 		{
 			Path	   *nppath = NULL;
+			Path	   *cheapest_partial_path = NULL;
 
-			nppath =
-				get_cheapest_parallel_safe_total_inner(childrel->pathlist);
+			/*
+			 * Only attempt to use a parallel safe subplan when parallel
+			 * append is allowed on this relation.
+			 */
+			if (enable_parallel_append && rel->consider_parallel)
+				nppath =
+					get_cheapest_parallel_safe_total_inner(childrel->pathlist);
+
+			if (childrel->partial_pathlist != NIL)
+				cheapest_partial_path = linitial(childrel->partial_pathlist);
 
 			if (cheapest_partial_path == NULL && nppath == NULL)
 			{
@@ -1525,23 +1520,25 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 												  partitioned_rels, -1));
 
 	/*
-	 * Consider an append of unordered, unparameterized partial paths.  Make
-	 * it parallel-aware if possible.
+	 * Consider Parallel Append, or an Append with a series of parallel_safe
+	 * subpaths.
 	 */
-	if (partial_subpaths_valid && partial_subpaths != NIL)
+	if (pa_subpaths_valid)
 	{
 		AppendPath *appendpath;
 		ListCell   *lc;
 		int			parallel_workers = 0;
 
-		/* Find the highest number of workers requested for any subpath. */
-		foreach(lc, partial_subpaths)
+		/*
+		 * Find the highest number of workers requested for any partial
+		 * subpath.
+		 */
+		foreach(lc, pa_partial_subpaths)
 		{
 			Path	   *path = lfirst(lc);
 
 			parallel_workers = Max(parallel_workers, path->parallel_workers);
 		}
-		Assert(parallel_workers > 0);
 
 		/*
 		 * If the use of parallel append is permitted, always request at least
@@ -1558,11 +1555,11 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 								   fls(list_length(live_childrels)));
 			parallel_workers = Min(parallel_workers,
 								   max_parallel_workers_per_gather);
+			Assert(parallel_workers > 0);
 		}
-		Assert(parallel_workers > 0);
 
-		/* Generate a partial append path. */
-		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
+		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
+										pa_partial_subpaths,
 										NIL, NULL, parallel_workers,
 										enable_parallel_append,
 										partitioned_rels, -1);
@@ -1573,48 +1570,6 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		 */
 		partial_rows = appendpath->path.rows;
 
-		/* Add the path. */
-		add_partial_path(rel, (Path *) appendpath);
-	}
-
-	/*
-	 * Consider a parallel-aware append using a mix of partial and non-partial
-	 * paths.  (This only makes sense if there's at least one child which has
-	 * a non-partial path that is substantially cheaper than any partial path;
-	 * otherwise, we should use the append path added in the previous step.)
-	 */
-	if (pa_subpaths_valid && pa_nonpartial_subpaths != NIL)
-	{
-		AppendPath *appendpath;
-		ListCell   *lc;
-		int			parallel_workers = 0;
-
-		/*
-		 * Find the highest number of workers requested for any partial
-		 * subpath.
-		 */
-		foreach(lc, pa_partial_subpaths)
-		{
-			Path	   *path = lfirst(lc);
-
-			parallel_workers = Max(parallel_workers, path->parallel_workers);
-		}
-
-		/*
-		 * Same formula here as above.  It's even more important in this
-		 * instance because the non-partial paths won't contribute anything to
-		 * the planned number of parallel workers.
-		 */
-		parallel_workers = Max(parallel_workers,
-							   fls(list_length(live_childrels)));
-		parallel_workers = Min(parallel_workers,
-							   max_parallel_workers_per_gather);
-		Assert(parallel_workers > 0);
-
-		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
-										pa_partial_subpaths,
-										NIL, NULL, parallel_workers, true,
-										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}

Amit Langote

amitlangote09@gmail.com

over 5 years ago

In reply to: David Rowley (#6)

Re: Parallel Append can break run-time partition pruning

On Wed, Apr 22, 2020 at 12:22 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 21 Apr 2020 at 15:03, David Rowley <dgrowleyml@gmail.com> wrote:

I wonder if the fix should be more something along the lines of trying
to merge things do we only generate a single partial path. That way
we wouldn't be at the mercy of the logic in add_partial_path() to
accept or reject the path based on the order the paths are added.

In the attached, I'm trying to solve this by only created 1 partial
Append path in the first place. This path will always try to use the
cheapest partial path, or the cheapest parallel safe path, if parallel
append is allowed and that path is cheaper than the cheapest partial
path.

I believe the attached gives us what we want and additionally, since
it should now always pullup the sub-Appends, then there's no need to
consider adjusting partitioned_rels based on if the pull-up occurred
or not. Those should now be right in all cases. This should also fix
the run-time pruning issue too since in my original test case it'll
pullup the sub-Append which means that the code added in 8edd0e794 is
no longer going to do anything with it as the top-level Append will
never contain just 1 subpath.

I'm reasonably certain that this is correct, but I did find it a bit
mind-bending considering all the possible cases, so it could do with
some more eyes on it. I've not really done a final polish of the
comments yet. I'll do that if the patch is looking promising.

Thanks for the patch.

It's good to see that unfolded sub-Appends will not occur with the new
code structure or one hopes. Also, I am finding it somewhat easier to
understand how partial Appends get built due to smaller code footprint
after patching.

One thing I remain concerned about is that it appears like we are no
longer leaving the choice between parallel and non-parallel Append to
the cost machinery which is currently the case. AFAICS with patched,
as long as parallel Append is enabled and allowed, it will be chosen
over a non-parallel Append as the partial path.

Regarding the patch, I had been assuming that the "pa" in
pa_subpaths_valid stands for "parallel append", so it using the
variable as is in the new code structure would be misleading. Maybe,
parallel_subpaths_valid?

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

David Rowley

dgrowleyml@gmail.com

over 5 years ago

In reply to: Amit Langote (#7)

Re: Parallel Append can break run-time partition pruning

On Thu, 23 Apr 2020 at 02:37, Amit Langote <amitlangote09@gmail.com> wrote:

One thing I remain concerned about is that it appears like we are no
longer leaving the choice between parallel and non-parallel Append to
the cost machinery which is currently the case. AFAICS with patched,
as long as parallel Append is enabled and allowed, it will be chosen
over a non-parallel Append as the partial path.

Given the same set of paths, when would a non-parallel append be
cheaper than a parallel one? I don't see anything in cost_append()
that could cause the costs to come out higher for the parallel
version. However, I might have misunderstood something. Can you give
an example of a case that you think might change?

The cost comparison is still there for the cheapest parallel safe
normal path vs the cheapest partial path, so, when each of those paths
are allowed, then we will still end up with the cheapest paths for
each child.

Regarding the patch, I had been assuming that the "pa" in
pa_subpaths_valid stands for "parallel append", so it using the
variable as is in the new code structure would be misleading. Maybe,
parallel_subpaths_valid?

Yeah, I had wondered if it would be better to rename it, I'd just not
thought too hard on what to call it yet.

David

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: David Rowley (#8)

Re: Parallel Append can break run-time partition pruning

David Rowley <dgrowleyml@gmail.com> writes:

Given the same set of paths, when would a non-parallel append be
cheaper than a parallel one?

Well, anytime the parallel startup cost is significant, for starters.
But maybe we account for that at some other point, like when building
the Gather?

regards, tom lane

#10

David Rowley

dgrowleyml@gmail.com

over 5 years ago

In reply to: Tom Lane (#9)

Re: Parallel Append can break run-time partition pruning

On Thu, 23 Apr 2020 at 11:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

Given the same set of paths, when would a non-parallel append be
cheaper than a parallel one?

Well, anytime the parallel startup cost is significant, for starters.
But maybe we account for that at some other point, like when building
the Gather?

Yeah. There's no mention of parallel_setup_cost or parallel_tuple_cost
in any of the Append costing code. Those are only applied when we cost
Gather / GatherMerge At the point Amit and I are talking about, we're
only comparing two Append paths. No Gather/GatherMerge in sight yet,
so any additional costs from those is not applicable.

If there was some reason that a Parallel Append could come out more
expensive, then maybe we could just create a non-parallel Append using
the same subpath list and add_partial_path() it. I just don't quite
see how that would ever win though. I'm willing to be proven wrong
though.

David

#11

Tom Lane

tgl@sss.pgh.pa.us

over 5 years ago

In reply to: David Rowley (#10)

Re: Parallel Append can break run-time partition pruning

David Rowley <dgrowleyml@gmail.com> writes:

On Thu, 23 Apr 2020 at 11:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, anytime the parallel startup cost is significant, for starters.
But maybe we account for that at some other point, like when building
the Gather?

Yeah. There's no mention of parallel_setup_cost or parallel_tuple_cost
in any of the Append costing code. Those are only applied when we cost
Gather / GatherMerge At the point Amit and I are talking about, we're
only comparing two Append paths. No Gather/GatherMerge in sight yet,
so any additional costs from those is not applicable.

Right, so really the costs of partial and non-partial paths are not
commensurable, and comparing them directly is just misleading.
I trust we're not throwing away non-partial paths on that basis?

regards, tom lane

#12

David Rowley

dgrowleyml@gmail.com

over 5 years ago

In reply to: Tom Lane (#11)

Re: Parallel Append can break run-time partition pruning

On Thu, 23 Apr 2020 at 11:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:

David Rowley <dgrowleyml@gmail.com> writes:

On Thu, 23 Apr 2020 at 11:11, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Well, anytime the parallel startup cost is significant, for starters.
But maybe we account for that at some other point, like when building
the Gather?

Yeah. There's no mention of parallel_setup_cost or parallel_tuple_cost
in any of the Append costing code. Those are only applied when we cost
Gather / GatherMerge At the point Amit and I are talking about, we're
only comparing two Append paths. No Gather/GatherMerge in sight yet,
so any additional costs from those is not applicable.

Right, so really the costs of partial and non-partial paths are not
commensurable, and comparing them directly is just misleading.
I trust we're not throwing away non-partial paths on that basis?

There is a case in both master and in the patch where we compare the
cost of the cheapest path in partial_pathlist. However, in this case,
the pathlist path will be used in an Append or Parallel Append with a
Gather below it, so those parallel_(setup|tuple)_costs will be applied
regardless. The non-parallel Append in this case still requires a
Gather since it still is using multiple workers to execute the
subpaths. e.g the plan I posted in [1]/messages/by-id/CAApHDvqcOD3ObPgPAeU+3qyFL_wzE5kmczw70qMAh7qJ-3wuzw@mail.gmail.com.

The code comparing the path costs is:

else if (nppath == NULL ||
(cheapest_partial_path != NULL &&
cheapest_partial_path->total_cost < nppath->total_cost))

[1]: /messages/by-id/CAApHDvqcOD3ObPgPAeU+3qyFL_wzE5kmczw70qMAh7qJ-3wuzw@mail.gmail.com

#13

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: David Rowley (#10)

Re: Parallel Append can break run-time partition pruning

On Wed, Apr 22, 2020 at 7:36 PM David Rowley <dgrowleyml@gmail.com> wrote:

If there was some reason that a Parallel Append could come out more
expensive, then maybe we could just create a non-parallel Append using
the same subpath list and add_partial_path() it. I just don't quite
see how that would ever win though. I'm willing to be proven wrong
though.

I think you're talking about the thing that this comment is trying to explain:

/*
* Consider a parallel-aware append using a mix of partial and non-partial
* paths. (This only makes sense if there's at least one child which has
* a non-partial path that is substantially cheaper than any partial path;
* otherwise, we should use the append path added in the previous step.)
*/

Like, suppose there are two relations A and B, and we're appending
them. A has no indexes, so we can only choose between a Seq Scan and
an Index Scan. B has a GIST index that is well-suited to the query,
but GIST indexes don't support parallel scans. So we've got three
choices:

1. Don't use parallelism at all. Then, we can do a normal Append with
a Seq Scan on a and an Index Scan on b. ("If we found unparameterized
paths for all children, build an unordered, unparameterized Append
path for the rel.")

2. Use parallelism for both relations. Then, we can do a Gather over a
Parallel Append (or a regular Append, if Parallel Append is disabled)
with a Parallel Seq Scan on a and a Parallel Seq Scan on b. As
compared with #1, this should win for a, but it might lose heavily for
b, because switching from an index scan to a Seq Scan could be a big
loser. ("Consider an append of unordered, unparameterized partial
paths. Make it parallel-aware if possible.")

3. Use parallelism for a but not for b. The only way to do this is a
Parallel Append, because there's no other way to mix partial and
non-partial paths at present. This lets us get the benefit of a
Parallel Seq Scan on a while still being able to do a non-parallel
GIST Index Scan on b. This has a good chance of being better than #2,
but it's fundamentally a costing decision, because if the table is
small enough or the query isn't very selective, #2 will actually be
faster, just on the raw power of more workers and less random I/O
("Consider a parallel-aware append using a mix of partial and
non-partial paths.")

It seems to me that all three strategies are viable. The third one is
much less likely to be used now that we have parallel index scans for
btree and parallel bitmap heap scans, but I believe it can be a winner
if you have the right case. You want to think about cases where there
are parallel plans available for everything in the tree, but at least
some of the children have much better non-parallel plans.

Note that for strategy #2 we always prefer Parallel Append to
non-Parallel Append on the optimistic assumption that Parallel Append
will always be better; we only use regular Append if Parallel Append
is disabled. But for strategy #3 there is no such choice to be made: a
regular Append would not be valid. If placed under a Gather, it would
execute the non-partial paths more than once; if not placed under a
Gather, we'd have partial paths without any Gather above them, which
is an invalid plan shape. So you can't "just use a regular Append" in
case #3.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14

David Rowley

dgrowleyml@gmail.com

over 5 years ago

In reply to: Robert Haas (#13)

Re: Parallel Append can break run-time partition pruning

On Thu, 23 Apr 2020 at 14:37, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Apr 22, 2020 at 7:36 PM David Rowley <dgrowleyml@gmail.com> wrote:

If there was some reason that a Parallel Append could come out more
expensive, then maybe we could just create a non-parallel Append using
the same subpath list and add_partial_path() it. I just don't quite
see how that would ever win though. I'm willing to be proven wrong
though.

I think you're talking about the thing that this comment is trying to explain:

/*
* Consider a parallel-aware append using a mix of partial and non-partial
* paths. (This only makes sense if there's at least one child which has
* a non-partial path that is substantially cheaper than any partial path;
* otherwise, we should use the append path added in the previous step.)
*/

Like, suppose there are two relations A and B, and we're appending
them. A has no indexes, so we can only choose between a Seq Scan and
an Index Scan. B has a GIST index that is well-suited to the query,
but GIST indexes don't support parallel scans. So we've got three
choices:

1. Don't use parallelism at all. Then, we can do a normal Append with
a Seq Scan on a and an Index Scan on b. ("If we found unparameterized
paths for all children, build an unordered, unparameterized Append
path for the rel.")

2. Use parallelism for both relations. Then, we can do a Gather over a
Parallel Append (or a regular Append, if Parallel Append is disabled)
with a Parallel Seq Scan on a and a Parallel Seq Scan on b. As
compared with #1, this should win for a, but it might lose heavily for
b, because switching from an index scan to a Seq Scan could be a big
loser. ("Consider an append of unordered, unparameterized partial
paths. Make it parallel-aware if possible.")

3. Use parallelism for a but not for b. The only way to do this is a
Parallel Append, because there's no other way to mix partial and
non-partial paths at present. This lets us get the benefit of a
Parallel Seq Scan on a while still being able to do a non-parallel
GIST Index Scan on b. This has a good chance of being better than #2,
but it's fundamentally a costing decision, because if the table is
small enough or the query isn't very selective, #2 will actually be
faster, just on the raw power of more workers and less random I/O
("Consider a parallel-aware append using a mix of partial and
non-partial paths.")

Thanks for those examples.

I ran this situation through the code but used a hash index instead of
GIST. The 3 settings which give us control over this plan are
enable_parallel_append, enable_indexscan, enable_bitmapscan.
enable_bitmapscan must be included since we can still get a parallel
bitmap scan with a hash index.

For completeness, I just tried with each of the 8 combinations of the
GUCs, but I'd detailed below which of your cases I'm testing as a
comment. There are 4 cases since #2 works with parallel and
non-parallel Append. Naturally, the aim is that the patched version
does not change the behaviour.

-- Test case
create table listp (a int, b int) partition by list(a);
create table listp1 partition of listp for values in(1);
create table listp2 partition of listp for values in(2);
insert into listp select x,y from generate_Series(1,2) x,
generate_Series(1,1000000) y;
create index on listp2 using hash(b);
vacuum analyze listp;

explain (costs off) select * from listp where b = 1;
SET enable_indexscan = off;
explain (costs off) select * from listp where b = 1;
SET enable_indexscan = on;
SET enable_bitmapscan = off;
explain (costs off) select * from listp where b = 1; -- case #3, Mixed
scan of parallel and non-parallel paths with a Parallel Append
SET enable_indexscan = off;
explain (costs off) select * from listp where b = 1; -- case #2 with
Parallel Append
SET enable_indexscan = on;
SET enable_bitmapscan = on;
SET enable_parallel_append = off;
explain (costs off) select * from listp where b = 1;
SET enable_indexscan = off;
explain (costs off) select * from listp where b = 1; -- case #2 with
non-Parallel Append
SET enable_indexscan = on;
SET enable_bitmapscan = off;
explain (costs off) select * from listp where b = 1; -- case #1, best
serial plan
SET enable_indexscan = off;
explain (costs off) select * from listp where b = 1;

The results, patched/unpatched, are the same.

David

#15

David Rowley

dgrowleyml@gmail.com

over 5 years ago

In reply to: Amit Langote (#7)

1 attachment(s)

Re: Parallel Append can break run-time partition pruning

On Thu, 23 Apr 2020 at 02:37, Amit Langote <amitlangote09@gmail.com> wrote:

Regarding the patch, I had been assuming that the "pa" in
pa_subpaths_valid stands for "parallel append", so it using the
variable as is in the new code structure would be misleading. Maybe,
parallel_subpaths_valid?

I started making another pass over this patch again. I did the
renaming you've mentioned plus the other two List variables for the
subpaths.

I did notice that I had forgotten to properly disable parallel append
when it was disallowed by the rel's consider_parallel flag. For us to
have done something wrong, we'd have needed a parallel safe path in
the pathlist and also have needed the rel to be consider_parallel ==
false. It was easy enough to make up a case for that by sticking a
parallel restrict function in the target list. I've fixed that issue
in the attached and also polished up the comments a little and removed
the unused variable that I had forgotten about.

For now. I'd still like to get a bit more confidence that the only
noticeable change in the outcome here is that we're now pulling up
sub-Appends in all cases. I've read the code a number of times and
just can't quite see any room for anything changing. My tests per
Robert's case all matched what the previous version did, but I'm still
only about 93% on this. Given that I'm aiming to fix a bug in master,
v11 and v12 here, I need to get that confidence level up to about the
100% mark.

I've attached the v2 patch.

David

Attachments:

always_pullup_child_appends_to_top_level_append_v2.patchapplication/octet-stream; name=always_pullup_child_appends_to_top_level_append_v2.patchDownload

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 255f56b827..544afa2781 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1296,19 +1296,17 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 {
 	List	   *subpaths = NIL;
 	bool		subpaths_valid = true;
-	List	   *partial_subpaths = NIL;
-	List	   *pa_partial_subpaths = NIL;
-	List	   *pa_nonpartial_subpaths = NIL;
-	bool		partial_subpaths_valid = true;
-	bool		pa_subpaths_valid;
+	List	   *parallel_partial_subpaths = NIL;
+	List	   *parallel_nonpartial_subpaths = NIL;
+	bool		parallel_subpaths_valid = true;
 	List	   *all_child_pathkeys = NIL;
 	List	   *all_child_outers = NIL;
 	ListCell   *l;
 	List	   *partitioned_rels = NIL;
 	double		partial_rows = -1;
+	bool		allow_parallel_append;
 
-	/* If appropriate, consider parallel append */
-	pa_subpaths_valid = enable_parallel_append && rel->consider_parallel;
+	allow_parallel_append = enable_parallel_append && rel->consider_parallel;
 
 	/*
 	 * AppendPath generated for partitioned tables must record the RT indexes
@@ -1366,7 +1364,6 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	{
 		RelOptInfo *childrel = lfirst(l);
 		ListCell   *lcp;
-		Path	   *cheapest_partial_path = NULL;
 
 		/*
 		 * For UNION ALLs with non-empty partitioned_child_rels, accumulate
@@ -1391,31 +1388,31 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		else
 			subpaths_valid = false;
 
-		/* Same idea, but for a partial plan. */
-		if (childrel->partial_pathlist != NIL)
-		{
-			cheapest_partial_path = linitial(childrel->partial_pathlist);
-			accumulate_append_subpath(cheapest_partial_path,
-									  &partial_subpaths, NULL);
-		}
-		else
-			partial_subpaths_valid = false;
-
 		/*
-		 * Same idea, but for a parallel append mixing partial and non-partial
-		 * paths.
+		 * We also build a set of paths for each child by trying to use the
+		 * cheapest partial path, or the cheapest parallel safe normal path
+		 * either when that is cheaper, or if parallel append is not allowed.
 		 */
-		if (pa_subpaths_valid)
+		if (parallel_subpaths_valid)
 		{
 			Path	   *nppath = NULL;
+			Path	   *cheapest_partial_path = NULL;
 
-			nppath =
-				get_cheapest_parallel_safe_total_inner(childrel->pathlist);
+			/*
+			 * Only attempt to use a parallel safe subplan when parallel
+			 * append is allowed.
+			 */
+			if (allow_parallel_append)
+				nppath =
+					get_cheapest_parallel_safe_total_inner(childrel->pathlist);
+
+			if (childrel->partial_pathlist != NIL)
+				cheapest_partial_path = linitial(childrel->partial_pathlist);
 
 			if (cheapest_partial_path == NULL && nppath == NULL)
 			{
 				/* Neither a partial nor a parallel-safe path?  Forget it. */
-				pa_subpaths_valid = false;
+				parallel_subpaths_valid = false;
 			}
 			else if (nppath == NULL ||
 					 (cheapest_partial_path != NULL &&
@@ -1424,8 +1421,8 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 				/* Partial path is cheaper or the only option. */
 				Assert(cheapest_partial_path != NULL);
 				accumulate_append_subpath(cheapest_partial_path,
-										  &pa_partial_subpaths,
-										  &pa_nonpartial_subpaths);
+										  &parallel_partial_subpaths,
+										  &parallel_nonpartial_subpaths);
 
 			}
 			else
@@ -1444,7 +1441,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 				 * figure that out.
 				 */
 				accumulate_append_subpath(nppath,
-										  &pa_nonpartial_subpaths,
+										  &parallel_nonpartial_subpaths,
 										  NULL);
 			}
 		}
@@ -1525,23 +1522,28 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 												  partitioned_rels, -1));
 
 	/*
-	 * Consider an append of unordered, unparameterized partial paths.  Make
-	 * it parallel-aware if possible.
+	 * Add a partial append path for the list of partial and parallel-safe
+	 * subpaths that we collected above.  We perform a parallel append on
+	 * these, when allowed.  We don't attempt to consider both a parallel and
+	 * non-parallel append as we assume a parallel append will always come out
+	 * the cheaper of the two partial path options.
 	 */
-	if (partial_subpaths_valid && partial_subpaths != NIL)
+	if (parallel_subpaths_valid)
 	{
 		AppendPath *appendpath;
 		ListCell   *lc;
 		int			parallel_workers = 0;
 
-		/* Find the highest number of workers requested for any subpath. */
-		foreach(lc, partial_subpaths)
+		/*
+		 * Find the highest number of workers requested for any partial
+		 * subpath.
+		 */
+		foreach(lc, parallel_partial_subpaths)
 		{
 			Path	   *path = lfirst(lc);
 
 			parallel_workers = Max(parallel_workers, path->parallel_workers);
 		}
-		Assert(parallel_workers > 0);
 
 		/*
 		 * If the use of parallel append is permitted, always request at least
@@ -1552,19 +1554,20 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		 * partitions vs. an unpartitioned table with the same data, so the
 		 * use of some kind of log-scaling here seems to make some sense.
 		 */
-		if (enable_parallel_append)
+		if (allow_parallel_append)
 		{
 			parallel_workers = Max(parallel_workers,
 								   fls(list_length(live_childrels)));
 			parallel_workers = Min(parallel_workers,
 								   max_parallel_workers_per_gather);
+			Assert(parallel_workers > 0);
 		}
-		Assert(parallel_workers > 0);
 
-		/* Generate a partial append path. */
-		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
+		appendpath = create_append_path(root, rel,
+										parallel_nonpartial_subpaths,
+										parallel_partial_subpaths,
 										NIL, NULL, parallel_workers,
-										enable_parallel_append,
+										allow_parallel_append,
 										partitioned_rels, -1);
 
 		/*
@@ -1573,48 +1576,6 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		 */
 		partial_rows = appendpath->path.rows;
 
-		/* Add the path. */
-		add_partial_path(rel, (Path *) appendpath);
-	}
-
-	/*
-	 * Consider a parallel-aware append using a mix of partial and non-partial
-	 * paths.  (This only makes sense if there's at least one child which has
-	 * a non-partial path that is substantially cheaper than any partial path;
-	 * otherwise, we should use the append path added in the previous step.)
-	 */
-	if (pa_subpaths_valid && pa_nonpartial_subpaths != NIL)
-	{
-		AppendPath *appendpath;
-		ListCell   *lc;
-		int			parallel_workers = 0;
-
-		/*
-		 * Find the highest number of workers requested for any partial
-		 * subpath.
-		 */
-		foreach(lc, pa_partial_subpaths)
-		{
-			Path	   *path = lfirst(lc);
-
-			parallel_workers = Max(parallel_workers, path->parallel_workers);
-		}
-
-		/*
-		 * Same formula here as above.  It's even more important in this
-		 * instance because the non-partial paths won't contribute anything to
-		 * the planned number of parallel workers.
-		 */
-		parallel_workers = Max(parallel_workers,
-							   fls(list_length(live_childrels)));
-		parallel_workers = Min(parallel_workers,
-							   max_parallel_workers_per_gather);
-		Assert(parallel_workers > 0);
-
-		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
-										pa_partial_subpaths,
-										NIL, NULL, parallel_workers, true,
-										partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}

#16

Amit Langote

amitlangote09@gmail.com

over 5 years ago

In reply to: David Rowley (#15)

Re: Parallel Append can break run-time partition pruning

On Thu, Apr 23, 2020 at 7:37 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 23 Apr 2020 at 02:37, Amit Langote <amitlangote09@gmail.com> wrote:

Regarding the patch, I had been assuming that the "pa" in
pa_subpaths_valid stands for "parallel append", so it using the
variable as is in the new code structure would be misleading. Maybe,
parallel_subpaths_valid?

I started making another pass over this patch again. I did the
renaming you've mentioned plus the other two List variables for the
subpaths.

I did notice that I had forgotten to properly disable parallel append
when it was disallowed by the rel's consider_parallel flag. For us to
have done something wrong, we'd have needed a parallel safe path in
the pathlist and also have needed the rel to be consider_parallel ==
false. It was easy enough to make up a case for that by sticking a
parallel restrict function in the target list. I've fixed that issue
in the attached and also polished up the comments a little and removed
the unused variable that I had forgotten about.

Thanks for updating the patch.

For now. I'd still like to get a bit more confidence that the only
noticeable change in the outcome here is that we're now pulling up
sub-Appends in all cases.

Yeah, I would have liked us to have enough confidence to remove the
following text in the header comment of accumulate_append_subpath:

... If the latter is
* NULL, we don't flatten the path at all (unless it contains only partial
* paths).

I've read the code a number of times and
just can't quite see any room for anything changing. My tests per
Robert's case all matched what the previous version did, but I'm still
only about 93% on this. Given that I'm aiming to fix a bug in master,
v11 and v12 here, I need to get that confidence level up to about the
100% mark.

I think I have managed to convince myself (still < 100% though) that
it's not all that bad that we won't be leaving it up to
add_partial_path() to decide between a Parallel Append whose subpaths
set consist only of partial paths and another whose subpaths set
consists of a mix of partial and non-partial paths. That's because we
will be building only the latter if that one looks cheaper to begin
with. If Parallel Append is disabled, there could only ever be one
partial Append path for add_partial_path() to consider, one whose
subpaths set consists only of partial paths.

On the patch itself:

+        * We also build a set of paths for each child by trying to use the
+        * cheapest partial path, or the cheapest parallel safe normal path
+        * either when that is cheaper, or if parallel append is not allowed.
         */
-       if (pa_subpaths_valid)
+       if (parallel_subpaths_valid)
        {

In the comment above ", or parallel append is not allowed" should be "
provided parallel append is allowed". Or how about writing it as
follows:

/*
* Add the child's cheapest partial path, or if parallel append is
* allowed, its cheapest parallel safe normal path if that one is
* cheaper, to the partial Append path we are constructing for the
* parent.
*/

--
Amit Langote
EnterpriseDB: http://www.enterprisedb.com

#17

David Rowley

dgrowleyml@gmail.com

over 5 years ago

In reply to: David Rowley (#15)

Re: Parallel Append can break run-time partition pruning

On Thu, 23 Apr 2020 at 22:37, David Rowley <dgrowleyml@gmail.com> wrote:

For now. I'd still like to get a bit more confidence that the only
noticeable change in the outcome here is that we're now pulling up
sub-Appends in all cases. I've read the code a number of times and
just can't quite see any room for anything changing. My tests per
Robert's case all matched what the previous version did, but I'm still
only about 93% on this. Given that I'm aiming to fix a bug in master,
v11 and v12 here, I need to get that confidence level up to about the
100% mark.

I looked at this again and I don't think what I've got is right.

The situation that I'm concerned about is that we generate a Parallel
Append path for a sub-partitioned table and that path has a mix of
partial and parallel safe paths. When we perform
add_paths_to_append_rel() for the top-level partition, if for some
reason we didn't do a Parallel Append, then we could pullup the mixed
set of partial and parallel safe paths from the child-Append. We can't
execute a parallel_safe subpath under a normal Append that's below a
Gather as multiple workers could execute the same parallel safe scan,
which won't lead to anything good, probably wrong results at best.
Now, we'll only ever have a Parallel Append for the sub-partitioned
table if enable_parallel_append is on and the rel's consider_parallel
is switched on too, but if for some reason the top-level partitioned
table had consider_parallel set to off, then we could end up pulling
up the subpaths from a Parallel Append path, which could contain a mix
of partial and parallel safe paths and we'd then throw those into a
non-Parallel Append partial path! The parallel safe path just can't
go in one of those as multiple workers could then execute that path.
Only Parallel Append knows how to execute those safely as it ensures
only 1 worker touches it.

Now, looking at the code today, it does seem that a child rel will do
parallel only if the parent rel does, but I don't think it's a great
idea to assume that will never change. Additionally, it seems fragile
to also assume the value of the enable_parallel_append, a global
variable would stay the same between calls too.

I wonder if we could fix this by when we call
add_paths_to_append_rel() for the top-level partitioned table, just
recursively get all the child rels and their children too.
accumulate_append_subpath() would then never see Appends or
MergeAppends as we'd just be dealing with scan type Paths.
add_paths_to_append_rel() for sub-partitioned tables wouldn't have to
do very much then. I'll need to see how that idea would fit in with
partition-wise joins. My guess is, not very well. There's also
apply_scanjoin_target_to_paths() which calls add_paths_to_append_rel()
too.

The only other idea I have so far is just to either not generate the
partial path using the partial_subpaths list in
add_paths_to_append_rel() when pa_subpaths_valid and
pa_nonpartial_subpaths != NIL. i.e only create the first of those
partial paths if we're not going to create the 2nd one. Or even just
swap the order that we add_partial_path() them so that
add_partial_path() does not reject the path with the flattened
Appends.

David

#18

Robert Haas

robertmhaas@gmail.com

over 5 years ago

In reply to: David Rowley (#17)

Re: Parallel Append can break run-time partition pruning

On Sat, Apr 25, 2020 at 7:25 AM David Rowley <dgrowleyml@gmail.com> wrote:

I looked at this again and I don't think what I've got is right.

Apart from the considerations which you raised, I think there's also a
costing issue here. For instance, suppose we have an Append with three
children. Two of them have partial paths which will complete after
consuming 1 second of CPU time, and using a partial path is
unquestionably the best strategy for those children. The third one has
a partial path which will complete after using 10 seconds of CPU time
and a non-partial path which will complete after using 8 seconds of
CPU time. What strategy is best? If we pick the non-partial path, the
time that the whole thing takes to complete will be limited by the
non-partial path, which, since it cannot use parallelism, will take 8
seconds of wall clock time. If we pick the partial path, we have a
total of 12 seconds of CPU time which we can finish in 6 wall clock
seconds with 2 participants, 4 seconds with 3 participants, or 3
seconds with 4 participants. This is clearly a lot better.

Incidentally, the planner knows about the fact that you can't finish
an Append faster than you can finish its non-partial participants. See
append_nonpartial_cost().

Now, I don't think anything necessarily gets broken here by your patch
as written. Planner decisions are made on estimated cost, which is a
proxy for wall clock time, not CPU time. Right now, we only prefer the
non-partial path if we think it can be executed in less time by ONE
worker than the partial path could be executed by ALL workers:

/*
* Either we've got only a non-partial
path, or we think that
* a single backend can execute the
best non-partial path
* faster than all the parallel
backends working together can
* execute the best partial path.

So, in the above example, we wouldn't even consider the non-partial
path. Even say we just have 2 workers. The partial path will have a
cost of 6, which is less than 8, so it gets rejected. But as the
comment goes on to note:

* It might make sense to be more
aggressive here. Even if
* the best non-partial path is more
expensive than the best
* partial path, it could still be
better to choose the
* non-partial path if there are
several such paths that can
* be given to different workers. For
now, we don't try to
* figure that out.

This kind of strategy is particularly appealing when the number of
Append children is large compared to the number of workers. For
instance, if you have an Append with 100 children and you are planning
with 4 workers, it's probably a pretty good idea to be very aggressive
about picking the path that uses the least resources, which the
current algorithm would not do. You're unlikely to end up with idle
workers, because you have so many plans to which they can be
allocated. However, it's possible: it could be that there's one child
which is way bigger than all the others and the really important thing
is to get a partial path for that child, so that it can be attacked in
parallel, even if that means that overall resource consumption is
higher. As the number of children decreases relative to the number of
workers, having partial paths becomes increasingly appealing. To take
a degenerate case, suppose you have 4 workers but only 2 children.
Partial paths should look really appealing, because the alternative is
leaving workers unused.

I *think* your patch is based around the idea of merging the
turn-the-append-of-partial-paths-into-a-parallel-append case with the
consider-a-mix-of-partial-and-nonpartial-paths-for-parallel-append
case. That seems possible to do given that the current heuristic is to
compare the raw path costs, but I think that it wouldn't work if we
wanted to be more aggressive about considering the mixed strategy and
letting the costing machinery sort out which way is better.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19

Amit Langote

amitlangote09@gmail.com

about 5 years ago

In reply to: Robert Haas (#18)

Re: Parallel Append can break run-time partition pruning

Robert forwarded me a link to an email sent to -general list, where
the reported problem seems to be the same problem that was being
discussed here.

/messages/by-id/d0f6d811-8946-eb9f-68e2-1a8a7f80ff21@a-kretschmer.de

Going over the last few emails, it seems that David held off from
committing the patch, because of the lack of confidence in its
robustness. With the patch, a sub-partitioned child's partial
Append's child paths will *always* be pulled up into the parent's set
of partial child paths thus preventing the nesting of Appends, which
the run-time pruning code can't currently cope with. The lack of
confidence seems to be due to the fact that always pulling up a
sub-Append's child paths into the parent partial Append's child paths
*may* cause the latter to behave wrongly under parallelism and the new
code structure will prevent add_partial_path() from being the
arbitrator of whether such a path is really the best in terms of cost.

If we can't be confident in that approach, maybe we should consider
making the run-time pruning code cope with nested Appends?

--
Amit Langote
EDB: http://www.enterprisedb.com

#20

David Rowley

dgrowleyml@gmail.com

about 5 years ago

In reply to: Amit Langote (#19)

1 attachment(s)

Re: Parallel Append can break run-time partition pruning

On Fri, 16 Oct 2020 at 03:01, Amit Langote <amitlangote09@gmail.com> wrote:

Going over the last few emails, it seems that David held off from
committing the patch, because of the lack of confidence in its
robustness. With the patch, a sub-partitioned child's partial
Append's child paths will *always* be pulled up into the parent's set
of partial child paths thus preventing the nesting of Appends, which
the run-time pruning code can't currently cope with. The lack of
confidence seems to be due to the fact that always pulling up a
sub-Append's child paths into the parent partial Append's child paths
*may* cause the latter to behave wrongly under parallelism and the new
code structure will prevent add_partial_path() from being the
arbitrator of whether such a path is really the best in terms of cost.

If we can't be confident in that approach, maybe we should consider
making the run-time pruning code cope with nested Appends?

I've been thinking about this and my thoughts are:

There are other cases where we don't pullup sub-Merge Append nodes
anyway, so I think we should just make run-time pruning work for more
than just the top-level Append/Merge Append.

The case I'm thinking about is the code added in 959d00e9dbe for
ordered Append scans. It's possible a sub-partitioned table has
partitions which don't participate in the same ordering. We need to
keep the sub-Merge Append in those cases... well, at least when
there's more than 1 partition remaining after plan-time pruning.

I've attached the patch which, pending a final look, I'm proposing to
commit for master only. I just don't quite think this is a bug fix,
and certainly, due to the invasiveness of the proposed patch, that
means no backpatch.

I fixed all the stuff about the incorrectly set partitioned_rels list.
What I ended up with there is making it accumulate_append_subpath's
job to also pullup the sub-paths partitioned_rels fields when pulling
up a nested Append/MergeAppend. If there's no pullup, there then we
don't care about the sub-path's partitioned_rels. That's for it to
deal with. With that, I think that run-time pruning will only get the
RT indexes for partitions that we actually have sub-paths for. That's
pretty good, so I added an Assert() to verify that in
make_partitionedrel_pruneinfo(). (I hope I don't regret that later)

This does mean we need to maintain a different partitioned_rels list
for each Append path we consider. So there's a number (six) of these
now between add_paths_to_append_rel() and
generate_orderedappend_paths(). To try to minimise the impact of
that, I've changed the field so that instead of being a List of
IntLists, it's just a List of Relids. The top-level list just
contains a single element until you get a UNION ALL that selects from
a partitioned table on each side of the union. Merging sub-path
partitioned rel RT indexes into the existing element is now pretty
cheap as it just uses bms_add_members() rather the list_concat we'd
have had to have used if it was still a List of IntLists.

After fixing up how partitioned_rels is built, I saw there were no
usages of RelOptInfo.partitioned_child_rels, so I got rid of it.

I did another couple of little cleanups and wrote some regression
tests to test all this.

Overall I'm fairly happy with this, especially getting rid of a
partitioned table related field from RelOptInfo.

David

Attachments:

v1-0001-Allow-run-time-pruning-on-nested-Append-MergeAppe.patchtext/plain; charset=US-ASCII; name=v1-0001-Allow-run-time-pruning-on-nested-Append-MergeAppe.patchDownload

From b8f33c0074f3546139a396a683926c29db2a193f Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Sun, 25 Oct 2020 12:38:58 +1300
Subject: [PATCH v1] Allow run-time pruning on nested Append/MergeAppend nodes

Previously we only tagged on the required information to allow the
executor to perform run-time partition pruning for Append/MergeAppend
nodes belonging to base relations.  It was thought that nested
Append/MergeAppend nodes were just about always pulled up into the
top-level Append/MergeAppend and that making the run-time pruning info for
any sub Append/MergeAppend nodes was a waste of time.  However, that was
likely badly thought through.

Some examples of cases we're unable to pullup nested Append/MergeAppends
are: 1) Parallel Append nodes with a mix of parallel and non-parallel
paths into a Parallel Append.  2) When planning an ordered Append scan a
sub-partition which is unordered may require a nested MergeAppend path to
ensure sub-partitions don't mix up the order of tuples being fed into the
top-level Append.

Unfortunately, it was not just as simple as removing the lines in
createplan.c which were purposefully not building the run-time pruning
info for anything but RELOPT_BASEREL relations.  The code in
add_paths_to_append_rel() was far too sloppy about which partitioned_rels
it included for the Append/MergeAppend paths.  The original code there
would always assume accumulate_append_subpath() would pull each sub-Append
and sub-MergeAppend path into the top level path.  While it does not
appear that there were any actual bugs caused by those surplus RT indexes,
what it did mean is that later in planning, when we built the run-time
pruning info that we wasted effort and built PartitionedRelPruneInfos for
partitioned tables that we had no subpaths for the executor to run-time
prune.

Here we tighten that up so that partitioned_rels only ever contains the RT
index for partitioned tables which actually have subpaths in the given
Append/MergeAppend.  We can now Assert that every PartitionedRelPruneInfo
has a non-empty present_parts.  That should allow us to catch any weird
corner cases that have been missed.

In passing, it seems there is no longer a good reason to have the
AppendPath and MergeAppendPath's partitioned_rel fields a List of IntList.
We can simply have a List of Relids.  We still know which is the root
level partition as these always have a lower relid than their children.
Once upon a time this field did get passed to the executor to instruct it
which partitioned tables needed to be locked.  We now have something much
more robust for that.

Here we also get rid of the RelOptInfo partitioned_child_rels field. This
is what was previously used to (sometimes incorrectly) set the
Append/MergeAppend path's partitioned_rels field.  That was the only user
of that field, so we can happily just nuke it out of existence.

I also couldn't resist changing some nearby code to make use of the newly
added for_each_from macro so we can skip the first element in the list
without checking if the current item was the first one on each
iteration.

A bug report from Andreas Kretschmer prompted all this work, however,
after some consideration, I'm not personally classing this as a bug fix.
So no backpatch.  In Andreas' test case, it just wasn't that clear that
there was a nested Append since the top-level Append just had a single
sub-path and was pulled up a level, per 8edd0e794.

Author: David Rowley
Discussion: https://postgr.es/m/flat/CAApHDvqSchs%2BubdybcfFaSPB%2B%2BEA7kqMaoqajtP0GtZvzOOR3g%40mail.gmail.com
---
 src/backend/nodes/outfuncs.c                  |   1 -
 src/backend/optimizer/path/allpaths.c         | 233 +++++++++++-------
 src/backend/optimizer/plan/createplan.c       |   2 -
 src/backend/optimizer/util/relnode.c          |   3 -
 src/backend/partitioning/partprune.c          |  25 +-
 src/include/nodes/pathnodes.h                 |  16 +-
 src/test/regress/expected/partition_prune.out | 102 ++++++++
 src/test/regress/sql/partition_prune.sql      |  49 ++++
 8 files changed, 318 insertions(+), 113 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 08a049232e..7049d9eef3 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2310,7 +2310,6 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_BITMAPSET_FIELD(top_parent_relids);
 	WRITE_BOOL_FIELD(partbounds_merged);
 	WRITE_BITMAPSET_FIELD(all_partrels);
-	WRITE_NODE_FIELD(partitioned_child_rels);
 }
 
 static void
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index b399592ff8..ea7410dd42 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -104,8 +104,13 @@ static void generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 static Path *get_cheapest_parameterized_child_path(PlannerInfo *root,
 												   RelOptInfo *rel,
 												   Relids required_outer);
+static List *accumulate_partitioned_rels(List *partitioned_rels,
+										 List *sub_partitioned_rels,
+										 bool flatten_partitioned_rels);
 static void accumulate_append_subpath(Path *path,
-									  List **subpaths, List **special_subpaths);
+									  List **subpaths, List **special_subpaths,
+									  List **partitioned_rels,
+									  bool flatten_partitioned_rels);
 static Path *get_singleton_append_subpath(Path *path);
 static void set_dummy_rel_pathlist(RelOptInfo *rel);
 static void set_subquery_pathlist(PlannerInfo *root, RelOptInfo *rel,
@@ -959,17 +964,6 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 
 	Assert(IS_SIMPLE_REL(rel));
 
-	/*
-	 * Initialize partitioned_child_rels to contain this RT index.
-	 *
-	 * Note that during the set_append_rel_pathlist() phase, we will bubble up
-	 * the indexes of partitioned relations that appear down in the tree, so
-	 * that when we've created Paths for all the children, the root
-	 * partitioned table's list will contain all such indexes.
-	 */
-	if (rte->relkind == RELKIND_PARTITIONED_TABLE)
-		rel->partitioned_child_rels = list_make1_int(rti);
-
 	/*
 	 * If this is a partitioned baserel, set the consider_partitionwise_join
 	 * flag; currently, we only consider partitionwise joins with the baserel
@@ -1269,12 +1263,6 @@ set_append_rel_pathlist(PlannerInfo *root, RelOptInfo *rel,
 		if (IS_DUMMY_REL(childrel))
 			continue;
 
-		/* Bubble up childrel's partitioned children. */
-		if (rel->part_scheme)
-			rel->partitioned_child_rels =
-				list_concat(rel->partitioned_child_rels,
-							childrel->partitioned_child_rels);
-
 		/*
 		 * Child is live, so add it to the live_childrels list for use below.
 		 */
@@ -1312,56 +1300,34 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	List	   *all_child_outers = NIL;
 	ListCell   *l;
 	List	   *partitioned_rels = NIL;
+	List	   *partial_partitioned_rels = NIL;
+	List	   *pa_partitioned_rels = NIL;
 	double		partial_rows = -1;
+	bool		flatten_partitioned_rels;
 
 	/* If appropriate, consider parallel append */
 	pa_subpaths_valid = enable_parallel_append && rel->consider_parallel;
 
+	/* What we do with the partitioned_rels list is different for UNION ALL */
+	flatten_partitioned_rels = (rel->rtekind != RTE_SUBQUERY);
+
 	/*
-	 * AppendPath generated for partitioned tables must record the RT indexes
-	 * of partitioned tables that are direct or indirect children of this
-	 * Append rel.
-	 *
-	 * AppendPath may be for a sub-query RTE (UNION ALL), in which case, 'rel'
-	 * itself does not represent a partitioned relation, but the child sub-
-	 * queries may contain references to partitioned relations.  The loop
-	 * below will look for such children and collect them in a list to be
-	 * passed to the path creation function.  (This assumes that we don't need
-	 * to look through multiple levels of subquery RTEs; if we ever do, we
-	 * could consider stuffing the list we generate here into sub-query RTE's
-	 * RelOptInfo, just like we do for partitioned rels, which would be used
-	 * when populating our parent rel with paths.  For the present, that
-	 * appears to be unnecessary.)
+	 * For partitioned tables, we accumulate a list of the partitioned RT
+	 * indexes for the subpaths that are directly under this Append.  This is
+	 * used later for run-time partition pruning.  We must maintain separate
+	 * lists for each Append Path that we create as accumulate_append_subpath
+	 * sometimes can't flatten sub-Appends into the top-level Append.
+	 * We needn't bother doing this for join rels as no run-time pruning is
+	 * done on those.
 	 */
-	if (rel->part_scheme != NULL)
+	if (rel->reloptkind != RELOPT_JOINREL && rel->part_scheme != NULL)
 	{
-		if (IS_SIMPLE_REL(rel))
-			partitioned_rels = list_make1(rel->partitioned_child_rels);
-		else if (IS_JOIN_REL(rel))
-		{
-			int			relid = -1;
-			List	   *partrels = NIL;
-
-			/*
-			 * For a partitioned joinrel, concatenate the component rels'
-			 * partitioned_child_rels lists.
-			 */
-			while ((relid = bms_next_member(rel->relids, relid)) >= 0)
-			{
-				RelOptInfo *component;
-
-				Assert(relid >= 1 && relid < root->simple_rel_array_size);
-				component = root->simple_rel_array[relid];
-				Assert(component->part_scheme != NULL);
-				Assert(list_length(component->partitioned_child_rels) >= 1);
-				partrels = list_concat(partrels,
-									   component->partitioned_child_rels);
-			}
+		partitioned_rels = list_make1(bms_make_singleton(rel->relid));
+		partial_partitioned_rels = list_make1(bms_make_singleton(rel->relid));
 
-			partitioned_rels = list_make1(partrels);
-		}
-
-		Assert(list_length(partitioned_rels) >= 1);
+		/* skip this one if we're not going to make a Parallel Append path */
+		if (pa_subpaths_valid)
+			pa_partitioned_rels = list_make1(bms_make_singleton(rel->relid));
 	}
 
 	/*
@@ -1375,14 +1341,6 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		ListCell   *lcp;
 		Path	   *cheapest_partial_path = NULL;
 
-		/*
-		 * For UNION ALLs with non-empty partitioned_child_rels, accumulate
-		 * the Lists of child relations.
-		 */
-		if (rel->rtekind == RTE_SUBQUERY && childrel->partitioned_child_rels != NIL)
-			partitioned_rels = lappend(partitioned_rels,
-									   childrel->partitioned_child_rels);
-
 		/*
 		 * If child has an unparameterized cheapest-total path, add that to
 		 * the unparameterized Append path we are constructing for the parent.
@@ -1394,7 +1352,8 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		if (childrel->pathlist != NIL &&
 			childrel->cheapest_total_path->param_info == NULL)
 			accumulate_append_subpath(childrel->cheapest_total_path,
-									  &subpaths, NULL);
+									  &subpaths, NULL, &partitioned_rels,
+									  flatten_partitioned_rels);
 		else
 			subpaths_valid = false;
 
@@ -1403,7 +1362,9 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		{
 			cheapest_partial_path = linitial(childrel->partial_pathlist);
 			accumulate_append_subpath(cheapest_partial_path,
-									  &partial_subpaths, NULL);
+									  &partial_subpaths, NULL,
+									  &partial_partitioned_rels,
+									  flatten_partitioned_rels);
 		}
 		else
 			partial_subpaths_valid = false;
@@ -1432,7 +1393,9 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 				Assert(cheapest_partial_path != NULL);
 				accumulate_append_subpath(cheapest_partial_path,
 										  &pa_partial_subpaths,
-										  &pa_nonpartial_subpaths);
+										  &pa_nonpartial_subpaths,
+										  &pa_partitioned_rels,
+										  flatten_partitioned_rels);
 
 			}
 			else
@@ -1452,7 +1415,9 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 				 */
 				accumulate_append_subpath(nppath,
 										  &pa_nonpartial_subpaths,
-										  NULL);
+										  NULL,
+										  &pa_partitioned_rels,
+										  flatten_partitioned_rels);
 			}
 		}
 
@@ -1572,7 +1537,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		appendpath = create_append_path(root, rel, NIL, partial_subpaths,
 										NIL, NULL, parallel_workers,
 										enable_parallel_append,
-										partitioned_rels, -1);
+										partial_partitioned_rels, -1);
 
 		/*
 		 * Make sure any subsequent partial paths use the same row count
@@ -1621,7 +1586,7 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 		appendpath = create_append_path(root, rel, pa_nonpartial_subpaths,
 										pa_partial_subpaths,
 										NIL, NULL, parallel_workers, true,
-										partitioned_rels, partial_rows);
+										pa_partitioned_rels, partial_rows);
 		add_partial_path(rel, (Path *) appendpath);
 	}
 
@@ -1651,6 +1616,10 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	{
 		Relids		required_outer = (Relids) lfirst(l);
 		ListCell   *lcr;
+		List	   *part_rels = NIL;
+
+		if (rel->reloptkind != RELOPT_JOINREL && rel->part_scheme != NULL)
+			part_rels = list_make1(bms_make_singleton(rel->relid));
 
 		/* Select the child paths for an Append with this parameterization */
 		subpaths = NIL;
@@ -1676,14 +1645,17 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 				subpaths_valid = false;
 				break;
 			}
-			accumulate_append_subpath(subpath, &subpaths, NULL);
+			accumulate_append_subpath(subpath, &subpaths, NULL, &part_rels,
+									  flatten_partitioned_rels);
 		}
 
 		if (subpaths_valid)
 			add_path(rel, (Path *)
 					 create_append_path(root, rel, subpaths, NIL,
 										NIL, required_outer, 0, false,
-										partitioned_rels, -1));
+										part_rels, -1));
+		else
+			list_free(part_rels);			/* XXX need we bother? */
 	}
 
 	/*
@@ -1697,17 +1669,14 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo *rel,
 	{
 		RelOptInfo *childrel = (RelOptInfo *) linitial(live_childrels);
 
-		foreach(l, childrel->partial_pathlist)
+		/* skip the cheapest partial path, since we already used that above */
+		for_each_from(l, childrel->partial_pathlist, 1)
 		{
 			Path	   *path = (Path *) lfirst(l);
 			AppendPath *appendpath;
 
-			/*
-			 * Skip paths with no pathkeys.  Also skip the cheapest partial
-			 * path, since we already used that above.
-			 */
-			if (path->pathkeys == NIL ||
-				path == linitial(childrel->partial_pathlist))
+			/* skip paths with no pathkeys. */
+			if (path->pathkeys == NIL)
 				continue;
 
 			appendpath = create_append_path(root, rel, NIL, list_make1(path),
@@ -1757,6 +1726,18 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 	List	   *partition_pathkeys_desc = NIL;
 	bool		partition_pathkeys_partial = true;
 	bool		partition_pathkeys_desc_partial = true;
+	List	   *startup_partitioned_rels = NIL;
+	List	   *total_partitioned_rels = NIL;
+	bool		flatten_partitioned_rels;
+
+	/* Set up the method for building the partitioned rels lists */
+	flatten_partitioned_rels = (rel->rtekind != RTE_SUBQUERY);
+
+	if (rel->reloptkind != RELOPT_JOINREL && rel->part_scheme != NULL)
+	{
+		startup_partitioned_rels = list_make1(bms_make_singleton(rel->relid));
+		total_partitioned_rels = list_make1(bms_make_singleton(rel->relid));
+	}
 
 	/*
 	 * Some partitioned table setups may allow us to use an Append node
@@ -1898,9 +1879,13 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 				 * child paths for the MergeAppend.
 				 */
 				accumulate_append_subpath(cheapest_startup,
-										  &startup_subpaths, NULL);
+										  &startup_subpaths, NULL,
+										  &startup_partitioned_rels,
+										  flatten_partitioned_rels);
 				accumulate_append_subpath(cheapest_total,
-										  &total_subpaths, NULL);
+										  &total_subpaths, NULL,
+										  &total_partitioned_rels,
+										  flatten_partitioned_rels);
 			}
 		}
 
@@ -1916,7 +1901,7 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 													  NULL,
 													  0,
 													  false,
-													  partitioned_rels,
+													  startup_partitioned_rels,
 													  -1));
 			if (startup_neq_total)
 				add_path(rel, (Path *) create_append_path(root,
@@ -1927,7 +1912,7 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 														  NULL,
 														  0,
 														  false,
-														  partitioned_rels,
+														  total_partitioned_rels,
 														  -1));
 		}
 		else
@@ -1938,14 +1923,14 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 															startup_subpaths,
 															pathkeys,
 															NULL,
-															partitioned_rels));
+															startup_partitioned_rels));
 			if (startup_neq_total)
 				add_path(rel, (Path *) create_merge_append_path(root,
 																rel,
 																total_subpaths,
 																pathkeys,
 																NULL,
-																partitioned_rels));
+																total_partitioned_rels));
 		}
 	}
 }
@@ -2024,6 +2009,54 @@ get_cheapest_parameterized_child_path(PlannerInfo *root, RelOptInfo *rel,
 	return cheapest;
 }
 
+/*
+ * accumulate_partitioned_rels
+ *		Record 'sub_partitioned_rels' in the 'partitioned_rels' list,
+ *		flattening as appropriate.
+ */
+static List *
+accumulate_partitioned_rels(List *partitioned_rels,
+							List *sub_partitioned_rels,
+							bool flatten)
+{
+	if (flatten)
+	{
+		/*
+		 * We're only called with flatten == true when the partitioned_rels
+		 * list has at most 1 element.  So we can just add the members from
+		 * sub list's first element onto the first element of
+		 * partitioned_rels.  Only later in planning when doing UNION ALL
+		 * Append processing will we see flatten == false. partitioned_rels
+		 * may end up with more than 1 element then, but we never expect to be
+		 * called with flatten == true again after that, so we needn't bother
+		 * doing anything here for anything but the initial element.
+		 */
+		if (partitioned_rels != NIL && sub_partitioned_rels != NIL)
+		{
+			Relids partrels = (Relids) linitial(partitioned_rels);
+			Relids subpartrels = (Relids) linitial(sub_partitioned_rels);
+
+			/* Ensure the above comment holds true */
+			Assert(list_length(partitioned_rels) == 1);
+			Assert(list_length(sub_partitioned_rels) == 1);
+
+			linitial(partitioned_rels) = bms_add_members(partrels, subpartrels);
+		}
+	}
+	else
+	{
+		/*
+		 * Handle UNION ALL to partitioned tables.  This always occurs after
+		 * we've done the accumulation for sub-partitioned tables, so there's
+		 * no need to consider how adding multiple elements to the top level
+		 * list affects the flatten == true case above.
+		 */
+		partitioned_rels = list_concat(partitioned_rels, sub_partitioned_rels);
+	}
+
+	return partitioned_rels;
+}
+
 /*
  * accumulate_append_subpath
  *		Add a subpath to the list being built for an Append or MergeAppend.
@@ -2044,9 +2077,22 @@ get_cheapest_parameterized_child_path(PlannerInfo *root, RelOptInfo *rel,
  * children to subpaths and the rest to special_subpaths.  If the latter is
  * NULL, we don't flatten the path at all (unless it contains only partial
  * paths).
+ *
+ * When pulling up sub-Appends and sub-Merge Appends, we also gather the
+ * path's list of partitioned tables and store in 'partitioned_rels'.
+ * When 'flatten_partitioned_rels' is true, 'partitioned_rels' will contain at
+ * most one element which is a RelIds of the partitioned relations which there
+ * are subpaths for.  In this case we just add the RT indexes for the
+ * partitioned tables for the subpath we're pulling up to the single entry in
+ * 'partitioned_rels'.  When 'flatten_partitioned_rels' is false we
+ * concatenate the path's partitioned rel list onto the top-level list.  This
+ * done for UNION ALLs which could have a partitioned table in each union
+ * branch.
  */
 static void
-accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
+accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths,
+						  List **partitioned_rels,
+						  bool flatten_partitioned_rels)
 {
 	if (IsA(path, AppendPath))
 	{
@@ -2055,6 +2101,9 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 		if (!apath->path.parallel_aware || apath->first_partial_path == 0)
 		{
 			*subpaths = list_concat(*subpaths, apath->subpaths);
+			*partitioned_rels = accumulate_partitioned_rels(*partitioned_rels,
+															apath->partitioned_rels,
+															flatten_partitioned_rels);
 			return;
 		}
 		else if (special_subpaths != NULL)
@@ -2070,6 +2119,9 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 							  apath->first_partial_path);
 			*special_subpaths = list_concat(*special_subpaths,
 											new_special_subpaths);
+			*partitioned_rels = accumulate_partitioned_rels(*partitioned_rels,
+															apath->partitioned_rels,
+															flatten_partitioned_rels);
 			return;
 		}
 	}
@@ -2078,6 +2130,9 @@ accumulate_append_subpath(Path *path, List **subpaths, List **special_subpaths)
 		MergeAppendPath *mpath = (MergeAppendPath *) path;
 
 		*subpaths = list_concat(*subpaths, mpath->subpaths);
+		*partitioned_rels = accumulate_partitioned_rels(*partitioned_rels,
+														mpath->partitioned_rels,
+														flatten_partitioned_rels);
 		return;
 	}
 
diff --git a/src/backend/optimizer/plan/createplan.c b/src/backend/optimizer/plan/createplan.c
index 94280a730c..40abe6f9f6 100644
--- a/src/backend/optimizer/plan/createplan.c
+++ b/src/backend/optimizer/plan/createplan.c
@@ -1228,7 +1228,6 @@ create_append_plan(PlannerInfo *root, AppendPath *best_path, int flags)
 	 * do partition pruning.
 	 */
 	if (enable_partition_pruning &&
-		rel->reloptkind == RELOPT_BASEREL &&
 		best_path->partitioned_rels != NIL)
 	{
 		List	   *prunequal;
@@ -1395,7 +1394,6 @@ create_merge_append_plan(PlannerInfo *root, MergeAppendPath *best_path,
 	 * do partition pruning.
 	 */
 	if (enable_partition_pruning &&
-		rel->reloptkind == RELOPT_BASEREL &&
 		best_path->partitioned_rels != NIL)
 	{
 		List	   *prunequal;
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index a203e6f1ff..76245c1ff3 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -257,7 +257,6 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent)
 	rel->all_partrels = NULL;
 	rel->partexprs = NULL;
 	rel->nullable_partexprs = NULL;
-	rel->partitioned_child_rels = NIL;
 
 	/*
 	 * Pass assorted information down the inheritance hierarchy.
@@ -672,7 +671,6 @@ build_join_rel(PlannerInfo *root,
 	joinrel->all_partrels = NULL;
 	joinrel->partexprs = NULL;
 	joinrel->nullable_partexprs = NULL;
-	joinrel->partitioned_child_rels = NIL;
 
 	/* Compute information relevant to the foreign relations. */
 	set_foreign_rel_properties(joinrel, outer_rel, inner_rel);
@@ -850,7 +848,6 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
 	joinrel->all_partrels = NULL;
 	joinrel->partexprs = NULL;
 	joinrel->nullable_partexprs = NULL;
-	joinrel->partitioned_child_rels = NIL;
 
 	joinrel->top_parent_relids = bms_union(outer_rel->top_parent_relids,
 										   inner_rel->top_parent_relids);
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index 6268623d56..fdbfddb30b 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -141,7 +141,7 @@ typedef struct PruneStepResult
 static List *make_partitionedrel_pruneinfo(PlannerInfo *root,
 										   RelOptInfo *parentrel,
 										   int *relid_subplan_map,
-										   List *partitioned_rels, List *prunequal,
+										   Relids partrelids, List *prunequal,
 										   Bitmapset **matchedsubplans);
 static void gen_partprune_steps(RelOptInfo *rel, List *clauses,
 								PartClauseTarget target,
@@ -267,13 +267,13 @@ make_partition_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 	prunerelinfos = NIL;
 	foreach(lc, partitioned_rels)
 	{
-		List	   *rels = (List *) lfirst(lc);
+		Relids	   partrelids = (Relids) lfirst(lc);
 		List	   *pinfolist;
 		Bitmapset  *matchedsubplans = NULL;
 
 		pinfolist = make_partitionedrel_pruneinfo(root, parentrel,
 												  relid_subplan_map,
-												  rels, prunequal,
+												  partrelids, prunequal,
 												  &matchedsubplans);
 
 		/* When pruning is possible, record the matched subplans */
@@ -342,7 +342,7 @@ make_partition_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 static List *
 make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 							  int *relid_subplan_map,
-							  List *partitioned_rels, List *prunequal,
+							  Relids partrelids, List *prunequal,
 							  Bitmapset **matchedsubplans)
 {
 	RelOptInfo *targetpart = NULL;
@@ -351,6 +351,7 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 	int		   *relid_subpart_map;
 	Bitmapset  *subplansfound = NULL;
 	ListCell   *lc;
+	int			rti;
 	int			i;
 
 	/*
@@ -364,9 +365,9 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 	relid_subpart_map = palloc0(sizeof(int) * root->simple_rel_array_size);
 
 	i = 1;
-	foreach(lc, partitioned_rels)
+	rti = -1;
+	while ((rti = bms_next_member(partrelids, rti)) > 0)
 	{
-		Index		rti = lfirst_int(lc);
 		RelOptInfo *subpart = find_base_rel(root, rti);
 		PartitionedRelPruneInfo *pinfo;
 		List	   *partprunequal;
@@ -379,14 +380,12 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 		 * Fill the mapping array.
 		 *
 		 * relid_subpart_map maps relid of a non-leaf partition to the index
-		 * in 'partitioned_rels' of that rel (which will also be the index in
-		 * the returned PartitionedRelPruneInfo list of the info for that
+		 * in 'partrelids' of that rel (which will also be the index in the
+		 * returned PartitionedRelPruneInfo list of the info for that
 		 * partition).  We use 1-based indexes here, so that zero can
 		 * represent an un-filled array entry.
 		 */
 		Assert(rti < root->simple_rel_array_size);
-		/* No duplicates please */
-		Assert(relid_subpart_map[rti] == 0);
 		relid_subpart_map[rti] = i++;
 
 		/*
@@ -582,6 +581,12 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 				present_parts = bms_add_member(present_parts, i);
 		}
 
+		/*
+		 * Ensure there were no stray PartitionedRelPruneInfo generated for
+		 * partitioned tables that had no sub-paths for.
+		 */
+		Assert(!bms_is_empty(present_parts));
+
 		/* Record the maps and other information. */
 		pinfo->present_parts = present_parts;
 		pinfo->nparts = nparts;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index 3dd16b9ad5..6c23b7bc8c 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -601,9 +601,6 @@ typedef struct PartitionSchemeData *PartitionScheme;
  *		part_rels - RelOptInfos for each partition
  *		all_partrels - Relids set of all partition relids
  *		partexprs, nullable_partexprs - Partition key expressions
- *		partitioned_child_rels - RT indexes of unpruned partitions of
- *								 this relation that are partitioned tables
- *								 themselves, in hierarchical order
  *
  * The partexprs and nullable_partexprs arrays each contain
  * part_scheme->partnatts elements.  Each of the elements is a list of
@@ -751,7 +748,6 @@ typedef struct RelOptInfo
 	Relids		all_partrels;	/* Relids set of all partition relids */
 	List	  **partexprs;		/* Non-nullable partition key expressions */
 	List	  **nullable_partexprs; /* Nullable partition key expressions */
-	List	   *partitioned_child_rels; /* List of RT indexes */
 } RelOptInfo;
 
 /*
@@ -1398,8 +1394,10 @@ typedef struct CustomPath
 typedef struct AppendPath
 {
 	Path		path;
-	/* RT indexes of non-leaf tables in a partition tree */
-	List	   *partitioned_rels;
+	List	   *partitioned_rels;	/* List of Relids for each non-leaf
+									 * partitioned table in the partition
+									 * tree. One for each partition hierarchy.
+									 */
 	List	   *subpaths;		/* list of component Paths */
 	/* Index of first partial path in subpaths; list_length(subpaths) if none */
 	int			first_partial_path;
@@ -1424,8 +1422,10 @@ extern bool is_dummy_rel(RelOptInfo *rel);
 typedef struct MergeAppendPath
 {
 	Path		path;
-	/* RT indexes of non-leaf tables in a partition tree */
-	List	   *partitioned_rels;
+	List	   *partitioned_rels;	/* List of Relids for each non-leaf
+									 * partitioned table in the partition
+									 * tree. One for each partition hierarchy.
+									 */
 	List	   *subpaths;		/* list of component Paths */
 	double		limit_tuples;	/* hard limit on output tuples, or -1 */
 } MergeAppendPath;
diff --git a/src/test/regress/expected/partition_prune.out b/src/test/regress/expected/partition_prune.out
index 50d2a7e4b9..80e71b8e2b 100644
--- a/src/test/regress/expected/partition_prune.out
+++ b/src/test/regress/expected/partition_prune.out
@@ -3671,6 +3671,108 @@ explain (costs off) update listp1 set a = 1 where a = 2;
 reset constraint_exclusion;
 reset enable_partition_pruning;
 drop table listp;
+-- Ensure run-time pruning works correctly for nested Append nodes
+set parallel_setup_cost to 0;
+set parallel_tuple_cost to 0;
+create table listp (a int) partition by list(a);
+create table listp_12 partition of listp for values in(1,2) partition by list(a);
+create table listp_12_1 partition of listp_12 for values in(1);
+create table listp_12_2 partition of listp_12 for values in(2);
+-- Force the 2nd subnode of the Append to be non-parallel.  This results in
+-- a nested Append node because the mixed parallel / non-parallel paths cannot
+-- be pulled into the top-level Append.
+alter table listp_12_1 set (parallel_workers = 0);
+-- Ensure that listp_12_2 is not scanned.  (The nested Append is not seen in
+-- the plan as it's pulled in setref.c due to having just a single subnode).
+explain (analyze on, costs off, timing off, summary off)
+select * from listp where a = (select 1);
+                              QUERY PLAN                              
+----------------------------------------------------------------------
+ Gather (actual rows=0 loops=1)
+   Workers Planned: 2
+   Params Evaluated: $0
+   Workers Launched: 2
+   InitPlan 1 (returns $0)
+     ->  Result (actual rows=1 loops=1)
+   ->  Parallel Append (actual rows=0 loops=3)
+         ->  Seq Scan on listp_12_1 listp_1 (actual rows=0 loops=1)
+               Filter: (a = $0)
+         ->  Parallel Seq Scan on listp_12_2 listp_2 (never executed)
+               Filter: (a = $0)
+(11 rows)
+
+-- Like the above but throw some more complexity at the planner by adding
+-- a UNION ALL.  We expect both sides of the union not to scan the
+-- non-required partitions.
+explain (analyze on, costs off, timing off, summary off)
+select * from listp where a = (select 1)
+  union all
+select * from listp where a = (select 2);
+                                    QUERY PLAN                                     
+-----------------------------------------------------------------------------------
+ Append (actual rows=0 loops=1)
+   ->  Gather (actual rows=0 loops=1)
+         Workers Planned: 2
+         Params Evaluated: $0
+         Workers Launched: 2
+         InitPlan 1 (returns $0)
+           ->  Result (actual rows=1 loops=1)
+         ->  Parallel Append (actual rows=0 loops=3)
+               ->  Seq Scan on listp_12_1 listp_1 (actual rows=0 loops=1)
+                     Filter: (a = $0)
+               ->  Parallel Seq Scan on listp_12_2 listp_2 (never executed)
+                     Filter: (a = $0)
+   ->  Gather (actual rows=0 loops=1)
+         Workers Planned: 2
+         Params Evaluated: $1
+         Workers Launched: 2
+         InitPlan 2 (returns $1)
+           ->  Result (actual rows=1 loops=1)
+         ->  Parallel Append (actual rows=0 loops=3)
+               ->  Seq Scan on listp_12_1 listp_4 (never executed)
+                     Filter: (a = $1)
+               ->  Parallel Seq Scan on listp_12_2 listp_5 (actual rows=0 loops=1)
+                     Filter: (a = $1)
+(23 rows)
+
+drop table listp;
+reset parallel_tuple_cost;
+reset parallel_setup_cost;
+-- Test case for run-time pruning with a nested Merge Append
+set enable_sort to 0;
+create table rangep (a int, b int) partition by range (a);
+create table rangep_0_to_100 partition of rangep for values from (0) to (100) partition by list (b);
+-- We need 3 sub-partitions. 1 to validate pruning worked and another two
+-- because a single remaining partition would be pulled up to the main Append.
+create table rangep_0_to_100_1 partition of rangep_0_to_100 for values in(1);
+create table rangep_0_to_100_2 partition of rangep_0_to_100 for values in(2);
+create table rangep_0_to_100_3 partition of rangep_0_to_100 for values in(3);
+create table rangep_100_to_200 partition of rangep for values from (100) to (200);
+create index on rangep (a);
+-- Ensure run-time pruning works on the nested Merge Append
+explain (analyze on, costs off, timing off, summary off)
+select * from rangep where b IN((select 1),(select 2)) order by a;
+                                                 QUERY PLAN                                                 
+------------------------------------------------------------------------------------------------------------
+ Append (actual rows=0 loops=1)
+   InitPlan 1 (returns $0)
+     ->  Result (actual rows=1 loops=1)
+   InitPlan 2 (returns $1)
+     ->  Result (actual rows=1 loops=1)
+   ->  Merge Append (actual rows=0 loops=1)
+         Sort Key: rangep_2.a
+         ->  Index Scan using rangep_0_to_100_1_a_idx on rangep_0_to_100_1 rangep_2 (actual rows=0 loops=1)
+               Filter: (b = ANY (ARRAY[$0, $1]))
+         ->  Index Scan using rangep_0_to_100_2_a_idx on rangep_0_to_100_2 rangep_3 (actual rows=0 loops=1)
+               Filter: (b = ANY (ARRAY[$0, $1]))
+         ->  Index Scan using rangep_0_to_100_3_a_idx on rangep_0_to_100_3 rangep_4 (never executed)
+               Filter: (b = ANY (ARRAY[$0, $1]))
+   ->  Index Scan using rangep_100_to_200_a_idx on rangep_100_to_200 rangep_5 (actual rows=0 loops=1)
+         Filter: (b = ANY (ARRAY[$0, $1]))
+(15 rows)
+
+reset enable_sort;
+drop table rangep;
 --
 -- Check that gen_prune_steps_from_opexps() works well for various cases of
 -- clauses for different partition keys
diff --git a/src/test/regress/sql/partition_prune.sql b/src/test/regress/sql/partition_prune.sql
index 1e904a8c5b..939a9b1193 100644
--- a/src/test/regress/sql/partition_prune.sql
+++ b/src/test/regress/sql/partition_prune.sql
@@ -1051,6 +1051,55 @@ reset enable_partition_pruning;
 
 drop table listp;
 
+-- Ensure run-time pruning works correctly for nested Append nodes
+set parallel_setup_cost to 0;
+set parallel_tuple_cost to 0;
+
+create table listp (a int) partition by list(a);
+create table listp_12 partition of listp for values in(1,2) partition by list(a);
+create table listp_12_1 partition of listp_12 for values in(1);
+create table listp_12_2 partition of listp_12 for values in(2);
+
+-- Force the 2nd subnode of the Append to be non-parallel.  This results in
+-- a nested Append node because the mixed parallel / non-parallel paths cannot
+-- be pulled into the top-level Append.
+alter table listp_12_1 set (parallel_workers = 0);
+
+-- Ensure that listp_12_2 is not scanned.  (The nested Append is not seen in
+-- the plan as it's pulled in setref.c due to having just a single subnode).
+explain (analyze on, costs off, timing off, summary off)
+select * from listp where a = (select 1);
+
+-- Like the above but throw some more complexity at the planner by adding
+-- a UNION ALL.  We expect both sides of the union not to scan the
+-- non-required partitions.
+explain (analyze on, costs off, timing off, summary off)
+select * from listp where a = (select 1)
+  union all
+select * from listp where a = (select 2);
+
+drop table listp;
+reset parallel_tuple_cost;
+reset parallel_setup_cost;
+
+-- Test case for run-time pruning with a nested Merge Append
+set enable_sort to 0;
+create table rangep (a int, b int) partition by range (a);
+create table rangep_0_to_100 partition of rangep for values from (0) to (100) partition by list (b);
+-- We need 3 sub-partitions. 1 to validate pruning worked and another two
+-- because a single remaining partition would be pulled up to the main Append.
+create table rangep_0_to_100_1 partition of rangep_0_to_100 for values in(1);
+create table rangep_0_to_100_2 partition of rangep_0_to_100 for values in(2);
+create table rangep_0_to_100_3 partition of rangep_0_to_100 for values in(3);
+create table rangep_100_to_200 partition of rangep for values from (100) to (200);
+create index on rangep (a);
+
+-- Ensure run-time pruning works on the nested Merge Append
+explain (analyze on, costs off, timing off, summary off)
+select * from rangep where b IN((select 1),(select 2)) order by a;
+reset enable_sort;
+drop table rangep;
+
 --
 -- Check that gen_prune_steps_from_opexps() works well for various cases of
 -- clauses for different partition keys
-- 
2.27.0

#21

Amit Langote

amitlangote09@gmail.com

about 5 years ago

In reply to: David Rowley (#20)

Re: Parallel Append can break run-time partition pruning

On Sun, Oct 25, 2020 at 10:06 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 16 Oct 2020 at 03:01, Amit Langote <amitlangote09@gmail.com> wrote:

Going over the last few emails, it seems that David held off from
committing the patch, because of the lack of confidence in its
robustness. With the patch, a sub-partitioned child's partial
Append's child paths will *always* be pulled up into the parent's set
of partial child paths thus preventing the nesting of Appends, which
the run-time pruning code can't currently cope with. The lack of
confidence seems to be due to the fact that always pulling up a
sub-Append's child paths into the parent partial Append's child paths
*may* cause the latter to behave wrongly under parallelism and the new
code structure will prevent add_partial_path() from being the
arbitrator of whether such a path is really the best in terms of cost.

If we can't be confident in that approach, maybe we should consider
making the run-time pruning code cope with nested Appends?

I've been thinking about this and my thoughts are:

Thanks for working on this.

There are other cases where we don't pullup sub-Merge Append nodes
anyway, so I think we should just make run-time pruning work for more
than just the top-level Append/Merge Append.

The case I'm thinking about is the code added in 959d00e9dbe for
ordered Append scans. It's possible a sub-partitioned table has
partitions which don't participate in the same ordering. We need to
keep the sub-Merge Append in those cases... well, at least when
there's more than 1 partition remaining after plan-time pruning.

Ah, I guess that case would also likewise fail to use runtime pruning properly.

I've attached the patch which, pending a final look, I'm proposing to
commit for master only. I just don't quite think this is a bug fix,
and certainly, due to the invasiveness of the proposed patch, that
means no backpatch.

I fixed all the stuff about the incorrectly set partitioned_rels list.
What I ended up with there is making it accumulate_append_subpath's
job to also pullup the sub-paths partitioned_rels fields when pulling
up a nested Append/MergeAppend. If there's no pullup, there then we
don't care about the sub-path's partitioned_rels. That's for it to
deal with. With that, I think that run-time pruning will only get the
RT indexes for partitions that we actually have sub-paths for. That's
pretty good, so I added an Assert() to verify that in
make_partitionedrel_pruneinfo(). (I hope I don't regret that later)

So AIUI, the fix here is to make any given Append/MergeAppend's
part_prune_info only contain PartitionedRelPruneInfos of the (sub-)
partitioned tables whose partitions' subplans are directly in
appendplan/mergeplans, such that the partition indexes can be mapped
to the subplan indexes. That does imply present_parts must be
non-empty, so the Assert seems warranted.

This does mean we need to maintain a different partitioned_rels list
for each Append path we consider. So there's a number (six) of these
now between add_paths_to_append_rel() and
generate_orderedappend_paths(). To try to minimise the impact of
that, I've changed the field so that instead of being a List of
IntLists, it's just a List of Relids. The top-level list just
contains a single element until you get a UNION ALL that selects from
a partitioned table on each side of the union. Merging sub-path
partitioned rel RT indexes into the existing element is now pretty
cheap as it just uses bms_add_members() rather the list_concat we'd
have had to have used if it was still a List of IntLists.

The refactoring seemed complicated on a first look, but overall looks good.

After fixing up how partitioned_rels is built, I saw there were no
usages of RelOptInfo.partitioned_child_rels, so I got rid of it.

I did another couple of little cleanups and wrote some regression
tests to test all this.

Overall I'm fairly happy with this, especially getting rid of a
partitioned table related field from RelOptInfo.

Some comments:

+    * For partitioned tables, we accumulate a list of the partitioned RT
+    * indexes for the subpaths that are directly under this Append.

Maybe:

For partitioned tables, accumulate a list of the RT indexes of
partitioned tables in the tree whose partitions' subpaths are directly
under this Append.

+    * lists for each Append Path that we create as accumulate_append_subpath
+    * sometimes can't flatten sub-Appends into the top-level Append.

How about expanding that reason a little bit as:

...can't flatten sub-Appends into the top-level Append which prevents
the former's partitioned_rels from being pulled into the latter's.

+ * most one element which is a RelIds of the partitioned relations which there

s/RelIds/Relids

+ * are subpaths for.  In this case we just add the RT indexes for the
+ * partitioned tables for the subpath we're pulling up to the single entry in
+ * 'partitioned_rels'.

How about:

In this case, we just pull the RT indexes contained in
sub-Append/MergeAppend's partitioned_rels into the single entry of
*partitioned_rels, which belongs to the parent Append.

         * relid_subpart_map maps relid of a non-leaf partition to the index
-        * in 'partitioned_rels' of that rel (which will also be the index in
-        * the returned PartitionedRelPruneInfo list of the info for that
+        * in 'partrelids' of that rel (which will also be the index in the
+        * returned PartitionedRelPruneInfo list of the info for that

...the index in 'partrelids', which in the new code is a bitmap set,
sounds a bit odd. Maybe just mention the index in the list of
PartitionedRelPruneInfos as:

relid_subpart_map maps relid of a given non-leaf partition in
'partrelids' to the index of its PartitionedRelPruneInfo in the
returned list.

+       /*
+        * Ensure there were no stray PartitionedRelPruneInfo generated for
+        * partitioned tables that had no sub-paths for.
+        */
+       Assert(!bms_is_empty(present_parts));

Maybe you meant:

...for partitioned tables for which we had neither leaf subpaths nor
sub-PartitionedRelPruneInfos.

+   List       *partitioned_rels;   /* List of Relids for each non-leaf
+                                    * partitioned table in the partition
+                                    * tree. One for each partition hierarchy.
+                                    */
    List       *subpaths;       /* list of component Paths */

How about describing partitioned_rels as follows:

List of Relids set containing RT indexes of non-leaf tables for each
partition hierarchy whose paths are in 'subpaths'

--
Amit Langote
EDB: http://www.enterprisedb.com

#22

David Rowley

dgrowleyml@gmail.com

about 5 years ago

In reply to: Amit Langote (#21)

Re: Parallel Append can break run-time partition pruning

On Tue, 27 Oct 2020 at 19:40, Amit Langote <amitlangote09@gmail.com> wrote:

Some comments:

Thanks for having a look at this.

I've made some adjustments to those comments and pushed.

David

#23

Amit Langote

amitlangote09@gmail.com

about 5 years ago

In reply to: David Rowley (#22)

Re: Parallel Append can break run-time partition pruning

On Mon, Nov 2, 2020 at 9:51 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 27 Oct 2020 at 19:40, Amit Langote <amitlangote09@gmail.com> wrote:

Some comments:

Thanks for having a look at this.

I've made some adjustments to those comments and pushed.

Thank you.

--
Amit Langote
EDB: http://www.enterprisedb.com

#24

Justin Pryzby

pryzby@telsasoft.com

about 4 years ago

In reply to: David Rowley (#22)

Re: Parallel Append can break run-time partition pruning

On Mon, Nov 02, 2020 at 01:50:57PM +1300, David Rowley wrote:

On Tue, 27 Oct 2020 at 19:40, Amit Langote <amitlangote09@gmail.com> wrote:

Some comments:

Thanks for having a look at this.

I've made some adjustments to those comments and pushed.

commit a929e17e5 doesn't appear in the v14 release notes, but I wanted to
mention that this appears to allow fixing a rowcount mis-estimate for us,
which I had reported before:
/messages/by-id/20170326193344.GS31628@telsasoft.com
/messages/by-id/20170415002322.GA24216@telsasoft.com
/messages/by-id/20170524211730.GM31097@telsasoft.com

And others have reported before:
/messages/by-id/7DF51702-0F6A-4571-80BB-188AAEF260DA@gmail.com
/messages/by-id/SG2PR01MB29673BE6F7AA24424FDBFF60BC670@SG2PR01MB2967.apcprd01.prod.exchangelabs.com

For years, our reports have included a generated WHERE clause for each table
being queried, to allow each table's partitions to be properly pruned/excluded
(not just one table, as happened if we used a single WHERE clause).

That worked, but then the planner underestimates the rowcount, since it doesn't
realize that the conditions are redundant (since "equality classes" do not
handle the inequality conditions).

In v14, one WHERE clause per table still gives an underestimate; but, now
multiple WHERE clauses aren't required, because a single WHERE clause excludes
partitions from each table, and the rowcount from the elided partitions is
excluded from the Append rowcount at plan time.

Thanks for this feature !

--
Justin