Record a Bitmapset of non-pruned partitions

Started by David Rowleyover 4 years ago15 messages
#1David Rowley
dgrowleyml@gmail.com
1 attachment(s)

Back when working on 959d00e9d, to allow ordered partition scans for
LIST and RANGE partitioned tables, I mentioned [1]/messages/by-id/CAKJS1f9W7sg1sb3SXiTQUovs=wDMrHATXv68F5dSbe5fuHH+iQ@mail.gmail.com that if we had a
field that recorded a Bitmapset of the non-pruned partitions, we could
use that to do a more thorough check to see if ordered scans are
possible.

At the moment these ordered scans are possible if the order by matches
the partition key and, for LIST partitioned tables we must also ensure
that we don't have something like partition1 FOR VALUES IN(1,3) and
partition2 FOR VALUES(2,4). Here we can't scan partition1 first before
we'd get 3s before the 2s that are in partition2. Since we don't
record the list of partitions that survived pruning, I just made that
check to ensure all partitions only allow a single value to be stored.
That means the optimisation is not done in cases where it's possible
to do it.

To make this work, as I mentioned in [1]/messages/by-id/CAKJS1f9W7sg1sb3SXiTQUovs=wDMrHATXv68F5dSbe5fuHH+iQ@mail.gmail.com, we really need a live_parts
field to track which partitions survived pruning. If we have that
then we can ensure that just those partitions have no instances where
a lower value could appear in a later partition.

In the attached patch, I've only added the live_parts field and made
use of it in a few locations where the knowledge is useful as an
optimisation. Right now apply_scanjoin_target_to_paths() appears in
profiles when planning a query to a partitioned table with many
partitions and just 1 or a few survive pruning. The reason for this is
simple; looping over a large almost entirely empty array is just slow
and using the Bitmapset as a sort of index into the interesting
elements of that array speeds it up, quite a bit.

Since we've already put quite a bit of effort into making that fast,
then I think it might be worth adding this field even if it was just
for that purpose.

With 10k empty hash partitions and 1 random one surviving partition
pruning, I see:

Master:
18.13% postgres postgres [.] apply_scanjoin_target_to_paths
3.66% postgres postgres [.] AllocSetAlloc
2.05% postgres postgres [.] hash_search_with_hash_value
1.95% postgres libc-2.33.so [.] __memset_avx2_unaligned_erms
1.88% postgres postgres [.] SearchCatCacheInternal
1.55% postgres postgres [.] base_yyparse
0.74% postgres postgres [.] get_relation_info
0.68% postgres [kernel.kallsyms] [k] __d_lookup_rcu
0.61% postgres postgres [.] palloc

Patched:
3.72% postgres postgres [.] AllocSetAlloc
2.30% postgres postgres [.] hash_search_with_hash_value
2.22% postgres postgres [.] SearchCatCacheInternal
2.02% postgres libc-2.33.so [.] __memset_avx2_unaligned_erms
1.88% postgres postgres [.] base_yyparse
1.08% postgres postgres [.] palloc
0.92% postgres [kernel.kallsyms] [k] __d_lookup_rcu

$ cat setup.sql
create table hp (a int primary key, b int not null) partition by hash(a);
select 'create table hp'||x|| ' partition of hp for values with
(modulus 10000, remainder '||x||');' from generate_series(0,9999) x;
\gexec

$ cat select.sql
\set p random(1,2000000)
select * from hp where a = :p

master

$ pgbench -n -f select.sql -c 1 -j 1 -T 60 postgres
tps = 2608.704218 (without initial connection time)
tps = 2607.641247 (without initial connection time)
tps = 2583.017011 (without initial connection time)

patched

$ pgbench -n -f select.sql -c 1 -j 1 -T 60 postgres
tps = 2715.993495 (without initial connection time)
tps = 2701.527640 (without initial connection time)
tps = 2707.343009 (without initial connection time)

Does anyone have any thoughts about if this is worth a new field in RelOptInfo?

David

[1]: /messages/by-id/CAKJS1f9W7sg1sb3SXiTQUovs=wDMrHATXv68F5dSbe5fuHH+iQ@mail.gmail.com

Attachments:

add_RelOptInfo_live_parts_field.patchapplication/octet-stream; name=add_RelOptInfo_live_parts_field.patchDownload
diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e32b92e299..a35160a81c 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2385,6 +2385,7 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_BOOL_FIELD(consider_partitionwise_join);
 	WRITE_BITMAPSET_FIELD(top_parent_relids);
 	WRITE_BOOL_FIELD(partbounds_merged);
+	WRITE_BITMAPSET_FIELD(live_parts);
 	WRITE_BITMAPSET_FIELD(all_partrels);
 }
 
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 0dbe2ac726..8b69870cf4 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1539,6 +1539,7 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 												 child_sjinfo,
 												 child_sjinfo->jointype);
 			joinrel->part_rels[cnt_parts] = child_joinrel;
+			joinrel->live_parts = bms_add_member(joinrel->live_parts, cnt_parts);
 			joinrel->all_partrels = bms_add_members(joinrel->all_partrels,
 													child_joinrel->relids);
 		}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 1868c4eff4..2a0dee9b5b 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6989,19 +6989,22 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	if (rel_is_partitioned)
 	{
 		List	   *live_children = NIL;
-		int			partition_idx;
+		int			i;
 
 		/* Adjust each partition. */
-		for (partition_idx = 0; partition_idx < rel->nparts; partition_idx++)
+		i = -1;
+		while ((i = bms_next_member(rel->live_parts, i)) >= 0)
 		{
-			RelOptInfo *child_rel = rel->part_rels[partition_idx];
+			RelOptInfo *child_rel = rel->part_rels[i];
 			AppendRelInfo **appinfos;
 			int			nappinfos;
 			List	   *child_scanjoin_targets = NIL;
 			ListCell   *lc;
 
-			/* Pruned or dummy children can be ignored. */
-			if (child_rel == NULL || IS_DUMMY_REL(child_rel))
+			Assert(child_rel != NULL);
+
+			/* Dummy children can be ignored. */
+			if (IS_DUMMY_REL(child_rel))
 				continue;
 
 			/* Translate scan/join targets for this child. */
@@ -7082,32 +7085,35 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 									PartitionwiseAggregateType patype,
 									GroupPathExtraData *extra)
 {
-	int			nparts = input_rel->nparts;
-	int			cnt_parts;
 	List	   *grouped_live_children = NIL;
 	List	   *partially_grouped_live_children = NIL;
 	PathTarget *target = grouped_rel->reltarget;
 	bool		partial_grouping_valid = true;
+	int			i;
 
 	Assert(patype != PARTITIONWISE_AGGREGATE_NONE);
 	Assert(patype != PARTITIONWISE_AGGREGATE_PARTIAL ||
 		   partially_grouped_rel != NULL);
 
 	/* Add paths for partitionwise aggregation/grouping. */
-	for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
+	i = -1;
+	while ((i = bms_next_member(input_rel->live_parts, i)) >= 0)
 	{
-		RelOptInfo *child_input_rel = input_rel->part_rels[cnt_parts];
-		PathTarget *child_target = copy_pathtarget(target);
+		RelOptInfo *child_input_rel = input_rel->part_rels[i];
+		PathTarget *child_target;
 		AppendRelInfo **appinfos;
 		int			nappinfos;
 		GroupPathExtraData child_extra;
 		RelOptInfo *child_grouped_rel;
 		RelOptInfo *child_partially_grouped_rel;
 
-		/* Pruned or dummy children can be ignored. */
-		if (child_input_rel == NULL || IS_DUMMY_REL(child_input_rel))
+		Assert(child_input_rel != NULL);
+
+		/* Dummy children can be ignored. */
+		if (IS_DUMMY_REL(child_input_rel))
 			continue;
 
+		child_target = copy_pathtarget(target);
 		/*
 		 * Copy the given "extra" structure as is and then override the
 		 * members specific to this child.
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 992ef87b9d..c758172efa 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -348,7 +348,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo,
 	 * that survive pruning.  Below, we will initialize child objects for the
 	 * surviving partitions.
 	 */
-	live_parts = prune_append_rel_partitions(relinfo);
+	relinfo->live_parts = live_parts = prune_append_rel_partitions(relinfo);
 
 	/* Expand simple_rel_array and friends to hold child objects. */
 	num_live_parts = bms_num_members(live_parts);
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index e105a4d5f1..47769cea45 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -255,6 +255,7 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent)
 	rel->partbounds_merged = false;
 	rel->partition_qual = NIL;
 	rel->part_rels = NULL;
+	rel->live_parts = NULL;
 	rel->all_partrels = NULL;
 	rel->partexprs = NULL;
 	rel->nullable_partexprs = NULL;
@@ -669,6 +670,7 @@ build_join_rel(PlannerInfo *root,
 	joinrel->partbounds_merged = false;
 	joinrel->partition_qual = NIL;
 	joinrel->part_rels = NULL;
+	joinrel->live_parts = NULL;
 	joinrel->all_partrels = NULL;
 	joinrel->partexprs = NULL;
 	joinrel->nullable_partexprs = NULL;
@@ -847,6 +849,7 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
 	joinrel->partbounds_merged = false;
 	joinrel->partition_qual = NIL;
 	joinrel->part_rels = NULL;
+	joinrel->live_parts = NULL;
 	joinrel->all_partrels = NULL;
 	joinrel->partexprs = NULL;
 	joinrel->nullable_partexprs = NULL;
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index c79374265c..e00edbe5c8 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -654,15 +654,14 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 		relid_map = (Oid *) palloc0(nparts * sizeof(Oid));
 		present_parts = NULL;
 
-		for (i = 0; i < nparts; i++)
+		i = -1;
+		while ((i = bms_next_member(subpart->live_parts, i)) >= 0)
 		{
 			RelOptInfo *partrel = subpart->part_rels[i];
 			int			subplanidx;
 			int			subpartidx;
 
-			/* Skip processing pruned partitions. */
-			if (partrel == NULL)
-				continue;
+			Assert(partrel != NULL);
 
 			subplan_map[i] = subplanidx = relid_subplan_map[partrel->relid] - 1;
 			subpart_map[i] = subpartidx = relid_subpart_map[partrel->relid] - 1;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index b7b2817a5d..ff6a641b6e 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -762,6 +762,9 @@ typedef struct RelOptInfo
 	List	   *partition_qual; /* Partition constraint, if not the root */
 	struct RelOptInfo **part_rels;	/* Array of RelOptInfos of partitions,
 									 * stored in the same order as bounds */
+	Bitmapset  *live_parts;		/* Bitmap with members to indicate which
+								 * elements of the 'part_rels' array are
+								 * filled */
 	Relids		all_partrels;	/* Relids set of all partition relids */
 	List	  **partexprs;		/* Non-nullable partition key expressions */
 	List	  **nullable_partexprs; /* Nullable partition key expressions */
#2Amit Langote
amitlangote09@gmail.com
In reply to: David Rowley (#1)
Re: Record a Bitmapset of non-pruned partitions

On Wed, Jun 30, 2021 at 10:59 PM David Rowley <dgrowleyml@gmail.com> wrote:

Back when working on 959d00e9d, to allow ordered partition scans for
LIST and RANGE partitioned tables, I mentioned [1] that if we had a
field that recorded a Bitmapset of the non-pruned partitions, we could
use that to do a more thorough check to see if ordered scans are
possible.

At the moment these ordered scans are possible if the order by matches
the partition key and, for LIST partitioned tables we must also ensure
that we don't have something like partition1 FOR VALUES IN(1,3) and
partition2 FOR VALUES(2,4). Here we can't scan partition1 first before
we'd get 3s before the 2s that are in partition2. Since we don't
record the list of partitions that survived pruning, I just made that
check to ensure all partitions only allow a single value to be stored.
That means the optimisation is not done in cases where it's possible
to do it.

To make this work, as I mentioned in [1], we really need a live_parts
field to track which partitions survived pruning. If we have that
then we can ensure that just those partitions have no instances where
a lower value could appear in a later partition.

In the attached patch, I've only added the live_parts field and made
use of it in a few locations where the knowledge is useful as an
optimisation. Right now apply_scanjoin_target_to_paths() appears in
profiles when planning a query to a partitioned table with many
partitions and just 1 or a few survive pruning. The reason for this is
simple; looping over a large almost entirely empty array is just slow
and using the Bitmapset as a sort of index into the interesting
elements of that array speeds it up, quite a bit.

Since we've already put quite a bit of effort into making that fast,
then I think it might be worth adding this field even if it was just
for that purpose.

With 10k empty hash partitions and 1 random one surviving partition
pruning, I see:

Master:
18.13% postgres postgres [.] apply_scanjoin_target_to_paths
3.66% postgres postgres [.] AllocSetAlloc
2.05% postgres postgres [.] hash_search_with_hash_value
1.95% postgres libc-2.33.so [.] __memset_avx2_unaligned_erms
1.88% postgres postgres [.] SearchCatCacheInternal
1.55% postgres postgres [.] base_yyparse
0.74% postgres postgres [.] get_relation_info
0.68% postgres [kernel.kallsyms] [k] __d_lookup_rcu
0.61% postgres postgres [.] palloc

Patched:
3.72% postgres postgres [.] AllocSetAlloc
2.30% postgres postgres [.] hash_search_with_hash_value
2.22% postgres postgres [.] SearchCatCacheInternal
2.02% postgres libc-2.33.so [.] __memset_avx2_unaligned_erms
1.88% postgres postgres [.] base_yyparse
1.08% postgres postgres [.] palloc
0.92% postgres [kernel.kallsyms] [k] __d_lookup_rcu

$ cat setup.sql
create table hp (a int primary key, b int not null) partition by hash(a);
select 'create table hp'||x|| ' partition of hp for values with
(modulus 10000, remainder '||x||');' from generate_series(0,9999) x;
\gexec

$ cat select.sql
\set p random(1,2000000)
select * from hp where a = :p

master

$ pgbench -n -f select.sql -c 1 -j 1 -T 60 postgres
tps = 2608.704218 (without initial connection time)
tps = 2607.641247 (without initial connection time)
tps = 2583.017011 (without initial connection time)

patched

$ pgbench -n -f select.sql -c 1 -j 1 -T 60 postgres
tps = 2715.993495 (without initial connection time)
tps = 2701.527640 (without initial connection time)
tps = 2707.343009 (without initial connection time)

Does anyone have any thoughts about if this is worth a new field in RelOptInfo?

+1 from me.

I had proposed adding a live_parts bitmapset back in [1]/messages/by-id/3f280722-46f2-c2a4-4c19-2cfa28c6c1cd@lab.ntt.co.jp (v12 work on
speeding up planning with partition) to address the
apply_scanjoin_target_to_paths() inefficiency among other things,
though Tom seemed to think that it wouldn't be worthwhile if only for
that purpose. He'd suggested that we fix things elsewhere such that
that function is not needed in the first place [2]/messages/by-id/3529.1554051867@sss.pgh.pa.us, something I keep
thinking about in between doing other things, but never sit down to
actually write a patch.

Given that you're proposing more uses for live_parts, maybe he'd be
open to the idea.

--
Amit Langote
EDB: http://www.enterprisedb.com

[1]: /messages/by-id/3f280722-46f2-c2a4-4c19-2cfa28c6c1cd@lab.ntt.co.jp
[2]: /messages/by-id/3529.1554051867@sss.pgh.pa.us

#3David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#2)
2 attachment(s)
Re: Record a Bitmapset of non-pruned partitions

On Thu, 1 Jul 2021 at 17:49, Amit Langote <amitlangote09@gmail.com> wrote:

Given that you're proposing more uses for live_parts, maybe he'd be
open to the idea.

Just to make sure the new field in the 0001 patch gets good enough
use, I've attached the patch which includes more usages of the field.

0002 adds a new field named interleaved_parts to PartitionBoundInfo
which is populated for LIST partitioned tables with any partitions
which have interleaved values, e.g FOR VALUES IN(3,5) and another
partition with FOR VALUES IN(4), the 3,5 partition is "interleaved"
around the partition for 4.

This combined with recording "live_parts" in the 0001 patch allows us
to do ordered partition scans in many more cases for LIST partitioning
and 1 more case with RANGE partitioning.

create table mclparted (a int) partition by list(a);
create table mclparted1 partition of mclparted for values in(1);
create table mclparted2 partition of mclparted for values in(2);
create table mclparted3_5 partition of mclparted for values in(3,5);
create table mclparted4 partition of mclparted for values in(4);
create index on mclparted (a);

set enable_bitmapscan=0;
set enable_sort=0;

-- ordered scan using Append
explain (costs off) select * from mclparted where a in(1,2) order by a;
QUERY PLAN
------------------------------------------------------------------------
Append
-> Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_1
Index Cond: (a = ANY ('{1,2}'::integer[]))
-> Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_2
Index Cond: (a = ANY ('{1,2}'::integer[]))

-- no ordered scan due to interleaved partition. Must use Merge Append
explain (costs off) select * from mclparted where a in(3,4) order by a;
QUERY PLAN
----------------------------------------------------------------------------
Merge Append
Sort Key: mclparted.a
-> Index Only Scan using mclparted3_5_a_idx on mclparted3_5 mclparted_1
Index Cond: (a = ANY ('{3,4}'::integer[]))
-> Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_2
Index Cond: (a = ANY ('{3,4}'::integer[]))

Currently, this is a bit more strict than maybe it needs to be. I'm
disabling the optimisation if any interleaved partitions remain after
pruning, however, it would be ok to allow them providing their
interleaved partner(s) were pruned. I think making that work might be
a bit more costly as we'd need to track all partitions that were
interleaved with each interleaved partition and ensure those were all
pruned. As far as I can see that requires storing a Bitmapset per
interleaved partition and makes the whole thing not so cheap. I'd
really like to keep all this stuff cheap as possible. That's why I
ended up calculating the interleaved partitions in
create_list_bounds() rather than partitions_are_ordered().

The good news is that the code in partitions_are_ordered() became even
more simple as a result of this change. We can do ordered scan simply
when !bms_overlap(live_parts, boundinfo->interleaved_parts).

The additional case we can now allow for RANGE partition is that we
can now do ordered scan when there is a DEFAULT partition but it was
pruned. Previously we had to disable the optimisation when there was a
DEFAULT partition as we had no idea if it was pruned or not.

David

Attachments:

v2-0002-Allow-ordered-partition-scans-in-more-cases.patchapplication/octet-stream; name=v2-0002-Allow-ordered-partition-scans-in-more-cases.patchDownload
From 472f1ae24f9722a93e3cd695e0b874bb01104603 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Sat, 10 Jul 2021 00:55:00 +1200
Subject: [PATCH v2 2/2] Allow ordered partition scans in more cases

Previously we couldn't allow ordered partition scans for LIST partitioned
table when there were any partitions that allowed multiple Datums.  This
was a very cheap check to make and we could likely have done a little
better by checking if there were interleaved partitions, but at the time
we didn't have visibility about which partitions were pruned, so we still
may have disallowed cases when the interleaved partitions were pruned.

Since we now have knowledge of pruned partitions, we can do a much better
job inside partitions_are_ordered().

Here we pass which partitioned survived partition pruning into
partitions_are_ordered() and, for LIST partitioning, have it check to see
if any live partitions exist in the new "interleaved_parts" field defined
in PartitionBoundInfo.

For RANGE partitioning we can relax the code which caused the partitions
to be unordered if a DEFAULT partition existed.  Since we now know which
partitions were pruned, partitions_are_ordered() now returns true when the
DEFAULT partition was pruned.
---
 src/backend/optimizer/path/allpaths.c |   2 +-
 src/backend/optimizer/path/pathkeys.c |   2 +-
 src/backend/partitioning/partbounds.c |  87 +++++++++++++++------
 src/include/partitioning/partbounds.h |  13 +++-
 src/test/regress/expected/inherit.out | 108 ++++++++++++++++++++++++++
 src/test/regress/sql/inherit.sql      |  27 +++++++
 6 files changed, 212 insertions(+), 27 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 17febfff8a..9d62d1cd65 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1689,7 +1689,7 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * for both forward and reverse scans.
 	 */
 	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
-		partitions_are_ordered(rel->boundinfo, rel->nparts))
+		partitions_are_ordered(rel->boundinfo, rel->live_parts))
 	{
 		partition_pathkeys = build_partition_pathkeys(root, rel,
 													  ForwardScanDirection,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index bd9a176d7d..216dd26385 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -704,7 +704,7 @@ build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
 	int			i;
 
 	Assert(partscheme != NULL);
-	Assert(partitions_are_ordered(partrel->boundinfo, partrel->nparts));
+	Assert(partitions_are_ordered(partrel->boundinfo, partrel->live_parts));
 	/* For now, we can only cope with baserels */
 	Assert(IS_SIMPLE_REL(partrel));
 
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 38baaefcda..7d23867345 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -396,6 +396,7 @@ create_hash_bounds(PartitionBoundSpec **boundspecs, int nparts,
 	boundinfo->ndatums = nparts;
 	boundinfo->datums = (Datum **) palloc0(nparts * sizeof(Datum *));
 	boundinfo->kind = NULL;
+	boundinfo->interleaved_parts = NULL;
 	boundinfo->nindexes = greatest_modulus;
 	boundinfo->indexes = (int *) palloc(greatest_modulus * sizeof(int));
 	for (i = 0; i < greatest_modulus; i++)
@@ -544,6 +545,7 @@ create_list_bounds(PartitionBoundSpec **boundspecs, int nparts,
 	boundinfo->ndatums = ndatums;
 	boundinfo->datums = (Datum **) palloc0(ndatums * sizeof(Datum *));
 	boundinfo->kind = NULL;
+	boundinfo->interleaved_parts = NULL;
 	boundinfo->nindexes = ndatums;
 	boundinfo->indexes = (int *) palloc(ndatums * sizeof(int));
 
@@ -608,6 +610,51 @@ create_list_bounds(PartitionBoundSpec **boundspecs, int nparts,
 		boundinfo->default_index = (*mapping)[default_index];
 	}
 
+	/*
+	 * Calculate interleaved partitions.  For there to be any of these there
+	 * has to be a partition that allows more than one Datum, so we can do a
+	 * quick check and skip building the interleaved partitions when there's
+	 * exactly 1 Datum per partition.
+	 */
+	if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo) != nparts)
+	{
+		int			last_index = -1;
+
+		/*
+		 * Since the indexes array is sorted in Datum order, if any partitions
+		 * are interleaved then it will show up by the partition indexes not
+		 * being in ascending order.  Here we check for that and record all
+		 * partitions that are out of order.  These are the interleaved ones.
+		 */
+		for (i = 0; i < boundinfo->nindexes; i++)
+		{
+			int			index = boundinfo->indexes[i];
+
+			if (index < last_index)
+				boundinfo->interleaved_parts = bms_add_member(boundinfo->interleaved_parts,
+															  index);
+
+			last_index = index;
+		}
+
+		/*
+		 * If the NULL partition is not last then it must allow some other
+		 * Datum.  We must also handle a corner case where there only is a
+		 * NULL partition but it does also allow other Datums.  In this case
+		 * the NULL partition will be index 0.
+		 */
+		if (partition_bound_accepts_nulls(boundinfo) &&
+			(boundinfo->null_index != nparts - 1 ||
+			 (boundinfo->null_index == 0 && boundinfo->ndatums > 0)))
+			boundinfo->interleaved_parts = bms_add_member(boundinfo->interleaved_parts,
+														  boundinfo->null_index);
+	}
+
+	/* Ensure we add the DEFAULT partition, if it exists */
+	if (partition_bound_has_default(boundinfo))
+		boundinfo->interleaved_parts = bms_add_member(boundinfo->interleaved_parts,
+													  boundinfo->default_index);
+
 	/* All partitions must now have been assigned canonical indexes. */
 	Assert(next_index == nparts);
 	return boundinfo;
@@ -751,6 +798,7 @@ create_range_bounds(PartitionBoundSpec **boundspecs, int nparts,
 	boundinfo->kind = (PartitionRangeDatumKind **)
 		palloc(ndatums *
 			   sizeof(PartitionRangeDatumKind *));
+	boundinfo->interleaved_parts = NULL;
 
 	/*
 	 * For range partitioning, an additional value of -1 is stored as the last
@@ -994,6 +1042,9 @@ partition_bounds_copy(PartitionBoundInfo src,
 	else
 		dest->kind = NULL;
 
+	/* copy interleaved partitions for LIST partitioned tables */
+	dest->interleaved_parts = bms_copy(src->interleaved_parts);
+
 	/*
 	 * For hash partitioning, datums array will have two elements - modulus
 	 * and remainder.
@@ -2781,13 +2832,15 @@ add_merged_range_bounds(int partnatts, FmgrInfo *partsupfuncs,
  *		that is partitions appearing earlier in the PartitionDesc sequence
  *		contain partition keys strictly less than those appearing later.
  *		Also, if NULL values are possible, they must come in the last
- *		partition defined in the PartitionDesc.
+ *		partition defined in the PartitionDesc.  'live_parts' marks which
+ *		partitions we should include when checking the ordering.  Partitions
+ *		that do not appear in 'live_parts' are ignored.
  *
  * If out of order, or there is insufficient info to know the order,
  * then we return false.
  */
 bool
-partitions_are_ordered(PartitionBoundInfo boundinfo, int nparts)
+partitions_are_ordered(PartitionBoundInfo boundinfo, Bitmapset *live_parts)
 {
 	Assert(boundinfo != NULL);
 
@@ -2799,38 +2852,24 @@ partitions_are_ordered(PartitionBoundInfo boundinfo, int nparts)
 			 * RANGE-type partitioning guarantees that the partitions can be
 			 * scanned in the order that they're defined in the PartitionDesc
 			 * to provide sequential, non-overlapping ranges of tuples.
-			 * However, if a DEFAULT partition exists then it doesn't work, as
-			 * that could contain tuples from either below or above the
-			 * defined range, or tuples belonging to gaps between partitions.
+			 * However, if a DEFAULT partition exists and it's contained
+			 * within live_parts, then the partitions are not ordered.
 			 */
-			if (!partition_bound_has_default(boundinfo))
+			if (!partition_bound_has_default(boundinfo) ||
+				!bms_is_member(boundinfo->default_index, live_parts))
 				return true;
 			break;
 
 		case PARTITION_STRATEGY_LIST:
 
 			/*
-			 * LIST partitioning can also guarantee ordering, but only if the
-			 * partitions don't accept interleaved values.  We could likely
-			 * check for this by looping over the PartitionBound's indexes
-			 * array to check that the indexes are in order.  For now, let's
-			 * just keep it simple and just accept LIST partitioning when
-			 * there's no DEFAULT partition, exactly one value per partition,
-			 * and optionally a NULL partition that does not accept any other
-			 * values.  Such a NULL partition will come last in the
-			 * PartitionDesc, and the other partitions will be properly
-			 * ordered.  This is a cheap test to make as it does not require
-			 * any per-partition processing.  Maybe we'd like to handle more
-			 * complex cases in the future.
+			 * LIST partitioned are ordered providing any interleaved
+			 * partitions are not in live_parts.
 			 */
-			if (partition_bound_has_default(boundinfo))
-				return false;
-
-			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo)
-				== nparts)
+			if (!bms_overlap(live_parts, boundinfo->interleaved_parts))
 				return true;
-			break;
 
+			break;
 		default:
 			/* HASH, or some other strategy */
 			break;
diff --git a/src/include/partitioning/partbounds.h b/src/include/partitioning/partbounds.h
index ebf3ff1f49..3d88f709d7 100644
--- a/src/include/partitioning/partbounds.h
+++ b/src/include/partitioning/partbounds.h
@@ -61,6 +61,14 @@ struct RelOptInfo;				/* avoid including pathnodes.h here */
  * The indexes array is indexed according to the hash key's remainder modulo
  * the greatest modulus, and it contains either the partition index accepting
  * that remainder, or -1 if there is no partition for that remainder.
+ *
+ * For LIST partitioned tables, we track the partition indexes of partitions
+ * which have "interleaved" values.  This is a partition that has more than
+ * one Datum where another partition exists with a Datum that sorts between
+ * two of the Datums on the other partition.  For example, FOR VALUES IN(3,5)
+ * and another partition exists FOR VALUES IN (4).  The partition for values 3
+ * and 5 will be marked as interleaved and stored in the 'interleaved_parts'
+ * bitmap.
  */
 typedef struct PartitionBoundInfoData
 {
@@ -70,6 +78,8 @@ typedef struct PartitionBoundInfoData
 	PartitionRangeDatumKind **kind; /* The kind of each range bound datum;
 									 * NULL for hash and list partitioned
 									 * tables */
+	Bitmapset  *interleaved_parts;	/* Interleaved partition indexes for LIST
+									 * partitions, NULL for RANGE and HASH */
 	int			nindexes;		/* Length of the indexes[] array */
 	int		   *indexes;		/* Partition indexes */
 	int			null_index;		/* Index of the null-accepting partition; -1
@@ -102,7 +112,8 @@ extern PartitionBoundInfo partition_bounds_merge(int partnatts,
 												 JoinType jointype,
 												 List **outer_parts,
 												 List **inner_parts);
-extern bool partitions_are_ordered(PartitionBoundInfo boundinfo, int nparts);
+extern bool partitions_are_ordered(PartitionBoundInfo boundinfo,
+								   Bitmapset *live_parts);
 extern void check_new_partition_bound(char *relname, Relation parent,
 									  PartitionBoundSpec *spec,
 									  ParseState *pstate);
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 06f44287bc..2d49e765de 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2180,6 +2180,8 @@ explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
          Index Cond: (a < 20)
 (9 rows)
 
+set enable_bitmapscan to off;
+set enable_sort to off;
 create table mclparted (a int) partition by list(a);
 create table mclparted1 partition of mclparted for values in(1);
 create table mclparted2 partition of mclparted for values in(2);
@@ -2208,7 +2210,113 @@ explain (costs off) select * from mclparted order by a;
    ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_4
 (6 rows)
 
+explain (costs off) select * from mclparted where a in(3,4,5) order by a;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted.a
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5 mclparted_1
+         Index Cond: (a = ANY ('{3,4,5}'::integer[]))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_2
+         Index Cond: (a = ANY ('{3,4,5}'::integer[]))
+(6 rows)
+
+-- Introduce a NULL and DEFAULT partition so we can test more complex cases
+create table mclparted_null partition of mclparted for values in(null);
+create table mclparted_def partition of mclparted default;
+-- Append can be used providing we don't scan the interleaved partition
+explain (costs off) select * from mclparted where a in(1,2,4) order by a;
+                               QUERY PLAN                               
+------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_1
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_2
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_3
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+(7 rows)
+
+explain (costs off) select * from mclparted where a in(1,2,4) or a is null order by a;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_1
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_2
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_3
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted_null_a_idx on mclparted_null mclparted_4
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+(9 rows)
+
+-- Test a more complex case where the NULL partition allows some other value
+drop table mclparted_null;
+create table mclparted_0_null partition of mclparted for values in(0,null);
+-- Ensure MergeAppend is used since 0 and NULLs are in the same partition.
+explain (costs off) select * from mclparted where a in(1,2,4) or a is null order by a;
+                                     QUERY PLAN                                     
+------------------------------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted.a
+   ->  Index Only Scan using mclparted_0_null_a_idx on mclparted_0_null mclparted_1
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_2
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_3
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_4
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+(10 rows)
+
+explain (costs off) select * from mclparted where a in(0,1,2,4) order by a;
+                                     QUERY PLAN                                     
+------------------------------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted.a
+   ->  Index Only Scan using mclparted_0_null_a_idx on mclparted_0_null mclparted_1
+         Index Cond: (a = ANY ('{0,1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_2
+         Index Cond: (a = ANY ('{0,1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_3
+         Index Cond: (a = ANY ('{0,1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_4
+         Index Cond: (a = ANY ('{0,1,2,4}'::integer[]))
+(10 rows)
+
+-- Ensure Append is used when the null partition is pruned
+explain (costs off) select * from mclparted where a in(1,2,4) order by a;
+                               QUERY PLAN                               
+------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_1
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_2
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_3
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+(7 rows)
+
+-- Ensure MergeAppend is used when the default partition is not pruned
+explain (costs off) select * from mclparted where a in(1,2,4,100) order by a;
+                                  QUERY PLAN                                  
+------------------------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_1
+         Index Cond: (a = ANY ('{1,2,4,100}'::integer[]))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_2
+         Index Cond: (a = ANY ('{1,2,4,100}'::integer[]))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_3
+         Index Cond: (a = ANY ('{1,2,4,100}'::integer[]))
+   ->  Index Only Scan using mclparted_def_a_idx on mclparted_def mclparted_4
+         Index Cond: (a = ANY ('{1,2,4,100}'::integer[]))
+(10 rows)
+
 drop table mclparted;
+reset enable_sort;
+reset enable_bitmapscan;
 -- Ensure subplans which don't have a path with the correct pathkeys get
 -- sorted correctly.
 drop index mcrparted_a_abs_c_idx;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 64173a8738..195aedb5ff 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -775,6 +775,8 @@ explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
 -- during planning.
 explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
 
+set enable_bitmapscan to off;
+set enable_sort to off;
 create table mclparted (a int) partition by list(a);
 create table mclparted1 partition of mclparted for values in(1);
 create table mclparted2 partition of mclparted for values in(2);
@@ -789,8 +791,33 @@ create table mclparted3_5 partition of mclparted for values in(3,5);
 create table mclparted4 partition of mclparted for values in(4);
 
 explain (costs off) select * from mclparted order by a;
+explain (costs off) select * from mclparted where a in(3,4,5) order by a;
+
+-- Introduce a NULL and DEFAULT partition so we can test more complex cases
+create table mclparted_null partition of mclparted for values in(null);
+create table mclparted_def partition of mclparted default;
+
+-- Append can be used providing we don't scan the interleaved partition
+explain (costs off) select * from mclparted where a in(1,2,4) order by a;
+explain (costs off) select * from mclparted where a in(1,2,4) or a is null order by a;
+
+-- Test a more complex case where the NULL partition allows some other value
+drop table mclparted_null;
+create table mclparted_0_null partition of mclparted for values in(0,null);
+
+-- Ensure MergeAppend is used since 0 and NULLs are in the same partition.
+explain (costs off) select * from mclparted where a in(1,2,4) or a is null order by a;
+explain (costs off) select * from mclparted where a in(0,1,2,4) order by a;
+
+-- Ensure Append is used when the null partition is pruned
+explain (costs off) select * from mclparted where a in(1,2,4) order by a;
+
+-- Ensure MergeAppend is used when the default partition is not pruned
+explain (costs off) select * from mclparted where a in(1,2,4,100) order by a;
 
 drop table mclparted;
+reset enable_sort;
+reset enable_bitmapscan;
 
 -- Ensure subplans which don't have a path with the correct pathkeys get
 -- sorted correctly.
-- 
2.30.2

v2-0001-Track-non-pruned-partitions-in-RelOptInfo.patchapplication/octet-stream; name=v2-0001-Track-non-pruned-partitions-in-RelOptInfo.patchDownload
From b3f66e03fb5fc66f93b9ed42e0053c834757e612 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Sun, 23 May 2021 23:06:38 +1200
Subject: [PATCH v2 1/2] Track non-pruned partitions in RelOptInfo

For partitioned tables with large numbers of partitions where queries are
able to prune all but a very small number of partitions, as is common in
partitioned tables used in OLTP workloads, the time spent in the planner
looping over RelOptInfo.part_rels checking for non-NULL RelOptInfos could
become a large portion of the overall planning time.

Here we add a Bitmapset which records the non-pruned partitions.  This
allows us to more efficiently skip the pruned partitions by looping over
the Bitmapset.

This will very slightly slow down cases where no or not many partitions
could be pruned, however, those cases are already slow to plan anyway and
the overhead of looping over the Bitmapset would be unmeasurable when
compared with the other duties the planner must perform.
---
 src/backend/nodes/outfuncs.c          |  1 +
 src/backend/optimizer/path/joinrels.c |  1 +
 src/backend/optimizer/plan/planner.c  | 31 ++++++++++++++++-----------
 src/backend/optimizer/util/inherit.c  |  2 +-
 src/backend/optimizer/util/relnode.c  |  3 +++
 src/backend/partitioning/partprune.c  |  7 +++---
 src/include/nodes/pathnodes.h         |  2 ++
 7 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e09e4f77fe..8e20bcd6ec 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2386,6 +2386,7 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_BOOL_FIELD(consider_partitionwise_join);
 	WRITE_BITMAPSET_FIELD(top_parent_relids);
 	WRITE_BOOL_FIELD(partbounds_merged);
+	WRITE_BITMAPSET_FIELD(live_parts);
 	WRITE_BITMAPSET_FIELD(all_partrels);
 }
 
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 0dbe2ac726..8b69870cf4 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1539,6 +1539,7 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 												 child_sjinfo,
 												 child_sjinfo->jointype);
 			joinrel->part_rels[cnt_parts] = child_joinrel;
+			joinrel->live_parts = bms_add_member(joinrel->live_parts, cnt_parts);
 			joinrel->all_partrels = bms_add_members(joinrel->all_partrels,
 													child_joinrel->relids);
 		}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 1868c4eff4..344f5d58cd 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6989,19 +6989,22 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	if (rel_is_partitioned)
 	{
 		List	   *live_children = NIL;
-		int			partition_idx;
+		int			i;
 
 		/* Adjust each partition. */
-		for (partition_idx = 0; partition_idx < rel->nparts; partition_idx++)
+		i = -1;
+		while ((i = bms_next_member(rel->live_parts, i)) >= 0)
 		{
-			RelOptInfo *child_rel = rel->part_rels[partition_idx];
+			RelOptInfo *child_rel = rel->part_rels[i];
 			AppendRelInfo **appinfos;
 			int			nappinfos;
 			List	   *child_scanjoin_targets = NIL;
 			ListCell   *lc;
 
-			/* Pruned or dummy children can be ignored. */
-			if (child_rel == NULL || IS_DUMMY_REL(child_rel))
+			Assert(child_rel != NULL);
+
+			/* Dummy children can be ignored. */
+			if (IS_DUMMY_REL(child_rel))
 				continue;
 
 			/* Translate scan/join targets for this child. */
@@ -7082,32 +7085,36 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 									PartitionwiseAggregateType patype,
 									GroupPathExtraData *extra)
 {
-	int			nparts = input_rel->nparts;
-	int			cnt_parts;
 	List	   *grouped_live_children = NIL;
 	List	   *partially_grouped_live_children = NIL;
 	PathTarget *target = grouped_rel->reltarget;
 	bool		partial_grouping_valid = true;
+	int			i;
 
 	Assert(patype != PARTITIONWISE_AGGREGATE_NONE);
 	Assert(patype != PARTITIONWISE_AGGREGATE_PARTIAL ||
 		   partially_grouped_rel != NULL);
 
 	/* Add paths for partitionwise aggregation/grouping. */
-	for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
+	i = -1;
+	while ((i = bms_next_member(input_rel->live_parts, i)) >= 0)
 	{
-		RelOptInfo *child_input_rel = input_rel->part_rels[cnt_parts];
-		PathTarget *child_target = copy_pathtarget(target);
+		RelOptInfo *child_input_rel = input_rel->part_rels[i];
+		PathTarget *child_target;
 		AppendRelInfo **appinfos;
 		int			nappinfos;
 		GroupPathExtraData child_extra;
 		RelOptInfo *child_grouped_rel;
 		RelOptInfo *child_partially_grouped_rel;
 
-		/* Pruned or dummy children can be ignored. */
-		if (child_input_rel == NULL || IS_DUMMY_REL(child_input_rel))
+		Assert(child_input_rel != NULL);
+
+		/* Dummy children can be ignored. */
+		if (IS_DUMMY_REL(child_input_rel))
 			continue;
 
+		child_target = copy_pathtarget(target);
+
 		/*
 		 * Copy the given "extra" structure as is and then override the
 		 * members specific to this child.
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 992ef87b9d..c758172efa 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -348,7 +348,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo,
 	 * that survive pruning.  Below, we will initialize child objects for the
 	 * surviving partitions.
 	 */
-	live_parts = prune_append_rel_partitions(relinfo);
+	relinfo->live_parts = live_parts = prune_append_rel_partitions(relinfo);
 
 	/* Expand simple_rel_array and friends to hold child objects. */
 	num_live_parts = bms_num_members(live_parts);
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index e105a4d5f1..47769cea45 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -255,6 +255,7 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent)
 	rel->partbounds_merged = false;
 	rel->partition_qual = NIL;
 	rel->part_rels = NULL;
+	rel->live_parts = NULL;
 	rel->all_partrels = NULL;
 	rel->partexprs = NULL;
 	rel->nullable_partexprs = NULL;
@@ -669,6 +670,7 @@ build_join_rel(PlannerInfo *root,
 	joinrel->partbounds_merged = false;
 	joinrel->partition_qual = NIL;
 	joinrel->part_rels = NULL;
+	joinrel->live_parts = NULL;
 	joinrel->all_partrels = NULL;
 	joinrel->partexprs = NULL;
 	joinrel->nullable_partexprs = NULL;
@@ -847,6 +849,7 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
 	joinrel->partbounds_merged = false;
 	joinrel->partition_qual = NIL;
 	joinrel->part_rels = NULL;
+	joinrel->live_parts = NULL;
 	joinrel->all_partrels = NULL;
 	joinrel->partexprs = NULL;
 	joinrel->nullable_partexprs = NULL;
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index c79374265c..e00edbe5c8 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -654,15 +654,14 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 		relid_map = (Oid *) palloc0(nparts * sizeof(Oid));
 		present_parts = NULL;
 
-		for (i = 0; i < nparts; i++)
+		i = -1;
+		while ((i = bms_next_member(subpart->live_parts, i)) >= 0)
 		{
 			RelOptInfo *partrel = subpart->part_rels[i];
 			int			subplanidx;
 			int			subpartidx;
 
-			/* Skip processing pruned partitions. */
-			if (partrel == NULL)
-				continue;
+			Assert(partrel != NULL);
 
 			subplan_map[i] = subplanidx = relid_subplan_map[partrel->relid] - 1;
 			subpart_map[i] = subpartidx = relid_subpart_map[partrel->relid] - 1;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index bebf774818..553108ddbc 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -762,6 +762,8 @@ typedef struct RelOptInfo
 	List	   *partition_qual; /* Partition constraint, if not the root */
 	struct RelOptInfo **part_rels;	/* Array of RelOptInfos of partitions,
 									 * stored in the same order as bounds */
+	Bitmapset  *live_parts;		/* Bitmap with members to indicate which
+								 * partitions survived partition pruning. */
 	Relids		all_partrels;	/* Relids set of all partition relids */
 	List	  **partexprs;		/* Non-nullable partition key expressions */
 	List	  **nullable_partexprs; /* Nullable partition key expressions */
-- 
2.30.2

#4David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#3)
2 attachment(s)
Re: Record a Bitmapset of non-pruned partitions

On Sat, 10 Jul 2021 at 03:24, David Rowley <dgrowleyml@gmail.com> wrote:

The good news is that the code in partitions_are_ordered() became even
more simple as a result of this change. We can do ordered scan simply
when !bms_overlap(live_parts, boundinfo->interleaved_parts).

I've spent a bit more time revising the 0002 patch so that we're a bit
more strict about when we mark a partition as interleaved. For
example, if the DEFAULT partition happens to be the only partition,
then I'm no longer classing that as interleaved as there's nothing for
it to be interleaved with.

This also fixes up the not-so-robust check that I had to check if the
NULL partition allowed other Datums.

David

Attachments:

v3-0002-Allow-ordered-partition-scans-in-more-cases.patchapplication/octet-stream; name=v3-0002-Allow-ordered-partition-scans-in-more-cases.patchDownload
From ad9a4fe2d9bdef83a7669101a8b2e14b8206d5fd Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Sat, 10 Jul 2021 00:55:00 +1200
Subject: [PATCH v3 2/3] Allow ordered partition scans in more cases

Previously we couldn't allow ordered partition scans for LIST partitioned
table when there were any partitions that allowed multiple Datums.  This
was a very cheap check to make and we could likely have done a little
better by checking if there were interleaved partitions, but at the time
we didn't have visibility about which partitions were pruned, so we still
may have disallowed cases when the interleaved partitions were pruned.

Since we now have knowledge of pruned partitions, we can do a much better
job inside partitions_are_ordered().

Here we pass which partitioned survived partition pruning into
partitions_are_ordered() and, for LIST partitioning, have it check to see
if any live partitions exist in the new "interleaved_parts" field defined
in PartitionBoundInfo.

For RANGE partitioning we can relax the code which caused the partitions
to be unordered if a DEFAULT partition existed.  Since we now know which
partitions were pruned, partitions_are_ordered() now returns true when the
DEFAULT partition was pruned.
---
 src/backend/optimizer/path/allpaths.c |   2 +-
 src/backend/optimizer/path/pathkeys.c |   2 +-
 src/backend/partitioning/partbounds.c | 105 +++++++++++++++++++------
 src/include/partitioning/partbounds.h |  18 ++++-
 src/test/regress/expected/inherit.out | 108 ++++++++++++++++++++++++++
 src/test/regress/sql/inherit.sql      |  27 +++++++
 6 files changed, 235 insertions(+), 27 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 17febfff8a..9d62d1cd65 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1689,7 +1689,7 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * for both forward and reverse scans.
 	 */
 	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
-		partitions_are_ordered(rel->boundinfo, rel->nparts))
+		partitions_are_ordered(rel->boundinfo, rel->live_parts))
 	{
 		partition_pathkeys = build_partition_pathkeys(root, rel,
 													  ForwardScanDirection,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index bd9a176d7d..216dd26385 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -704,7 +704,7 @@ build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
 	int			i;
 
 	Assert(partscheme != NULL);
-	Assert(partitions_are_ordered(partrel->boundinfo, partrel->nparts));
+	Assert(partitions_are_ordered(partrel->boundinfo, partrel->live_parts));
 	/* For now, we can only cope with baserels */
 	Assert(IS_SIMPLE_REL(partrel));
 
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 38baaefcda..42ed3bc6d0 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -396,6 +396,7 @@ create_hash_bounds(PartitionBoundSpec **boundspecs, int nparts,
 	boundinfo->ndatums = nparts;
 	boundinfo->datums = (Datum **) palloc0(nparts * sizeof(Datum *));
 	boundinfo->kind = NULL;
+	boundinfo->interleaved_parts = NULL;
 	boundinfo->nindexes = greatest_modulus;
 	boundinfo->indexes = (int *) palloc(greatest_modulus * sizeof(int));
 	for (i = 0; i < greatest_modulus; i++)
@@ -544,6 +545,7 @@ create_list_bounds(PartitionBoundSpec **boundspecs, int nparts,
 	boundinfo->ndatums = ndatums;
 	boundinfo->datums = (Datum **) palloc0(ndatums * sizeof(Datum *));
 	boundinfo->kind = NULL;
+	boundinfo->interleaved_parts = NULL;
 	boundinfo->nindexes = ndatums;
 	boundinfo->indexes = (int *) palloc(ndatums * sizeof(int));
 
@@ -608,6 +610,69 @@ create_list_bounds(PartitionBoundSpec **boundspecs, int nparts,
 		boundinfo->default_index = (*mapping)[default_index];
 	}
 
+	/*
+	 * Calculate interleaved partitions.  Here we look for partitions which
+	 * might be interleaved with other partitions and set the
+	 * interleaved_parts field with the partition indexes of any partitions
+	 * which may be interleaved with another partition.
+	 */
+
+	/*
+	 * There must be multiple partitions to have any interleaved partitions,
+	 * otherwise there's nothing to interleave with.
+	 */
+	if (nparts > 1)
+	{
+		/*
+		 * Short-circuit check to see if only 1 Datum is allowed per
+		 * partition.  When this is true there's no need to do the more
+		 * expensive checks to look for interleaved values.
+		 */
+		if (boundinfo->ndatums +
+			partition_bound_accepts_nulls(boundinfo) +
+			partition_bound_has_default(boundinfo) != nparts)
+		{
+			int			last_index = -1;
+
+			/*
+			 * Since the indexes array is sorted in Datum order, if any
+			 * partitions are interleaved then it will show up by the
+			 * partition indexes not being in ascending order.  Here we check
+			 * for that and record all partitions that are out of order.
+			 */
+			for (i = 0; i < boundinfo->nindexes; i++)
+			{
+				int			index = boundinfo->indexes[i];
+
+				if (index < last_index)
+					boundinfo->interleaved_parts = bms_add_member(boundinfo->interleaved_parts,
+																  index);
+
+				/*
+				 * Mark the NULL partition as interleaved if we find that it
+				 * allows some other non-NULL Datum.
+				 */
+				if (partition_bound_accepts_nulls(boundinfo) &&
+					index == boundinfo->null_index)
+					boundinfo->interleaved_parts = bms_add_member(boundinfo->interleaved_parts,
+																  boundinfo->null_index);
+
+				last_index = index;
+			}
+		}
+
+		/*
+		 * The DEFAULT partition is the "catch all" partition that can contain
+		 * anything that does not belong to any other partition.  If there are
+		 * any other partitions then the DEFAULT partition must be marked as
+		 * interleaved.
+		 */
+		if (partition_bound_has_default(boundinfo))
+			boundinfo->interleaved_parts = bms_add_member(boundinfo->interleaved_parts,
+														  boundinfo->default_index);
+	}
+
+
 	/* All partitions must now have been assigned canonical indexes. */
 	Assert(next_index == nparts);
 	return boundinfo;
@@ -751,6 +816,7 @@ create_range_bounds(PartitionBoundSpec **boundspecs, int nparts,
 	boundinfo->kind = (PartitionRangeDatumKind **)
 		palloc(ndatums *
 			   sizeof(PartitionRangeDatumKind *));
+	boundinfo->interleaved_parts = NULL;
 
 	/*
 	 * For range partitioning, an additional value of -1 is stored as the last
@@ -994,6 +1060,9 @@ partition_bounds_copy(PartitionBoundInfo src,
 	else
 		dest->kind = NULL;
 
+	/* copy interleaved partitions for LIST partitioned tables */
+	dest->interleaved_parts = bms_copy(src->interleaved_parts);
+
 	/*
 	 * For hash partitioning, datums array will have two elements - modulus
 	 * and remainder.
@@ -2781,13 +2850,15 @@ add_merged_range_bounds(int partnatts, FmgrInfo *partsupfuncs,
  *		that is partitions appearing earlier in the PartitionDesc sequence
  *		contain partition keys strictly less than those appearing later.
  *		Also, if NULL values are possible, they must come in the last
- *		partition defined in the PartitionDesc.
+ *		partition defined in the PartitionDesc.  'live_parts' marks which
+ *		partitions we should include when checking the ordering.  Partitions
+ *		that do not appear in 'live_parts' are ignored.
  *
  * If out of order, or there is insufficient info to know the order,
  * then we return false.
  */
 bool
-partitions_are_ordered(PartitionBoundInfo boundinfo, int nparts)
+partitions_are_ordered(PartitionBoundInfo boundinfo, Bitmapset *live_parts)
 {
 	Assert(boundinfo != NULL);
 
@@ -2799,38 +2870,24 @@ partitions_are_ordered(PartitionBoundInfo boundinfo, int nparts)
 			 * RANGE-type partitioning guarantees that the partitions can be
 			 * scanned in the order that they're defined in the PartitionDesc
 			 * to provide sequential, non-overlapping ranges of tuples.
-			 * However, if a DEFAULT partition exists then it doesn't work, as
-			 * that could contain tuples from either below or above the
-			 * defined range, or tuples belonging to gaps between partitions.
+			 * However, if a DEFAULT partition exists and it's contained
+			 * within live_parts, then the partitions are not ordered.
 			 */
-			if (!partition_bound_has_default(boundinfo))
+			if (!partition_bound_has_default(boundinfo) ||
+				!bms_is_member(boundinfo->default_index, live_parts))
 				return true;
 			break;
 
 		case PARTITION_STRATEGY_LIST:
 
 			/*
-			 * LIST partitioning can also guarantee ordering, but only if the
-			 * partitions don't accept interleaved values.  We could likely
-			 * check for this by looping over the PartitionBound's indexes
-			 * array to check that the indexes are in order.  For now, let's
-			 * just keep it simple and just accept LIST partitioning when
-			 * there's no DEFAULT partition, exactly one value per partition,
-			 * and optionally a NULL partition that does not accept any other
-			 * values.  Such a NULL partition will come last in the
-			 * PartitionDesc, and the other partitions will be properly
-			 * ordered.  This is a cheap test to make as it does not require
-			 * any per-partition processing.  Maybe we'd like to handle more
-			 * complex cases in the future.
+			 * LIST partitioned are ordered providing none of live_parts
+			 * overlap with the partitioned table's interleaved partitions.
 			 */
-			if (partition_bound_has_default(boundinfo))
-				return false;
-
-			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo)
-				== nparts)
+			if (!bms_overlap(live_parts, boundinfo->interleaved_parts))
 				return true;
-			break;
 
+			break;
 		default:
 			/* HASH, or some other strategy */
 			break;
diff --git a/src/include/partitioning/partbounds.h b/src/include/partitioning/partbounds.h
index ebf3ff1f49..19e8cfd0a9 100644
--- a/src/include/partitioning/partbounds.h
+++ b/src/include/partitioning/partbounds.h
@@ -61,6 +61,18 @@ struct RelOptInfo;				/* avoid including pathnodes.h here */
  * The indexes array is indexed according to the hash key's remainder modulo
  * the greatest modulus, and it contains either the partition index accepting
  * that remainder, or -1 if there is no partition for that remainder.
+ *
+ * For LIST partitioned tables, we track the partition indexes of partitions
+ * which are possibly "interleaved" partitions.  The definition of interleaved
+ * is any partition that can contain multiple different values where exists at
+ * least one other partition which could contain a value which is between the
+ * multiple possible values in the other partition.  For example, if a
+ * partition exists FOR VALUES IN(3,5) and another partition exists FOR VALUES
+ * IN (4), then the IN(3,5) partition is an interleaved partition.  The same
+ * is possible with DEFAULT partitions since they can contain any value that
+ * does not belong in another partition.  This field only serves as proof that
+ * a partition is not interleaved, not proof that it is interleaved.  When
+ * we're uncertain, we marked the partition as interleaved.
  */
 typedef struct PartitionBoundInfoData
 {
@@ -70,6 +82,9 @@ typedef struct PartitionBoundInfoData
 	PartitionRangeDatumKind **kind; /* The kind of each range bound datum;
 									 * NULL for hash and list partitioned
 									 * tables */
+	Bitmapset  *interleaved_parts;	/* Partition indexes of partitions which
+									 * may be interleaved. See above. This is
+									 * only set for LIST partitioned tables */
 	int			nindexes;		/* Length of the indexes[] array */
 	int		   *indexes;		/* Partition indexes */
 	int			null_index;		/* Index of the null-accepting partition; -1
@@ -102,7 +117,8 @@ extern PartitionBoundInfo partition_bounds_merge(int partnatts,
 												 JoinType jointype,
 												 List **outer_parts,
 												 List **inner_parts);
-extern bool partitions_are_ordered(PartitionBoundInfo boundinfo, int nparts);
+extern bool partitions_are_ordered(PartitionBoundInfo boundinfo,
+								   Bitmapset *live_parts);
 extern void check_new_partition_bound(char *relname, Relation parent,
 									  PartitionBoundSpec *spec,
 									  ParseState *pstate);
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 06f44287bc..2d49e765de 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2180,6 +2180,8 @@ explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
          Index Cond: (a < 20)
 (9 rows)
 
+set enable_bitmapscan to off;
+set enable_sort to off;
 create table mclparted (a int) partition by list(a);
 create table mclparted1 partition of mclparted for values in(1);
 create table mclparted2 partition of mclparted for values in(2);
@@ -2208,7 +2210,113 @@ explain (costs off) select * from mclparted order by a;
    ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_4
 (6 rows)
 
+explain (costs off) select * from mclparted where a in(3,4,5) order by a;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted.a
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5 mclparted_1
+         Index Cond: (a = ANY ('{3,4,5}'::integer[]))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_2
+         Index Cond: (a = ANY ('{3,4,5}'::integer[]))
+(6 rows)
+
+-- Introduce a NULL and DEFAULT partition so we can test more complex cases
+create table mclparted_null partition of mclparted for values in(null);
+create table mclparted_def partition of mclparted default;
+-- Append can be used providing we don't scan the interleaved partition
+explain (costs off) select * from mclparted where a in(1,2,4) order by a;
+                               QUERY PLAN                               
+------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_1
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_2
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_3
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+(7 rows)
+
+explain (costs off) select * from mclparted where a in(1,2,4) or a is null order by a;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_1
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_2
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_3
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted_null_a_idx on mclparted_null mclparted_4
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+(9 rows)
+
+-- Test a more complex case where the NULL partition allows some other value
+drop table mclparted_null;
+create table mclparted_0_null partition of mclparted for values in(0,null);
+-- Ensure MergeAppend is used since 0 and NULLs are in the same partition.
+explain (costs off) select * from mclparted where a in(1,2,4) or a is null order by a;
+                                     QUERY PLAN                                     
+------------------------------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted.a
+   ->  Index Only Scan using mclparted_0_null_a_idx on mclparted_0_null mclparted_1
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_2
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_3
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_4
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+(10 rows)
+
+explain (costs off) select * from mclparted where a in(0,1,2,4) order by a;
+                                     QUERY PLAN                                     
+------------------------------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted.a
+   ->  Index Only Scan using mclparted_0_null_a_idx on mclparted_0_null mclparted_1
+         Index Cond: (a = ANY ('{0,1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_2
+         Index Cond: (a = ANY ('{0,1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_3
+         Index Cond: (a = ANY ('{0,1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_4
+         Index Cond: (a = ANY ('{0,1,2,4}'::integer[]))
+(10 rows)
+
+-- Ensure Append is used when the null partition is pruned
+explain (costs off) select * from mclparted where a in(1,2,4) order by a;
+                               QUERY PLAN                               
+------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_1
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_2
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_3
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+(7 rows)
+
+-- Ensure MergeAppend is used when the default partition is not pruned
+explain (costs off) select * from mclparted where a in(1,2,4,100) order by a;
+                                  QUERY PLAN                                  
+------------------------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_1
+         Index Cond: (a = ANY ('{1,2,4,100}'::integer[]))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_2
+         Index Cond: (a = ANY ('{1,2,4,100}'::integer[]))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_3
+         Index Cond: (a = ANY ('{1,2,4,100}'::integer[]))
+   ->  Index Only Scan using mclparted_def_a_idx on mclparted_def mclparted_4
+         Index Cond: (a = ANY ('{1,2,4,100}'::integer[]))
+(10 rows)
+
 drop table mclparted;
+reset enable_sort;
+reset enable_bitmapscan;
 -- Ensure subplans which don't have a path with the correct pathkeys get
 -- sorted correctly.
 drop index mcrparted_a_abs_c_idx;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 64173a8738..195aedb5ff 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -775,6 +775,8 @@ explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
 -- during planning.
 explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
 
+set enable_bitmapscan to off;
+set enable_sort to off;
 create table mclparted (a int) partition by list(a);
 create table mclparted1 partition of mclparted for values in(1);
 create table mclparted2 partition of mclparted for values in(2);
@@ -789,8 +791,33 @@ create table mclparted3_5 partition of mclparted for values in(3,5);
 create table mclparted4 partition of mclparted for values in(4);
 
 explain (costs off) select * from mclparted order by a;
+explain (costs off) select * from mclparted where a in(3,4,5) order by a;
+
+-- Introduce a NULL and DEFAULT partition so we can test more complex cases
+create table mclparted_null partition of mclparted for values in(null);
+create table mclparted_def partition of mclparted default;
+
+-- Append can be used providing we don't scan the interleaved partition
+explain (costs off) select * from mclparted where a in(1,2,4) order by a;
+explain (costs off) select * from mclparted where a in(1,2,4) or a is null order by a;
+
+-- Test a more complex case where the NULL partition allows some other value
+drop table mclparted_null;
+create table mclparted_0_null partition of mclparted for values in(0,null);
+
+-- Ensure MergeAppend is used since 0 and NULLs are in the same partition.
+explain (costs off) select * from mclparted where a in(1,2,4) or a is null order by a;
+explain (costs off) select * from mclparted where a in(0,1,2,4) order by a;
+
+-- Ensure Append is used when the null partition is pruned
+explain (costs off) select * from mclparted where a in(1,2,4) order by a;
+
+-- Ensure MergeAppend is used when the default partition is not pruned
+explain (costs off) select * from mclparted where a in(1,2,4,100) order by a;
 
 drop table mclparted;
+reset enable_sort;
+reset enable_bitmapscan;
 
 -- Ensure subplans which don't have a path with the correct pathkeys get
 -- sorted correctly.
-- 
2.30.2

v3-0001-Track-non-pruned-partitions-in-RelOptInfo.patchapplication/octet-stream; name=v3-0001-Track-non-pruned-partitions-in-RelOptInfo.patchDownload
From 6b2cfdfc2ea4ee5da3991acdf410c20d9b509e9c Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Sun, 23 May 2021 23:06:38 +1200
Subject: [PATCH v3 1/3] Track non-pruned partitions in RelOptInfo

For partitioned tables with large numbers of partitions where queries are
able to prune all but a very small number of partitions, as is common in
partitioned tables used in OLTP workloads, the time spent in the planner
looping over RelOptInfo.part_rels checking for non-NULL RelOptInfos could
become a large portion of the overall planning time.

Here we add a Bitmapset which records the non-pruned partitions.  This
allows us to more efficiently skip the pruned partitions by looping over
the Bitmapset.

This will very slightly slow down cases where no or not many partitions
could be pruned, however, those cases are already slow to plan anyway and
the overhead of looping over the Bitmapset would be unmeasurable when
compared with the other duties the planner must perform.
---
 src/backend/nodes/outfuncs.c          |  1 +
 src/backend/optimizer/path/joinrels.c |  1 +
 src/backend/optimizer/plan/planner.c  | 31 ++++++++++++++++-----------
 src/backend/optimizer/util/inherit.c  |  2 +-
 src/backend/optimizer/util/relnode.c  |  3 +++
 src/backend/partitioning/partprune.c  |  7 +++---
 src/include/nodes/pathnodes.h         |  2 ++
 7 files changed, 30 insertions(+), 17 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index e09e4f77fe..8e20bcd6ec 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2386,6 +2386,7 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_BOOL_FIELD(consider_partitionwise_join);
 	WRITE_BITMAPSET_FIELD(top_parent_relids);
 	WRITE_BOOL_FIELD(partbounds_merged);
+	WRITE_BITMAPSET_FIELD(live_parts);
 	WRITE_BITMAPSET_FIELD(all_partrels);
 }
 
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 0dbe2ac726..8b69870cf4 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1539,6 +1539,7 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 												 child_sjinfo,
 												 child_sjinfo->jointype);
 			joinrel->part_rels[cnt_parts] = child_joinrel;
+			joinrel->live_parts = bms_add_member(joinrel->live_parts, cnt_parts);
 			joinrel->all_partrels = bms_add_members(joinrel->all_partrels,
 													child_joinrel->relids);
 		}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 1868c4eff4..344f5d58cd 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6989,19 +6989,22 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	if (rel_is_partitioned)
 	{
 		List	   *live_children = NIL;
-		int			partition_idx;
+		int			i;
 
 		/* Adjust each partition. */
-		for (partition_idx = 0; partition_idx < rel->nparts; partition_idx++)
+		i = -1;
+		while ((i = bms_next_member(rel->live_parts, i)) >= 0)
 		{
-			RelOptInfo *child_rel = rel->part_rels[partition_idx];
+			RelOptInfo *child_rel = rel->part_rels[i];
 			AppendRelInfo **appinfos;
 			int			nappinfos;
 			List	   *child_scanjoin_targets = NIL;
 			ListCell   *lc;
 
-			/* Pruned or dummy children can be ignored. */
-			if (child_rel == NULL || IS_DUMMY_REL(child_rel))
+			Assert(child_rel != NULL);
+
+			/* Dummy children can be ignored. */
+			if (IS_DUMMY_REL(child_rel))
 				continue;
 
 			/* Translate scan/join targets for this child. */
@@ -7082,32 +7085,36 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 									PartitionwiseAggregateType patype,
 									GroupPathExtraData *extra)
 {
-	int			nparts = input_rel->nparts;
-	int			cnt_parts;
 	List	   *grouped_live_children = NIL;
 	List	   *partially_grouped_live_children = NIL;
 	PathTarget *target = grouped_rel->reltarget;
 	bool		partial_grouping_valid = true;
+	int			i;
 
 	Assert(patype != PARTITIONWISE_AGGREGATE_NONE);
 	Assert(patype != PARTITIONWISE_AGGREGATE_PARTIAL ||
 		   partially_grouped_rel != NULL);
 
 	/* Add paths for partitionwise aggregation/grouping. */
-	for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
+	i = -1;
+	while ((i = bms_next_member(input_rel->live_parts, i)) >= 0)
 	{
-		RelOptInfo *child_input_rel = input_rel->part_rels[cnt_parts];
-		PathTarget *child_target = copy_pathtarget(target);
+		RelOptInfo *child_input_rel = input_rel->part_rels[i];
+		PathTarget *child_target;
 		AppendRelInfo **appinfos;
 		int			nappinfos;
 		GroupPathExtraData child_extra;
 		RelOptInfo *child_grouped_rel;
 		RelOptInfo *child_partially_grouped_rel;
 
-		/* Pruned or dummy children can be ignored. */
-		if (child_input_rel == NULL || IS_DUMMY_REL(child_input_rel))
+		Assert(child_input_rel != NULL);
+
+		/* Dummy children can be ignored. */
+		if (IS_DUMMY_REL(child_input_rel))
 			continue;
 
+		child_target = copy_pathtarget(target);
+
 		/*
 		 * Copy the given "extra" structure as is and then override the
 		 * members specific to this child.
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 992ef87b9d..c758172efa 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -348,7 +348,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo,
 	 * that survive pruning.  Below, we will initialize child objects for the
 	 * surviving partitions.
 	 */
-	live_parts = prune_append_rel_partitions(relinfo);
+	relinfo->live_parts = live_parts = prune_append_rel_partitions(relinfo);
 
 	/* Expand simple_rel_array and friends to hold child objects. */
 	num_live_parts = bms_num_members(live_parts);
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index e105a4d5f1..47769cea45 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -255,6 +255,7 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent)
 	rel->partbounds_merged = false;
 	rel->partition_qual = NIL;
 	rel->part_rels = NULL;
+	rel->live_parts = NULL;
 	rel->all_partrels = NULL;
 	rel->partexprs = NULL;
 	rel->nullable_partexprs = NULL;
@@ -669,6 +670,7 @@ build_join_rel(PlannerInfo *root,
 	joinrel->partbounds_merged = false;
 	joinrel->partition_qual = NIL;
 	joinrel->part_rels = NULL;
+	joinrel->live_parts = NULL;
 	joinrel->all_partrels = NULL;
 	joinrel->partexprs = NULL;
 	joinrel->nullable_partexprs = NULL;
@@ -847,6 +849,7 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
 	joinrel->partbounds_merged = false;
 	joinrel->partition_qual = NIL;
 	joinrel->part_rels = NULL;
+	joinrel->live_parts = NULL;
 	joinrel->all_partrels = NULL;
 	joinrel->partexprs = NULL;
 	joinrel->nullable_partexprs = NULL;
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index c79374265c..e00edbe5c8 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -654,15 +654,14 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 		relid_map = (Oid *) palloc0(nparts * sizeof(Oid));
 		present_parts = NULL;
 
-		for (i = 0; i < nparts; i++)
+		i = -1;
+		while ((i = bms_next_member(subpart->live_parts, i)) >= 0)
 		{
 			RelOptInfo *partrel = subpart->part_rels[i];
 			int			subplanidx;
 			int			subpartidx;
 
-			/* Skip processing pruned partitions. */
-			if (partrel == NULL)
-				continue;
+			Assert(partrel != NULL);
 
 			subplan_map[i] = subplanidx = relid_subplan_map[partrel->relid] - 1;
 			subpart_map[i] = subpartidx = relid_subpart_map[partrel->relid] - 1;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index bebf774818..553108ddbc 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -762,6 +762,8 @@ typedef struct RelOptInfo
 	List	   *partition_qual; /* Partition constraint, if not the root */
 	struct RelOptInfo **part_rels;	/* Array of RelOptInfos of partitions,
 									 * stored in the same order as bounds */
+	Bitmapset  *live_parts;		/* Bitmap with members to indicate which
+								 * partitions survived partition pruning. */
 	Relids		all_partrels;	/* Relids set of all partition relids */
 	List	  **partexprs;		/* Non-nullable partition key expressions */
 	List	  **nullable_partexprs; /* Nullable partition key expressions */
-- 
2.30.2

#5Amit Langote
amitlangote09@gmail.com
In reply to: David Rowley (#4)
Re: Record a Bitmapset of non-pruned partitions

On Mon, Jul 12, 2021 at 11:47 AM David Rowley <dgrowleyml@gmail.com> wrote:

v3 patches

0001 looks mostly fine, except I thought the following could be worded
to say that the bitmap members are offsets into the part_rels array.
To avoid someone confusing them with RT indexes, for example.

+   Bitmapset  *live_parts;     /* Bitmap with members to indicate which
+                                * partitions survived partition pruning. */

On 0002:

interleaved_parts idea looks clever. I wonder if you decided that
it's maybe not worth setting that field in the joinrel's
PartitionBoundInfos? For example, adding the code that you added in
create_list_bounds() also in merge_list_bounds().

...  The definition of interleaved
+ * is any partition that can contain multiple different values where exists at
+ * least one other partition which could contain a value which is between the
+ * multiple possible values in the other partition.

The sentence sounds a bit off starting at "...where exists". How about:

"A partition is considered interleaved if it contains multiple values
such that there exists at least one other partition containing a value
that lies between those values [ in terms of partitioning-defined
ordering ]."

Looks fine otherwise.

--
Amit Langote
EDB: http://www.enterprisedb.com

#6David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#5)
2 attachment(s)
Re: Record a Bitmapset of non-pruned partitions

On Fri, 30 Jul 2021 at 19:10, Amit Langote <amitlangote09@gmail.com> wrote:

0001 looks mostly fine, except I thought the following could be worded
to say that the bitmap members are offsets into the part_rels array.
To avoid someone confusing them with RT indexes, for example.

+   Bitmapset  *live_parts;     /* Bitmap with members to indicate which
+                                * partitions survived partition pruning. */

Yeah, agreed. I've adjusted that.

On 0002:

interleaved_parts idea looks clever. I wonder if you decided that
it's maybe not worth setting that field in the joinrel's
PartitionBoundInfos? For example, adding the code that you added in
create_list_bounds() also in merge_list_bounds().

Currently generate_orderedappend_paths() only checks
partitions_are_ordered() for base and other member rels, so setting
the field for join rels would be a waste of effort given that it's not
used for anything.

I've not really looked into the possibility of enabling this
optimization for partition-wise joined rels. I know that there's a bit
more complexity now due to c8434d64c. I'm not really all that clear on
which cases could be allowed here and which couldn't. It would require
more analysis and I'd say that's a different patch.

...  The definition of interleaved
+ * is any partition that can contain multiple different values where exists at
+ * least one other partition which could contain a value which is between the
+ * multiple possible values in the other partition.

The sentence sounds a bit off starting at "...where exists". How about:

I must have spent too long writing SQL queries.

"A partition is considered interleaved if it contains multiple values
such that there exists at least one other partition containing a value
that lies between those values [ in terms of partitioning-defined
ordering ]."

That looks better. I took that with some small adjustments.

Looks fine otherwise.

Thanks for the review.

I had another self review of these and I'm pretty happy with them. I'm
quite glad to see the performance of querying a single partition of a
table with large numbers of partitions no longer tails off as much as
it used to.

David

Attachments:

v4-0001-Track-non-pruned-partitions-in-RelOptInfo.patchapplication/octet-stream; name=v4-0001-Track-non-pruned-partitions-in-RelOptInfo.patchDownload
From 94b6631986fad7d953f7ae9c14eabe077fe2be7d Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Sun, 23 May 2021 23:06:38 +1200
Subject: [PATCH v4 1/2] Track non-pruned partitions in RelOptInfo

For partitioned tables with large numbers of partitions where queries are
able to prune all but a very small number of partitions, as is common in
partitioned tables used in OLTP workloads, the time spent in the planner
looping over RelOptInfo.part_rels checking for non-NULL RelOptInfos could
become a large portion of the overall planning time.

Here we add a Bitmapset which records the non-pruned partitions.  This
allows us to more efficiently skip the pruned partitions by looping over
the Bitmapset.

This will very slightly slow down cases where no or not many partitions
could be pruned, however, those cases are already slow to plan anyway and
the overhead of looping over the Bitmapset would be unmeasurable when
compared with the other duties the planner must perform.
---
 src/backend/nodes/outfuncs.c          |  1 +
 src/backend/optimizer/path/joinrels.c |  1 +
 src/backend/optimizer/plan/planner.c  | 31 ++++++++++++++++-----------
 src/backend/optimizer/util/inherit.c  |  2 +-
 src/backend/optimizer/util/relnode.c  |  3 +++
 src/backend/partitioning/partprune.c  |  7 +++---
 src/include/nodes/pathnodes.h         |  3 +++
 7 files changed, 31 insertions(+), 17 deletions(-)

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 48202d2232..87561cbb6f 100644
--- a/src/backend/nodes/outfuncs.c
+++ b/src/backend/nodes/outfuncs.c
@@ -2386,6 +2386,7 @@ _outRelOptInfo(StringInfo str, const RelOptInfo *node)
 	WRITE_BOOL_FIELD(consider_partitionwise_join);
 	WRITE_BITMAPSET_FIELD(top_parent_relids);
 	WRITE_BOOL_FIELD(partbounds_merged);
+	WRITE_BITMAPSET_FIELD(live_parts);
 	WRITE_BITMAPSET_FIELD(all_partrels);
 }
 
diff --git a/src/backend/optimizer/path/joinrels.c b/src/backend/optimizer/path/joinrels.c
index 0dbe2ac726..8b69870cf4 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1539,6 +1539,7 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo *rel1, RelOptInfo *rel2,
 												 child_sjinfo,
 												 child_sjinfo->jointype);
 			joinrel->part_rels[cnt_parts] = child_joinrel;
+			joinrel->live_parts = bms_add_member(joinrel->live_parts, cnt_parts);
 			joinrel->all_partrels = bms_add_members(joinrel->all_partrels,
 													child_joinrel->relids);
 		}
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 86816ffe19..2cd691191c 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -6989,19 +6989,22 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
 	if (rel_is_partitioned)
 	{
 		List	   *live_children = NIL;
-		int			partition_idx;
+		int			i;
 
 		/* Adjust each partition. */
-		for (partition_idx = 0; partition_idx < rel->nparts; partition_idx++)
+		i = -1;
+		while ((i = bms_next_member(rel->live_parts, i)) >= 0)
 		{
-			RelOptInfo *child_rel = rel->part_rels[partition_idx];
+			RelOptInfo *child_rel = rel->part_rels[i];
 			AppendRelInfo **appinfos;
 			int			nappinfos;
 			List	   *child_scanjoin_targets = NIL;
 			ListCell   *lc;
 
-			/* Pruned or dummy children can be ignored. */
-			if (child_rel == NULL || IS_DUMMY_REL(child_rel))
+			Assert(child_rel != NULL);
+
+			/* Dummy children can be ignored. */
+			if (IS_DUMMY_REL(child_rel))
 				continue;
 
 			/* Translate scan/join targets for this child. */
@@ -7082,32 +7085,36 @@ create_partitionwise_grouping_paths(PlannerInfo *root,
 									PartitionwiseAggregateType patype,
 									GroupPathExtraData *extra)
 {
-	int			nparts = input_rel->nparts;
-	int			cnt_parts;
 	List	   *grouped_live_children = NIL;
 	List	   *partially_grouped_live_children = NIL;
 	PathTarget *target = grouped_rel->reltarget;
 	bool		partial_grouping_valid = true;
+	int			i;
 
 	Assert(patype != PARTITIONWISE_AGGREGATE_NONE);
 	Assert(patype != PARTITIONWISE_AGGREGATE_PARTIAL ||
 		   partially_grouped_rel != NULL);
 
 	/* Add paths for partitionwise aggregation/grouping. */
-	for (cnt_parts = 0; cnt_parts < nparts; cnt_parts++)
+	i = -1;
+	while ((i = bms_next_member(input_rel->live_parts, i)) >= 0)
 	{
-		RelOptInfo *child_input_rel = input_rel->part_rels[cnt_parts];
-		PathTarget *child_target = copy_pathtarget(target);
+		RelOptInfo *child_input_rel = input_rel->part_rels[i];
+		PathTarget *child_target;
 		AppendRelInfo **appinfos;
 		int			nappinfos;
 		GroupPathExtraData child_extra;
 		RelOptInfo *child_grouped_rel;
 		RelOptInfo *child_partially_grouped_rel;
 
-		/* Pruned or dummy children can be ignored. */
-		if (child_input_rel == NULL || IS_DUMMY_REL(child_input_rel))
+		Assert(child_input_rel != NULL);
+
+		/* Dummy children can be ignored. */
+		if (IS_DUMMY_REL(child_input_rel))
 			continue;
 
+		child_target = copy_pathtarget(target);
+
 		/*
 		 * Copy the given "extra" structure as is and then override the
 		 * members specific to this child.
diff --git a/src/backend/optimizer/util/inherit.c b/src/backend/optimizer/util/inherit.c
index 992ef87b9d..c758172efa 100644
--- a/src/backend/optimizer/util/inherit.c
+++ b/src/backend/optimizer/util/inherit.c
@@ -348,7 +348,7 @@ expand_partitioned_rtentry(PlannerInfo *root, RelOptInfo *relinfo,
 	 * that survive pruning.  Below, we will initialize child objects for the
 	 * surviving partitions.
 	 */
-	live_parts = prune_append_rel_partitions(relinfo);
+	relinfo->live_parts = live_parts = prune_append_rel_partitions(relinfo);
 
 	/* Expand simple_rel_array and friends to hold child objects. */
 	num_live_parts = bms_num_members(live_parts);
diff --git a/src/backend/optimizer/util/relnode.c b/src/backend/optimizer/util/relnode.c
index e105a4d5f1..47769cea45 100644
--- a/src/backend/optimizer/util/relnode.c
+++ b/src/backend/optimizer/util/relnode.c
@@ -255,6 +255,7 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo *parent)
 	rel->partbounds_merged = false;
 	rel->partition_qual = NIL;
 	rel->part_rels = NULL;
+	rel->live_parts = NULL;
 	rel->all_partrels = NULL;
 	rel->partexprs = NULL;
 	rel->nullable_partexprs = NULL;
@@ -669,6 +670,7 @@ build_join_rel(PlannerInfo *root,
 	joinrel->partbounds_merged = false;
 	joinrel->partition_qual = NIL;
 	joinrel->part_rels = NULL;
+	joinrel->live_parts = NULL;
 	joinrel->all_partrels = NULL;
 	joinrel->partexprs = NULL;
 	joinrel->nullable_partexprs = NULL;
@@ -847,6 +849,7 @@ build_child_join_rel(PlannerInfo *root, RelOptInfo *outer_rel,
 	joinrel->partbounds_merged = false;
 	joinrel->partition_qual = NIL;
 	joinrel->part_rels = NULL;
+	joinrel->live_parts = NULL;
 	joinrel->all_partrels = NULL;
 	joinrel->partexprs = NULL;
 	joinrel->nullable_partexprs = NULL;
diff --git a/src/backend/partitioning/partprune.c b/src/backend/partitioning/partprune.c
index c79374265c..e00edbe5c8 100644
--- a/src/backend/partitioning/partprune.c
+++ b/src/backend/partitioning/partprune.c
@@ -654,15 +654,14 @@ make_partitionedrel_pruneinfo(PlannerInfo *root, RelOptInfo *parentrel,
 		relid_map = (Oid *) palloc0(nparts * sizeof(Oid));
 		present_parts = NULL;
 
-		for (i = 0; i < nparts; i++)
+		i = -1;
+		while ((i = bms_next_member(subpart->live_parts, i)) >= 0)
 		{
 			RelOptInfo *partrel = subpart->part_rels[i];
 			int			subplanidx;
 			int			subpartidx;
 
-			/* Skip processing pruned partitions. */
-			if (partrel == NULL)
-				continue;
+			Assert(partrel != NULL);
 
 			subplan_map[i] = subplanidx = relid_subplan_map[partrel->relid] - 1;
 			subpart_map[i] = subpartidx = relid_subpart_map[partrel->relid] - 1;
diff --git a/src/include/nodes/pathnodes.h b/src/include/nodes/pathnodes.h
index e20c245f98..ce7908c326 100644
--- a/src/include/nodes/pathnodes.h
+++ b/src/include/nodes/pathnodes.h
@@ -762,6 +762,9 @@ typedef struct RelOptInfo
 	List	   *partition_qual; /* Partition constraint, if not the root */
 	struct RelOptInfo **part_rels;	/* Array of RelOptInfos of partitions,
 									 * stored in the same order as bounds */
+	Bitmapset  *live_parts;		/* Bitmap with members acting as indexes into
+								 * the part_rels[] array to indicate which
+								 * partitions survived partition pruning. */
 	Relids		all_partrels;	/* Relids set of all partition relids */
 	List	  **partexprs;		/* Non-nullable partition key expressions */
 	List	  **nullable_partexprs; /* Nullable partition key expressions */
-- 
2.30.2

v4-0002-Allow-ordered-partition-scans-in-more-cases.patchapplication/octet-stream; name=v4-0002-Allow-ordered-partition-scans-in-more-cases.patchDownload
From 9d6d654bb6838fe2effb2ddaeef5b3f79ecb346d Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Sat, 10 Jul 2021 00:55:00 +1200
Subject: [PATCH v4 2/2] Allow ordered partition scans in more cases

Previously we couldn't allow ordered partition scans for LIST partitioned
table when there were any partitions that allowed multiple Datums.  This
was a very cheap check to make and we could likely have done a little
better by checking if there were interleaved partitions, but at the time
we didn't have visibility about which partitions were pruned, so we still
may have disallowed cases when the interleaved partitions were pruned.

Since we now have knowledge of pruned partitions, we can do a much better
job inside partitions_are_ordered().

Here we pass which partitioned survived partition pruning into
partitions_are_ordered() and, for LIST partitioning, have it check to see
if any live partitions exist in the new "interleaved_parts" field defined
in PartitionBoundInfo.

For RANGE partitioning we can relax the code which caused the partitions
to be unordered if a DEFAULT partition existed.  Since we now know which
partitions were pruned, partitions_are_ordered() now returns true when the
DEFAULT partition was pruned.
---
 src/backend/optimizer/path/allpaths.c |   2 +-
 src/backend/optimizer/path/pathkeys.c |   2 +-
 src/backend/partitioning/partbounds.c | 105 +++++++++++++++++++------
 src/include/partitioning/partbounds.h |  18 ++++-
 src/test/regress/expected/inherit.out | 108 ++++++++++++++++++++++++++
 src/test/regress/sql/inherit.sql      |  27 +++++++
 6 files changed, 235 insertions(+), 27 deletions(-)

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 671117314a..296dd75c1b 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1689,7 +1689,7 @@ generate_orderedappend_paths(PlannerInfo *root, RelOptInfo *rel,
 	 * for both forward and reverse scans.
 	 */
 	if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
-		partitions_are_ordered(rel->boundinfo, rel->nparts))
+		partitions_are_ordered(rel->boundinfo, rel->live_parts))
 	{
 		partition_pathkeys = build_partition_pathkeys(root, rel,
 													  ForwardScanDirection,
diff --git a/src/backend/optimizer/path/pathkeys.c b/src/backend/optimizer/path/pathkeys.c
index bd9a176d7d..216dd26385 100644
--- a/src/backend/optimizer/path/pathkeys.c
+++ b/src/backend/optimizer/path/pathkeys.c
@@ -704,7 +704,7 @@ build_partition_pathkeys(PlannerInfo *root, RelOptInfo *partrel,
 	int			i;
 
 	Assert(partscheme != NULL);
-	Assert(partitions_are_ordered(partrel->boundinfo, partrel->nparts));
+	Assert(partitions_are_ordered(partrel->boundinfo, partrel->live_parts));
 	/* For now, we can only cope with baserels */
 	Assert(IS_SIMPLE_REL(partrel));
 
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 25018b1a8b..4fdcf656cd 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -395,6 +395,7 @@ create_hash_bounds(PartitionBoundSpec **boundspecs, int nparts,
 	boundinfo->ndatums = nparts;
 	boundinfo->datums = (Datum **) palloc0(nparts * sizeof(Datum *));
 	boundinfo->kind = NULL;
+	boundinfo->interleaved_parts = NULL;
 	boundinfo->nindexes = greatest_modulus;
 	boundinfo->indexes = (int *) palloc(greatest_modulus * sizeof(int));
 	for (i = 0; i < greatest_modulus; i++)
@@ -543,6 +544,7 @@ create_list_bounds(PartitionBoundSpec **boundspecs, int nparts,
 	boundinfo->ndatums = ndatums;
 	boundinfo->datums = (Datum **) palloc0(ndatums * sizeof(Datum *));
 	boundinfo->kind = NULL;
+	boundinfo->interleaved_parts = NULL;
 	boundinfo->nindexes = ndatums;
 	boundinfo->indexes = (int *) palloc(ndatums * sizeof(int));
 
@@ -607,6 +609,69 @@ create_list_bounds(PartitionBoundSpec **boundspecs, int nparts,
 		boundinfo->default_index = (*mapping)[default_index];
 	}
 
+	/*
+	 * Calculate interleaved partitions.  Here we look for partitions which
+	 * might be interleaved with other partitions and set the
+	 * interleaved_parts field with the partition indexes of any partitions
+	 * which may be interleaved with another partition.
+	 */
+
+	/*
+	 * There must be multiple partitions to have any interleaved partitions,
+	 * otherwise there's nothing to interleave with.
+	 */
+	if (nparts > 1)
+	{
+		/*
+		 * Short-circuit check to see if only 1 Datum is allowed per
+		 * partition.  When this is true there's no need to do the more
+		 * expensive checks to look for interleaved values.
+		 */
+		if (boundinfo->ndatums +
+			partition_bound_accepts_nulls(boundinfo) +
+			partition_bound_has_default(boundinfo) != nparts)
+		{
+			int			last_index = -1;
+
+			/*
+			 * Since the indexes array is sorted in Datum order, if any
+			 * partitions are interleaved then it will show up by the
+			 * partition indexes not being in ascending order.  Here we check
+			 * for that and record all partitions that are out of order.
+			 */
+			for (i = 0; i < boundinfo->nindexes; i++)
+			{
+				int			index = boundinfo->indexes[i];
+
+				if (index < last_index)
+					boundinfo->interleaved_parts = bms_add_member(boundinfo->interleaved_parts,
+																  index);
+
+				/*
+				 * Mark the NULL partition as interleaved if we find that it
+				 * allows some other non-NULL Datum.
+				 */
+				if (partition_bound_accepts_nulls(boundinfo) &&
+					index == boundinfo->null_index)
+					boundinfo->interleaved_parts = bms_add_member(boundinfo->interleaved_parts,
+																  boundinfo->null_index);
+
+				last_index = index;
+			}
+		}
+
+		/*
+		 * The DEFAULT partition is the "catch all" partition that can contain
+		 * anything that does not belong to any other partition.  If there are
+		 * any other partitions then the DEFAULT partition must be marked as
+		 * interleaved.
+		 */
+		if (partition_bound_has_default(boundinfo))
+			boundinfo->interleaved_parts = bms_add_member(boundinfo->interleaved_parts,
+														  boundinfo->default_index);
+	}
+
+
 	/* All partitions must now have been assigned canonical indexes. */
 	Assert(next_index == nparts);
 	return boundinfo;
@@ -750,6 +815,7 @@ create_range_bounds(PartitionBoundSpec **boundspecs, int nparts,
 	boundinfo->kind = (PartitionRangeDatumKind **)
 		palloc(ndatums *
 			   sizeof(PartitionRangeDatumKind *));
+	boundinfo->interleaved_parts = NULL;
 
 	/*
 	 * For range partitioning, an additional value of -1 is stored as the last
@@ -993,6 +1059,9 @@ partition_bounds_copy(PartitionBoundInfo src,
 	else
 		dest->kind = NULL;
 
+	/* copy interleaved partitions for LIST partitioned tables */
+	dest->interleaved_parts = bms_copy(src->interleaved_parts);
+
 	/*
 	 * For hash partitioning, datums array will have two elements - modulus
 	 * and remainder.
@@ -2780,13 +2849,15 @@ add_merged_range_bounds(int partnatts, FmgrInfo *partsupfuncs,
  *		that is partitions appearing earlier in the PartitionDesc sequence
  *		contain partition keys strictly less than those appearing later.
  *		Also, if NULL values are possible, they must come in the last
- *		partition defined in the PartitionDesc.
+ *		partition defined in the PartitionDesc.  'live_parts' marks which
+ *		partitions we should include when checking the ordering.  Partitions
+ *		that do not appear in 'live_parts' are ignored.
  *
  * If out of order, or there is insufficient info to know the order,
  * then we return false.
  */
 bool
-partitions_are_ordered(PartitionBoundInfo boundinfo, int nparts)
+partitions_are_ordered(PartitionBoundInfo boundinfo, Bitmapset *live_parts)
 {
 	Assert(boundinfo != NULL);
 
@@ -2798,38 +2869,24 @@ partitions_are_ordered(PartitionBoundInfo boundinfo, int nparts)
 			 * RANGE-type partitioning guarantees that the partitions can be
 			 * scanned in the order that they're defined in the PartitionDesc
 			 * to provide sequential, non-overlapping ranges of tuples.
-			 * However, if a DEFAULT partition exists then it doesn't work, as
-			 * that could contain tuples from either below or above the
-			 * defined range, or tuples belonging to gaps between partitions.
+			 * However, if a DEFAULT partition exists and it's contained
+			 * within live_parts, then the partitions are not ordered.
 			 */
-			if (!partition_bound_has_default(boundinfo))
+			if (!partition_bound_has_default(boundinfo) ||
+				!bms_is_member(boundinfo->default_index, live_parts))
 				return true;
 			break;
 
 		case PARTITION_STRATEGY_LIST:
 
 			/*
-			 * LIST partitioning can also guarantee ordering, but only if the
-			 * partitions don't accept interleaved values.  We could likely
-			 * check for this by looping over the PartitionBound's indexes
-			 * array to check that the indexes are in order.  For now, let's
-			 * just keep it simple and just accept LIST partitioning when
-			 * there's no DEFAULT partition, exactly one value per partition,
-			 * and optionally a NULL partition that does not accept any other
-			 * values.  Such a NULL partition will come last in the
-			 * PartitionDesc, and the other partitions will be properly
-			 * ordered.  This is a cheap test to make as it does not require
-			 * any per-partition processing.  Maybe we'd like to handle more
-			 * complex cases in the future.
+			 * LIST partitioned are ordered providing none of live_parts
+			 * overlap with the partitioned table's interleaved partitions.
 			 */
-			if (partition_bound_has_default(boundinfo))
-				return false;
-
-			if (boundinfo->ndatums + partition_bound_accepts_nulls(boundinfo)
-				== nparts)
+			if (!bms_overlap(live_parts, boundinfo->interleaved_parts))
 				return true;
-			break;
 
+			break;
 		default:
 			/* HASH, or some other strategy */
 			break;
diff --git a/src/include/partitioning/partbounds.h b/src/include/partitioning/partbounds.h
index 2f00f9aa3d..9db546def6 100644
--- a/src/include/partitioning/partbounds.h
+++ b/src/include/partitioning/partbounds.h
@@ -61,6 +61,18 @@ struct RelOptInfo;				/* avoid including pathnodes.h here */
  * The indexes array is indexed according to the hash key's remainder modulo
  * the greatest modulus, and it contains either the partition index accepting
  * that remainder, or -1 if there is no partition for that remainder.
+ *
+ * For LIST partitioned tables, we track the partition indexes of partitions
+ * which are possibly "interleaved" partitions.  A partition is considered
+ * interleaved if it allows multiple values and there exists at least one
+ * other partition which could contain a value that lies between those values.
+ * For example, if a partition exists FOR VALUES IN(3,5) and another partition
+ * exists FOR VALUES IN (4), then the IN(3,5) partition is an interleaved
+ * partition.  The same is possible with DEFAULT partitions since they can
+ * contain any value that does not belong in another partition.  This field
+ * only serves as proof that a particular partition is not interleaved, not
+ * proof that it is interleaved.  When we're uncertain, we marked the
+ * partition as interleaved.
  */
 typedef struct PartitionBoundInfoData
 {
@@ -70,6 +82,9 @@ typedef struct PartitionBoundInfoData
 	PartitionRangeDatumKind **kind; /* The kind of each range bound datum;
 									 * NULL for hash and list partitioned
 									 * tables */
+	Bitmapset  *interleaved_parts;	/* Partition indexes of partitions which
+									 * may be interleaved. See above. This is
+									 * only set for LIST partitioned tables */
 	int			nindexes;		/* Length of the indexes[] array */
 	int		   *indexes;		/* Partition indexes */
 	int			null_index;		/* Index of the null-accepting partition; -1
@@ -102,7 +117,8 @@ extern PartitionBoundInfo partition_bounds_merge(int partnatts,
 												 JoinType jointype,
 												 List **outer_parts,
 												 List **inner_parts);
-extern bool partitions_are_ordered(PartitionBoundInfo boundinfo, int nparts);
+extern bool partitions_are_ordered(PartitionBoundInfo boundinfo,
+								   Bitmapset *live_parts);
 extern void check_new_partition_bound(char *relname, Relation parent,
 									  PartitionBoundSpec *spec,
 									  ParseState *pstate);
diff --git a/src/test/regress/expected/inherit.out b/src/test/regress/expected/inherit.out
index 06f44287bc..2d49e765de 100644
--- a/src/test/regress/expected/inherit.out
+++ b/src/test/regress/expected/inherit.out
@@ -2180,6 +2180,8 @@ explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
          Index Cond: (a < 20)
 (9 rows)
 
+set enable_bitmapscan to off;
+set enable_sort to off;
 create table mclparted (a int) partition by list(a);
 create table mclparted1 partition of mclparted for values in(1);
 create table mclparted2 partition of mclparted for values in(2);
@@ -2208,7 +2210,113 @@ explain (costs off) select * from mclparted order by a;
    ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_4
 (6 rows)
 
+explain (costs off) select * from mclparted where a in(3,4,5) order by a;
+                                 QUERY PLAN                                 
+----------------------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted.a
+   ->  Index Only Scan using mclparted3_5_a_idx on mclparted3_5 mclparted_1
+         Index Cond: (a = ANY ('{3,4,5}'::integer[]))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_2
+         Index Cond: (a = ANY ('{3,4,5}'::integer[]))
+(6 rows)
+
+-- Introduce a NULL and DEFAULT partition so we can test more complex cases
+create table mclparted_null partition of mclparted for values in(null);
+create table mclparted_def partition of mclparted default;
+-- Append can be used providing we don't scan the interleaved partition
+explain (costs off) select * from mclparted where a in(1,2,4) order by a;
+                               QUERY PLAN                               
+------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_1
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_2
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_3
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+(7 rows)
+
+explain (costs off) select * from mclparted where a in(1,2,4) or a is null order by a;
+                                   QUERY PLAN                                   
+--------------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_1
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_2
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_3
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted_null_a_idx on mclparted_null mclparted_4
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+(9 rows)
+
+-- Test a more complex case where the NULL partition allows some other value
+drop table mclparted_null;
+create table mclparted_0_null partition of mclparted for values in(0,null);
+-- Ensure MergeAppend is used since 0 and NULLs are in the same partition.
+explain (costs off) select * from mclparted where a in(1,2,4) or a is null order by a;
+                                     QUERY PLAN                                     
+------------------------------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted.a
+   ->  Index Only Scan using mclparted_0_null_a_idx on mclparted_0_null mclparted_1
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_2
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_3
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_4
+         Filter: ((a = ANY ('{1,2,4}'::integer[])) OR (a IS NULL))
+(10 rows)
+
+explain (costs off) select * from mclparted where a in(0,1,2,4) order by a;
+                                     QUERY PLAN                                     
+------------------------------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted.a
+   ->  Index Only Scan using mclparted_0_null_a_idx on mclparted_0_null mclparted_1
+         Index Cond: (a = ANY ('{0,1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_2
+         Index Cond: (a = ANY ('{0,1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_3
+         Index Cond: (a = ANY ('{0,1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_4
+         Index Cond: (a = ANY ('{0,1,2,4}'::integer[]))
+(10 rows)
+
+-- Ensure Append is used when the null partition is pruned
+explain (costs off) select * from mclparted where a in(1,2,4) order by a;
+                               QUERY PLAN                               
+------------------------------------------------------------------------
+ Append
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_1
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_2
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_3
+         Index Cond: (a = ANY ('{1,2,4}'::integer[]))
+(7 rows)
+
+-- Ensure MergeAppend is used when the default partition is not pruned
+explain (costs off) select * from mclparted where a in(1,2,4,100) order by a;
+                                  QUERY PLAN                                  
+------------------------------------------------------------------------------
+ Merge Append
+   Sort Key: mclparted.a
+   ->  Index Only Scan using mclparted1_a_idx on mclparted1 mclparted_1
+         Index Cond: (a = ANY ('{1,2,4,100}'::integer[]))
+   ->  Index Only Scan using mclparted2_a_idx on mclparted2 mclparted_2
+         Index Cond: (a = ANY ('{1,2,4,100}'::integer[]))
+   ->  Index Only Scan using mclparted4_a_idx on mclparted4 mclparted_3
+         Index Cond: (a = ANY ('{1,2,4,100}'::integer[]))
+   ->  Index Only Scan using mclparted_def_a_idx on mclparted_def mclparted_4
+         Index Cond: (a = ANY ('{1,2,4,100}'::integer[]))
+(10 rows)
+
 drop table mclparted;
+reset enable_sort;
+reset enable_bitmapscan;
 -- Ensure subplans which don't have a path with the correct pathkeys get
 -- sorted correctly.
 drop index mcrparted_a_abs_c_idx;
diff --git a/src/test/regress/sql/inherit.sql b/src/test/regress/sql/inherit.sql
index 64173a8738..195aedb5ff 100644
--- a/src/test/regress/sql/inherit.sql
+++ b/src/test/regress/sql/inherit.sql
@@ -775,6 +775,8 @@ explain (costs off) select a, abs(b) from mcrparted order by a, abs(b), c;
 -- during planning.
 explain (costs off) select * from mcrparted where a < 20 order by a, abs(b), c;
 
+set enable_bitmapscan to off;
+set enable_sort to off;
 create table mclparted (a int) partition by list(a);
 create table mclparted1 partition of mclparted for values in(1);
 create table mclparted2 partition of mclparted for values in(2);
@@ -789,8 +791,33 @@ create table mclparted3_5 partition of mclparted for values in(3,5);
 create table mclparted4 partition of mclparted for values in(4);
 
 explain (costs off) select * from mclparted order by a;
+explain (costs off) select * from mclparted where a in(3,4,5) order by a;
+
+-- Introduce a NULL and DEFAULT partition so we can test more complex cases
+create table mclparted_null partition of mclparted for values in(null);
+create table mclparted_def partition of mclparted default;
+
+-- Append can be used providing we don't scan the interleaved partition
+explain (costs off) select * from mclparted where a in(1,2,4) order by a;
+explain (costs off) select * from mclparted where a in(1,2,4) or a is null order by a;
+
+-- Test a more complex case where the NULL partition allows some other value
+drop table mclparted_null;
+create table mclparted_0_null partition of mclparted for values in(0,null);
+
+-- Ensure MergeAppend is used since 0 and NULLs are in the same partition.
+explain (costs off) select * from mclparted where a in(1,2,4) or a is null order by a;
+explain (costs off) select * from mclparted where a in(0,1,2,4) order by a;
+
+-- Ensure Append is used when the null partition is pruned
+explain (costs off) select * from mclparted where a in(1,2,4) order by a;
+
+-- Ensure MergeAppend is used when the default partition is not pruned
+explain (costs off) select * from mclparted where a in(1,2,4,100) order by a;
 
 drop table mclparted;
+reset enable_sort;
+reset enable_bitmapscan;
 
 -- Ensure subplans which don't have a path with the correct pathkeys get
 -- sorted correctly.
-- 
2.30.2

#7Zhihong Yu
zyu@yugabyte.com
In reply to: David Rowley (#6)
Re: Record a Bitmapset of non-pruned partitions

On Sun, Aug 1, 2021 at 5:31 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 30 Jul 2021 at 19:10, Amit Langote <amitlangote09@gmail.com>
wrote:

0001 looks mostly fine, except I thought the following could be worded
to say that the bitmap members are offsets into the part_rels array.
To avoid someone confusing them with RT indexes, for example.

+   Bitmapset  *live_parts;     /* Bitmap with members to indicate which
+                                * partitions survived partition

pruning. */

Yeah, agreed. I've adjusted that.

On 0002:

interleaved_parts idea looks clever. I wonder if you decided that
it's maybe not worth setting that field in the joinrel's
PartitionBoundInfos? For example, adding the code that you added in
create_list_bounds() also in merge_list_bounds().

Currently generate_orderedappend_paths() only checks
partitions_are_ordered() for base and other member rels, so setting
the field for join rels would be a waste of effort given that it's not
used for anything.

I've not really looked into the possibility of enabling this
optimization for partition-wise joined rels. I know that there's a bit
more complexity now due to c8434d64c. I'm not really all that clear on
which cases could be allowed here and which couldn't. It would require
more analysis and I'd say that's a different patch.

... The definition of interleaved
+ * is any partition that can contain multiple different values where

exists at

+ * least one other partition which could contain a value which is

between the

+ * multiple possible values in the other partition.

The sentence sounds a bit off starting at "...where exists". How about:

I must have spent too long writing SQL queries.

"A partition is considered interleaved if it contains multiple values
such that there exists at least one other partition containing a value
that lies between those values [ in terms of partitioning-defined
ordering ]."

That looks better. I took that with some small adjustments.

Looks fine otherwise.

Thanks for the review.

I had another self review of these and I'm pretty happy with them. I'm
quite glad to see the performance of querying a single partition of a
table with large numbers of partitions no longer tails off as much as
it used to.

David

Hi,
Some minor comment.

bq. Here we pass which partitioned

partitioned -> partitions

Here we look for partitions which
+    * might be interleaved with other partitions and set the
+    * interleaved_parts field with the partition indexes of any partitions
+    * which may be interleaved with another partition.

The above seems a little bit repetitive. It can be shortened to remove
repetition.

Cheers

#8David Rowley
dgrowleyml@gmail.com
In reply to: Zhihong Yu (#7)
Re: Record a Bitmapset of non-pruned partitions

Thanks for having a look at this.

On Mon, 2 Aug 2021 at 02:33, Zhihong Yu <zyu@yugabyte.com> wrote:

Here we look for partitions which
+    * might be interleaved with other partitions and set the
+    * interleaved_parts field with the partition indexes of any partitions
+    * which may be interleaved with another partition.

The above seems a little bit repetitive. It can be shortened to remove repetition.

I agree that the word "partition" is mentioned quite a few times. The
only one I can see that could be removed is the "partition indexes"
one. Likely the details about which bit we set can be left up to the
struct field comment in partbounds.h

I've adjusted this to become:

/*
* Calculate interleaved partitions. Here we look for partitions which
* might be interleaved with other partitions and set a bit in
* interleaved_parts for any partitions which may be interleaved with
* another partition.
*/

David

#9Amit Langote
amitlangote09@gmail.com
In reply to: David Rowley (#6)
Re: Record a Bitmapset of non-pruned partitions

On Sun, Aug 1, 2021 at 9:31 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 30 Jul 2021 at 19:10, Amit Langote <amitlangote09@gmail.com> wrote:

interleaved_parts idea looks clever. I wonder if you decided that
it's maybe not worth setting that field in the joinrel's
PartitionBoundInfos? For example, adding the code that you added in
create_list_bounds() also in merge_list_bounds().

Currently generate_orderedappend_paths() only checks
partitions_are_ordered() for base and other member rels, so setting
the field for join rels would be a waste of effort given that it's not
used for anything.

I've not really looked into the possibility of enabling this
optimization for partition-wise joined rels. I know that there's a bit
more complexity now due to c8434d64c. I'm not really all that clear on
which cases could be allowed here and which couldn't. It would require
more analysis and I'd say that's a different patch.

Yeah, that makes sense.

...  The definition of interleaved
+ * is any partition that can contain multiple different values where exists at
+ * least one other partition which could contain a value which is between the
+ * multiple possible values in the other partition.

The sentence sounds a bit off starting at "...where exists". How about:

I must have spent too long writing SQL queries.

Hah.

I had another self review of these and I'm pretty happy with them. I'm
quite glad to see the performance of querying a single partition of a
table with large numbers of partitions no longer tails off as much as
it used to.

Nice, glad to see the apply_scanjoin_target_to_paths() loop taken care of.

Thank you.

--
Amit Langote
EDB: http://www.enterprisedb.com

#10David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#6)
Re: Record a Bitmapset of non-pruned partitions

On Mon, 2 Aug 2021 at 00:31, David Rowley <dgrowleyml@gmail.com> wrote:

I had another self review of these and I'm pretty happy with them. I'm
quite glad to see the performance of querying a single partition of a
table with large numbers of partitions no longer tails off as much as
it used to.

I did some profiling and benchmarking on master and with the v4 patch.
With a hash partitioned table containing 8192 partitions I see the
following when running a query that selects a value from a single
partition:

19.39% postgres [.] apply_scanjoin_target_to_paths
5.35% postgres [.] base_yyparse
4.71% postgres [.] AllocSetAlloc
2.86% libc-2.33.so [.] __memset_avx2_unaligned_erms
2.17% postgres [.] SearchCatCacheInternal

With the patched version, I see:

5.89% postgres [.] AllocSetAlloc
3.97% postgres [.] base_yyparse
3.87% libc-2.33.so [.] __memset_avx2_unaligned_erms
2.44% postgres [.] SearchCatCacheInternal
1.29% postgres [.] hash_search_with_hash_value

I'm getting:
master: 16613 tps
patched: 22078 tps

So there's about 32% performance improvement with this number of
partitions. These results are not the same as my original email here
as I've only recently discovered that I really need to pin pgbench and
the postgres backend to the same CPU core to get good and stable
performance from a single threaded pgbench job.

FWIW, the next thing there on the profile the following line in
expand_partitioned_rtentry()

relinfo->part_rels = (RelOptInfo **) palloc0(relinfo->nparts *
sizeof(RelOptInfo *));

If anyone has any objections to both the v4 0001 and 0002 patch, can
they let me know soon. I'm currently seeing no reason that they can't
go in.

David

#11David Rowley
dgrowleyml@gmail.com
In reply to: David Rowley (#10)
Re: Record a Bitmapset of non-pruned partitions

On Mon, 2 Aug 2021 at 20:16, David Rowley <dgrowleyml@gmail.com> wrote:

If anyone has any objections to both the v4 0001 and 0002 patch, can
they let me know soon. I'm currently seeing no reason that they can't
go in.

I've now pushed both of these. Thanks for the reviews.

David

#12Amit Langote
amitlangote09@gmail.com
In reply to: Amit Langote (#9)
1 attachment(s)
Re: Record a Bitmapset of non-pruned partitions

Hi David,

On Mon, Aug 2, 2021 at 11:59 AM Amit Langote <amitlangote09@gmail.com> wrote:

On Sun, Aug 1, 2021 at 9:31 PM David Rowley <dgrowleyml@gmail.com> wrote:

On Fri, 30 Jul 2021 at 19:10, Amit Langote <amitlangote09@gmail.com> wrote:

interleaved_parts idea looks clever. I wonder if you decided that
it's maybe not worth setting that field in the joinrel's
PartitionBoundInfos? For example, adding the code that you added in
create_list_bounds() also in merge_list_bounds().

Currently generate_orderedappend_paths() only checks
partitions_are_ordered() for base and other member rels, so setting
the field for join rels would be a waste of effort given that it's not
used for anything.

I've not really looked into the possibility of enabling this
optimization for partition-wise joined rels. I know that there's a bit
more complexity now due to c8434d64c. I'm not really all that clear on
which cases could be allowed here and which couldn't. It would require
more analysis and I'd say that's a different patch.

Yeah, that makes sense.

Related to the above, I noticed while looking at
build_merged_partition_bounds() that db632fbca3 missed adding a line
to that function to set interleaved_parts to NULL. Because the
PartitionBoundInfo is only palloc'd (not palloc0'd), interleaved_parts
of a "merged" bounds struct ends up pointing to garbage, so let's fix
that. Attached a patch.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachments:

initialize-inteleaved_parts.patchapplication/octet-stream; name=initialize-inteleaved_parts.patchDownload
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index fdfe712f91..d286ef91e1 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -2564,6 +2564,9 @@ build_merged_partition_bounds(char strategy, List *merged_datums,
 		merged_bounds->kind = NULL;
 	}
 
+	/* We don't currently care to set this accurately for merged bounds. */
+	merged_bounds->interleaved_parts = NULL;
+
 	Assert(list_length(merged_indexes) == ndatums);
 	merged_bounds->nindexes = ndatums;
 	merged_bounds->indexes = (int *) palloc(sizeof(int) * ndatums);
#13David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#12)
1 attachment(s)
Re: Record a Bitmapset of non-pruned partitions

On Thu, 30 Sept 2021 at 20:25, Amit Langote <amitlangote09@gmail.com> wrote:

Related to the above, I noticed while looking at
build_merged_partition_bounds() that db632fbca3 missed adding a line
to that function to set interleaved_parts to NULL. Because the
PartitionBoundInfo is only palloc'd (not palloc0'd), interleaved_parts
of a "merged" bounds struct ends up pointing to garbage, so let's fix
that. Attached a patch.

Thanks for the patch.

I think we also need to document that interleaved_parts is not set for
join relations, otherwise someone may in the future try to use that
field for an optimisation for join relations. At the moment, per
generate_orderedappend_paths, we only handle IS_SIMPLE_REL type
relations.

I've attached a patch that updates the comments to mention this.

David

Attachments:

initialize-inteleaved_parts_v2.patchapplication/octet-stream; name=initialize-inteleaved_parts_v2.patchDownload
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index fdfe712f91..95798f4f66 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -2564,6 +2564,9 @@ build_merged_partition_bounds(char strategy, List *merged_datums,
 		merged_bounds->kind = NULL;
 	}
 
+	/* interleaved_parts is always NULL for join relations. */
+	merged_bounds->interleaved_parts = NULL;
+
 	Assert(list_length(merged_indexes) == ndatums);
 	merged_bounds->nindexes = ndatums;
 	merged_bounds->indexes = (int *) palloc(sizeof(int) * ndatums);
diff --git a/src/include/partitioning/partbounds.h b/src/include/partitioning/partbounds.h
index 9db546def6..7138cb1f2a 100644
--- a/src/include/partitioning/partbounds.h
+++ b/src/include/partitioning/partbounds.h
@@ -72,7 +72,9 @@ struct RelOptInfo;				/* avoid including pathnodes.h here */
  * contain any value that does not belong in another partition.  This field
  * only serves as proof that a particular partition is not interleaved, not
  * proof that it is interleaved.  When we're uncertain, we marked the
- * partition as interleaved.
+ * partition as interleaved.  The interleaved_parts field is only ever set for
+ * RELOPT_BASEREL and RELOPT_OTHER_MEMBER_REL, it is always left NULL for join
+ * relations.
  */
 typedef struct PartitionBoundInfoData
 {
#14Amit Langote
amitlangote09@gmail.com
In reply to: David Rowley (#13)
Re: Record a Bitmapset of non-pruned partitions

On Fri, Oct 1, 2021 at 7:07 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 30 Sept 2021 at 20:25, Amit Langote <amitlangote09@gmail.com> wrote:

Related to the above, I noticed while looking at
build_merged_partition_bounds() that db632fbca3 missed adding a line
to that function to set interleaved_parts to NULL. Because the
PartitionBoundInfo is only palloc'd (not palloc0'd), interleaved_parts
of a "merged" bounds struct ends up pointing to garbage, so let's fix
that. Attached a patch.

Thanks for the patch.

I think we also need to document that interleaved_parts is not set for
join relations, otherwise someone may in the future try to use that
field for an optimisation for join relations. At the moment, per
generate_orderedappend_paths, we only handle IS_SIMPLE_REL type
relations.

I've attached a patch that updates the comments to mention this.

Looks good to me. Thanks.

--
Amit Langote
EDB: http://www.enterprisedb.com

#15David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#14)
Re: Record a Bitmapset of non-pruned partitions

On Fri, 1 Oct 2021 at 13:37, Amit Langote <amitlangote09@gmail.com> wrote:

I've attached a patch that updates the comments to mention this.

Looks good to me. Thanks.

Thanks. Pushed.

David