Speeding up INSERTs and UPDATEs to partitioned tables

Started by David Rowleyover 7 years ago77 messages
#1David Rowley
david.rowley@2ndquadrant.com
2 attachment(s)

Hi,

As part of my efforts to make partitioning scale better for larger
numbers of partitions, I've been looking at primarily INSERT VALUES
performance. Here the overheads are almost completely in the
executor. Planning of this type of statement is very simple since
there is no FROM clause to process.

My benchmarks have been around a RANGE partitioned table with 10k leaf
partitions and no sub-partitioned tables. The partition key is a
timestamp column.

I've found that ExecSetupPartitionTupleRouting() is very slow indeed
and there are a number of things slow about it. The biggest culprit
for the slowness is the locking of each partition inside of
find_all_inheritors(). For now, this needs to remain as we must hold
locks on each partition while performing RelationBuildPartitionDesc(),
otherwise, one of the partitions may get dropped out from under us.
There might be other valid reasons too, but please see my note at the
bottom of this email.

The locking is not the only slow thing. I found the following to also be slow:

1. RelationGetPartitionDispatchInfo uses a List and lappend() must
perform a palloc() each time a partition is added to the list.
2. A foreach loop is performed over leaf_parts to search for subplans
belonging to this partition. This seems pointless to do for INSERTs as
there's never any to find.
3. ExecCleanupTupleRouting() loops through the entire partitions
array. If a single tuple was inserted then all but one of the elements
will be NULL.
4. Tuple conversion map allocates an empty array thinking there might
be something to put into it. This is costly when the array is large
and pointless when there are no maps to store.
5. During get_partition_dispatch_recurse(), get_rel_relkind() is
called to determine if the partition is a partitioned table or a leaf
partition. This results in a slow relcache hashtable lookup.
6. get_partition_dispatch_recurse() also ends up just building the
indexes array with a sequence of numbers from 0 to nparts - 1 if there
are no sub-partitioned tables. Doing this is slow when there are many
partitions.

Besides the locking, the only thing that remains slow now is the
palloc0() for the 'partitions' array. In my test, it takes 0.6% of
execution time. I don't see any pretty ways to fix that.

I've written fixes for items 1-6 above.

I did:

1. Use an array instead of a List.
2. Don't do this loop. palloc0() the partitions array instead. Let
UPDATE add whatever subplans exist to the zeroed array.
3. Track what we initialize in a gapless array and cleanup just those
ones. Make this array small and increase it only when we need more
space.
4. Only allocate the map array when we need to store a map.
5. Work that out in relcache beforehand.
6. ditto

The only questionable thing I see is what I did for 6. In partcache.c
I'm basically building an array of nparts containing 0 to nparts -1.
It seems a bit pointless, so perhaps there's a better way. I was also
a bit too tight to memcpy() that out of relcache, and just pointed
directly to it. That might be a no-go area.

I've attached 2 patches:

0001: implements items 1-6
0002: Is not intended for commit. It just a demo of where we could get
the performance if we were smarter about locking partitions. I've just
included this to show 0001's worth.

Performance

AWS: m5d.large fsync=off

Unpatched:

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 2836
latency average = 21.162 ms
tps = 47.254409 (including connections establishing)
tps = 47.255756 (excluding connections establishing)

(yes, it's bad)

0001:

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 3235
latency average = 18.548 ms
tps = 53.913121 (including connections establishing)
tps = 53.914629 (excluding connections establishing)

(a small improvement from 0001)

0001+0002:

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 660079
latency average = 0.091 ms
tps = 11001.303764 (including connections establishing)
tps = 11001.602377 (excluding connections establishing)

(something to aspire towards)

0002 (only):

$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 27682
latency average = 2.168 ms
tps = 461.350885 (including connections establishing)
tps = 461.363327 (excluding connections establishing)

(shows that doing 0002 alone does not fix all our problems)

Unpartitioned table (control test):

$ pgbench -n -T 60 -f partbench__insert.sql postgres
transaction type: partbench__insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 801260
latency average = 0.075 ms
tps = 13354.311397 (including connections establishing)
tps = 13354.656163 (excluding connections establishing)

Test setup:

CREATE TABLE partbench_ (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2
INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL);
CREATE TABLE partbench (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2
INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL)
PARTITION BY RANGE (date);
\o /dev/null
select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
FOR VALUES FROM (''' || '2017-03-06'::date + (x::text || '
hours')::interval || ''') TO (''' || '2017-03-06'::date + ((x+1)::text
|| ' hours')::interval || ''');'
from generate_Series(0,9999) x;
\gexec
\o

partbench_insert.sql contains:
insert into partbench values('2018-04-26 15:00:00',1,2,3,4,5);

partbench__insert.sql contains:
insert into partbench_ values('2018-04-26 15:00:00',1,2,3,4,5);

I don't want to discuss the locking on this thread. That discussion
will detract from discussing what I'm proposing here... Which is not
to change anything relating to locks. I'm still working on that and
will post elsewhere. Please start another thread if you'd like to
discuss that in the meantime. Feel free to link it in here so others
can follow.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchapplication/octet-stream; name=v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload
From cb81abc25bcfcc146c5f0c46e5d1345790bd3d8e Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 22 Jun 2018 15:05:42 +1200
Subject: [PATCH v1 1/2] Speed up INSERT and UPDATE on partitioned tables

Various changes have been made here to reduce the overhead of executor init
of INSERT and UPDATE plans which perform the operation on a partitioned
table.  Tests done against partitioned tables with many partitions show
that there are a number of bottlenecks in the
ExecSetupPartitionTupleRouting code.  Namely, locking all the partitions
when we may require inserting into just one partition is quite a costly
overhead. This commit does not change anything relating to the locks, it
does however remove all the other bottlenecks.  Lock reduction will need to
be left for another day.

This commit also moves some of the work being done in
ExecSetupPartitionTupleRouting and the functions which it calls so that the
setup work is pre-calculated by the relcache code.  Particular care has
been taken in get_partition_dispatch_recurse to speed up the code.
Dereferencing the input parameters once per call, rather than once per loop
made a noticeable increase in performance.   Also, changing the
leaf_part_oids List into an array speeds things up considerably both
because that's the final form we need that data in, and also because it
saves constant palloc() calls which are made in lappend.

Initialization of the parent/child translation maps array is now only
performed when we need to store the first translation map.  If the column
order between the parent and its child are the same, then no map ever needs
to be stored, this (possibly large) array did nothing.

In simple INSERTs hitting a single partition to a partitioned table with
many partitions the shutdown of the executor was also slow in comparison to
the actual execution, this was down to the loop which cleans up each
ResultRelInfo having to loop over an array which often contained mostly
NULLs, which had to be skipped.  To speed this up we now keep track of
exactly which ResultRelInfos have been initialized.  These are stored in a
new array which we expand on demand.  Technically we could initialize the
full array size on the first allocation, but profiles indicated a higher
overhead when that memory context was destroyed, presumably due to some
extra malloc/free calls which had resulted due to the large array
allocation.
---
 src/backend/commands/copy.c            |  17 +-
 src/backend/executor/execPartition.c   | 370 ++++++++++++++++++++++-----------
 src/backend/executor/nodeModifyTable.c |  16 +-
 src/backend/utils/cache/partcache.c    |  32 ++-
 src/include/catalog/partition.h        |  13 +-
 src/include/executor/execPartition.h   |  17 +-
 6 files changed, 319 insertions(+), 146 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3a66cb5025..25bec76c1d 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2644,15 +2644,10 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = proute->partitions[leaf_part_index];
-			if (resultRelInfo == NULL)
-			{
-				resultRelInfo = ExecInitPartitionInfo(mtstate,
-													  saved_resultRelInfo,
-													  proute, estate,
-													  leaf_part_index);
-				Assert(resultRelInfo != NULL);
-			}
+			resultRelInfo = ExecGetPartitionInfo(mtstate,
+												 saved_resultRelInfo,
+												 proute, estate,
+												 leaf_part_index);
 
 			/*
 			 * For ExecInsertIndexTuples() to work on the partition's indexes
@@ -2693,7 +2688,9 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
+			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps ?
+												proute->parent_child_tupconv_maps[leaf_part_index] :
+												NULL,
 											  tuple,
 											  proute->partition_tuple_slot,
 											  &slot);
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 7a4665cc4e..1a3a67dd0d 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,11 +31,17 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
-
+static ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *resultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate, int partidx);
 static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
+								 int *num_parted, Oid **leaf_part_oids,
+								 int *n_leaf_part_oids);
 static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+							   List **pds, Oid **leaf_part_oids,
+							   int *n_leaf_part_oids,
+							   int *leaf_part_oid_size);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -65,22 +71,18 @@ static void find_matching_subplans_recurse(PartitionPruneState *prunestate,
  * While we allocate the arrays of pointers of ResultRelInfo and
  * TupleConversionMap for all partitions here, actual objects themselves are
  * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * see ExecInitPartitionInfo.  However, if the function is invoked for UPDATE
+ * tuple routing, the caller will have already initialized ResultRelInfo's for
+ * each partition present in the ModifyTable's subplans. These are reused and
+ * assigned to their respective slot in the aforementioned array.  For such
+ * partitions, we delay setting up objects such as TupleConversionMap until
+ * those are actually chosen as the partitions to route tuples to.  See
+ * ExecPrepareTupleRouting.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
 	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
 	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
@@ -90,32 +92,36 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	 * partitions.
 	 */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
 	proute->partition_dispatch_info =
 		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
+										 &proute->partition_oids, &nparts);
+
+	proute->num_partitions = nparts;
 	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
+		(ResultRelInfo **) palloc0(nparts * sizeof(ResultRelInfo *));
 
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
+	/*
+	 * Allocate an array to store ResultRelInfos that we'll later allocate.
+	 * It is common that not all partitions will have tuples routed to them,
+	 * so we'll refrain from allocating enough space for all partitions here.
+	 * Let's just start with something small and make it bigger only when
+	 * needed.  Storing these separately rather than relying on the
+	 *'partitions' array allows us to quickly identify which ResultRelInfos we
+	 * must teardown at the end.
+	 */
+	proute->partitions_init_size = Min(nparts, 8);
+
+	proute->partitions_init = (ResultRelInfo **)
+		palloc(proute->partitions_init_size * sizeof(ResultRelInfo *));
+
+	proute->num_partitions_init = 0;
+
+	/* We only allocate this when we need to store the first non-NULL map */
+	proute->parent_child_tupconv_maps = NULL;
+
+	proute->child_parent_tupconv_maps = NULL;
 
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
-	}
 
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
@@ -125,50 +131,70 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	 */
 	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
 
-	i = 0;
-	foreach(cell, leaf_parts)
+	/* Set up details specific to the type of tuple routing we are doing. */
+	if (node && node->operation == CMD_UPDATE)
 	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
+		ResultRelInfo *update_rri = NULL;
+		int			num_update_rri = 0,
+					update_rri_index = 0;
 
-		proute->partition_oids[i] = leaf_oid;
+		update_rri = mtstate->resultRelInfo;
+		num_update_rri = list_length(node->plans);
+		proute->subplan_partition_offsets =
+			palloc(num_update_rri * sizeof(int));
+		proute->num_subplan_partition_offsets = num_update_rri;
 
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
+		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
+
+		for (i = 0; i < nparts; i++)
 		{
-			leaf_part_rri = &update_rri[update_rri_index];
+			Oid			leaf_oid = proute->partition_oids[i];
 
 			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
+			 * If the leaf partition is already present in the per-subplan
+			 * result rels, we re-use that rather than initialize a new result
+			 * rel. The per-subplan resultrels and the resultrels of the leaf
+			 * partitions are both in the same canonical order. So while going
+			 * through the leaf partition oids, we need to keep track of the
+			 * next per-subplan result rel to be looked for in the leaf
+			 * partition resultrels.
 			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
+			if (update_rri_index < num_update_rri &&
+				RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
+			{
+				ResultRelInfo *leaf_part_rri;
+
+				leaf_part_rri = &update_rri[update_rri_index];
+
+				/*
+				 * This is required in order to convert the partition's tuple
+				 * to be compatible with the root partitioned table's tuple
+				 * descriptor.  When generating the per-subplan result rels,
+				 * this was not set.
+				 */
+				leaf_part_rri->ri_PartitionRoot = rel;
+
+				/* Remember the subplan offset for this ResultRelInfo */
+				proute->subplan_partition_offsets[update_rri_index] = i;
 
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
+				update_rri_index++;
 
-			update_rri_index++;
+				proute->partitions[i] = leaf_part_rri;
+			}
 		}
 
-		proute->partitions[i] = leaf_part_rri;
-		i++;
+		/*
+		 * We should have found all the per-subplan resultrels in the leaf
+		 * partitions.
+		 */
+		Assert(update_rri_index == num_update_rri);
+	}
+	else
+	{
+		proute->root_tuple_slot = NULL;
+		proute->subplan_partition_offsets = NULL;
+		proute->num_subplan_partition_offsets = 0;
 	}
-
-	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
-	 */
-	Assert(update_rri_index == num_update_rri);
 
 	return proute;
 }
@@ -291,13 +317,61 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	return result;
 }
 
+/*
+ * ExecGetPartitionInfo
+ *		Fetch ResultRelInfo for partidx
+ *
+ * Sets up ResultRelInfo, if not done already.
+ */
+ResultRelInfo *
+ExecGetPartitionInfo(ModifyTableState *mtstate,
+					 ResultRelInfo *resultRelInfo,
+					 PartitionTupleRouting *proute,
+					 EState *estate, int partidx)
+{
+	ResultRelInfo *result = proute->partitions[partidx];
+
+	if (result)
+		return result;
+
+	result = ExecInitPartitionInfo(mtstate,
+								   resultRelInfo,
+								   proute,
+								   estate,
+								   partidx);
+	Assert(result);
+
+	proute->partitions[partidx] = result;
+
+	/*
+	 * Record the ones setup so far in setup order.  This makes the cleanup
+	 * operation more efficient when very few have been setup.
+	 */
+	if (proute->num_partitions_init == proute->partitions_init_size)
+	{
+		/* First allocate more space if the array is not large enough */
+		proute->partitions_init_size =
+			Min(proute->partitions_init_size * 2, proute->num_partitions);
+
+		proute->partitions_init = (ResultRelInfo **)
+				repalloc(proute->partitions_init,
+				proute->partitions_init_size * sizeof(ResultRelInfo *));
+	}
+
+	proute->partitions_init[proute->num_partitions_init++] = result;
+
+	Assert(proute->num_partitions_init <= proute->num_partitions);
+
+	return result;
+}
+
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
  *
  * Returns the ResultRelInfo
  */
-ResultRelInfo *
+static ResultRelInfo *
 ExecInitPartitionInfo(ModifyTableState *mtstate,
 					  ResultRelInfo *resultRelInfo,
 					  PartitionTupleRouting *proute,
@@ -500,7 +574,6 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -550,6 +623,11 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = proute->parent_child_tupconv_maps ?
+				proute->parent_child_tupconv_maps[partidx] : NULL;
+
 			Assert(node->onConflictSet != NIL);
 			Assert(resultRelInfo->ri_onConflict != NULL);
 
@@ -671,6 +749,7 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 					int partidx)
 {
 	MemoryContext oldContext;
+	TupleConversionMap *map;
 
 	/*
 	 * Switch into per-query memory context.
@@ -681,10 +760,19 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
-		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
-							   RelationGetDescr(partRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	map = convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								 RelationGetDescr(partRelInfo->ri_RelationDesc),
+								 gettext_noop("could not convert row type"));
+
+	if (map)
+	{
+		/* Allocate parent child map array only if we need to store a map */
+		if (!proute->parent_child_tupconv_maps)
+			proute->parent_child_tupconv_maps = (TupleConversionMap **)
+				palloc0(proute->num_partitions * sizeof(TupleConversionMap *));
+
+		proute->parent_child_tupconv_maps[partidx] = map;
+	}
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -805,7 +893,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -822,13 +909,9 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(pd->tupslot);
 	}
 
-	for (i = 0; i < proute->num_partitions; i++)
+	for (i = 0; i < proute->num_partitions_init; i++)
 	{
-		ResultRelInfo *resultRelInfo = proute->partitions[i];
-
-		/* skip further processsing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
+		ResultRelInfo *resultRelInfo = proute->partitions_init[i];
 
 		/* Allow any FDWs to shut down if they've been exercised */
 		if (resultRelInfo->ri_PartitionReadyForRouting &&
@@ -837,24 +920,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 			resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
 														   resultRelInfo);
 
-		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
-		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
-		{
-			subplan_index++;
-			continue;
-		}
-
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
@@ -868,31 +933,36 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 
 /*
  * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
+ *		Returns an array of PartitionDispatch as is required for routing
+ *		tuples to the correct partition.
  *
+ * 'num_parted' is set to the size of the returned array and the
+ *'leaf_part_oids' array is allocated and populated with each leaf partition
+ * Oid in the hierarchy. 'n_leaf_part_oids' is set to the size of that array.
  * All the relations in the partition tree (including 'rel') must have been
  * locked (using at least the AccessShareLock) by the caller.
  */
 static PartitionDispatch *
 RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
+								 int *num_parted, Oid **leaf_part_oids,
+								 int *n_leaf_part_oids)
 {
 	List	   *pdlist = NIL;
 	PartitionDispatchData **pd;
 	ListCell   *lc;
 	int			i;
+	int			leaf_part_oid_size;
 
 	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
 
 	*num_parted = 0;
-	*leaf_part_oids = NIL;
+	*n_leaf_part_oids = 0;
+
+	leaf_part_oid_size = 0;
+	*leaf_part_oids = NULL;
 
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
+	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids,
+								   n_leaf_part_oids, &leaf_part_oid_size);
 	*num_parted = list_length(pdlist);
 	pd = (PartitionDispatchData **) palloc(*num_parted *
 										   sizeof(PartitionDispatchData *));
@@ -909,9 +979,9 @@ RelationGetPartitionDispatchInfo(Relation rel,
  * get_partition_dispatch_recurse
  *		Recursively expand partition tree rooted at rel
  *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
+ * As the partition tree is expanded in a depth-first manner, we populate
+ * '*pds' with PartitionDispatch objects of each partitioned table we find,
+ * and populate leaf_part_oids with each leaf partition OID found.
  *
  * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
  * the order in which the planner's expand_partitioned_rtentry() processes
@@ -920,16 +990,27 @@ RelationGetPartitionDispatchInfo(Relation rel,
  * planner side, whereas we'll always have the complete list; but unpruned
  * partitions will appear in the same order in the plan as they are returned
  * here.
+ *
+ * Note: Callers must not attempt to pfree the 'leaf_part_oids' array.
  */
 static void
 get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
+							   List **pds, Oid **leaf_part_oids,
+							   int *n_leaf_part_oids,
+							   int *leaf_part_oid_size)
 {
 	TupleDesc	tupdesc = RelationGetDescr(rel);
 	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
 	PartitionKey partkey = RelationGetPartitionKey(rel);
 	PartitionDispatch pd;
 	int			i;
+	int			nparts;
+	int			oid_array_used;
+	int			oid_array_size;
+	Oid		   *oid_array;
+	Oid		   *partdesc_oids;
+	bool	   *partdesc_subpartitions;
+	int		   *indexes;
 
 	check_stack_depth();
 
@@ -960,6 +1041,21 @@ get_partition_dispatch_recurse(Relation rel, Relation parent,
 		/* Not required for the root partitioned table */
 		pd->tupslot = NULL;
 		pd->tupmap = NULL;
+
+		/*
+		 * If the parent has no sub partitions then we can skip calculating
+		 * all the leaf partitions and just return all the oids at this level.
+		 * In this case, the indexes were also pre-calculated for us by the
+		 * syscache code.
+		 */
+		if (!partdesc->hassubpart)
+		{
+			*leaf_part_oids = partdesc->oids;
+			/* XXX or should we memcpy this out of syscache? */
+			pd->indexes = partdesc->indexes;
+			*n_leaf_part_oids = partdesc->nparts;
+			return;
+		}
 	}
 
 	/*
@@ -980,15 +1076,38 @@ get_partition_dispatch_recurse(Relation rel, Relation parent,
 	 * corresponding sub-partition; otherwise, we've identified the correct
 	 * partition.
 	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
+	oid_array_used = *n_leaf_part_oids;
+	oid_array_size = *leaf_part_oid_size;
+	oid_array = *leaf_part_oids;
+	nparts = partdesc->nparts;
+
+	if (!oid_array)
+	{
+		oid_array_size = *leaf_part_oid_size = nparts;
+		*leaf_part_oids = (Oid *) palloc(sizeof(Oid) * nparts);
+		oid_array = *leaf_part_oids;
+	}
+
+	partdesc_oids = partdesc->oids;
+	partdesc_subpartitions = partdesc->subpartitions;
+
+	pd->indexes = indexes = (int *) palloc(nparts * sizeof(int));
+
+	for (i = 0; i < nparts; i++)
 	{
-		Oid			partrelid = partdesc->oids[i];
+		Oid			partrelid = partdesc_oids[i];
 
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
+		if (!partdesc_subpartitions[i])
 		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
+			if (oid_array_size <= oid_array_used)
+			{
+				oid_array_size *= 2;
+				oid_array = (Oid *) repalloc(oid_array,
+											 sizeof(Oid) * oid_array_size);
+			}
+
+			oid_array[oid_array_used] = partrelid;
+			indexes[i] = oid_array_used++;
 		}
 		else
 		{
@@ -998,10 +1117,23 @@ get_partition_dispatch_recurse(Relation rel, Relation parent,
 			 */
 			Relation	partrel = heap_open(partrelid, NoLock);
 
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
+			*n_leaf_part_oids = oid_array_used;
+			*leaf_part_oid_size = oid_array_size;
+			*leaf_part_oids = oid_array;
+
+			indexes[i] = -list_length(*pds);
+			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids,
+										   n_leaf_part_oids, leaf_part_oid_size);
+
+			oid_array_used = *n_leaf_part_oids;
+			oid_array_size = *leaf_part_oid_size;
+			oid_array = *leaf_part_oids;
 		}
 	}
+
+	*n_leaf_part_oids = oid_array_used;
+	*leaf_part_oid_size = oid_array_size;
+	*leaf_part_oids = oid_array;
 }
 
 /* ----------------
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 7e0b867971..8f62f35cd2 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1682,15 +1682,9 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 								estate);
 	Assert(partidx >= 0 && partidx < proute->num_partitions);
 
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
-	 */
-	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
+	/* Get the ResultRelInfo corresponding to the selected partition. */
+	partrel = ExecGetPartitionInfo(mtstate, targetRelInfo, proute, estate,
+								   partidx);
 
 	/*
 	 * Check whether the partition is routable if we didn't yet
@@ -1756,7 +1750,9 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
+	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps ?
+								proute->parent_child_tupconv_maps[partidx] :
+								NULL,
 							  tuple,
 							  proute->partition_tuple_slot,
 							  &slot);
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 115a9fe78f..b36b7366e5 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -594,6 +594,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
+		result->subpartitions = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -774,6 +775,7 @@ RelationBuildPartitionDesc(Relation rel)
 		}
 
 		result->boundinfo = boundinfo;
+		result->hassubpart = false; /* unless we discover otherwise below */
 
 		/*
 		 * Now assign OIDs from the original array into mapped indexes of the
@@ -782,7 +784,35 @@ RelationBuildPartitionDesc(Relation rel)
 		 * defined by canonicalized representation of the partition bounds.
 		 */
 		for (i = 0; i < nparts; i++)
-			result->oids[mapping[i]] = oids[i];
+		{
+			int			index = mapping[i];
+			bool		subpart;
+
+			result->oids[index] = oids[i];
+
+			subpart = (get_rel_relkind(oids[i]) == RELKIND_PARTITIONED_TABLE);
+			/* Record if the partition is a subpartitioned table */
+			result->subpartitions[index] = subpart;
+			result->hassubpart |= subpart;
+		}
+
+		/*
+		 * If there are no subpartitions then we can pre-calculate the
+		 * PartitionDispatch->indexes array.  Doing this here saves quite a
+		 * bit of overhead on simple queries which perform INSERTs or UPDATEs
+		 * on partitioned tables with many partitions.  The pre-calculation is
+		 * very simple.  All we need to store is a sequence of numbers from 0
+		 * to nparts - 1.
+		 */
+		if (!result->hassubpart)
+		{
+			result->indexes = (int *) palloc(nparts * sizeof(int));
+			for (i = 0; i < nparts; i++)
+				result->indexes[i] = i;
+		}
+		else
+			result->indexes = NULL;
+
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 1f49e5d3a9..a8c69ff224 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -26,7 +26,18 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* OIDs array of 'nparts' of partitions in
+								 * partbound order */
+	int		   *indexes;		/* Stores index for corresponding 'oids'
+								 * element for use in tuple routing, or NULL
+								 * if hassubpart is true.
+								 */
+	bool	   *subpartitions;	/* Array of 'nparts' set to true if the
+								 * corresponding 'oids' element belongs to a
+								 * sub-partitioned table.
+								 */
+	bool		hassubpart;		/* true if any oid belongs to a
+								 * sub-partitioned table */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 862bf65060..822f66f5e2 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -65,13 +65,17 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  * partitions					Array of ResultRelInfo* objects with one entry
  *								for every leaf partition in the partition tree,
  *								initialized lazily by ExecInitPartitionInfo.
+ * partitions_init				Array of ResultRelInfo* objects in the order
+ *								that they were lazily initialized.
  * num_partitions				Number of leaf partitions in the partition tree
  *								(= 'partitions_oid'/'partitions' array length)
+ * num_partitions_init			Number of leaf partition lazily setup so far.
+ * partitions_init_size			Size of partitions_init array.
  * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
  *								entry for every leaf partition (required to
  *								convert tuple from the root table's rowtype to
  *								a leaf partition's rowtype after tuple routing
- *								is done)
+ *								is done). Remains NULL if no maps to store.
  * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
  *								entry for every leaf partition (required to
  *								convert an updated tuple from the leaf
@@ -102,7 +106,10 @@ typedef struct PartitionTupleRouting
 	int			num_dispatch;
 	Oid		   *partition_oids;
 	ResultRelInfo **partitions;
+	ResultRelInfo **partitions_init;
 	int			num_partitions;
+	int			num_partitions_init;
+	int			partitions_init_size;
 	TupleConversionMap **parent_child_tupconv_maps;
 	TupleConversionMap **child_parent_tupconv_maps;
 	bool	   *child_parent_map_not_required;
@@ -190,10 +197,10 @@ extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
 				  PartitionDispatch *pd,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
+extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,
+					 ResultRelInfo *resultRelInfo,
+					 PartitionTupleRouting *proute,
+					 EState *estate, int partidx);
 extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
 					PartitionTupleRouting *proute,
-- 
2.16.2.windows.1

v1-0002-Unsafe-locking-reduction-for-partitioned-INSERT-U.patchapplication/octet-stream; name=v1-0002-Unsafe-locking-reduction-for-partitioned-INSERT-U.patchDownload
From 42e975fcf9e1c2c7920544721965c641ce6bb1a1 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 22 Jun 2018 15:40:46 +1200
Subject: [PATCH v1 2/2] Unsafe locking reduction for partitioned
 INSERT/UPDATEs

For performance demonstration purposes only.
---
 src/backend/executor/execPartition.c | 20 ++------------------
 1 file changed, 2 insertions(+), 18 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 1a3a67dd0d..cb6a4c3ff0 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -65,9 +65,6 @@ static void find_matching_subplans_recurse(PartitionPruneState *prunestate,
  * tuple routing for partitioned tables, encapsulates it in
  * PartitionTupleRouting, and returns it.
  *
- * Note that all the relations in the partition tree are locked using the
- * RowExclusiveLock mode upon return from this function.
- *
  * While we allocate the arrays of pointers of ResultRelInfo and
  * TupleConversionMap for all partitions here, actual objects themselves are
  * lazily allocated for a given partition if a tuple is actually routed to it;
@@ -87,11 +84,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
-	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
 	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
 	proute->partition_dispatch_info =
 		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
@@ -386,11 +378,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
 
-	/*
-	 * We locked all the partitions in ExecSetupPartitionTupleRouting
-	 * including the leaf partitions.
-	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(proute->partition_oids[partidx], RowExclusiveLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -1111,11 +1099,7 @@ get_partition_dispatch_recurse(Relation rel, Relation parent,
 		}
 		else
 		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
+			Relation	partrel = heap_open(partrelid, RowExclusiveLock);
 
 			*n_leaf_part_oids = oid_array_used;
 			*leaf_part_oid_size = oid_array_size;
-- 
2.16.2.windows.1

#2David Rowley
david.rowley@2ndquadrant.com
In reply to: David Rowley (#1)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 22 June 2018 at 18:28, David Rowley <david.rowley@2ndquadrant.com> wrote:

I've written fixes for items 1-6 above.

I did:

1. Use an array instead of a List.
2. Don't do this loop. palloc0() the partitions array instead. Let
UPDATE add whatever subplans exist to the zeroed array.
3. Track what we initialize in a gapless array and cleanup just those
ones. Make this array small and increase it only when we need more
space.
4. Only allocate the map array when we need to store a map.
5. Work that out in relcache beforehand.
6. ditto

I've added this to the July 'fest:

https://commitfest.postgresql.org/18/1690/

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#3Kato, Sho
kato-sho@jp.fujitsu.com
In reply to: David Rowley (#2)
RE: Speeding up INSERTs and UPDATEs to partitioned tables

Hi,

I tried to benchmark with v1-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch, but when I create the second partition, server process get segmentation fault.

I don't know the cause, but it seems that an incorrect value is set to partdesc->boundinfo.

(gdb) p partdesc->boundinfo[0]
$6 = {strategy = 0 '\000', ndatums = 2139062142, datums = 0x7f7f7f7f7f7f7f7f, kind = 0x7f7f7f7f7f7f7f7f, indexes = 0x7f7f7f7f7f7f7f7f, null_index = 2139062143, default_index = 2139062143}

$ psql postgres
psql (11beta2)
Type "help" for help.

postgres=# create table a(i int) partition by range(i);
CREATE TABLE
postgres=# create table a_1 partition of a for values from(1) to (200);
CREATE TABLE
postgres=# create table a_2 partition of a for values from(200) to (400);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
!>

2018-07-05 14:02:52.405 JST [60250] LOG: server process (PID 60272) was terminated by signal 11: Segmentation fault
2018-07-05 14:02:52.405 JST [60250] DETAIL: Failed process was running: create table a_2 partition of a for values from(200) to (400);

(gdb) bt
#0 0x0000000000596e52 in get_default_oid_from_partdesc (partdesc=0x259e928) at partition.c:269
#1 0x0000000000677355 in DefineRelation (stmt=0x259e610, relkind=114 'r', ownerId=10, typaddress=0x0, queryString=0x24d58b8 "create table a_2 partition of a for values from(200) to (400);") at tablecmds.c:832
#2 0x00000000008b6893 in ProcessUtilitySlow (pstate=0x259e4f8, pstmt=0x24d67d8, queryString=0x24d58b8 "create table a_2 partition of a for values from(200) to (400);", context=PROCESS_UTILITY_TOPLEVEL,
params=0x0, queryEnv=0x0, dest=0x24d6ac8, completionTag=0x7ffc05932330 "") at utility.c:1000
#3 0x00000000008b66c2 in standard_ProcessUtility (pstmt=0x24d67d8, queryString=0x24d58b8 "create table a_2 partition of a for values from(200) to (400);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, dest=0x24d6ac8, completionTag=0x7ffc05932330 "") at utility.c:920
#4 0x00000000008b583b in ProcessUtility (pstmt=0x24d67d8, queryString=0x24d58b8 "create table a_2 partition of a for values from(200) to (400);", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0,
dest=0x24d6ac8, completionTag=0x7ffc05932330 "") at utility.c:360
#5 0x00000000008b482c in PortalRunUtility (portal=0x253af38, pstmt=0x24d67d8, isTopLevel=true, setHoldSnapshot=false, dest=0x24d6ac8, completionTag=0x7ffc05932330 "") at pquery.c:1178
#6 0x00000000008b4a45 in PortalRunMulti (portal=0x253af38, isTopLevel=true, setHoldSnapshot=false, dest=0x24d6ac8, altdest=0x24d6ac8, completionTag=0x7ffc05932330 "") at pquery.c:1324
#7 0x00000000008b3f7d in PortalRun (portal=0x253af38, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x24d6ac8, altdest=0x24d6ac8, completionTag=0x7ffc05932330 "") at pquery.c:799
#8 0x00000000008adf16 in exec_simple_query (query_string=0x24d58b8 "create table a_2 partition of a for values from(200) to (400);") at postgres.c:1122
#9 0x00000000008b21a5 in PostgresMain (argc=1, argv=0x24ff5b0, dbname=0x24ff410 "postgres", username=0x24d2358 "symfo") at postgres.c:4153
#10 0x00000000008113f4 in BackendRun (port=0x24f73f0) at postmaster.c:4361
#11 0x0000000000810b67 in BackendStartup (port=0x24f73f0) at postmaster.c:4033
#12 0x000000000080d0ed in ServerLoop () at postmaster.c:1706
#13 0x000000000080c9a3 in PostmasterMain (argc=1, argv=0x24d0310) at postmaster.c:1379
#14 0x0000000000737392 in main (argc=1, argv=0x24d0310) at main.c:228

(gdb) disassemble
Dump of assembler code for function get_default_oid_from_partdesc:
0x0000000000596e0a <+0>: push %rbp
0x0000000000596e0b <+1>: mov %rsp,%rbp
0x0000000000596e0e <+4>: mov %rdi,-0x8(%rbp)
0x0000000000596e12 <+8>: cmpq $0x0,-0x8(%rbp)
0x0000000000596e17 <+13>: je 0x596e56 <get_default_oid_from_partdesc+76>
0x0000000000596e19 <+15>: mov -0x8(%rbp),%rax
0x0000000000596e1d <+19>: mov 0x10(%rax),%rax
0x0000000000596e21 <+23>: test %rax,%rax
0x0000000000596e24 <+26>: je 0x596e56 <get_default_oid_from_partdesc+76>
0x0000000000596e26 <+28>: mov -0x8(%rbp),%rax
0x0000000000596e2a <+32>: mov 0x10(%rax),%rax
0x0000000000596e2e <+36>: mov 0x24(%rax),%eax
0x0000000000596e31 <+39>: cmp $0xffffffff,%eax
0x0000000000596e34 <+42>: je 0x596e56 <get_default_oid_from_partdesc+76>
0x0000000000596e36 <+44>: mov -0x8(%rbp),%rax
0x0000000000596e3a <+48>: mov 0x8(%rax),%rdx
0x0000000000596e3e <+52>: mov -0x8(%rbp),%rax
0x0000000000596e42 <+56>: mov 0x10(%rax),%rax
0x0000000000596e46 <+60>: mov 0x24(%rax),%eax
0x0000000000596e49 <+63>: cltq
0x0000000000596e4b <+65>: shl $0x2,%rax
0x0000000000596e4f <+69>: add %rdx,%rax
=> 0x0000000000596e52 <+72>: mov (%rax),%eax
0x0000000000596e54 <+74>: jmp 0x596e5b <get_default_oid_from_partdesc+81>
0x0000000000596e56 <+76>: mov $0x0,%eax
0x0000000000596e5b <+81>: pop %rbp
0x0000000000596e5c <+82>: retq
End of assembler dump.

(gdb) i r
rax 0x20057e77c 8595695484
rbx 0x72 114
rcx 0x7f50ce90e0e8 139985039712488
rdx 0x259e980 39446912
rsi 0x7f50ce90e0a8 139985039712424
rdi 0x259e928 39446824
rbp 0x7ffc05931890 0x7ffc05931890
rsp 0x7ffc05931890 0x7ffc05931890
r8 0x7ffc059317bf 140720402012095
r9 0x0 0
r10 0x6b 107
r11 0x7f50cdbc3f10 139985025777424
r12 0x70 112
r13 0x0 0
r14 0x0 0
r15 0x0 0
rip 0x596e52 0x596e52 <get_default_oid_from_partdesc+72>
eflags 0x10202 [ IF RF ]
cs 0x33 51
ss 0x2b 43
ds 0x0 0
es 0x0 0
fs 0x0 0
gs 0x0 0

(gdb) list *0x596e52
0x596e52 is in get_default_oid_from_partdesc (partition.c:269).
264 Oid
265 get_default_oid_from_partdesc(PartitionDesc partdesc)
266 {
267 if (partdesc && partdesc->boundinfo &&
268 partition_bound_has_default(partdesc->boundinfo))
269 return partdesc->oids[partdesc->boundinfo->default_index];
270
271 return InvalidOid;
272 }
273

regards,
-----Original Message-----
From: David Rowley [mailto:david.rowley@2ndquadrant.com]
Sent: Saturday, June 23, 2018 7:19 AM
To: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 22 June 2018 at 18:28, David Rowley <david.rowley@2ndquadrant.com> wrote:

I've written fixes for items 1-6 above.

I did:

1. Use an array instead of a List.
2. Don't do this loop. palloc0() the partitions array instead. Let
UPDATE add whatever subplans exist to the zeroed array.
3. Track what we initialize in a gapless array and cleanup just those
ones. Make this array small and increase it only when we need more
space.
4. Only allocate the map array when we need to store a map.
5. Work that out in relcache beforehand.
6. ditto

I've added this to the July 'fest:

https://commitfest.postgresql.org/18/1690/

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#4David Rowley
david.rowley@2ndquadrant.com
In reply to: Kato, Sho (#3)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 5 July 2018 at 18:39, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:

postgres=# create table a(i int) partition by range(i);
CREATE TABLE
postgres=# create table a_1 partition of a for values from(1) to (200);
CREATE TABLE
postgres=# create table a_2 partition of a for values from(200) to (400);
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Hi,

Thanks for testing. I'm unable to reproduce this on beta2 or master as
of f61988d16.

Did you try make clean then building again? The 0001 patch does
change PartitionDescData, so if you've not rebuilt all .c files which
use that then that might explain your crash.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#5Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: David Rowley (#1)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

Hi David,

On 06/22/2018 02:28 AM, David Rowley wrote:

I've attached 2 patches:

0001: implements items 1-6
0002: Is not intended for commit. It just a demo of where we could get
the performance if we were smarter about locking partitions. I've just
included this to show 0001's worth.

I did some tests with a 64 hash partition setup, and see a speedup for
INSERT / UPDATE scenarios.

I don't want to discuss the locking on this thread. That discussion
will detract from discussing what I'm proposing here... Which is not
to change anything relating to locks. I'm still working on that and
will post elsewhere.

With 0002 INSERTs get close to the TPS of the non-partitioned case.
However, UPDATEs doesn't see the same speedup. But, as you said, a
discussion for another thread.

Best regards,
Jesper

#6David Rowley
david.rowley@2ndquadrant.com
In reply to: Jesper Pedersen (#5)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 6 July 2018 at 01:18, Jesper Pedersen <jesper.pedersen@redhat.com> wrote:

With 0002 INSERTs get close to the TPS of the non-partitioned case. However,
UPDATEs doesn't see the same speedup. But, as you said, a discussion for
another thread.

Hi Jesper,

Thanks for testing this out. It was only really the locking I didn't
want to discuss here due to the risk of the discussion of removing the
other overheads getting lost in discussions about locking.

It's most likely that the UPDATE is slower due to the planner still
being quite slow when dealing with partitioned tables. It still builds
RangeTblEntry and RelOptInfo objects for each partition even if the
partition is pruned. With INSERT with a VALUES clause, the planner
does not build these objects, in fact, the planner bearly does any
work at all, so this can be speeded up just by removing the executor
overheads.

(I do have other WIP patches to speed up the planner. After delaying
building the RelOptInfo and RangeTblEntry, with my 10k partition
setup, the planner SELECT became 1600 times faster. UPDATE did not
finish in the unpatched version, so gains there are harder to measure.
There's still much work to do on these patches, and there's still more
performance to squeeze out too. Hopefully, I'll be discussing this on
another thread soon.)

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#7Kato, Sho
kato-sho@jp.fujitsu.com
In reply to: David Rowley (#4)
RE: Speeding up INSERTs and UPDATEs to partitioned tables

Thanks David!

I did benchmark with pgbench, and see a speedup for INSERT / UPDATE scenarios.
I used range partition.

Benchmark results are as follows.

1. 11beta2 result

part_num | tps_ex | latency_avg | update_latency | select_latency | insert_latency
----------+------------+-------------+----------------+----------------+----------------
100 | 479.456278 | 2.086 | 1.382 | 0.365 | 0.168
200 | 169.155411 | 5.912 | 4.628 | 0.737 | 0.299
400 | 24.857495 | 40.23 | 36.606 | 2.252 | 0.881
800 | 6.718104 | 148.853 | 141.471 | 5.253 | 1.433
1600 | 1.934908 | 516.825 | 489.982 | 21.102 | 3.701
3200 | 0.456967 | 2188.362 | 2101.247 | 72.784 | 8.833
6400 | 0.116643 | 8573.224 | 8286.79 | 257.904 | 14.949

2. 11beta2 + patch1 + patch2

patch1: Allow direct lookups of AppendRelInfo by child relid
commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687
patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch

part_num | tps_ex | latency_avg | update_latency | select_latency | insert_latency
----------+-------------+-------------+----------------+----------------+----------------
100 | 1224.430344 | 0.817 | 0.551 | 0.085 | 0.048
200 | 689.567511 | 1.45 | 1.12 | 0.119 | 0.05
400 | 347.876616 | 2.875 | 2.419 | 0.185 | 0.052
800 | 140.489269 | 7.118 | 6.393 | 0.329 | 0.059
1600 | 29.681672 | 33.691 | 31.272 | 1.517 | 0.147
3200 | 7.021957 | 142.412 | 136.4 | 4.033 | 0.214
6400 | 1.462949 | 683.557 | 669.187 | 7.677 | 0.264

benchmark script:

\set aid random(1, 100 * 1)
\set delta random(-5000, 5000)
BEGIN;
UPDATE test.accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM test.accounts WHERE aid = :aid;
INSERT INTO test.accounts_history (aid, delta, mtime) VALUES (:aid, :delta, CURRENT_TIMESTAMP);
END;

partition key is aid.

-----Original Message-----
From: David Rowley [mailto:david.rowley@2ndquadrant.com]
Sent: Thursday, July 05, 2018 6:19 PM
To: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>
Cc: PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 5 July 2018 at 18:39, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:

postgres=# create table a(i int) partition by range(i); CREATE TABLE
postgres=# create table a_1 partition of a for values from(1) to
(200); CREATE TABLE postgres=# create table a_2 partition of a for
values from(200) to (400); server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.

Hi,

Thanks for testing. I'm unable to reproduce this on beta2 or master as of f61988d16.

Did you try make clean then building again? The 0001 patch does change PartitionDescData, so if you've not rebuilt all .c files which use that then that might explain your crash.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#8David Rowley
david.rowley@2ndquadrant.com
In reply to: Kato, Sho (#7)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 6 July 2018 at 21:25, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:

2. 11beta2 + patch1 + patch2

patch1: Allow direct lookups of AppendRelInfo by child relid
commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687
patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch

part_num | tps_ex | latency_avg | update_latency | select_latency | insert_latency
----------+-------------+-------------+----------------+----------------+----------------
100 | 1224.430344 | 0.817 | 0.551 | 0.085 | 0.048
200 | 689.567511 | 1.45 | 1.12 | 0.119 | 0.05
400 | 347.876616 | 2.875 | 2.419 | 0.185 | 0.052
800 | 140.489269 | 7.118 | 6.393 | 0.329 | 0.059
1600 | 29.681672 | 33.691 | 31.272 | 1.517 | 0.147
3200 | 7.021957 | 142.412 | 136.4 | 4.033 | 0.214
6400 | 1.462949 | 683.557 | 669.187 | 7.677 | 0.264

Hi,

Thanks a lot for benchmarking this.

Just a note to say that the "Allow direct lookups of AppendRelInfo by
child relid" patch is already in master. It's much more relevant to be
testing with master than pg11. This patch is not intended for pg11.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: David Rowley (#8)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018-Jul-11, David Rowley wrote:

On 6 July 2018 at 21:25, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:

2. 11beta2 + patch1 + patch2

patch1: Allow direct lookups of AppendRelInfo by child relid
commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687
patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch

part_num | tps_ex | latency_avg | update_latency | select_latency | insert_latency
----------+-------------+-------------+----------------+----------------+----------------
100 | 1224.430344 | 0.817 | 0.551 | 0.085 | 0.048
200 | 689.567511 | 1.45 | 1.12 | 0.119 | 0.05
400 | 347.876616 | 2.875 | 2.419 | 0.185 | 0.052
800 | 140.489269 | 7.118 | 6.393 | 0.329 | 0.059
1600 | 29.681672 | 33.691 | 31.272 | 1.517 | 0.147
3200 | 7.021957 | 142.412 | 136.4 | 4.033 | 0.214
6400 | 1.462949 | 683.557 | 669.187 | 7.677 | 0.264

Just a note to say that the "Allow direct lookups of AppendRelInfo by
child relid" patch is already in master. It's much more relevant to be
testing with master than pg11. This patch is not intended for pg11.

That commit is also in pg11, though -- just not in beta2. So we still
don't know how much of an improvement patch2 is by itself :-)

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#1)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

Hi David.

On 2018/06/22 15:28, David Rowley wrote:

Hi,

As part of my efforts to make partitioning scale better for larger
numbers of partitions, I've been looking at primarily INSERT VALUES
performance. Here the overheads are almost completely in the
executor. Planning of this type of statement is very simple since
there is no FROM clause to process.

Thanks for this effort.

My benchmarks have been around a RANGE partitioned table with 10k leaf
partitions and no sub-partitioned tables. The partition key is a
timestamp column.

I've found that ExecSetupPartitionTupleRouting() is very slow indeed
and there are a number of things slow about it. The biggest culprit
for the slowness is the locking of each partition inside of
find_all_inheritors().

Yes. :-(

For now, this needs to remain as we must hold
locks on each partition while performing RelationBuildPartitionDesc(),
otherwise, one of the partitions may get dropped out from under us.

We lock all partitions using find_all_inheritors to keep locking order
consistent with other sites that may want to lock tables in the same
partition tree but with a possibly conflicting lock mode. If we remove
the find_all_inheritors call in ExecSetupPartitionPruneState (like your
0002 does), we may end up locking partitions in arbitrary order in a given
transaction, because input tuples will be routed to various partitions in
an order that's not predetermined.

But, maybe it's not necessary to be that paranoid. If we've locked on the
parent, any concurrent lockers would have to wait for the lock on the
parent anyway, so it doesn't matter which order tuple routing locks the
partitions.

The locking is not the only slow thing. I found the following to also be slow:

1. RelationGetPartitionDispatchInfo uses a List and lappend() must
perform a palloc() each time a partition is added to the list.
2. A foreach loop is performed over leaf_parts to search for subplans
belonging to this partition. This seems pointless to do for INSERTs as
there's never any to find.
3. ExecCleanupTupleRouting() loops through the entire partitions
array. If a single tuple was inserted then all but one of the elements
will be NULL.
4. Tuple conversion map allocates an empty array thinking there might
be something to put into it. This is costly when the array is large
and pointless when there are no maps to store.
5. During get_partition_dispatch_recurse(), get_rel_relkind() is
called to determine if the partition is a partitioned table or a leaf
partition. This results in a slow relcache hashtable lookup.
6. get_partition_dispatch_recurse() also ends up just building the
indexes array with a sequence of numbers from 0 to nparts - 1 if there
are no sub-partitioned tables. Doing this is slow when there are many
partitions.

Besides the locking, the only thing that remains slow now is the
palloc0() for the 'partitions' array. In my test, it takes 0.6% of
execution time. I don't see any pretty ways to fix that.

I've written fixes for items 1-6 above.

I did:

1. Use an array instead of a List.
2. Don't do this loop. palloc0() the partitions array instead. Let
UPDATE add whatever subplans exist to the zeroed array.
3. Track what we initialize in a gapless array and cleanup just those
ones. Make this array small and increase it only when we need more
space.
4. Only allocate the map array when we need to store a map.
5. Work that out in relcache beforehand.
6. ditto

The issues you list seem all legitimate to me and also your proposed fixes
for each, except I think we could go a bit further.

Why don't we abandon the notion altogether that
ExecSetupPartitionTupleRouting *has to* process the whole partition tree?
ISTM, there is no need to determine the exact number of leaf partitions
and partitioned tables in the partition tree and allocate the arrays in
PartitionTupleRouting based on that. I know that the indexes array in
PartitionDispatchData contains mapping from local partition indexes (0 to
partdesc->nparts - 1) to those that span *all* leaf partitions and *all*
partitioned tables (0 to proute->num_partitions or proute->num_dispatch)
in a partition tree, but we can change that.

The idea I had was inspired by looking at partitions_init stuff in your
patch. We could allocate proute->partition_dispatch_info and
proute->partitions arrays to be of a predetermined size, which doesn't
require us to calculate the exact number of leaf partitions and
partitioned tables beforehand. So, RelationGetPartitionDispatchInfo need
not recursively go over all of the partition tree. Instead we create just
one PartitionDispatch object of the root parent table, whose indexes array
is initialized with -1 meaning none of the partitions has not been
encountered yet. In ExecFindPartition, once tuple routing chooses a
partition, we create either a ResultRelInfo (if leaf) or a
PartitionDispatch for it and store it in the 0th slot in
proute->partitions or proute->partition_dispatch_info, respectively.
Also, we update the indexes array in the parent's PartitionDispatch to
replace -1 with 0 so that future tuples routing to that partition don't
allocate it again. The process is repeated if the tuple needs to be
routed one more level down. If the query needs to allocate more
ResultRelInfos and/or PartitionDispatch objects than we initially
allocated space for, we expand those arrays. Finally, during
ExecCleanupTupleRouting, we only "clean up" the partitions that we
allocated ResultRelInfos and PartitionDispatch objects for, which is very
similar to the partitions_init idea in your patch.

I implemented that idea in the attached patch, which applies on top of
your 0001 patch, but I'd say it's too big to be just called a delta. I
was able to get following performance numbers using the following pgbench
test:

pgbench -n -T 180 -f insert-ht.sql
cat insert-ht.sql
\set b random(1, 1000)
\set a random(1, 1000)
insert into ht values (:b, :a);

Note that pgbench is run 3 times and every tps result is listed below.

HEAD - 0 parts (unpartitioned table)
tps = 2519.603076 (including connections establishing)
tps = 2486.903189 (including connections establishing)
tps = 2518.771182 (including connections establishing)

HEAD - 2500 hash parts (no subpart)
tps = 13.158224 (including connections establishing)
tps = 12.940713 (including connections establishing)
tps = 12.882465 (including connections establishing)

David - 2500 hash parts (no subpart)
tps = 18.717628 (including connections establishing)
tps = 18.602280 (including connections establishing)
tps = 18.945527 (including connections establishing)

Amit - 2500 hash parts (no subpart)
tps = 18.576858 (including connections establishing)
tps = 18.431902 (including connections establishing)
tps = 18.797023 (including connections establishing)

HEAD - 2500 hash parts (4 hash subparts each)
tps = 2.339332 (including connections establishing)
tps = 2.339582 (including connections establishing)
tps = 2.317037 (including connections establishing)

David - 2500 hash parts (4 hash subparts each)
tps = 3.225044 (including connections establishing)
tps = 3.214053 (including connections establishing)
tps = 3.239591 (including connections establishing)

Amit - 2500 hash parts (4 hash subparts each)
tps = 3.321918 (including connections establishing)
tps = 3.305952 (including connections establishing)
tps = 3.301036 (including connections establishing)

Applying the lazy locking patch on top of David's and my patch,
respectively, produces the following results.

David - 2500 hash parts (no subpart)
tps = 1577.854360 (including connections establishing)
tps = 1532.681499 (including connections establishing)
tps = 1464.254096 (including connections establishing)

Amit - 2500 hash parts (no subpart)
tps = 1532.475751 (including connections establishing)
tps = 1534.650325 (including connections establishing)
tps = 1527.840837 (including connections establishing)

David - 2500 hash parts (4 hash subparts each)
tps = 78.845916 (including connections establishing)
tps = 79.167079 (including connections establishing)
tps = 79.621686 (including connections establishing)

Amit - 2500 hash parts (4 hash subparts each)
9:tps = 329.887008 (including connections establishing)
9:tps = 327.428103 (including connections establishing)
9:tps = 326.863248 (including connections establishing)

About the last two results: after getting rid of the time-hog that is
find_all_inheritors() call in ExecSetupPartitionTupleRouting for locking
all partitions, it seems that we'll end up spending most of the time in
RelationGetPartitionDispatchInfo() without my patch, because it will call
get_partition_dispatch_recurse() for each of the 2500 partitions of first
level that are themselves partitioned. With my patch, we won't do that
and won't end up generating 2499 PartitionDispatch objects that would not
be needed for a single-row insert statement.

Thanks,
Amit

Attachments:

david-0001-delta.patchtext/plain; charset=UTF-8; name=david-0001-delta.patchDownload
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 25bec76c1d..44cf3bba12 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2621,10 +2621,8 @@ CopyFrom(CopyState cstate)
 			 * will get us the ResultRelInfo and TupleConversionMap for the
 			 * partition, respectively.
 			 */
-			leaf_part_index = ExecFindPartition(resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
+			leaf_part_index = ExecFindPartition(mtstate, resultRelInfo,
+												proute, slot, estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
@@ -2644,10 +2642,8 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = ExecGetPartitionInfo(mtstate,
-												 saved_resultRelInfo,
-												 proute, estate,
-												 leaf_part_index);
+			Assert(proute->partitions[leaf_part_index] != NULL);
+			resultRelInfo = proute->partitions[leaf_part_index];
 
 			/*
 			 * For ExecInsertIndexTuples() to work on the partition's indexes
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 1a3a67dd0d..250c2cd53e 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,17 +31,19 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
-static ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
+#define PARTITION_ROUTING_INITSIZE	8
+#define PARTITION_ROUTING_MAXSIZE		65536
+
+static void ExecUseUpdateResultRelForRouting(ModifyTableState *mtstate,
+								 PartitionTupleRouting *proute,
+								 PartitionDispatch pd);
+static void ExecInitPartitionInfo(ModifyTableState *mtstate,
 					  ResultRelInfo *resultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, Oid **leaf_part_oids,
-								 int *n_leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, Oid **leaf_part_oids,
-							   int *n_leaf_part_oids,
-							   int *leaf_part_oid_size);
+					  EState *estate, Oid partoid,
+					  int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+						Oid partoid, Relation parent, int dispatchidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -68,127 +70,58 @@ static void find_matching_subplans_recurse(PartitionPruneState *prunestate,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for UPDATE
- * tuple routing, the caller will have already initialized ResultRelInfo's for
- * each partition present in the ModifyTable's subplans. These are reused and
- * assigned to their respective slot in the aforementioned array.  For such
- * partitions, we delay setting up objects such as TupleConversionMap until
- * those are actually chosen as the partitions to route tuples to.  See
- * ExecPrepareTupleRouting.
+ * This is called during the initialization of a COPY FROM command or of a
+ * INSERT/UPDATE query.  We provisionally allocate space to hold
+ * PARTITION_ROUTING_INITSIZE number of PartitionDispatch and ResultRelInfo
+ * pointers in their respective arrays.  The arrays will be doubled in
+ * size via repalloc (subject to the limit of PARTITION_ROUTING_MAXSIZE
+ * entries  at most) if and when we run out of space, as more partitions need
+ * to be added.  Since we already have the root parent open, its
+ * PartitionDispatch is created here.
+ *
+ * PartitionDispatch object of a non-root partitioned table or ResultRelInfo
+ * of a leaf partition is allocated and added to the respective array when
+ * it is encountered for the first time in ExecFindPartition.  As mentioned
+ * above, we might need to expand the respective array before storing it.
+ *
+ * Tuple conversion maps (either child to parent and/or vice versa) and the
+ * array(s) to hold them are allocated only if needed.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	int			i;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &proute->partition_oids, &nparts);
 
-	proute->num_partitions = nparts;
-	proute->partitions =
-		(ResultRelInfo **) palloc0(nparts * sizeof(ResultRelInfo *));
+	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+			palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+
+	/* Initialize this table's PartitionDispatch object. */
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
+	proute->num_dispatch = 1;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
+	proute->partitions = (ResultRelInfo **)
+			palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
+	proute->num_partitions = 0;
 
 	/*
-	 * Allocate an array to store ResultRelInfos that we'll later allocate.
-	 * It is common that not all partitions will have tuples routed to them,
-	 * so we'll refrain from allocating enough space for all partitions here.
-	 * Let's just start with something small and make it bigger only when
-	 * needed.  Storing these separately rather than relying on the
-	 *'partitions' array allows us to quickly identify which ResultRelInfos we
-	 * must teardown at the end.
+	 * Check if we can use ResultRelInfos set up by ExecInitModifyTable as
+	 * target result rels of an UPDATE as also the target result rels of tuple
+	 * routing.  Note that we consider for now only the root parent's own leaf
+	 * partitions, that is, leaf partitions of level 1 and none of the leaf
+	 * partitions of the levels below.
 	 */
-	proute->partitions_init_size = Min(nparts, 8);
-
-	proute->partitions_init = (ResultRelInfo **)
-		palloc(proute->partitions_init_size * sizeof(ResultRelInfo *));
-
-	proute->num_partitions_init = 0;
-
-	/* We only allocate this when we need to store the first non-NULL map */
-	proute->parent_child_tupconv_maps = NULL;
-
-	proute->child_parent_tupconv_maps = NULL;
-
-
-	/*
-	 * Initialize an empty slot that will be used to manipulate tuples of any
-	 * given partition's rowtype.  It is attached to the caller-specified node
-	 * (such as ModifyTableState) and released when the node finishes
-	 * processing.
-	 */
-	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
-
-	/* Set up details specific to the type of tuple routing we are doing. */
 	if (node && node->operation == CMD_UPDATE)
-	{
-		ResultRelInfo *update_rri = NULL;
-		int			num_update_rri = 0,
-					update_rri_index = 0;
-
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
-
-		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
-
-		for (i = 0; i < nparts; i++)
-		{
-			Oid			leaf_oid = proute->partition_oids[i];
-
-			/*
-			 * If the leaf partition is already present in the per-subplan
-			 * result rels, we re-use that rather than initialize a new result
-			 * rel. The per-subplan resultrels and the resultrels of the leaf
-			 * partitions are both in the same canonical order. So while going
-			 * through the leaf partition oids, we need to keep track of the
-			 * next per-subplan result rel to be looked for in the leaf
-			 * partition resultrels.
-			 */
-			if (update_rri_index < num_update_rri &&
-				RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-			{
-				ResultRelInfo *leaf_part_rri;
-
-				leaf_part_rri = &update_rri[update_rri_index];
-
-				/*
-				 * This is required in order to convert the partition's tuple
-				 * to be compatible with the root partitioned table's tuple
-				 * descriptor.  When generating the per-subplan result rels,
-				 * this was not set.
-				 */
-				leaf_part_rri->ri_PartitionRoot = rel;
-
-				/* Remember the subplan offset for this ResultRelInfo */
-				proute->subplan_partition_offsets[update_rri_index] = i;
-
-				update_rri_index++;
-
-				proute->partitions[i] = leaf_part_rri;
-			}
-		}
-
-		/*
-		 * We should have found all the per-subplan resultrels in the leaf
-		 * partitions.
-		 */
-		Assert(update_rri_index == num_update_rri);
-	}
+		ExecUseUpdateResultRelForRouting(mtstate,
+										 proute,
+										 proute->partition_dispatch_info[0]);
 	else
 	{
 		proute->root_tuple_slot = NULL;
@@ -196,26 +129,38 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 		proute->num_subplan_partition_offsets = 0;
 	}
 
+	/* We only allocate this when we need to store the first non-NULL map */
+	proute->parent_child_tupconv_maps = NULL;
+	proute->child_parent_tupconv_maps = NULL;
+
+	/*
+	 * Initialize an empty slot that will be used to manipulate tuples of any
+	 * given partition's rowtype.
+	 */
+	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
+
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find a leaf partition for the tuple contained in *slot
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message, else it returns the index of the leaf partition's
+ * ResultRelInfo in the proute->partitions array.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
+	int			result = -1;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
@@ -272,10 +217,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		 * partitions to begin with.
 		 */
 		if (partdesc->nparts == 0)
-		{
-			result = -1;
 			break;
-		}
 
 		cur_index = get_partition_for_tuple(rel, values, isnull);
 
@@ -285,17 +227,71 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		 * next parent to find a partition of.
 		 */
 		if (cur_index < 0)
-		{
-			result = -1;
 			break;
-		}
-		else if (parent->indexes[cur_index] >= 0)
+
+		if (partdesc->is_leaf[cur_index])
 		{
-			result = parent->indexes[cur_index];
+			/* Get the ResultRelInfo of this leaf partition. */
+			if (parent->indexes[cur_index] >= 0)
+			{
+				/*
+				 * Already assigned (either created fresh or reused from the
+				 * set of UPDATE result rels.)
+				 */
+				Assert(parent->indexes[cur_index] < proute->num_partitions);
+				result = parent->indexes[cur_index];
+			}
+			else if (node && node->operation == CMD_UPDATE &&
+					 !parent->scanned_update_result_rels)
+			{
+				/* Try to assign UPDATE result rels for tuple routing. */
+				ExecUseUpdateResultRelForRouting(mtstate, proute, parent);
+
+				/* Check if we really found one. */
+				if (parent->indexes[cur_index] >= 0)
+				{
+					Assert(parent->indexes[cur_index] < proute->num_partitions);
+					result = parent->indexes[cur_index];
+				}
+			}
+
+			/* We need to create one afresh. */
+			if (result < 0)
+			{
+				result = proute->num_partitions++;
+				parent->indexes[cur_index] = result;
+				if (parent->indexes[cur_index] >= PARTITION_ROUTING_MAXSIZE)
+					elog(ERROR, "invalid partition index: %u",
+						 parent->indexes[cur_index]);
+				ExecInitPartitionInfo(mtstate, resultRelInfo,
+									  proute, estate,
+									  partdesc->oids[cur_index], result);
+			}
 			break;
 		}
 		else
-			parent = pd[-parent->indexes[cur_index]];
+		{
+			/* Get the PartitionDispatch of this parent. */
+			if (parent->indexes[cur_index] >= 0)
+			{
+				/* Already allocated. */
+				Assert(parent->indexes[cur_index] < proute->num_dispatch);
+				parent = pd[parent->indexes[cur_index]];
+			}
+			else
+			{
+				/* Not yet, allocate one. */
+				parent->indexes[cur_index] = proute->num_dispatch++;
+				if (parent->indexes[cur_index] >= PARTITION_ROUTING_MAXSIZE)
+					elog(ERROR, "invalid partition index: %u",
+						 parent->indexes[cur_index]);
+				parent =
+					ExecInitPartitionDispatchInfo(proute,
+												  partdesc->oids[cur_index],
+												  rel,
+												  parent->indexes[cur_index]);
+			}
+		}
 	}
 
 	/* A partition was not found. */
@@ -318,64 +314,114 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 }
 
 /*
- * ExecGetPartitionInfo
- *		Fetch ResultRelInfo for partidx
- *
- * Sets up ResultRelInfo, if not done already.
+ * ExecUseUpdateResultRelForRouting
+ *		Checks if any of the ResultRelInfo's created by ExecInitModifyTable
+ *		as target result rels for an UPDATE belong to a given parent table's
+ *		partitions, and if so, stores their pointers in proute so that they
+ *		can be used hereon as targets of tuple routing
  */
-ResultRelInfo *
-ExecGetPartitionInfo(ModifyTableState *mtstate,
-					 ResultRelInfo *resultRelInfo,
-					 PartitionTupleRouting *proute,
-					 EState *estate, int partidx)
+static void
+ExecUseUpdateResultRelForRouting(ModifyTableState *mtstate,
+								 PartitionTupleRouting *proute,
+								 PartitionDispatch pd)
 {
-	ResultRelInfo *result = proute->partitions[partidx];
+	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
+	Relation		rootrel  = proute->partition_root;
+	PartitionDesc	partdesc = RelationGetPartitionDesc(pd->reldesc);
+	ResultRelInfo  *update_rri = NULL;
+	int				num_update_rri = 0,
+					my_part_index = 0;
+	int				i;
 
-	if (result)
-		return result;
+	/* We should be here only once for a given parent table. */
+	Assert(!pd->scanned_update_result_rels);
 
-	result = ExecInitPartitionInfo(mtstate,
-								   resultRelInfo,
-								   proute,
-								   estate,
-								   partidx);
-	Assert(result);
+	update_rri = mtstate->resultRelInfo;
+	num_update_rri = list_length(node->plans);
 
-	proute->partitions[partidx] = result;
-
-	/*
-	 * Record the ones setup so far in setup order.  This makes the cleanup
-	 * operation more efficient when very few have been setup.
-	 */
-	if (proute->num_partitions_init == proute->partitions_init_size)
+	/* If here for the first time, initialize necessary data structures. */
+	if (proute->subplan_partition_offsets == NULL)
 	{
-		/* First allocate more space if the array is not large enough */
-		proute->partitions_init_size =
-			Min(proute->partitions_init_size * 2, proute->num_partitions);
-
-		proute->partitions_init = (ResultRelInfo **)
-				repalloc(proute->partitions_init,
-				proute->partitions_init_size * sizeof(ResultRelInfo *));
+		proute->subplan_partition_offsets = palloc(num_update_rri * sizeof(int));
+		memset(proute->subplan_partition_offsets, -1,
+			   num_update_rri * sizeof(int));
+		proute->num_subplan_partition_offsets = num_update_rri;
+		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
 	}
 
-	proute->partitions_init[proute->num_partitions_init++] = result;
+	/*
+	 * Go through UPDATE result rels and note down those that belong to
+	 * this table's partitions.
+	 */
+	for (i = 0; i < num_update_rri; i++)
+	{
+		Relation	update_rel = update_rri[i].ri_RelationDesc;
+		Oid			leaf_oid = partdesc->oids[my_part_index];
 
-	Assert(proute->num_partitions_init <= proute->num_partitions);
+		/*
+		 * Skip UPDATE result rels that correspond to leaf partitions of lower
+		 * levels.  They will be acquired via PartitionDispatch of their own
+		 * parents, if needed.
+		 */
+		while (RelationGetRelid(update_rel) != leaf_oid &&
+			   my_part_index < partdesc->nparts)
+			leaf_oid = partdesc->oids[++my_part_index];
 
-	return result;
+		if (RelationGetRelid(update_rel) == leaf_oid)
+		{
+			ResultRelInfo *leaf_part_rri;
+
+			leaf_part_rri = &update_rri[i];
+
+			/*
+			 * This is required in order to convert the partition's tuple
+			 * to be compatible with the root partitioned table's tuple
+			 * descriptor.  When generating the per-subplan result rels,
+			 * this was not set.
+			 */
+			leaf_part_rri->ri_PartitionRoot = rootrel;
+
+			/*
+			 * Remember the index of this UPDATE result rel in the tuple
+			 * routing partition array.
+			 */
+			proute->subplan_partition_offsets[i] = proute->num_partitions;
+
+			/*
+			 * Also, record in PartitionDispatch that we have a valid
+			 * ResultRelInfo for this partition.
+			 */
+
+			Assert(pd->indexes[my_part_index] == -1);
+			pd->indexes[my_part_index] = proute->num_partitions;
+			proute->partitions[proute->num_partitions++] = leaf_part_rri;
+			my_part_index++;
+		}
+
+		if (my_part_index >= partdesc->nparts)
+			break;
+	}
+
+	/*
+	 * Set that we have checked and reused all UPDATE result rels that we
+	 * found for this parent.
+	 */
+	pd->scanned_update_result_rels = true;
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
  *
- * Returns the ResultRelInfo
+ * This also stores it in the proute->partitions array at the specified index
+ * ('partidx'), possibly expanding the array if there isn't space left in it.
  */
-static ResultRelInfo *
+static void
 ExecInitPartitionInfo(ModifyTableState *mtstate,
 					  ResultRelInfo *resultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate, Oid partoid,
+					  int partidx)
 {
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
 	Relation	rootrel = resultRelInfo->ri_RelationDesc,
@@ -390,7 +436,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(partoid, NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -729,12 +775,20 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
+	if (partidx >= proute->partitions_allocsize)
+	{
+		/* Expand allocated place. */
+		proute->partitions_allocsize =
+			Min(proute->partitions_allocsize * 2, PARTITION_ROUTING_MAXSIZE);
+		proute->partitions = (ResultRelInfo **)
+			repalloc(proute->partitions,
+					 sizeof(ResultRelInfo *) * proute->partitions_allocsize);
+	}
+
+	/* Save here for later use. */
 	proute->partitions[partidx] = leaf_part_rri;
 
 	MemoryContextSwitchTo(oldContext);
-
-	return leaf_part_rri;
 }
 
 /*
@@ -766,10 +820,26 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 
 	if (map)
 	{
+		int		new_size;
+
 		/* Allocate parent child map array only if we need to store a map */
-		if (!proute->parent_child_tupconv_maps)
+		if (proute->parent_child_tupconv_maps == NULL)
+		{
+			proute->parent_child_tupconv_maps_allocsize = new_size =
+				PARTITION_ROUTING_INITSIZE;
 			proute->parent_child_tupconv_maps = (TupleConversionMap **)
-				palloc0(proute->num_partitions * sizeof(TupleConversionMap *));
+				palloc0(sizeof(TupleConversionMap *) * new_size);
+		}
+		/* We may have ran out of the initially allocated space. */
+		else if (partidx >= proute->parent_child_tupconv_maps_allocsize)
+		{
+			proute->parent_child_tupconv_maps_allocsize = new_size =
+				Min(proute->parent_child_tupconv_maps_allocsize * 2,
+					PARTITION_ROUTING_MAXSIZE);
+			proute->parent_child_tupconv_maps = (TupleConversionMap **)
+				repalloc( proute->parent_child_tupconv_maps,
+						 sizeof(TupleConversionMap *) * new_size);
+		}
 
 		proute->parent_child_tupconv_maps[partidx] = map;
 	}
@@ -788,6 +858,88 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 }
 
 /*
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table
+ *
+ * This also stores it in the proute->partition_dispatch_info array at the
+ * specified index ('dispatchidx'), possibly expanding the array if there
+ * isn't space left in it.
+ */
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid,
+							  Relation parent,
+							  int dispatchidx)
+{
+	Relation	rel;
+	TupleDesc	tupdesc;
+	PartitionDesc partdesc;
+	PartitionKey partkey;
+	PartitionDispatch pd;
+
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	tupdesc = RelationGetDescr(rel);
+	partdesc = RelationGetPartitionDesc(rel);
+	partkey = RelationGetPartitionKey(rel);
+
+	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
+	pd->reldesc = rel;
+	pd->key = partkey;
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent != NULL)
+	{
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
+											tupdesc,
+											gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
+
+	pd->indexes = (int *) palloc(sizeof(int) * partdesc->nparts);
+
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
+
+	pd->scanned_update_result_rels = false;
+
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize =
+			Min(proute->dispatch_allocsize * 2, PARTITION_ROUTING_MAXSIZE);
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
+
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
+
+	return pd;
+}
+
+/*
  * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
  * child-to-root tuple conversion map array.
  *
@@ -805,13 +957,14 @@ ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
 	 * These array elements get filled up with maps on an on-demand basis.
 	 * Initially just set all of them to NULL.
 	 */
+	proute->child_parent_tupconv_maps_allocsize = PARTITION_ROUTING_INITSIZE;
 	proute->child_parent_tupconv_maps =
 		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
+										PARTITION_ROUTING_INITSIZE);
 
 	/* Same is the case for this array. All the values are set to false */
 	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
+		(bool *) palloc0(sizeof(bool) * PARTITION_ROUTING_INITSIZE);
 }
 
 /*
@@ -826,8 +979,9 @@ TupConvMapForLeaf(PartitionTupleRouting *proute,
 	TupleConversionMap **map;
 	TupleDesc	tupdesc;
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	/* If nobody else set up the per-leaf maps array, do so ourselves. */
+	if (proute->child_parent_tupconv_maps == NULL)
+		ExecSetupChildParentMapForLeaf(proute);
 
 	/* If it's already known that we don't need a map, return NULL. */
 	if (proute->child_parent_map_not_required[leaf_index])
@@ -846,6 +1000,30 @@ TupConvMapForLeaf(PartitionTupleRouting *proute,
 							   gettext_noop("could not convert row type"));
 
 	/* If it turns out no map is needed, remember for next time. */
+
+	/* We may have run out of the initially allocated space. */
+	if (leaf_index >= proute->child_parent_tupconv_maps_allocsize)
+	{
+		int		new_size,
+				old_size;
+
+		old_size = proute->child_parent_tupconv_maps_allocsize;
+		proute->child_parent_tupconv_maps_allocsize = new_size =
+			Min(proute->parent_child_tupconv_maps_allocsize * 2,
+				PARTITION_ROUTING_MAXSIZE);
+		proute->child_parent_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->child_parent_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(proute->child_parent_tupconv_maps + old_size, 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+
+		proute->child_parent_map_not_required = (bool *)
+			repalloc(proute->child_parent_map_not_required,
+					 sizeof(bool) * new_size);
+		memset(proute->child_parent_map_not_required + old_size, false,
+			   sizeof(bool) * (new_size - old_size));
+	}
+
 	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
 
 	return *map;
@@ -909,9 +1087,9 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(pd->tupslot);
 	}
 
-	for (i = 0; i < proute->num_partitions_init; i++)
+	for (i = 0; i < proute->num_partitions; i++)
 	{
-		ResultRelInfo *resultRelInfo = proute->partitions_init[i];
+		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
 		/* Allow any FDWs to shut down if they've been exercised */
 		if (resultRelInfo->ri_PartitionReadyForRouting &&
@@ -920,6 +1098,28 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 			resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
 														   resultRelInfo);
 
+		/*
+		 * Check if this result rel is one of UPDATE subplan result rels,
+		 * which if so, let ExecEndPlan() close it.
+		 */
+		if (proute->subplan_partition_offsets)
+		{
+			int		j;
+			int		found = false;
+
+			for (j = 0; j < proute->num_subplan_partition_offsets; j++)
+			{
+				if (proute->subplan_partition_offsets[j] == i)
+				{
+					found = true;
+					break;
+				}
+			}
+
+			if (found)
+				continue;
+		}
+
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
@@ -931,211 +1131,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns an array of PartitionDispatch as is required for routing
- *		tuples to the correct partition.
- *
- * 'num_parted' is set to the size of the returned array and the
- *'leaf_part_oids' array is allocated and populated with each leaf partition
- * Oid in the hierarchy. 'n_leaf_part_oids' is set to the size of that array.
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, Oid **leaf_part_oids,
-								 int *n_leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-	int			leaf_part_oid_size;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*n_leaf_part_oids = 0;
-
-	leaf_part_oid_size = 0;
-	*leaf_part_oids = NULL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids,
-								   n_leaf_part_oids, &leaf_part_oid_size);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we populate
- * '*pds' with PartitionDispatch objects of each partitioned table we find,
- * and populate leaf_part_oids with each leaf partition OID found.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- *
- * Note: Callers must not attempt to pfree the 'leaf_part_oids' array.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, Oid **leaf_part_oids,
-							   int *n_leaf_part_oids,
-							   int *leaf_part_oid_size)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-	int			nparts;
-	int			oid_array_used;
-	int			oid_array_size;
-	Oid		   *oid_array;
-	Oid		   *partdesc_oids;
-	bool	   *partdesc_subpartitions;
-	int		   *indexes;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
-											tupdesc,
-											gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-
-		/*
-		 * If the parent has no sub partitions then we can skip calculating
-		 * all the leaf partitions and just return all the oids at this level.
-		 * In this case, the indexes were also pre-calculated for us by the
-		 * syscache code.
-		 */
-		if (!partdesc->hassubpart)
-		{
-			*leaf_part_oids = partdesc->oids;
-			/* XXX or should we memcpy this out of syscache? */
-			pd->indexes = partdesc->indexes;
-			*n_leaf_part_oids = partdesc->nparts;
-			return;
-		}
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	oid_array_used = *n_leaf_part_oids;
-	oid_array_size = *leaf_part_oid_size;
-	oid_array = *leaf_part_oids;
-	nparts = partdesc->nparts;
-
-	if (!oid_array)
-	{
-		oid_array_size = *leaf_part_oid_size = nparts;
-		*leaf_part_oids = (Oid *) palloc(sizeof(Oid) * nparts);
-		oid_array = *leaf_part_oids;
-	}
-
-	partdesc_oids = partdesc->oids;
-	partdesc_subpartitions = partdesc->subpartitions;
-
-	pd->indexes = indexes = (int *) palloc(nparts * sizeof(int));
-
-	for (i = 0; i < nparts; i++)
-	{
-		Oid			partrelid = partdesc_oids[i];
-
-		if (!partdesc_subpartitions[i])
-		{
-			if (oid_array_size <= oid_array_used)
-			{
-				oid_array_size *= 2;
-				oid_array = (Oid *) repalloc(oid_array,
-											 sizeof(Oid) * oid_array_size);
-			}
-
-			oid_array[oid_array_used] = partrelid;
-			indexes[i] = oid_array_used++;
-		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
-
-			*n_leaf_part_oids = oid_array_used;
-			*leaf_part_oid_size = oid_array_size;
-			*leaf_part_oids = oid_array;
-
-			indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids,
-										   n_leaf_part_oids, leaf_part_oid_size);
-
-			oid_array_used = *n_leaf_part_oids;
-			oid_array_size = *leaf_part_oid_size;
-			oid_array = *leaf_part_oids;
-		}
-	}
-
-	*n_leaf_part_oids = oid_array_used;
-	*leaf_part_oid_size = oid_array_size;
-	*leaf_part_oids = oid_array;
-}
-
 /* ----------------
  *		FormPartitionKeyDatum
  *			Construct values[] and isnull[] arrays for the partition key
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 07b5f968aa..8b671c6426 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1666,7 +1665,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1709,15 +1708,12 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	 * value is to be used as an index into the arrays for the ResultRelInfo
 	 * and TupleConversionMap for the partition.
 	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
+	partidx = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
 	Assert(partidx >= 0 && partidx < proute->num_partitions);
 
 	/* Get the ResultRelInfo corresponding to the selected partition. */
-	partrel = ExecGetPartitionInfo(mtstate, targetRelInfo, proute, estate,
-								   partidx);
+	Assert(proute->partitions[partidx] != NULL);
+	partrel = proute->partitions[partidx];
 
 	/*
 	 * Check whether the partition is routable if we didn't yet
@@ -1825,17 +1821,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			i;
 
 	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
-	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
 	 * conversion is necessary, which is hopefully a common case.
@@ -1857,78 +1842,17 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 }
 
 /*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
-/*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ouselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
-
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index b36b7366e5..aa82aa52eb 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -594,7 +594,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
-		result->subpartitions = (bool *) palloc(nparts * sizeof(bool));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -775,7 +775,6 @@ RelationBuildPartitionDesc(Relation rel)
 		}
 
 		result->boundinfo = boundinfo;
-		result->hassubpart = false; /* unless we discover otherwise below */
 
 		/*
 		 * Now assign OIDs from the original array into mapped indexes of the
@@ -786,33 +785,13 @@ RelationBuildPartitionDesc(Relation rel)
 		for (i = 0; i < nparts; i++)
 		{
 			int			index = mapping[i];
-			bool		subpart;
 
 			result->oids[index] = oids[i];
-
-			subpart = (get_rel_relkind(oids[i]) == RELKIND_PARTITIONED_TABLE);
 			/* Record if the partition is a subpartitioned table */
-			result->subpartitions[index] = subpart;
-			result->hassubpart |= subpart;
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
 		}
 
-		/*
-		 * If there are no subpartitions then we can pre-calculate the
-		 * PartitionDispatch->indexes array.  Doing this here saves quite a
-		 * bit of overhead on simple queries which perform INSERTs or UPDATEs
-		 * on partitioned tables with many partitions.  The pre-calculation is
-		 * very simple.  All we need to store is a sequence of numbers from 0
-		 * to nparts - 1.
-		 */
-		if (!result->hassubpart)
-		{
-			result->indexes = (int *) palloc(nparts * sizeof(int));
-			for (i = 0; i < nparts; i++)
-				result->indexes[i] = i;
-		}
-		else
-			result->indexes = NULL;
-
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index a8c69ff224..8d20469c98 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -26,18 +26,11 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs array of 'nparts' of partitions in
-								 * partbound order */
-	int		   *indexes;		/* Stores index for corresponding 'oids'
-								 * element for use in tuple routing, or NULL
-								 * if hassubpart is true.
-								 */
-	bool	   *subpartitions;	/* Array of 'nparts' set to true if the
-								 * corresponding 'oids' element belongs to a
-								 * sub-partitioned table.
-								 */
-	bool		hassubpart;		/* true if any oid belongs to a
-								 * sub-partitioned table */
+	Oid		   *oids;			/* Array of length 'nparts' containing
+								 * partition OIDs in order of the their
+								 * bounds */
+	bool	   *is_leaf;		/* Array of length 'nparts' containing whether
+								 * a partition is a leaf partition */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 822f66f5e2..f284bc2d81 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -45,77 +45,98 @@ typedef struct PartitionDispatchData
 	TupleTableSlot *tupslot;
 	TupleConversionMap *tupmap;
 	int		   *indexes;
+	bool		scanned_update_result_rels;
 } PartitionDispatchData;
 
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 /*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
- *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * partitions_init				Array of ResultRelInfo* objects in the order
- *								that they were lazily initialized.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * num_partitions_init			Number of leaf partition lazily setup so far.
- * partitions_init_size			Size of partitions_init array.
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done). Remains NULL if no maps to store.
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slot			TupleTableSlot to be used to manipulate any
- *								given leaf partition's rowtype after that
- *								partition is chosen for insertion by
- *								tuple-routing.
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
  *-----------------------
  */
 typedef struct PartitionTupleRouting
 {
+	/*
+	 * Root table, that is, the table mentioned in the INSERT/UPDATE query or
+	 * COPY FROM command.
+	 */
+	Relation	partition_root;
+
+	/*
+	 * Contains PartitionDispatch objects for every partitioned table touched
+	 * by tuple routing.  The entry for the root partitioned table is *always*
+	 * present as the first entry of this array.  'num_dispatch' is the
+	 * number of existing entries and also serves as the index of the next
+	 * entry to be allocated.  'dispatch_allocsize' (>= 'num_dispatch') is the
+	 * number of entries that can be stored in the array, before needing to
+	 * reallocate more space.  See ExecInitPartitionDispatchInfo().
+	 */
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
-	Oid		   *partition_oids;
+	int			dispatch_allocsize;
+
+	/*
+	 * Contains pointers to a ResultRelInfos of all leaf partitions touched by
+	 * tuple routing.  Some of these are pointers to "reused" ResultRelInfos,
+	 * that is, those that are created and destroyed outside execPartition.c,
+	 * for example, when tuple routing is used for UPDATE queries that modify
+	 * partition key; see ExecUseUpdateResultRelForRouting().  Rest of them
+	 * are pointers to ResultRelInfos managed by execPartition.c itself; see
+	 * ExecInitPartitionInfo() and ExecCleanupTupleRouting().
+	 */
 	ResultRelInfo **partitions;
-	ResultRelInfo **partitions_init;
 	int			num_partitions;
-	int			num_partitions_init;
-	int			partitions_init_size;
+	int			partitions_allocsize;
+
+	/*
+	 * Contains information to convert tuples of the root parent's rowtype to
+	 * those of the leaf partitions' rowtype, but only for those partitions
+	 * whose TupleDescs are physically different from the root parent's.  If
+	 * none of the partitions has such a differing TupleDesc, then the
+	 * following array is NULL.  If non-NULL, is is of the same size as the
+	 * 'partitions' array above, to be able to use the same array index.
+	 * Also, there need not be more of these maps than there are partitions
+	 * touched.
+	 */
 	TupleConversionMap **parent_child_tupconv_maps;
+	int			parent_child_tupconv_maps_allocsize;
+
+	/*
+	 * This is a tuple slot used for a partition after tuple routing.
+	 * Maintained separately because partitions may have different rowtype.
+	 */
+	TupleTableSlot *partition_tuple_slot;
+
+	/*
+	 * Note: The following fields are used only when UPDATE ends up needing to
+	 * do tuple routing.
+	 */
+
+	/*
+	 * Information to convert tuples of the leaf partitions' rowtype to the
+	 * the root parent's rowtype.  These are needed by transition table
+	 * machinery when storing tuples of partition's rowtype into the
+	 * transition table that can only store tuples of root parent's rowtype.
+	 */
 	TupleConversionMap **child_parent_tupconv_maps;
 	bool	   *child_parent_map_not_required;
+	int			child_parent_tupconv_maps_allocsize;
+
+	/*
+	 * The following maps indexes of UPDATE result rels in the per-subplan to
+	 * indexes of their pointers in the 'partitions' array above.
+	 */
 	int		   *subplan_partition_offsets;
 	int			num_subplan_partition_offsets;
-	TupleTableSlot *partition_tuple_slot;
+
+	/*
+	 * During UPDATE tuple routing, this tuple slot is used to transiently
+	 * store a tuple using the root table's rowtype after converting it from
+	 * the tuple's source leaf partition's rowtype.  That is, if leaf
+	 * partition's rowtype is different.
+	 */
 	TupleTableSlot *root_tuple_slot;
 } PartitionTupleRouting;
 
@@ -193,8 +214,9 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern int ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
 extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,
#11David Rowley
david.rowley@2ndquadrant.com
In reply to: Amit Langote (#10)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 13 July 2018 at 20:20, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Why don't we abandon the notion altogether that
ExecSetupPartitionTupleRouting *has to* process the whole partition tree?

[...]

I implemented that idea in the attached patch, which applies on top of
your 0001 patch, but I'd say it's too big to be just called a delta. I
was able to get following performance numbers using the following pgbench
test:

Thanks for looking at this. I like that your idea gets rid of the
indexes cache in syscache. I was not very happy with that part.

I've looked over the code and the ExecUseUpdateResultRelForRouting()
function is broken. Your while loop only skips partitions for the
current partitioned table, it does not skip ModifyTable subnodes that
belong to other partitioned tables.

You can use the following. The code does not find the t1_a2 subnode.

create table t1 (a int, b int) partition by list(a);
create table t1_a1 partition of t1 for values in(1) partition by list(b);
create table t1_a2 partition of t1 for values in(2);
create table t1_a1_b1 partition of t1_a1 for values in(1);
create table t1_a1_b2 partition of t1_a1 for values in(2);
insert into t1 values(2,2);

update t1 set a = a;

I think there might not be enough information to make this work
correctly, as if you change the loop to skip subnodes, then it won't
work in cases where the partition[0] was pruned.

I've another patch sitting here, partly done, that changes
pg_class.relispartition into pg_class.relpartitionparent. If we had
that then we could code your loop to work correctly. Alternatively,
I guess we could just ignore the UPDATE's ResultRelInfos and just
build new ones. Unsure if there's actually a reason we need to reuse
the existing ones, is there?

I think you'd need to know the owning partition and skip subnodes that
don't belong to pd->reldesc. Alternatively, a hashtable could be built
with all the oids belonging to pd->reldesc, then we could loop over
the update_rris finding subnodes that can be found in the hashtable.
Likely this will be much slower than the sort of merge lookup that the
previous code did.

Another thing that I don't like is the PARTITION_ROUTING_MAXSIZE code.
The code seems to assume that there can only be at the most 65536
partitions, but I don't think there's any code which restricts us to
that. There is code in the planner that will bork when trying to
create a RangeTblEntry up that high, but as far as I know that won't
be noticed on the INSERT path. I don't think this code has any
business knowing what the special varnos are set to either. It would
be better to just remove the limit and suffer the small wasted array
space. I understand you've probably coded it like this due to the
similar code that was in my patch, but with mine I knew the total
number of partitions. Your patch does not.

Other thoughts on the patch:

I wonder if it's worth having syscache keep a count on the number of
sub-partitioned tables a partition has. If there are none in the root
partition then the partition_dispatch_info can be initialized with
just 1 element to store the root details. Although, maybe it's not
worth it to reduce the array size by 7 elements.

Also, I'm a bit confused why you change the comments in
execPartition.h for PartitionTupleRouting to be inline again. I
brought those out of line as I thought the complexity of the code
warranted that. You're inlining them again goes against what all the
other structs do in that file.

Apart from that, I think the idea is promising. We'll just need to
find a way to make ExecUseUpdateResultRelForRouting work correctly.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#12Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#11)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

Hi David,

Thanks for taking a look.

On 2018/07/15 17:34, David Rowley wrote:

I've looked over the code and the ExecUseUpdateResultRelForRouting()
function is broken. Your while loop only skips partitions for the
current partitioned table, it does not skip ModifyTable subnodes that
belong to other partitioned tables.

You can use the following. The code does not find the t1_a2 subnode.

create table t1 (a int, b int) partition by list(a);
create table t1_a1 partition of t1 for values in(1) partition by list(b);
create table t1_a2 partition of t1 for values in(2);
create table t1_a1_b1 partition of t1_a1 for values in(1);
create table t1_a1_b2 partition of t1_a1 for values in(2);
insert into t1 values(2,2);

update t1 set a = a;

Hmm, it indeed is broken.

I think there might not be enough information to make this work
correctly, as if you change the loop to skip subnodes, then it won't
work in cases where the partition[0] was pruned.

I've another patch sitting here, partly done, that changes
pg_class.relispartition into pg_class.relpartitionparent. If we had
that then we could code your loop to work correctly.> Alternatively,
I guess we could just ignore the UPDATE's ResultRelInfos and just
build new ones. Unsure if there's actually a reason we need to reuse
the existing ones, is there?

We try to use the existing ones because we thought back when the patch was
written (not by me though) that redoing all the work that
InitResultRelInfo does for each partition, for which we could have instead
used an existing one, would cumulatively end up being more expensive than
figuring out which ones we could reuse by a linear scan of partition and
result rels arrays in parallel. I don't remember seeing a benchmark to
demonstrate the benefit of doing this though. Maybe it was posted, but I
don't remember having looked at it closely.

I think you'd need to know the owning partition and skip subnodes that
don't belong to pd->reldesc. Alternatively, a hashtable could be built
with all the oids belonging to pd->reldesc, then we could loop over
the update_rris finding subnodes that can be found in the hashtable.
Likely this will be much slower than the sort of merge lookup that the
previous code did.

I think one option is to simply give up on the idea of matching *all*
UPDATE result rels that belong to a given partitioned table (pd->reldesc)
in one call of ExecUseUpdateResultRelForRouting. Instead, pass the index
of the partition (in pd->partdesc->oids) to find the ResultRelInfo for,
loop over all UPDATE result rels looking for one, and return immediately
on finding one after having stored its pointer in proute->partitions. In
the worst case, we'll end up scanning UPDATE result rels array for every
partition that gets touched, but maybe such an UPDATE query is less common
or even if such a query occurs, tuple routing might be the last of its
bottlenecks.

I have implemented that approach in the updated patch.

That means I also needed to change things so that
ExecUseUpdateResultRelsForRouting is now only called by ExecFindPartition,
because with the new arrangement, it's useless to call it from
ExecSetupPartitionTupleRouting. Moreover, an UPDATE may not use tuple
routing at all, even if the fact that partition key is being updated
results in calling ExecSetupPartitionTupleRouting.

Another thing that I don't like is the PARTITION_ROUTING_MAXSIZE code.
The code seems to assume that there can only be at the most 65536
partitions, but I don't think there's any code which restricts us to
that. There is code in the planner that will bork when trying to
create a RangeTblEntry up that high, but as far as I know that won't
be noticed on the INSERT path. I don't think this code has any
business knowing what the special varnos are set to either. It would
be better to just remove the limit and suffer the small wasted array
space. I understand you've probably coded it like this due to the
similar code that was in my patch, but with mine I knew the total
number of partitions. Your patch does not.

OK, I changed it to UINT_MAX.

Other thoughts on the patch:

I wonder if it's worth having syscache keep a count on the number of
sub-partitioned tables a partition has. If there are none in the root
partition then the partition_dispatch_info can be initialized with
just 1 element to store the root details. Although, maybe it's not
worth it to reduce the array size by 7 elements.

Hmm yes. Allocating space for 8 pointers when we really need 1 is not too
bad, if the alternative is to modify partcache.c.

Also, I'm a bit confused why you change the comments in
execPartition.h for PartitionTupleRouting to be inline again. I
brought those out of line as I thought the complexity of the code
warranted that. You're inlining them again goes against what all the
other structs do in that file.

It was out-of-line to begin with but it started to become distracting when
updating the comments. But I agree about being consistent and hence I
have moved them back to where they were. I have significantly rewritten
those comments though to be clearer.

Apart from that, I think the idea is promising. We'll just need to
find a way to make ExecUseUpdateResultRelForRouting work correctly.

Let me know what you think of the code in the updated patch.

Thanks,
Amit

Attachments:

david-0001-delta-v2.patchtext/plain; charset=UTF-8; name=david-0001-delta-v2.patchDownload
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 25bec76c1d..44cf3bba12 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2621,10 +2621,8 @@ CopyFrom(CopyState cstate)
 			 * will get us the ResultRelInfo and TupleConversionMap for the
 			 * partition, respectively.
 			 */
-			leaf_part_index = ExecFindPartition(resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
+			leaf_part_index = ExecFindPartition(mtstate, resultRelInfo,
+												proute, slot, estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
@@ -2644,10 +2642,8 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = ExecGetPartitionInfo(mtstate,
-												 saved_resultRelInfo,
-												 proute, estate,
-												 leaf_part_index);
+			Assert(proute->partitions[leaf_part_index] != NULL);
+			resultRelInfo = proute->partitions[leaf_part_index];
 
 			/*
 			 * For ExecInsertIndexTuples() to work on the partition's indexes
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 1a3a67dd0d..23c766b5fc 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,17 +31,19 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
-static ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
+#define PARTITION_ROUTING_INITSIZE	8
+#define PARTITION_ROUTING_MAXSIZE	UINT_MAX
+
+static int ExecUseUpdateResultRelForRouting(ModifyTableState *mtstate,
+								 PartitionTupleRouting *proute,
+								 PartitionDispatch pd, int partidx);
+static int ExecInitPartitionInfo(ModifyTableState *mtstate,
 					  ResultRelInfo *resultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, Oid **leaf_part_oids,
-								 int *n_leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, Oid **leaf_part_oids,
-							   int *n_leaf_part_oids,
-							   int *leaf_part_oid_size);
+					  EState *estate,
+					  PartitionDispatch parent, int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+						Oid partoid, PartitionDispatch parent_pd, int part_index);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -68,127 +70,61 @@ static void find_matching_subplans_recurse(PartitionPruneState *prunestate,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for UPDATE
- * tuple routing, the caller will have already initialized ResultRelInfo's for
- * each partition present in the ModifyTable's subplans. These are reused and
- * assigned to their respective slot in the aforementioned array.  For such
- * partitions, we delay setting up objects such as TupleConversionMap until
- * those are actually chosen as the partitions to route tuples to.  See
- * ExecPrepareTupleRouting.
+ * This is called during the initialization of a COPY FROM command or of a
+ * INSERT/UPDATE query.  We provisionally allocate space to hold
+ * PARTITION_ROUTING_INITSIZE number of PartitionDispatch and ResultRelInfo
+ * pointers in their respective arrays.  The arrays will be doubled in
+ * size via repalloc (subject to the limit of PARTITION_ROUTING_MAXSIZE
+ * entries  at most) if and when we run out of space, as more partitions need
+ * to be added.  Since we already have the root parent open, its
+ * PartitionDispatch is created here.
+ *
+ * PartitionDispatch object of a non-root partitioned table or ResultRelInfo
+ * of a leaf partition is allocated and added to the respective array when
+ * it is encountered for the first time in ExecFindPartition.  As mentioned
+ * above, we might need to expand the respective array before storing it.
+ *
+ * Tuple conversion maps (either child to parent and/or vice versa) and the
+ * array(s) to hold them are allocated only if needed.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	int			i;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &proute->partition_oids, &nparts);
 
-	proute->num_partitions = nparts;
-	proute->partitions =
-		(ResultRelInfo **) palloc0(nparts * sizeof(ResultRelInfo *));
+	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+			palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
 
 	/*
-	 * Allocate an array to store ResultRelInfos that we'll later allocate.
-	 * It is common that not all partitions will have tuples routed to them,
-	 * so we'll refrain from allocating enough space for all partitions here.
-	 * Let's just start with something small and make it bigger only when
-	 * needed.  Storing these separately rather than relying on the
-	 *'partitions' array allows us to quickly identify which ResultRelInfos we
-	 * must teardown at the end.
+	 * Initialize this table's PartitionDispatch object.  Since the root
+	 * parent doesn't itself have any parent, last two parameters are
+	 * not used.
 	 */
-	proute->partitions_init_size = Min(nparts, 8);
-
-	proute->partitions_init = (ResultRelInfo **)
-		palloc(proute->partitions_init_size * sizeof(ResultRelInfo *));
-
-	proute->num_partitions_init = 0;
-
-	/* We only allocate this when we need to store the first non-NULL map */
-	proute->parent_child_tupconv_maps = NULL;
-
-	proute->child_parent_tupconv_maps = NULL;
-
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
+	proute->num_dispatch = 1;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
+	proute->partitions = (ResultRelInfo **)
+			palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
+	proute->num_partitions = 0;
 
 	/*
-	 * Initialize an empty slot that will be used to manipulate tuples of any
-	 * given partition's rowtype.  It is attached to the caller-specified node
-	 * (such as ModifyTableState) and released when the node finishes
-	 * processing.
+	 * If UPDATE needs to do tuple routing, we'll need a slot that will
+	 * transiently store the tuple being routed using the root parent's
+	 * rowtype.  We must set up at least this slot, because it's needed even
+	 * before tuple routing begins.  Other necessary information is
+	 * initialized when  tuple routing code calls
+	 * ExecUseUpdateResultRelForRouting.
 	 */
-	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
-
-	/* Set up details specific to the type of tuple routing we are doing. */
 	if (node && node->operation == CMD_UPDATE)
-	{
-		ResultRelInfo *update_rri = NULL;
-		int			num_update_rri = 0,
-					update_rri_index = 0;
-
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
-
 		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
-
-		for (i = 0; i < nparts; i++)
-		{
-			Oid			leaf_oid = proute->partition_oids[i];
-
-			/*
-			 * If the leaf partition is already present in the per-subplan
-			 * result rels, we re-use that rather than initialize a new result
-			 * rel. The per-subplan resultrels and the resultrels of the leaf
-			 * partitions are both in the same canonical order. So while going
-			 * through the leaf partition oids, we need to keep track of the
-			 * next per-subplan result rel to be looked for in the leaf
-			 * partition resultrels.
-			 */
-			if (update_rri_index < num_update_rri &&
-				RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-			{
-				ResultRelInfo *leaf_part_rri;
-
-				leaf_part_rri = &update_rri[update_rri_index];
-
-				/*
-				 * This is required in order to convert the partition's tuple
-				 * to be compatible with the root partitioned table's tuple
-				 * descriptor.  When generating the per-subplan result rels,
-				 * this was not set.
-				 */
-				leaf_part_rri->ri_PartitionRoot = rel;
-
-				/* Remember the subplan offset for this ResultRelInfo */
-				proute->subplan_partition_offsets[update_rri_index] = i;
-
-				update_rri_index++;
-
-				proute->partitions[i] = leaf_part_rri;
-			}
-		}
-
-		/*
-		 * We should have found all the per-subplan resultrels in the leaf
-		 * partitions.
-		 */
-		Assert(update_rri_index == num_update_rri);
-	}
 	else
 	{
 		proute->root_tuple_slot = NULL;
@@ -196,26 +132,38 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 		proute->num_subplan_partition_offsets = 0;
 	}
 
+	/* We only allocate this when we need to store the first non-NULL map */
+	proute->parent_child_tupconv_maps = NULL;
+	proute->child_parent_tupconv_maps = NULL;
+
+	/*
+	 * Initialize an empty slot that will be used to manipulate tuples of any
+	 * given partition's rowtype.
+	 */
+	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
+
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find a leaf partition for the tuple contained in *slot
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message, else it returns the index of the leaf partition's
+ * ResultRelInfo in the proute->partitions array.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
+	int			result = -1;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
@@ -272,10 +220,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		 * partitions to begin with.
 		 */
 		if (partdesc->nparts == 0)
-		{
-			result = -1;
 			break;
-		}
 
 		cur_index = get_partition_for_tuple(rel, values, isnull);
 
@@ -285,17 +230,64 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		 * next parent to find a partition of.
 		 */
 		if (cur_index < 0)
-		{
-			result = -1;
 			break;
-		}
-		else if (parent->indexes[cur_index] >= 0)
+
+		if (partdesc->is_leaf[cur_index])
 		{
-			result = parent->indexes[cur_index];
+			/* Get the ResultRelInfo of this leaf partition. */
+			if (parent->indexes[cur_index] >= 0)
+			{
+				/*
+				 * Already assigned (either created fresh or reused from the
+				 * set of UPDATE result rels.)
+				 */
+				Assert(parent->indexes[cur_index] < proute->num_partitions);
+				result = parent->indexes[cur_index];
+			}
+			else if (node && node->operation == CMD_UPDATE)
+			{
+				/* Try to assign an existing result rel for tuple routing. */
+				result = ExecUseUpdateResultRelForRouting(mtstate, proute,
+														  parent, cur_index);
+
+				/* We may not really have found one. */
+				Assert(result < 0 ||
+					   parent->indexes[cur_index] < proute->num_partitions);
+			}
+
+			/* We need to create one afresh. */
+			if (result < 0)
+			{
+				result = ExecInitPartitionInfo(mtstate, resultRelInfo,
+											   proute, estate,
+											   parent, cur_index);
+				Assert(result >= 0 && result < proute->num_partitions);
+			}
 			break;
 		}
 		else
-			parent = pd[-parent->indexes[cur_index]];
+		{
+			/* Get the PartitionDispatch of this parent. */
+			if (parent->indexes[cur_index] >= 0)
+			{
+				/* Already allocated. */
+				Assert(parent->indexes[cur_index] < proute->num_dispatch);
+				parent = pd[parent->indexes[cur_index]];
+			}
+			else
+			{
+				/* Not yet, allocate one. */
+				PartitionDispatch new_parent;
+
+				new_parent =
+					ExecInitPartitionDispatchInfo(proute,
+												  partdesc->oids[cur_index],
+												  parent, cur_index);
+				Assert(parent->indexes[cur_index] >= 0 &&
+					   parent->indexes[cur_index] < proute->num_dispatch);
+				parent = new_parent;
+			}
+		}
 	}
 
 	/* A partition was not found. */
@@ -318,65 +310,110 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 }
 
 /*
- * ExecGetPartitionInfo
- *		Fetch ResultRelInfo for partidx
+ * ExecUseUpdateResultRelForRouting
+ *		Checks if any of the ResultRelInfo's created by ExecInitModifyTable
+ *		belongs to the passed in partition, and if so, stores its pointer in
+ *		in proute so that it can be used as the target of tuple routing
  *
- * Sets up ResultRelInfo, if not done already.
+ * Return value is the index at which the found result rel is stored in proute
+ * or -1 if none found.
  */
-ResultRelInfo *
-ExecGetPartitionInfo(ModifyTableState *mtstate,
-					 ResultRelInfo *resultRelInfo,
-					 PartitionTupleRouting *proute,
-					 EState *estate, int partidx)
+static int
+ExecUseUpdateResultRelForRouting(ModifyTableState *mtstate,
+								 PartitionTupleRouting *proute,
+								 PartitionDispatch pd,
+								 int partidx)
 {
-	ResultRelInfo *result = proute->partitions[partidx];
+	Oid				partoid = pd->partdesc->oids[partidx];
+	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo  *update_result_rels = NULL;
+	int				num_update_result_rels = 0;
+	int				i;
+	int				part_result_rel_index = -1;
 
-	if (result)
-		return result;
+	update_result_rels = mtstate->resultRelInfo;
+	num_update_result_rels = list_length(node->plans);
 
-	result = ExecInitPartitionInfo(mtstate,
-								   resultRelInfo,
-								   proute,
-								   estate,
-								   partidx);
-	Assert(result);
-
-	proute->partitions[partidx] = result;
-
-	/*
-	 * Record the ones setup so far in setup order.  This makes the cleanup
-	 * operation more efficient when very few have been setup.
-	 */
-	if (proute->num_partitions_init == proute->partitions_init_size)
+	/* If here for the first time, initialize necessary info in proute. */
+	if (proute->subplan_partition_offsets == NULL)
 	{
-		/* First allocate more space if the array is not large enough */
-		proute->partitions_init_size =
-			Min(proute->partitions_init_size * 2, proute->num_partitions);
-
-		proute->partitions_init = (ResultRelInfo **)
-				repalloc(proute->partitions_init,
-				proute->partitions_init_size * sizeof(ResultRelInfo *));
+		proute->subplan_partition_offsets =
+				palloc(num_update_result_rels * sizeof(int));
+		memset(proute->subplan_partition_offsets, -1,
+			   num_update_result_rels * sizeof(int));
+		proute->num_subplan_partition_offsets = num_update_result_rels;
 	}
 
-	proute->partitions_init[proute->num_partitions_init++] = result;
+	/*
+	 * Go through UPDATE result rels and save the pointers of those that
+	 * belong to this table's partitions in proute.
+	 */
+	for (i = 0; i < num_update_result_rels; i++)
+	{
+		ResultRelInfo *update_result_rel = &update_result_rels[i];
 
-	Assert(proute->num_partitions_init <= proute->num_partitions);
+		if (partoid != RelationGetRelid(update_result_rel->ri_RelationDesc))
+			continue;
 
-	return result;
+		/* Found it. */
+
+		/*
+		 * This is required in order to convert the partition's tuple
+		 * to be compatible with the root partitioned table's tuple
+		 * descriptor.  When generating the per-subplan result rels,
+		 * this was not set.
+		 */
+		update_result_rel->ri_PartitionRoot = proute->partition_root;
+
+		/*
+		 * Remember the index of this UPDATE result rel in the tuple
+		 * routing partition array.
+		 */
+		proute->subplan_partition_offsets[i] = proute->num_partitions;
+
+		/*
+		 * Also, record in PartitionDispatch that we have a valid
+		 * ResultRelInfo for this partition.
+		 */
+		Assert(pd->indexes[partidx] == -1);
+		part_result_rel_index = proute->num_partitions++;
+		if (part_result_rel_index >= PARTITION_ROUTING_MAXSIZE)
+			elog(ERROR, "invalid partition index: %u", part_result_rel_index);
+		pd->indexes[partidx] = part_result_rel_index;
+		if (part_result_rel_index >= proute->partitions_allocsize)
+		{
+			/* Expand allocated place. */
+			proute->partitions_allocsize =
+				Min(proute->partitions_allocsize * 2,
+					PARTITION_ROUTING_MAXSIZE);
+			proute->partitions = (ResultRelInfo **)
+				repalloc(proute->partitions,
+						 sizeof(ResultRelInfo *) *
+								proute->partitions_allocsize);
+		}
+		proute->partitions[part_result_rel_index] = update_result_rel;
+		break;
+	}
+
+	return part_result_rel_index;
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
  *
- * Returns the ResultRelInfo
+ * This also stores it in the proute->partitions array at the next
+ * available index, possibly expanding the array if there isn't any space
+ * left in it, and returns the index where it's stored.
  */
-static ResultRelInfo *
+static int
 ExecInitPartitionInfo(ModifyTableState *mtstate,
 					  ResultRelInfo *resultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch parent, int partidx)
 {
+	Oid			partoid = parent->partdesc->oids[partidx];
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
 	Relation	rootrel = resultRelInfo->ri_RelationDesc,
 				partrel;
@@ -385,12 +422,13 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(partoid, NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -566,8 +604,23 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 									&mtstate->ps, RelationGetDescr(partrel));
 	}
 
+	part_result_rel_index = proute->num_partitions++;
+	if (part_result_rel_index >= PARTITION_ROUTING_MAXSIZE)
+		elog(ERROR, "invalid partition index: %u", part_result_rel_index);
+	parent->indexes[partidx] = part_result_rel_index;
+	if (part_result_rel_index >= proute->partitions_allocsize)
+	{
+		/* Expand allocated place. */
+		proute->partitions_allocsize =
+			Min(proute->partitions_allocsize * 2, PARTITION_ROUTING_MAXSIZE);
+		proute->partitions = (ResultRelInfo **)
+			repalloc(proute->partitions,
+					 sizeof(ResultRelInfo *) * proute->partitions_allocsize);
+	}
+
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri,
+						part_result_rel_index);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
@@ -626,7 +679,8 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			TupleConversionMap *map;
 
 			map = proute->parent_child_tupconv_maps ?
-				proute->parent_child_tupconv_maps[partidx] : NULL;
+				proute->parent_child_tupconv_maps[part_result_rel_index] :
+				NULL;
 
 			Assert(node->onConflictSet != NIL);
 			Assert(resultRelInfo->ri_onConflict != NULL);
@@ -729,12 +783,12 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
 
 	MemoryContextSwitchTo(oldContext);
 
-	return leaf_part_rri;
+	return part_result_rel_index;
 }
 
 /*
@@ -766,10 +820,26 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 
 	if (map)
 	{
+		int		new_size;
+
 		/* Allocate parent child map array only if we need to store a map */
-		if (!proute->parent_child_tupconv_maps)
+		if (proute->parent_child_tupconv_maps == NULL)
+		{
+			proute->parent_child_tupconv_maps_allocsize = new_size =
+				PARTITION_ROUTING_INITSIZE;
 			proute->parent_child_tupconv_maps = (TupleConversionMap **)
-				palloc0(proute->num_partitions * sizeof(TupleConversionMap *));
+				palloc0(sizeof(TupleConversionMap *) * new_size);
+		}
+		/* We may have ran out of the initially allocated space. */
+		else if (partidx >= proute->parent_child_tupconv_maps_allocsize)
+		{
+			proute->parent_child_tupconv_maps_allocsize = new_size =
+				Min(proute->parent_child_tupconv_maps_allocsize * 2,
+					PARTITION_ROUTING_MAXSIZE);
+			proute->parent_child_tupconv_maps = (TupleConversionMap **)
+				repalloc( proute->parent_child_tupconv_maps,
+						 sizeof(TupleConversionMap *) * new_size);
+		}
 
 		proute->parent_child_tupconv_maps[partidx] = map;
 	}
@@ -788,6 +858,91 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 }
 
 /*
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table
+ *
+ * This also stores it in the proute->partition_dispatch_info array at the
+ * specified index ('dispatchidx'), possibly expanding the array if there
+ * isn't space left in it.
+ */
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int part_index)
+{
+	Relation	rel;
+	TupleDesc	tupdesc;
+	PartitionDesc partdesc;
+	PartitionKey partkey;
+	PartitionDispatch pd;
+	int			dispatchidx;
+
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	tupdesc = RelationGetDescr(rel);
+	partdesc = RelationGetPartitionDesc(rel);
+	partkey = RelationGetPartitionKey(rel);
+
+	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
+	pd->reldesc = rel;
+	pd->key = partkey;
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap =
+				convert_tuples_by_name(RelationGetDescr(parent_pd->reldesc),
+									   tupdesc,
+									   gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
+
+	pd->indexes = (int *) palloc(sizeof(int) * partdesc->nparts);
+
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
+
+	dispatchidx = proute->num_dispatch++;
+	if (dispatchidx >= PARTITION_ROUTING_MAXSIZE)
+		elog(ERROR, "invalid partition index: %u", dispatchidx);
+	if (parent_pd)
+		parent_pd->indexes[part_index] = dispatchidx;
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize =
+			Min(proute->dispatch_allocsize * 2, PARTITION_ROUTING_MAXSIZE);
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
+
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
+
+	return pd;
+}
+
+/*
  * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
  * child-to-root tuple conversion map array.
  *
@@ -805,13 +960,14 @@ ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
 	 * These array elements get filled up with maps on an on-demand basis.
 	 * Initially just set all of them to NULL.
 	 */
+	proute->child_parent_tupconv_maps_allocsize = PARTITION_ROUTING_INITSIZE;
 	proute->child_parent_tupconv_maps =
 		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
+										PARTITION_ROUTING_INITSIZE);
 
 	/* Same is the case for this array. All the values are set to false */
 	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
+		(bool *) palloc0(sizeof(bool) * PARTITION_ROUTING_INITSIZE);
 }
 
 /*
@@ -826,8 +982,9 @@ TupConvMapForLeaf(PartitionTupleRouting *proute,
 	TupleConversionMap **map;
 	TupleDesc	tupdesc;
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	/* If nobody else set up the per-leaf maps array, do so ourselves. */
+	if (proute->child_parent_tupconv_maps == NULL)
+		ExecSetupChildParentMapForLeaf(proute);
 
 	/* If it's already known that we don't need a map, return NULL. */
 	if (proute->child_parent_map_not_required[leaf_index])
@@ -846,6 +1003,30 @@ TupConvMapForLeaf(PartitionTupleRouting *proute,
 							   gettext_noop("could not convert row type"));
 
 	/* If it turns out no map is needed, remember for next time. */
+
+	/* We may have run out of the initially allocated space. */
+	if (leaf_index >= proute->child_parent_tupconv_maps_allocsize)
+	{
+		int		new_size,
+				old_size;
+
+		old_size = proute->child_parent_tupconv_maps_allocsize;
+		proute->child_parent_tupconv_maps_allocsize = new_size =
+			Min(proute->parent_child_tupconv_maps_allocsize * 2,
+				PARTITION_ROUTING_MAXSIZE);
+		proute->child_parent_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->child_parent_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(proute->child_parent_tupconv_maps + old_size, 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+
+		proute->child_parent_map_not_required = (bool *)
+			repalloc(proute->child_parent_map_not_required,
+					 sizeof(bool) * new_size);
+		memset(proute->child_parent_map_not_required + old_size, false,
+			   sizeof(bool) * (new_size - old_size));
+	}
+
 	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
 
 	return *map;
@@ -909,9 +1090,9 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(pd->tupslot);
 	}
 
-	for (i = 0; i < proute->num_partitions_init; i++)
+	for (i = 0; i < proute->num_partitions; i++)
 	{
-		ResultRelInfo *resultRelInfo = proute->partitions_init[i];
+		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
 		/* Allow any FDWs to shut down if they've been exercised */
 		if (resultRelInfo->ri_PartitionReadyForRouting &&
@@ -920,6 +1101,28 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 			resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
 														   resultRelInfo);
 
+		/*
+		 * Check if this result rel is one of UPDATE subplan result rels,
+		 * which if so, let ExecEndPlan() close it.
+		 */
+		if (proute->subplan_partition_offsets)
+		{
+			int		j;
+			int		found = false;
+
+			for (j = 0; j < proute->num_subplan_partition_offsets; j++)
+			{
+				if (proute->subplan_partition_offsets[j] == i)
+				{
+					found = true;
+					break;
+				}
+			}
+
+			if (found)
+				continue;
+		}
+
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
@@ -931,211 +1134,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns an array of PartitionDispatch as is required for routing
- *		tuples to the correct partition.
- *
- * 'num_parted' is set to the size of the returned array and the
- *'leaf_part_oids' array is allocated and populated with each leaf partition
- * Oid in the hierarchy. 'n_leaf_part_oids' is set to the size of that array.
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, Oid **leaf_part_oids,
-								 int *n_leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-	int			leaf_part_oid_size;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*n_leaf_part_oids = 0;
-
-	leaf_part_oid_size = 0;
-	*leaf_part_oids = NULL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids,
-								   n_leaf_part_oids, &leaf_part_oid_size);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we populate
- * '*pds' with PartitionDispatch objects of each partitioned table we find,
- * and populate leaf_part_oids with each leaf partition OID found.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- *
- * Note: Callers must not attempt to pfree the 'leaf_part_oids' array.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, Oid **leaf_part_oids,
-							   int *n_leaf_part_oids,
-							   int *leaf_part_oid_size)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-	int			nparts;
-	int			oid_array_used;
-	int			oid_array_size;
-	Oid		   *oid_array;
-	Oid		   *partdesc_oids;
-	bool	   *partdesc_subpartitions;
-	int		   *indexes;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
-											tupdesc,
-											gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-
-		/*
-		 * If the parent has no sub partitions then we can skip calculating
-		 * all the leaf partitions and just return all the oids at this level.
-		 * In this case, the indexes were also pre-calculated for us by the
-		 * syscache code.
-		 */
-		if (!partdesc->hassubpart)
-		{
-			*leaf_part_oids = partdesc->oids;
-			/* XXX or should we memcpy this out of syscache? */
-			pd->indexes = partdesc->indexes;
-			*n_leaf_part_oids = partdesc->nparts;
-			return;
-		}
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	oid_array_used = *n_leaf_part_oids;
-	oid_array_size = *leaf_part_oid_size;
-	oid_array = *leaf_part_oids;
-	nparts = partdesc->nparts;
-
-	if (!oid_array)
-	{
-		oid_array_size = *leaf_part_oid_size = nparts;
-		*leaf_part_oids = (Oid *) palloc(sizeof(Oid) * nparts);
-		oid_array = *leaf_part_oids;
-	}
-
-	partdesc_oids = partdesc->oids;
-	partdesc_subpartitions = partdesc->subpartitions;
-
-	pd->indexes = indexes = (int *) palloc(nparts * sizeof(int));
-
-	for (i = 0; i < nparts; i++)
-	{
-		Oid			partrelid = partdesc_oids[i];
-
-		if (!partdesc_subpartitions[i])
-		{
-			if (oid_array_size <= oid_array_used)
-			{
-				oid_array_size *= 2;
-				oid_array = (Oid *) repalloc(oid_array,
-											 sizeof(Oid) * oid_array_size);
-			}
-
-			oid_array[oid_array_used] = partrelid;
-			indexes[i] = oid_array_used++;
-		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
-
-			*n_leaf_part_oids = oid_array_used;
-			*leaf_part_oid_size = oid_array_size;
-			*leaf_part_oids = oid_array;
-
-			indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids,
-										   n_leaf_part_oids, leaf_part_oid_size);
-
-			oid_array_used = *n_leaf_part_oids;
-			oid_array_size = *leaf_part_oid_size;
-			oid_array = *leaf_part_oids;
-		}
-	}
-
-	*n_leaf_part_oids = oid_array_used;
-	*leaf_part_oid_size = oid_array_size;
-	*leaf_part_oids = oid_array;
-}
-
 /* ----------------
  *		FormPartitionKeyDatum
  *			Construct values[] and isnull[] arrays for the partition key
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 07b5f968aa..8b671c6426 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1666,7 +1665,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1709,15 +1708,12 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	 * value is to be used as an index into the arrays for the ResultRelInfo
 	 * and TupleConversionMap for the partition.
 	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
+	partidx = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
 	Assert(partidx >= 0 && partidx < proute->num_partitions);
 
 	/* Get the ResultRelInfo corresponding to the selected partition. */
-	partrel = ExecGetPartitionInfo(mtstate, targetRelInfo, proute, estate,
-								   partidx);
+	Assert(proute->partitions[partidx] != NULL);
+	partrel = proute->partitions[partidx];
 
 	/*
 	 * Check whether the partition is routable if we didn't yet
@@ -1825,17 +1821,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			i;
 
 	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
-	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
 	 * conversion is necessary, which is hopefully a common case.
@@ -1857,78 +1842,17 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 }
 
 /*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
-/*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ouselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
-
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index b36b7366e5..aa82aa52eb 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -594,7 +594,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
-		result->subpartitions = (bool *) palloc(nparts * sizeof(bool));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -775,7 +775,6 @@ RelationBuildPartitionDesc(Relation rel)
 		}
 
 		result->boundinfo = boundinfo;
-		result->hassubpart = false; /* unless we discover otherwise below */
 
 		/*
 		 * Now assign OIDs from the original array into mapped indexes of the
@@ -786,33 +785,13 @@ RelationBuildPartitionDesc(Relation rel)
 		for (i = 0; i < nparts; i++)
 		{
 			int			index = mapping[i];
-			bool		subpart;
 
 			result->oids[index] = oids[i];
-
-			subpart = (get_rel_relkind(oids[i]) == RELKIND_PARTITIONED_TABLE);
 			/* Record if the partition is a subpartitioned table */
-			result->subpartitions[index] = subpart;
-			result->hassubpart |= subpart;
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
 		}
 
-		/*
-		 * If there are no subpartitions then we can pre-calculate the
-		 * PartitionDispatch->indexes array.  Doing this here saves quite a
-		 * bit of overhead on simple queries which perform INSERTs or UPDATEs
-		 * on partitioned tables with many partitions.  The pre-calculation is
-		 * very simple.  All we need to store is a sequence of numbers from 0
-		 * to nparts - 1.
-		 */
-		if (!result->hassubpart)
-		{
-			result->indexes = (int *) palloc(nparts * sizeof(int));
-			for (i = 0; i < nparts; i++)
-				result->indexes[i] = i;
-		}
-		else
-			result->indexes = NULL;
-
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index a8c69ff224..8d20469c98 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -26,18 +26,11 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs array of 'nparts' of partitions in
-								 * partbound order */
-	int		   *indexes;		/* Stores index for corresponding 'oids'
-								 * element for use in tuple routing, or NULL
-								 * if hassubpart is true.
-								 */
-	bool	   *subpartitions;	/* Array of 'nparts' set to true if the
-								 * corresponding 'oids' element belongs to a
-								 * sub-partitioned table.
-								 */
-	bool		hassubpart;		/* true if any oid belongs to a
-								 * sub-partitioned table */
+	Oid		   *oids;			/* Array of length 'nparts' containing
+								 * partition OIDs in order of the their
+								 * bounds */
+	bool	   *is_leaf;		/* Array of length 'nparts' containing whether
+								 * a partition is a leaf partition */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 822f66f5e2..91b840e12f 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -50,72 +50,124 @@ typedef struct PartitionDispatchData
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 /*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
  *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * partitions_init				Array of ResultRelInfo* objects in the order
- *								that they were lazily initialized.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * num_partitions_init			Number of leaf partition lazily setup so far.
- * partitions_init_size			Size of partitions_init array.
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done). Remains NULL if no maps to store.
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slot			TupleTableSlot to be used to manipulate any
- *								given leaf partition's rowtype after that
- *								partition is chosen for insertion by
- *								tuple-routing.
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
+ *	partition_root			Root table, that is, the table mentioned in the
+ *							INSERT or UPDATE query or COPY FROM command.
+ *
+ *	partition_dispatch_info	Contains PartitionDispatch objects for every
+ *							partitioned table touched by tuple routing.  The
+ *							entry for the root partitioned table is *always*
+ *							present as the first entry of this array.
+ *
+ *	num_dispatch			The number of existing entries and also serves as
+ *							the index of the next entry to be allocated and
+ *							placed in 'partition_dispatch_info'.
+ *
+ *	dispatch_allocsize		(>= 'num_dispatch') is the number of entries that
+ *							can be stored in 'partition_dispatch_info' before
+ *							needing to reallocate more space.
+ *
+ *	partitions				Contains pointers to a ResultRelInfos of all leaf
+ *							partitions touched by tuple routing.  Some of
+ *							these are pointers to "reused" ResultRelInfos,
+ *							that is, those that are created and destroyed
+ *							outside execPartition.c, for example, when tuple
+ *							routing is used for UPDATE queries that modify
+ *							the partition key.  Rest of them are pointers to
+ *							ResultRelInfos managed by execPartition.c itself
+ *
+ *	num_partitions			The number of existing entries and also serves as
+ *							the index of the next entry to be allocated and
+ *							placed in 'partitions'
+ *
+ *	partitions_allocsize	(>= 'num_partitions') is the number of entries
+ *							that can be stored in 'partitions' before needing
+ *							to reallocate more space
+ *
+ *	parent_child_tupconv_maps	Contains information to convert tuples of the
+ *							root parent's rowtype to those of the leaf
+ *							partitions' rowtype, but only for those partitions
+ *							whose TupleDescs are physically different from the
+ *							root parent's.  If none of the partitions has such
+ *							a differing TupleDesc, then it's NULL.  If
+ *							non-NULL, is of the same size as 'partitions', to
+ *							be able to use the same array index.  Also, there
+ *							need not be more of these maps than there are
+ *							partitions that were touched.
+ *
+ *	parent_child_tupconv_maps_allocsize		The number of entries that can be
+ *							stored in 'parent_child_tupconv_maps' before
+ *							needing to reallocate more space
+ *
+ *	partition_tuple_slot	This is a tuple slot used to store a tuple using
+ *							rowtype of the the partition chosen by tuple
+ *							routing.  Maintained separately because partitions
+ *							may have different rowtype.
+ *
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
+ *
+ *	child_parent_tupconv_maps	Information to convert tuples of the leaf
+ *							partitions' rowtype to the the root parent's
+ *							rowtype.  These are needed by transition table
+ *							machinery when storing tuples of partition's
+ *							rowtype into the transition table that can only
+ *							store tuples of the root parent's rowtype.
+ *							Like 'parent_child_tupconv_maps' it remains NULL
+ *							if none of the partitions selected by tuple
+ *							routing needed a conversion map.  Also, if non-
+ *							NULL, is of the same size as 'partitions'.
+ *
+ *	child_parent_map_not_required	Stores if we don't need a conversion
+ *							map for a partition so that TupConvMapForLeaf
+ *							can return quickly if set
+ *
+ *	child_parent_tupconv_maps_allocsize		The number of entries that can be
+ *							stored in 'child_parent_tupconv_maps' before
+ *							needing to reallocate more space
+ *
+ *	subplan_partition_offsets	The following maps indexes of UPDATE result
+ *							rels in the per-subplan array to indexes of their
+ *							pointers in the 'partitions'
+ *
+ *	num_subplan_partition_offsets	The number of entries in
+ *							'subplan_partition_offsets', which is same as the
+ *							number of UPDATE result rels
+ *
+ *	root_tuple_slot			During UPDATE tuple routing, this tuple slot is
+ *							used to transiently store a tuple using the root
+ *							table's rowtype after converting it from the
+ *							tuple's source leaf partition's rowtype.  That is,
+ *							if leaf partition's rowtype is different.
  *-----------------------
  */
 typedef struct PartitionTupleRouting
 {
+	Relation	partition_root;
+
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
-	Oid		   *partition_oids;
+	int			dispatch_allocsize;
+
 	ResultRelInfo **partitions;
-	ResultRelInfo **partitions_init;
 	int			num_partitions;
-	int			num_partitions_init;
-	int			partitions_init_size;
+	int			partitions_allocsize;
+
 	TupleConversionMap **parent_child_tupconv_maps;
+	int			parent_child_tupconv_maps_allocsize;
+
+	TupleTableSlot *partition_tuple_slot;
+
 	TupleConversionMap **child_parent_tupconv_maps;
 	bool	   *child_parent_map_not_required;
+	int			child_parent_tupconv_maps_allocsize;
+
 	int		   *subplan_partition_offsets;
 	int			num_subplan_partition_offsets;
-	TupleTableSlot *partition_tuple_slot;
+
 	TupleTableSlot *root_tuple_slot;
 } PartitionTupleRouting;
 
@@ -193,8 +245,9 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern int ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
 extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,
#13Kato, Sho
kato-sho@jp.fujitsu.com
In reply to: Alvaro Herrera (#9)
RE: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018-Jul-11, Alvaro Herrera wrote:

That commit is also in pg11, though -- just not in beta2. So we still don't know how much of an improvement patch2 is by itself :-)

Oops! I benchmarked with 11beta2 + 0001-Speed​​-up-INSERT-and-UPDATE-on-partitioned-tables.patch.
Results are as follows.

Performance seems to be improved.

part_num | latency_avg | tps_ex | update_latency | select_latency | insert_latency
----------+-------------+------------+----------------+----------------+----------------
100 | 2.09 | 478.379516 | 1.407 | 0.36 | 0.159
200 | 5.871 | 170.322179 | 4.621 | 0.732 | 0.285
400 | 39.029 | 25.622384 | 35.542 | 2.273 | 0.758
800 | 142.624 | 7.011494 | 135.447 | 5.04 | 1.388
1600 | 559.872 | 1.786138 | 534.301 | 20.318 | 3.122
3200 | 2161.834 | 0.462574 | 2077.737 | 72.804 | 7.037
6400 | 8282.38 | 0.120739 | 7996.212 | 259.406 | 14.514

Thanks

Kato Sho
-----Original Message-----
From: Alvaro Herrera [mailto:alvherre@2ndquadrant.com]
Sent: Wednesday, July 11, 2018 10:30 PM
To: David Rowley <david.rowley@2ndquadrant.com>
Cc: Kato, Sho/加藤 翔 <kato-sho@jp.fujitsu.com>; PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>
Subject: Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018-Jul-11, David Rowley wrote:

On 6 July 2018 at 21:25, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:

2. 11beta2 + patch1 + patch2

patch1: Allow direct lookups of AppendRelInfo by child relid
commit 7d872c91a3f9d49b56117557cdbb0c3d4c620687
patch2: 0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patch

part_num | tps_ex | latency_avg | update_latency | select_latency | insert_latency
----------+-------------+-------------+----------------+----------------+----------------
100 | 1224.430344 | 0.817 | 0.551 | 0.085 | 0.048
200 | 689.567511 | 1.45 | 1.12 | 0.119 | 0.05
400 | 347.876616 | 2.875 | 2.419 | 0.185 | 0.052
800 | 140.489269 | 7.118 | 6.393 | 0.329 | 0.059
1600 | 29.681672 | 33.691 | 31.272 | 1.517 | 0.147
3200 | 7.021957 | 142.412 | 136.4 | 4.033 | 0.214
6400 | 1.462949 | 683.557 | 669.187 | 7.677 | 0.264

Just a note to say that the "Allow direct lookups of AppendRelInfo by
child relid" patch is already in master. It's much more relevant to be
testing with master than pg11. This patch is not intended for pg11.

That commit is also in pg11, though -- just not in beta2. So we still don't know how much of an improvement patch2 is by itself :-)

--
Álvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#14David Rowley
david.rowley@2ndquadrant.com
In reply to: Kato, Sho (#13)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 18 July 2018 at 21:44, Kato, Sho <kato-sho@jp.fujitsu.com> wrote:

part_num | latency_avg | tps_ex | update_latency | select_latency | insert_latency
----------+-------------+------------+----------------+----------------+----------------
100 | 2.09 | 478.379516 | 1.407 | 0.36 | 0.159
200 | 5.871 | 170.322179 | 4.621 | 0.732 | 0.285
400 | 39.029 | 25.622384 | 35.542 | 2.273 | 0.758
800 | 142.624 | 7.011494 | 135.447 | 5.04 | 1.388
1600 | 559.872 | 1.786138 | 534.301 | 20.318 | 3.122
3200 | 2161.834 | 0.462574 | 2077.737 | 72.804 | 7.037
6400 | 8282.38 | 0.120739 | 7996.212 | 259.406 | 14.514

Thanks for testing. It's fairly customary to include before/after,
unpatched/patched results. I don't think your patched results really
mean much by themselves. It's pretty well known that adding more
partitions slows down the planner and the executor, to a lesser
extent. This patch only aims to reduce some of the executor startup
overheads for INSERT and UPDATE.

Also, the 0001 patch is not really aiming to break any performance
records. I posted results already and there is only a very small
improvement. The main aim with the 0001 patch is to remove the
bottlenecks so that the performance drop between partitioned and
non-partitioned is primarily due to the partition locking. I'd like
to fix that too, but it's more work and I see no reason that we
shouldn't fix up the other slow parts first. I imagine this will
increase the motivation to resolve the locking all partitions issue
too.

I'd also recommend that if you're testing this, that you do so with a
recent master. The patch is not intended for pg11.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#15David Rowley
david.rowley@2ndquadrant.com
In reply to: Amit Langote (#12)
2 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 18 July 2018 at 20:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Let me know what you think of the code in the updated patch.

Thanks for sending the updated patch.

I looked over it tonight and made a number of changes:

1) Got rid of PARTITION_ROUTING_MAXSIZE. The code using this was
useless since the int would have wrapped long before it reached
UINT_MAX. There's no shortage of other code doubling the size of an
array by multiplying it by 2 unconditionally without considering
overflowing an int. Unsure why you considered this more risky.
2) Fixed a series of bugs regarding the size of the arrays in
PartitionTupleRouting. The map arrays and the partitions array could
differ in size despite your comment that claimed
child_parent_tupconv_maps was the same size as 'partitions' when
non-NULL. The map arrays being a different size than the partitions
array caused the following two cases to segfault. I've included two
cases as it was two seperate bugs that caused them.

-- case 1
drop table listp;
create table listp (a int, b int) partition by list (a);
create table listp1 partition of listp for values in (1);
create table listp2 partition of listp for values in (2);
create table listp3 partition of listp for values in (3);
create table listp4 partition of listp for values in (4);
create table listp5 partition of listp for values in (5);
create table listp6 partition of listp for values in (6);
create table listp7 partition of listp for values in (7);
create table listp8 partition of listp for values in (8);
create table listp9 (b int, a int);

alter table listp attach partition listp9 for values in(9);

insert into listp select generate_series(1,9);

-- case 2
drop table listp;
create table listp (a int, b int) partition by list (a);
create table listp1 (b int, a int);

alter table listp attach partition listp1 for values in(1);

create table listp1 partition of listp for values in (1);
create table listp2 partition of listp for values in (2);
create table listp3 partition of listp for values in (3);
create table listp4 partition of listp for values in (4);
create table listp5 partition of listp for values in (5);
create table listp6 partition of listp for values in (6);
create table listp7 partition of listp for values in (7);
create table listp8 partition of listp for values in (8);
create table listp9 partition of listp for values in (9);

insert into listp select generate_series(1,9);

3) Got rid of ExecUseUpdateResultRelForRouting. I started to change
this to remove references to UPDATE in order to make it more friendly
towards other possible future node types that it would get used for
(aka MERGE). In the end, I found that performance could regress when
in cases like:

drop table listp;
create table listp (a int) partition by list(a);
\o /dev/null
\timing off
select 'create table listp'||x::Text||' partition of listp for values
in('||x::Text||');' from generate_series(1,1000) x;
\gexec
\o
insert into listp select x from generate_series(1,999) x;
\timing on
update listp set a = a+1;

It's true that UPDATEs with a large number of subplans performance is
quite terrible today in the planner, but this code made the
performance of planning+execution a bit worse. If we get around to
fixing the inheritance planner then I think
ExecUseUpdateResultRelForRouting() could easily appear in profiles.

I ended up rewriting it to just get called once and build a hash table
by Oid storing a ResultRelInfo pointer. This also gets rid of the
slow nested loop in the cleanup operation inside
ExecCleanupTupleRouting().

4) Did some tuning work in ExecFindPartition() getting rid of a
redundant check after the loop completion. Also added some likely()
and unlikely() decorations around some conditions.

5) Updated some newly out-dated comments since your patch in execPartition.h.

6) Replaced the palloc0() in ExecSetupPartitionTupleRouting() with a
palloc() updating the few fields that were not initialised. This might
save a few TPS (at least once we get rid of the all partition locking)
in the single-row INSERT case, but I've not tested the performance of
this yet.

7) Also moved and edited some comments above
ExecSetupPartitionTupleRouting() that I felt explained a little too
much about some internal implementation details.

One thing that I thought of, but didn't do was just having
ExecFindPartition() return the ResultRelInfo. I think it would be much
nicer in both call sites to not have to check the ->partitions array
to get that. The copy.c call site would need a few modifications
around the detection code to see if the partition has changed, but it
all looks quite possible to change. I left this for now as I have
another patch which touches all that code that I feel is closer to
commit than this patch is.

I've attached a delta of the changes I made since your v2 delta and
also a complete updated patch.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v2-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchapplication/octet-stream; name=v2-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload
From 93964b65609ed406333aa073e8b4eac72981b45a Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 26 Jul 2018 19:54:55 +1200
Subject: [PATCH v2] Speed up INSERT and UPDATE on partitioned tables

This is more or less a complete redesign of PartitionTupleRouting. The
aim here is to get rid of all the possibly large arrays that were being
allocated during ExecSetupPartitionTupleRouting().  We now allocate
small arrays to store the partition's ResultRelInfo and only enlarge
these when we run out of space.  The partitions array is now ordered
by the order in which the partition's ResultRelInfos are inititialized
rather than in same order as partdesc.

The slowest part of ExecSetupPartitionTupleRouting still remains.  The
find_all_inheritors call still remains by far the slowest part of the
function. This patch just removes the other slow parts.

Initialization of the parent/child translation maps array is now only
performed when we need to store the first translation map.  If the column
order between the parent and its child are the same, then no map ever
needs to be stored, this (possibly large) array did nothing.

In simple INSERTs hitting a single partition to a partitioned table with
many partitions the shutdown of the executor was also slow in comparison to
the actual execution, this was down to the loop which cleans up each
ResultRelInfo having to loop over an array which often contained mostly
NULLs, which had to be skipped.  Performance of this has now improved as
the array we loop over now no longer has to skip possibly many NULL
values.

David Rowley and Amit Langote
---
 src/backend/commands/copy.c            |  19 +-
 src/backend/executor/execPartition.c   | 744 +++++++++++++++++++--------------
 src/backend/executor/nodeModifyTable.c | 102 +----
 src/backend/utils/cache/partcache.c    |  11 +-
 src/include/catalog/partition.h        |   6 +-
 src/include/executor/execPartition.h   | 159 ++++---
 6 files changed, 555 insertions(+), 486 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3a66cb5025..44cf3bba12 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2621,10 +2621,8 @@ CopyFrom(CopyState cstate)
 			 * will get us the ResultRelInfo and TupleConversionMap for the
 			 * partition, respectively.
 			 */
-			leaf_part_index = ExecFindPartition(resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
+			leaf_part_index = ExecFindPartition(mtstate, resultRelInfo,
+												proute, slot, estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
@@ -2644,15 +2642,8 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
+			Assert(proute->partitions[leaf_part_index] != NULL);
 			resultRelInfo = proute->partitions[leaf_part_index];
-			if (resultRelInfo == NULL)
-			{
-				resultRelInfo = ExecInitPartitionInfo(mtstate,
-													  saved_resultRelInfo,
-													  proute, estate,
-													  leaf_part_index);
-				Assert(resultRelInfo != NULL);
-			}
 
 			/*
 			 * For ExecInsertIndexTuples() to work on the partition's indexes
@@ -2693,7 +2684,9 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
+			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps ?
+												proute->parent_child_tupconv_maps[leaf_part_index] :
+												NULL,
 											  tuple,
 											  proute->partition_tuple_slot,
 											  &slot);
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 7a4665cc4e..d7b18f52ed 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,11 +31,19 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+#define PARTITION_ROUTING_INITSIZE	8
 
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecExpandRoutingArrays(PartitionTupleRouting *proute);
+static int ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch parent, int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+						Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -62,134 +70,107 @@ static void find_matching_subplans_recurse(PartitionPruneState *prunestate,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition.  The actual ResultRelInfos are allocated lazily by that
+ * function.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
-
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
 
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
-	}
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
+	 * lazily, only when we actually need to route a tuple to that partition.
+	 * The reason for this is that a common case is for INSERT to insert a
+	 * single tuple into a single partition.
+	 *
+	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
+	 * PartitionDispatch and ResultRelInfo pointers in their respective arrays.
+	 * More space can be allocated later, if required via
+	 * ExecExpandRoutingArrays.
+	 *
+	 * We're certain to only need just 1 PartitionDispatch; the one for the
+	 * partitioned table which is the target of the command.  We'll only setup
+	 * PartitionDispatchs for any subpartitions if tuples actually get routed
+	 * to (through) them.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+			palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	proute->partitions = (ResultRelInfo **)
+			palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	/* We only allocate these arrays when we need to store the first map */
+	proute->parent_child_tupconv_maps = NULL;
+	proute->child_parent_tupconv_maps = NULL;
+	proute->child_parent_map_not_required = NULL;
 
 	/*
-	 * Initialize an empty slot that will be used to manipulate tuples of any
-	 * given partition's rowtype.  It is attached to the caller-specified node
-	 * (such as ModifyTableState) and released when the node finishes
-	 * processing.
+	 * Initialize this table's PartitionDispatch object.  Here we pass in
+	 * the parent is NULL as we don't need to care about any parent of the
+	 * target partitioned table.
 	 */
-	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
 
-	i = 0;
-	foreach(cell, leaf_parts)
+	/*
+	 * If UPDATE needs to do tuple routing, we'll need a slot that will
+	 * transiently store the tuple being routed using the root parent's
+	 * rowtype.  We must set up at least this slot, because it's needed even
+	 * before tuple routing begins.  Other necessary information is
+	 * initialized when  tuple routing code calls
+	 * ExecUseUpdateResultRelForRouting.
+	 */
+	if (node && node->operation == CMD_UPDATE)
 	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
-
-		proute->partition_oids[i] = leaf_oid;
-
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
-
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
-
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
-
-			update_rri_index++;
-		}
-
-		proute->partitions[i] = leaf_part_rri;
-		i++;
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
+		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
+	}
+	else
+	{
+		proute->subplan_partition_table = NULL;
+		proute->root_tuple_slot = NULL;
 	}
 
 	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
+	 * Initialize an empty slot that will be used to manipulate tuples of any
+	 * given partition's rowtype.
 	 */
-	Assert(update_rri_index == num_update_rri);
+	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
 
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find a leaf partition for the tuple contained in *slot
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message, else it returns the index of the leaf partition's
+ * ResultRelInfo in the proute->partitions array.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
+	int			result = -1;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
@@ -211,7 +192,7 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		PartitionDesc partdesc;
 		TupleTableSlot *myslot = parent->tupslot;
 		TupleConversionMap *map = parent->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = parent->reldesc;
 		partdesc = RelationGetPartitionDesc(rel);
@@ -242,81 +223,226 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(parent, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(rel, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(rel, values, isnull);
-
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
+		if (partdesc->is_leaf[partidx])
 		{
-			result = -1;
-			break;
+			/*
+			 * Get the index for PartitionTupleRouting->partitions array index
+			 * for this leaf partition.  This may require building a new
+			 * ResultRelInfo.
+			 */
+			if (likely(parent->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(parent->indexes[partidx] < proute->num_partitions);
+				result = parent->indexes[partidx];
+			}
+			else
+			{
+				if (proute->subplan_partition_table)
+				{
+					ResultRelInfo *rri;
+					Oid			partoid = partdesc->oids[partidx];
+
+					rri = hash_search(proute->subplan_partition_table,
+									  &partoid, HASH_FIND, NULL);
+
+					if (rri)
+					{
+						result = proute->num_partitions++;
+						parent->indexes[partidx] = result;
+
+						/* Allocate more space in the arrays, if required */
+						if (result >= proute->partitions_allocsize)
+							ExecExpandRoutingArrays(proute);
+
+						/* Save here for later use. */
+						proute->partitions[result] = rri;
+					}
+				}
+
+				/* We need to create one afresh. */
+				if (result < 0)
+				{
+					result = ExecInitPartitionInfo(mtstate, resultRelInfo,
+												   proute, estate,
+												   parent, partidx);
+					Assert(result >= 0 && result < proute->num_partitions);
+				}
+			}
+
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return result;
 		}
-		else if (parent->indexes[cur_index] >= 0)
+		else
 		{
-			result = parent->indexes[cur_index];
-			break;
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(parent->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(parent->indexes[partidx] < proute->num_dispatch);
+				parent = pd[parent->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subparent;
+
+				subparent = ExecInitPartitionDispatchInfo(proute,
+													partdesc->oids[partidx],
+													parent, partidx);
+				Assert(parent->indexes[partidx] >= 0 &&
+					   parent->indexes[partidx] < proute->num_dispatch);
+				parent = subparent;
+			}
 		}
-		else
-			parent = pd[-parent->indexes[cur_index]];
 	}
+}
 
-	/* A partition was not found. */
-	if (result < 0)
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo  *subplan_result_rels;
+	HASHCTL			ctl;
+	HTAB		   *htab;
+	int				nsubplans;
+	int				i;
+
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
+
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
+
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_partition_table = htab;
+
+	/* Hash all subplan by Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		char	   *val_desc;
-
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
+
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+												   &found);
+
+		if (!found)
+			*subplanrri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple
+		 * to be compatible with the root partitioned table's tuple
+		 * descriptor.  When generating the per-subplan result rels,
+		 * this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
 	}
+}
 
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
-	return result;
+/*
+ * ExecExpandRoutingArrays
+ *		Double the size of the allocated arrays in 'proute'
+ */
+static void
+ExecExpandRoutingArrays(PartitionTupleRouting *proute)
+{
+	int		new_size = proute->partitions_allocsize * 2;
+	int		old_size = proute->partitions_allocsize;
+
+	proute->partitions_allocsize = new_size;
+
+	proute->partitions = (ResultRelInfo **)
+		repalloc(proute->partitions, sizeof(ResultRelInfo *) * new_size);
+
+	if (proute->parent_child_tupconv_maps != NULL)
+	{
+		proute->parent_child_tupconv_maps = (TupleConversionMap **)
+			repalloc( proute->parent_child_tupconv_maps,
+						sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->parent_child_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_map_not_required != NULL)
+	{
+		proute->child_parent_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->child_parent_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->child_parent_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_map_not_required != NULL)
+	{
+		proute->child_parent_map_not_required = (bool *)
+			repalloc(proute->child_parent_map_not_required,
+					 sizeof(bool) * new_size);
+		memset(&proute->child_parent_map_not_required[old_size], 0,
+			   sizeof(bool) * (new_size - old_size));
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
- *
- * Returns the ResultRelInfo
+ *		and store it in the next empty slot in 'proute's partitions array and
+ *		return the index of that element.
  */
-ResultRelInfo *
+static int
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch parent, int partidx)
 {
+	Oid			partoid = parent->partdesc->oids[partidx];
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(partoid, NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -492,15 +618,25 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 									&mtstate->ps, RelationGetDescr(partrel));
 	}
 
+	part_result_rel_index = proute->num_partitions++;
+	parent->indexes[partidx] = part_result_rel_index;
+
+	/* Allocate more space in the arrays, if required */
+	if (part_result_rel_index >= proute->partitions_allocsize)
+		ExecExpandRoutingArrays(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
+
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri,
+						part_result_rel_index);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -513,7 +649,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -526,7 +662,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -540,7 +676,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -550,8 +686,14 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = proute->parent_child_tupconv_maps ?
+				proute->parent_child_tupconv_maps[part_result_rel_index] :
+				NULL;
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -560,7 +702,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -651,12 +793,9 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
-
 	MemoryContextSwitchTo(oldContext);
 
-	return leaf_part_rri;
+	return part_result_rel_index;
 }
 
 /*
@@ -671,6 +810,7 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 					int partidx)
 {
 	MemoryContext oldContext;
+	TupleConversionMap *map;
 
 	/*
 	 * Switch into per-query memory context.
@@ -681,10 +821,24 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
-		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
-							   RelationGetDescr(partRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	map = convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								 RelationGetDescr(partRelInfo->ri_RelationDesc),
+								 gettext_noop("could not convert row type"));
+
+	if (map)
+	{
+		int		new_size;
+
+		/* Allocate parent child map array only if we need to store a map */
+		if (proute->parent_child_tupconv_maps == NULL)
+		{
+			new_size = proute->partitions_allocsize;
+			proute->parent_child_tupconv_maps = (TupleConversionMap **)
+				palloc0(sizeof(TupleConversionMap *) * new_size);
+		}
+
+		proute->parent_child_tupconv_maps[partidx] = map;
+	}
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -699,6 +853,88 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	partRelInfo->ri_PartitionReadyForRouting = true;
 }
 
+/*
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table
+ *
+ * This also stores it in the proute->partition_dispatch_info array at the
+ * specified index ('dispatchidx'), possibly expanding the array if there
+ * isn't space left in it.
+ */
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
+{
+	Relation	rel;
+	TupleDesc	tupdesc;
+	PartitionDesc partdesc;
+	PartitionKey partkey;
+	PartitionDispatch pd;
+	int			dispatchidx;
+
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	tupdesc = RelationGetDescr(rel);
+	partdesc = RelationGetPartitionDesc(rel);
+	partkey = RelationGetPartitionKey(rel);
+
+	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
+	pd->reldesc = rel;
+	pd->key = partkey;
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap =
+				convert_tuples_by_name(RelationGetDescr(parent_pd->reldesc),
+									   tupdesc,
+									   gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
+
+	pd->indexes = (int *) palloc(sizeof(int) * partdesc->nparts);
+
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
+
+	dispatchidx = proute->num_dispatch++;
+	if (parent_pd)
+		parent_pd->indexes[partidx] = dispatchidx;
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize *= 2;
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
+
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
+
+	return pd;
+}
+
 /*
  * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
  * child-to-root tuple conversion map array.
@@ -711,19 +947,22 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 void
 ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
 {
+	int			size;
+
 	Assert(proute != NULL);
 
+	size = proute->partitions_allocsize;
+
 	/*
 	 * These array elements get filled up with maps on an on-demand basis.
 	 * Initially just set all of them to NULL.
 	 */
 	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * size);
 
 	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
+	proute->child_parent_map_not_required = (bool *) palloc0(sizeof(bool) *
+															 size);
 }
 
 /*
@@ -734,15 +973,15 @@ TupleConversionMap *
 TupConvMapForLeaf(PartitionTupleRouting *proute,
 				  ResultRelInfo *rootRelInfo, int leaf_index)
 {
-	ResultRelInfo **resultRelInfos = proute->partitions;
 	TupleConversionMap **map;
 	TupleDesc	tupdesc;
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	/* If nobody else set up the per-leaf maps array, do so ourselves. */
+	if (proute->child_parent_tupconv_maps == NULL)
+		ExecSetupChildParentMapForLeaf(proute);
 
 	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
+	else if (proute->child_parent_map_not_required[leaf_index])
 		return NULL;
 
 	/* If we've already got a map, return it. */
@@ -751,13 +990,16 @@ TupConvMapForLeaf(PartitionTupleRouting *proute,
 		return *map;
 
 	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
+	tupdesc = RelationGetDescr(proute->partitions[leaf_index]->ri_RelationDesc);
 	*map =
 		convert_tuples_by_name(tupdesc,
 							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
 							   gettext_noop("could not convert row type"));
 
-	/* If it turns out no map is needed, remember for next time. */
+	/*
+	 * If it turns out no map is needed, remember that so we don't try making
+	 * one again next time.
+	 */
 	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
 
 	return *map;
@@ -805,7 +1047,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -826,10 +1067,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-		/* skip further processsing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
-
 		/* Allow any FDWs to shut down if they've been exercised */
 		if (resultRelInfo->ri_PartitionReadyForRouting &&
 			resultRelInfo->ri_FdwRoutine != NULL &&
@@ -838,21 +1075,20 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 														   resultRelInfo);
 
 		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
+		if (proute->subplan_partition_table)
 		{
-			subplan_index++;
-			continue;
+			Oid			partoid;
+			bool		found;
+
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			(void) hash_search(proute->subplan_partition_table, &partoid,
+							   HASH_FIND, &found);
+			if (found)
+				continue;
 		}
 
 		ExecCloseIndices(resultRelInfo);
@@ -866,144 +1102,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
-											tupdesc,
-											gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
-
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
-		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
-
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
-	}
-}
-
 /* ----------------
  *		FormPartitionKeyDatum
  *			Construct values[] and isnull[] arrays for the partition key
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index f535762e2d..6e0c7862dc 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1666,7 +1665,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1709,21 +1708,12 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	 * value is to be used as an index into the arrays for the ResultRelInfo
 	 * and TupleConversionMap for the partition.
 	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
+	partidx = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
 	Assert(partidx >= 0 && partidx < proute->num_partitions);
 
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
-	 */
+	/* Get the ResultRelInfo corresponding to the selected partition. */
+	Assert(proute->partitions[partidx] != NULL);
 	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
 
 	/*
 	 * Check whether the partition is routable if we didn't yet
@@ -1789,7 +1779,9 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
+	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps ?
+								proute->parent_child_tupconv_maps[partidx] :
+								NULL,
 							  tuple,
 							  proute->partition_tuple_slot,
 							  &slot);
@@ -1828,17 +1820,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			numResultRelInfos = mtstate->mt_nplans;
 	int			i;
 
-	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
 	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
@@ -1860,79 +1841,18 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	}
 }
 
-/*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
 /*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ouselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 115a9fe78f..aa82aa52eb 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -594,6 +594,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -782,7 +783,15 @@ RelationBuildPartitionDesc(Relation rel)
 		 * defined by canonicalized representation of the partition bounds.
 		 */
 		for (i = 0; i < nparts; i++)
-			result->oids[mapping[i]] = oids[i];
+		{
+			int			index = mapping[i];
+
+			result->oids[index] = oids[i];
+			/* Record if the partition is a subpartitioned table */
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+		}
+
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 1f49e5d3a9..4cc7508067 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -26,7 +26,11 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of length 'nparts' containing
+								 * partition OIDs in order of the their
+								 * bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * a partition is a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 862bf65060..1b421f2ec5 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -31,9 +31,13 @@
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
  *				this table's rowtype (when extracting the partition key of a
  *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *	indexes		Array with partdesc->nparts elements.  For leaf partitions the
+ *				index into the PartitionTupleRouting->partitions array is
+ *				stored.  When the partition is itself a partitioned table then
+ *				we store the index into
+ *				PartitionTupleRouting->partition_dispatch_info.  -1 means
+ *				we've not yet allocated anything in PartitionTupleRouting for
+ *				the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -50,66 +54,106 @@ typedef struct PartitionDispatchData
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 /*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
- *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slot			TupleTableSlot to be used to manipulate any
- *								given leaf partition's rowtype after that
- *								partition is chosen for insertion by
- *								tuple-routing.
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
+ *
+ *	partition_root			Root table, that is, the table mentioned in the
+ *							command.
+ *
+ *	partition_dispatch_info	Contains PartitionDispatch objects for every
+ *							partitioned table touched by tuple routing.  The
+ *							entry for the root partitioned table is *always*
+ *							present as the first entry of this array.
+ *
+ *	num_dispatch			The number of existing entries and also serves as
+ *							the index of the next entry to be allocated and
+ *							placed in 'partition_dispatch_info'.
+ *
+ *	dispatch_allocsize		(>= 'num_dispatch') is the number of entries that
+ *							can be stored in 'partition_dispatch_info' before
+ *							needing to reallocate more space.
+ *
+ *	partitions				Contains pointers to a ResultRelInfos of all leaf
+ *							partitions touched by tuple routing.  Some of
+ *							these are pointers to "reused" ResultRelInfos,
+ *							that is, those that are created and destroyed
+ *							outside execPartition.c, for example, when tuple
+ *							routing is used for UPDATE queries that modify
+ *							the partition key.  Rest of them are pointers to
+ *							ResultRelInfos managed by execPartition.c itself
+ *
+ *	num_partitions			The number of existing entries and also serves as
+ *							the index of the next entry to be allocated and
+ *							placed in 'partitions'
+ *
+ *	partitions_allocsize	(>= 'num_partitions') is the number of entries
+ *							that can be stored in 'partitions',
+ *							'parent_child_tupconv_maps',
+ *							'child_parent_tupconv_maps' and
+ *							'child_parent_map_not_required' arrays before
+ *							needing to reallocate more space
+ *
+ *	parent_child_tupconv_maps	Contains information to convert tuples of the
+ *							root parent's rowtype to those of the leaf
+ *							partitions' rowtype, but only for those partitions
+ *							whose TupleDescs are physically different from the
+ *							root parent's.  If none of the partitions has such
+ *							a differing TupleDesc, then it's NULL.  If
+ *							non-NULL, is of the same size as 'partitions', to
+ *							be able to use the same array index.  Also, there
+ *							need not be more of these maps than there are
+ *							partitions that were touched.
+ *
+ *	partition_tuple_slot	This is a tuple slot used to store a tuple using
+ *							rowtype of the partition chosen by tuple
+ *							routing.  Maintained separately because partitions
+ *							may have different rowtype.
+ *
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
+ *
+ *	child_parent_tupconv_maps	Information to convert tuples of the leaf
+ *							partitions' rowtype to the root parent's rowtype.
+ *							These are needed by transition table machinery
+ *							when storing tuples of partition's rowtype into
+ *							the transition table that can only store tuples of
+ *							the root parent's rowtype.  Like
+ *							'parent_child_tupconv_maps' it remains NULL if
+ *							none of the partitions selected by tuple routing
+ *							needed a conversion map.  Also, if non-NULL, is of
+ *							the same size as 'partitions'.
+ *
+ *	child_parent_map_not_required	Stores if we don't need a conversion
+ *							map for a partition so that TupConvMapForLeaf
+ *							can return without having to re-check if it needs
+ *							to build a map.
+ *
+ *	subplan_partition_table	Hash table to store subplan index by Oid.
+ *
+ *	root_tuple_slot			During UPDATE tuple routing, this tuple slot is
+ *							used to transiently store a tuple using the root
+ *							table's rowtype after converting it from the
+ *							tuple's source leaf partition's rowtype.  That is,
+ *							if leaf partition's rowtype is different.
  *-----------------------
  */
 typedef struct PartitionTupleRouting
 {
+	Relation	partition_root;
+
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
-	Oid		   *partition_oids;
+	int			dispatch_allocsize;
 	ResultRelInfo **partitions;
 	int			num_partitions;
+	int			partitions_allocsize;
 	TupleConversionMap **parent_child_tupconv_maps;
 	TupleConversionMap **child_parent_tupconv_maps;
 	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
-	TupleTableSlot *partition_tuple_slot;
+	HTAB	   *subplan_partition_table;
 	TupleTableSlot *root_tuple_slot;
+	TupleTableSlot *partition_tuple_slot;
 } PartitionTupleRouting;
 
 /*-----------------------
@@ -186,14 +230,15 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern int ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
+extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,
+					 ResultRelInfo *resultRelInfo,
+					 PartitionTupleRouting *proute,
+					 EState *estate, int partidx);
 extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
 					PartitionTupleRouting *proute,
-- 
2.16.2.windows.1

v2_insert_speedups_delta.patchapplication/octet-stream; name=v2_insert_speedups_delta.patchDownload
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 23c766b5fc..342cf9b4f1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -32,18 +32,18 @@
 #include "utils/ruleutils.h"
 
 #define PARTITION_ROUTING_INITSIZE	8
-#define PARTITION_ROUTING_MAXSIZE	UINT_MAX
 
-static int ExecUseUpdateResultRelForRouting(ModifyTableState *mtstate,
-								 PartitionTupleRouting *proute,
-								 PartitionDispatch pd, int partidx);
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecExpandRoutingArrays(PartitionTupleRouting *proute);
 static int ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
 					  EState *estate,
 					  PartitionDispatch parent, int partidx);
 static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
-						Oid partoid, PartitionDispatch parent_pd, int part_index);
+						Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -70,22 +70,9 @@ static void find_matching_subplans_recurse(PartitionPruneState *prunestate,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * This is called during the initialization of a COPY FROM command or of a
- * INSERT/UPDATE query.  We provisionally allocate space to hold
- * PARTITION_ROUTING_INITSIZE number of PartitionDispatch and ResultRelInfo
- * pointers in their respective arrays.  The arrays will be doubled in
- * size via repalloc (subject to the limit of PARTITION_ROUTING_MAXSIZE
- * entries  at most) if and when we run out of space, as more partitions need
- * to be added.  Since we already have the root parent open, its
- * PartitionDispatch is created here.
- *
- * PartitionDispatch object of a non-root partitioned table or ResultRelInfo
- * of a leaf partition is allocated and added to the respective array when
- * it is encountered for the first time in ExecFindPartition.  As mentioned
- * above, we might need to expand the respective array before storing it.
- *
- * Tuple conversion maps (either child to parent and/or vice versa) and the
- * array(s) to hold them are allocated only if needed.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition.  The actual ResultRelInfos are allocated lazily by that
+ * function.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
@@ -96,24 +83,47 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
 
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
+	 * lazily, only when we actually need to route a tuple to that partition.
+	 * The reason for this is that a common case is for INSERT to insert a
+	 * single tuple into a single partition.
+	 *
+	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
+	 * PartitionDispatch and ResultRelInfo pointers in their respective arrays.
+	 * More space can be allocated later, if required via
+	 * ExecExpandRoutingArrays.
+	 *
+	 * We're certain to only need just 1 PartitionDispatch; the one for the
+	 * partitioned table which is the target of the command.  We'll only setup
+	 * PartitionDispatchs for any subpartitions if tuples actually get routed
+	 * to (through) them.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
 	proute->partition_root = rel;
-	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
 	proute->partition_dispatch_info = (PartitionDispatchData **)
 			palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	proute->partitions = (ResultRelInfo **)
+			palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	/* We only allocate these arrays when we need to store the first map */
+	proute->parent_child_tupconv_maps = NULL;
+	proute->child_parent_tupconv_maps = NULL;
+	proute->child_parent_map_not_required = NULL;
 
 	/*
-	 * Initialize this table's PartitionDispatch object.  Since the root
-	 * parent doesn't itself have any parent, last two parameters are
-	 * not used.
+	 * Initialize this table's PartitionDispatch object.  Here we pass in
+	 * the parent is NULL as we don't need to care about any parent of the
+	 * target partitioned table.
 	 */
 	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
 										 0);
-	proute->num_dispatch = 1;
-	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
-	proute->partitions = (ResultRelInfo **)
-			palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
-	proute->num_partitions = 0;
 
 	/*
 	 * If UPDATE needs to do tuple routing, we'll need a slot that will
@@ -124,18 +134,16 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	 * ExecUseUpdateResultRelForRouting.
 	 */
 	if (node && node->operation == CMD_UPDATE)
+	{
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
 		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
+	}
 	else
 	{
+		proute->subplan_partition_table = NULL;
 		proute->root_tuple_slot = NULL;
-		proute->subplan_partition_offsets = NULL;
-		proute->num_subplan_partition_offsets = 0;
 	}
 
-	/* We only allocate this when we need to store the first non-NULL map */
-	proute->parent_child_tupconv_maps = NULL;
-	proute->child_parent_tupconv_maps = NULL;
-
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
 	 * given partition's rowtype.
@@ -185,7 +193,7 @@ ExecFindPartition(ModifyTableState *mtstate,
 		PartitionDesc partdesc;
 		TupleTableSlot *myslot = parent->tupslot;
 		TupleConversionMap *map = parent->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = parent->reldesc;
 		partdesc = RelationGetPartitionDesc(rel);
@@ -216,146 +224,143 @@ ExecFindPartition(ModifyTableState *mtstate,
 		FormPartitionKeyDatum(parent, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (partdesc->nparts == 0)
-			break;
-
-		cur_index = get_partition_for_tuple(rel, values, isnull);
-
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
-			break;
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(rel, values, isnull)) < 0)
+		{
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
+		}
 
-		if (partdesc->is_leaf[cur_index])
+		if (partdesc->is_leaf[partidx])
 		{
-			/* Get the ResultRelInfo of this leaf partition. */
-			if (parent->indexes[cur_index] >= 0)
+			/*
+			 * Get the index for PartitionTupleRouting->partitions array index
+			 * for this leaf partition.  This may require building a new
+			 * ResultRelInfo.
+			 */
+			if (likely(parent->indexes[partidx] >= 0))
 			{
-				/*
-				 * Already assigned (either created fresh or reused from the
-				 * set of UPDATE result rels.)
-				 */
-				Assert(parent->indexes[cur_index] < proute->num_partitions);
-				result = parent->indexes[cur_index];
+				/* ResultRelInfo already built */
+				Assert(parent->indexes[partidx] < proute->num_partitions);
+				result = parent->indexes[partidx];
 			}
-			else if (node && node->operation == CMD_UPDATE)
+			else
 			{
-				/* Try to assign an existing result rel for tuple routing. */
-				result = ExecUseUpdateResultRelForRouting(mtstate, proute,
-														  parent, cur_index);
+				if (proute->subplan_partition_table)
+				{
+					ResultRelInfo *rri;
+					Oid			partoid = partdesc->oids[partidx];
 
-				/* We may not really have found one. */
-				Assert(result < 0 ||
-					   parent->indexes[cur_index] < proute->num_partitions);
-			}
+					rri = hash_search(proute->subplan_partition_table,
+									  &partoid, HASH_FIND, NULL);
 
-			/* We need to create one afresh. */
-			if (result < 0)
-			{
-				result = ExecInitPartitionInfo(mtstate, resultRelInfo,
-											   proute, estate,
-											   parent, cur_index);
-				Assert(result >= 0 && result < proute->num_partitions);
+					if (rri)
+					{
+						result = proute->num_partitions++;
+						parent->indexes[partidx] = result;
+
+						/* Allocate more space in the arrays, if required */
+						if (result >= proute->partitions_allocsize)
+							ExecExpandRoutingArrays(proute);
+
+						/* Save here for later use. */
+						proute->partitions[result] = rri;
+					}
+				}
+
+				/* We need to create one afresh. */
+				if (result < 0)
+				{
+					result = ExecInitPartitionInfo(mtstate, resultRelInfo,
+												   proute, estate,
+												   parent, partidx);
+					Assert(result >= 0 && result < proute->num_partitions);
+				}
 			}
-			break;
+
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return result;
 		}
 		else
 		{
-			/* Get the PartitionDispatch of this parent. */
-			if (parent->indexes[cur_index] >= 0)
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(parent->indexes[partidx] >= 0))
 			{
-				/* Already allocated. */
-				Assert(parent->indexes[cur_index] < proute->num_dispatch);
-				parent = pd[parent->indexes[cur_index]];
+				/* Already built. */
+				Assert(parent->indexes[partidx] < proute->num_dispatch);
+				parent = pd[parent->indexes[partidx]];
 			}
 			else
 			{
-				/* Not yet, allocate one. */
-				PartitionDispatch new_parent;
-
-				new_parent =
-					ExecInitPartitionDispatchInfo(proute,
-												  partdesc->oids[cur_index],
-												  parent, cur_index);
-				Assert(parent->indexes[cur_index] >= 0 &&
-					   parent->indexes[cur_index] < proute->num_dispatch);
-				parent = new_parent;
+				/* Not yet built. Do that now. */
+				PartitionDispatch subparent;
+
+				subparent = ExecInitPartitionDispatchInfo(proute,
+													partdesc->oids[partidx],
+													parent, partidx);
+				Assert(parent->indexes[partidx] >= 0 &&
+					   parent->indexes[partidx] < proute->num_dispatch);
+				parent = subparent;
 			}
 		}
 	}
-
-	/* A partition was not found. */
-	if (result < 0)
-	{
-		char	   *val_desc;
-
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
-	}
-
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
-	return result;
 }
 
 /*
- * ExecUseUpdateResultRelForRouting
- *		Checks if any of the ResultRelInfo's created by ExecInitModifyTable
- *		belongs to the passed in partition, and if so, stores its pointer in
- *		in proute so that it can be used as the target of tuple routing
- *
- * Return value is the index at which the found result rel is stored in proute
- * or -1 if none found.
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
  */
-static int
-ExecUseUpdateResultRelForRouting(ModifyTableState *mtstate,
-								 PartitionTupleRouting *proute,
-								 PartitionDispatch pd,
-								 int partidx)
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
 {
-	Oid				partoid = pd->partdesc->oids[partidx];
 	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
-	ResultRelInfo  *update_result_rels = NULL;
-	int				num_update_result_rels = 0;
+	ResultRelInfo  *subplan_result_rels;
+	HASHCTL			ctl;
+	HTAB		   *htab;
+	int				nsubplans;
 	int				i;
-	int				part_result_rel_index = -1;
 
-	update_result_rels = mtstate->resultRelInfo;
-	num_update_result_rels = list_length(node->plans);
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
 
-	/* If here for the first time, initialize necessary info in proute. */
-	if (proute->subplan_partition_offsets == NULL)
-	{
-		proute->subplan_partition_offsets =
-				palloc(num_update_result_rels * sizeof(int));
-		memset(proute->subplan_partition_offsets, -1,
-			   num_update_result_rels * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_result_rels;
-	}
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
 
-	/*
-	 * Go through UPDATE result rels and save the pointers of those that
-	 * belong to this table's partitions in proute.
-	 */
-	for (i = 0; i < num_update_result_rels; i++)
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_partition_table = htab;
+
+	/* Hash all subplan by Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		ResultRelInfo *update_result_rel = &update_result_rels[i];
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
 
-		if (partoid != RelationGetRelid(update_result_rel->ri_RelationDesc))
-			continue;
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+												   &found);
 
-		/* Found it. */
+		if (!found)
+			*subplanrri = rri;
 
 		/*
 		 * This is required in order to convert the partition's tuple
@@ -363,59 +368,69 @@ ExecUseUpdateResultRelForRouting(ModifyTableState *mtstate,
 		 * descriptor.  When generating the per-subplan result rels,
 		 * this was not set.
 		 */
-		update_result_rel->ri_PartitionRoot = proute->partition_root;
+		rri->ri_PartitionRoot = proute->partition_root;
+	}
+}
 
-		/*
-		 * Remember the index of this UPDATE result rel in the tuple
-		 * routing partition array.
-		 */
-		proute->subplan_partition_offsets[i] = proute->num_partitions;
+/*
+ * ExecExpandRoutingArrays
+ *		Double the size of the allocated arrays in 'proute'
+ */
+static void
+ExecExpandRoutingArrays(PartitionTupleRouting *proute)
+{
+	int		new_size = proute->partitions_allocsize * 2;
+	int		old_size = proute->partitions_allocsize;
 
-		/*
-		 * Also, record in PartitionDispatch that we have a valid
-		 * ResultRelInfo for this partition.
-		 */
-		Assert(pd->indexes[partidx] == -1);
-		part_result_rel_index = proute->num_partitions++;
-		if (part_result_rel_index >= PARTITION_ROUTING_MAXSIZE)
-			elog(ERROR, "invalid partition index: %u", part_result_rel_index);
-		pd->indexes[partidx] = part_result_rel_index;
-		if (part_result_rel_index >= proute->partitions_allocsize)
-		{
-			/* Expand allocated place. */
-			proute->partitions_allocsize =
-				Min(proute->partitions_allocsize * 2,
-					PARTITION_ROUTING_MAXSIZE);
-			proute->partitions = (ResultRelInfo **)
-				repalloc(proute->partitions,
-						 sizeof(ResultRelInfo *) *
-								proute->partitions_allocsize);
-		}
-		proute->partitions[part_result_rel_index] = update_result_rel;
-		break;
+	proute->partitions_allocsize = new_size;
+
+	proute->partitions = (ResultRelInfo **)
+		repalloc(proute->partitions, sizeof(ResultRelInfo *) * new_size);
+
+	if (proute->parent_child_tupconv_maps != NULL)
+	{
+		proute->parent_child_tupconv_maps = (TupleConversionMap **)
+			repalloc( proute->parent_child_tupconv_maps,
+						sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->parent_child_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
 	}
 
-	return part_result_rel_index;
+	if (proute->child_parent_map_not_required != NULL)
+	{
+		proute->child_parent_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->child_parent_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->child_parent_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_map_not_required != NULL)
+	{
+		proute->child_parent_map_not_required = (bool *)
+			repalloc(proute->child_parent_map_not_required,
+					 sizeof(bool) * new_size);
+		memset(&proute->child_parent_map_not_required[old_size], 0,
+			   sizeof(bool) * (new_size - old_size));
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
- *
- * This also stores it in the proute->partitions array at the next
- * available index, possibly expanding the array if there isn't any space
- * left in it, and returns the index where it's stored.
+ *		and store it in the next empty slot in 'proute's partitions array and
+ *		return the index of that element.
  */
 static int
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
 					  EState *estate,
 					  PartitionDispatch parent, int partidx)
 {
 	Oid			partoid = parent->partdesc->oids[partidx];
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
@@ -605,18 +620,14 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 	}
 
 	part_result_rel_index = proute->num_partitions++;
-	if (part_result_rel_index >= PARTITION_ROUTING_MAXSIZE)
-		elog(ERROR, "invalid partition index: %u", part_result_rel_index);
 	parent->indexes[partidx] = part_result_rel_index;
+
+	/* Allocate more space in the arrays, if required */
 	if (part_result_rel_index >= proute->partitions_allocsize)
-	{
-		/* Expand allocated place. */
-		proute->partitions_allocsize =
-			Min(proute->partitions_allocsize * 2, PARTITION_ROUTING_MAXSIZE);
-		proute->partitions = (ResultRelInfo **)
-			repalloc(proute->partitions,
-					 sizeof(ResultRelInfo *) * proute->partitions_allocsize);
-	}
+		ExecExpandRoutingArrays(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
 
 	/* Set up information needed for routing tuples to the partition. */
 	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri,
@@ -639,7 +650,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -652,7 +663,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -666,7 +677,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -683,7 +694,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				NULL;
 
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -692,7 +703,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -783,9 +794,6 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	/* Save here for later use. */
-	proute->partitions[part_result_rel_index] = leaf_part_rri;
-
 	MemoryContextSwitchTo(oldContext);
 
 	return part_result_rel_index;
@@ -825,21 +833,10 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 		/* Allocate parent child map array only if we need to store a map */
 		if (proute->parent_child_tupconv_maps == NULL)
 		{
-			proute->parent_child_tupconv_maps_allocsize = new_size =
-				PARTITION_ROUTING_INITSIZE;
+			new_size = proute->partitions_allocsize;
 			proute->parent_child_tupconv_maps = (TupleConversionMap **)
 				palloc0(sizeof(TupleConversionMap *) * new_size);
 		}
-		/* We may have ran out of the initially allocated space. */
-		else if (partidx >= proute->parent_child_tupconv_maps_allocsize)
-		{
-			proute->parent_child_tupconv_maps_allocsize = new_size =
-				Min(proute->parent_child_tupconv_maps_allocsize * 2,
-					PARTITION_ROUTING_MAXSIZE);
-			proute->parent_child_tupconv_maps = (TupleConversionMap **)
-				repalloc( proute->parent_child_tupconv_maps,
-						 sizeof(TupleConversionMap *) * new_size);
-		}
 
 		proute->parent_child_tupconv_maps[partidx] = map;
 	}
@@ -867,7 +864,7 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
  */
 static PartitionDispatch
 ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
-							  PartitionDispatch parent_pd, int part_index)
+							  PartitionDispatch parent_pd, int partidx)
 {
 	Relation	rel;
 	TupleDesc	tupdesc;
@@ -921,15 +918,12 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
 
 	dispatchidx = proute->num_dispatch++;
-	if (dispatchidx >= PARTITION_ROUTING_MAXSIZE)
-		elog(ERROR, "invalid partition index: %u", dispatchidx);
 	if (parent_pd)
-		parent_pd->indexes[part_index] = dispatchidx;
+		parent_pd->indexes[partidx] = dispatchidx;
 	if (dispatchidx >= proute->dispatch_allocsize)
 	{
 		/* Expand allocated space. */
-		proute->dispatch_allocsize =
-			Min(proute->dispatch_allocsize * 2, PARTITION_ROUTING_MAXSIZE);
+		proute->dispatch_allocsize *= 2;
 		proute->partition_dispatch_info = (PartitionDispatchData **)
 			repalloc(proute->partition_dispatch_info,
 					 sizeof(PartitionDispatchData *) *
@@ -954,20 +948,22 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 void
 ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
 {
+	int			size;
+
 	Assert(proute != NULL);
 
+	size = proute->partitions_allocsize;
+
 	/*
 	 * These array elements get filled up with maps on an on-demand basis.
 	 * Initially just set all of them to NULL.
 	 */
-	proute->child_parent_tupconv_maps_allocsize = PARTITION_ROUTING_INITSIZE;
 	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										PARTITION_ROUTING_INITSIZE);
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * size);
 
 	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * PARTITION_ROUTING_INITSIZE);
+	proute->child_parent_map_not_required = (bool *) palloc0(sizeof(bool) *
+															 size);
 }
 
 /*
@@ -978,7 +974,6 @@ TupleConversionMap *
 TupConvMapForLeaf(PartitionTupleRouting *proute,
 				  ResultRelInfo *rootRelInfo, int leaf_index)
 {
-	ResultRelInfo **resultRelInfos = proute->partitions;
 	TupleConversionMap **map;
 	TupleDesc	tupdesc;
 
@@ -987,7 +982,7 @@ TupConvMapForLeaf(PartitionTupleRouting *proute,
 		ExecSetupChildParentMapForLeaf(proute);
 
 	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
+	else if (proute->child_parent_map_not_required[leaf_index])
 		return NULL;
 
 	/* If we've already got a map, return it. */
@@ -996,37 +991,16 @@ TupConvMapForLeaf(PartitionTupleRouting *proute,
 		return *map;
 
 	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
+	tupdesc = RelationGetDescr(proute->partitions[leaf_index]->ri_RelationDesc);
 	*map =
 		convert_tuples_by_name(tupdesc,
 							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
 							   gettext_noop("could not convert row type"));
 
-	/* If it turns out no map is needed, remember for next time. */
-
-	/* We may have run out of the initially allocated space. */
-	if (leaf_index >= proute->child_parent_tupconv_maps_allocsize)
-	{
-		int		new_size,
-				old_size;
-
-		old_size = proute->child_parent_tupconv_maps_allocsize;
-		proute->child_parent_tupconv_maps_allocsize = new_size =
-			Min(proute->parent_child_tupconv_maps_allocsize * 2,
-				PARTITION_ROUTING_MAXSIZE);
-		proute->child_parent_tupconv_maps = (TupleConversionMap **)
-			repalloc(proute->child_parent_tupconv_maps,
-					 sizeof(TupleConversionMap *) * new_size);
-		memset(proute->child_parent_tupconv_maps + old_size, 0,
-			   sizeof(TupleConversionMap *) * (new_size - old_size));
-
-		proute->child_parent_map_not_required = (bool *)
-			repalloc(proute->child_parent_map_not_required,
-					 sizeof(bool) * new_size);
-		memset(proute->child_parent_map_not_required + old_size, false,
-			   sizeof(bool) * (new_size - old_size));
-	}
-
+	/*
+	 * If it turns out no map is needed, remember that so we don't try making
+	 * one again next time.
+	 */
 	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
 
 	return *map;
@@ -1102,23 +1076,18 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 														   resultRelInfo);
 
 		/*
-		 * Check if this result rel is one of UPDATE subplan result rels,
-		 * which if so, let ExecEndPlan() close it.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets)
+		if (proute->subplan_partition_table)
 		{
-			int		j;
-			int		found = false;
+			Oid			partoid;
+			bool		found;
 
-			for (j = 0; j < proute->num_subplan_partition_offsets; j++)
-			{
-				if (proute->subplan_partition_offsets[j] == i)
-				{
-					found = true;
-					break;
-				}
-			}
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 
+			(void) hash_search(proute->subplan_partition_table, &partoid,
+							   HASH_FIND, &found);
 			if (found)
 				continue;
 		}
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 8d20469c98..4cc7508067 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -29,8 +29,8 @@ typedef struct PartitionDescData
 	Oid		   *oids;			/* Array of length 'nparts' containing
 								 * partition OIDs in order of the their
 								 * bounds */
-	bool	   *is_leaf;		/* Array of length 'nparts' containing whether
-								 * a partition is a leaf partition */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * a partition is a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 91b840e12f..1b421f2ec5 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -31,9 +31,13 @@
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
  *				this table's rowtype (when extracting the partition key of a
  *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *	indexes		Array with partdesc->nparts elements.  For leaf partitions the
+ *				index into the PartitionTupleRouting->partitions array is
+ *				stored.  When the partition is itself a partitioned table then
+ *				we store the index into
+ *				PartitionTupleRouting->partition_dispatch_info.  -1 means
+ *				we've not yet allocated anything in PartitionTupleRouting for
+ *				the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -55,7 +59,7 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  * partitions
  *
  *	partition_root			Root table, that is, the table mentioned in the
- *							INSERT or UPDATE query or COPY FROM command.
+ *							command.
  *
  *	partition_dispatch_info	Contains PartitionDispatch objects for every
  *							partitioned table touched by tuple routing.  The
@@ -84,8 +88,11 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *							placed in 'partitions'
  *
  *	partitions_allocsize	(>= 'num_partitions') is the number of entries
- *							that can be stored in 'partitions' before needing
- *							to reallocate more space
+ *							that can be stored in 'partitions',
+ *							'parent_child_tupconv_maps',
+ *							'child_parent_tupconv_maps' and
+ *							'child_parent_map_not_required' arrays before
+ *							needing to reallocate more space
  *
  *	parent_child_tupconv_maps	Contains information to convert tuples of the
  *							root parent's rowtype to those of the leaf
@@ -98,12 +105,8 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *							need not be more of these maps than there are
  *							partitions that were touched.
  *
- *	parent_child_tupconv_maps_allocsize		The number of entries that can be
- *							stored in 'parent_child_tupconv_maps' before
- *							needing to reallocate more space
- *
  *	partition_tuple_slot	This is a tuple slot used to store a tuple using
- *							rowtype of the the partition chosen by tuple
+ *							rowtype of the partition chosen by tuple
  *							routing.  Maintained separately because partitions
  *							may have different rowtype.
  *
@@ -111,31 +114,22 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  * do tuple routing.
  *
  *	child_parent_tupconv_maps	Information to convert tuples of the leaf
- *							partitions' rowtype to the the root parent's
- *							rowtype.  These are needed by transition table
- *							machinery when storing tuples of partition's
- *							rowtype into the transition table that can only
- *							store tuples of the root parent's rowtype.
- *							Like 'parent_child_tupconv_maps' it remains NULL
- *							if none of the partitions selected by tuple
- *							routing needed a conversion map.  Also, if non-
- *							NULL, is of the same size as 'partitions'.
+ *							partitions' rowtype to the root parent's rowtype.
+ *							These are needed by transition table machinery
+ *							when storing tuples of partition's rowtype into
+ *							the transition table that can only store tuples of
+ *							the root parent's rowtype.  Like
+ *							'parent_child_tupconv_maps' it remains NULL if
+ *							none of the partitions selected by tuple routing
+ *							needed a conversion map.  Also, if non-NULL, is of
+ *							the same size as 'partitions'.
  *
  *	child_parent_map_not_required	Stores if we don't need a conversion
  *							map for a partition so that TupConvMapForLeaf
- *							can return quickly if set
+ *							can return without having to re-check if it needs
+ *							to build a map.
  *
- *	child_parent_tupconv_maps_allocsize		The number of entries that can be
- *							stored in 'child_parent_tupconv_maps' before
- *							needing to reallocate more space
- *
- *	subplan_partition_offsets	The following maps indexes of UPDATE result
- *							rels in the per-subplan array to indexes of their
- *							pointers in the 'partitions'
- *
- *	num_subplan_partition_offsets	The number of entries in
- *							'subplan_partition_offsets', which is same as the
- *							number of UPDATE result rels
+ *	subplan_partition_table	Hash table to store subplan index by Oid.
  *
  *	root_tuple_slot			During UPDATE tuple routing, this tuple slot is
  *							used to transiently store a tuple using the root
@@ -151,24 +145,15 @@ typedef struct PartitionTupleRouting
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
 	int			dispatch_allocsize;
-
 	ResultRelInfo **partitions;
 	int			num_partitions;
 	int			partitions_allocsize;
-
 	TupleConversionMap **parent_child_tupconv_maps;
-	int			parent_child_tupconv_maps_allocsize;
-
-	TupleTableSlot *partition_tuple_slot;
-
 	TupleConversionMap **child_parent_tupconv_maps;
 	bool	   *child_parent_map_not_required;
-	int			child_parent_tupconv_maps_allocsize;
-
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
-
+	HTAB	   *subplan_partition_table;
 	TupleTableSlot *root_tuple_slot;
+	TupleTableSlot *partition_tuple_slot;
 } PartitionTupleRouting;
 
 /*-----------------------
#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: David Rowley (#15)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

David Rowley <david.rowley@2ndquadrant.com> writes:

1) Got rid of PARTITION_ROUTING_MAXSIZE. The code using this was
useless since the int would have wrapped long before it reached
UINT_MAX. There's no shortage of other code doubling the size of an
array by multiplying it by 2 unconditionally without considering
overflowing an int. Unsure why you considered this more risky.

As long as you're re-palloc'ing the array each time, and not increasing
its size more than 2X, this is perfectly safe because of the 1GB size
limit on palloc requests. You'll fail because of that in the iteration
where the request is between 1GB and 2GB, just before integer overflow
can occur.

(Yes, this is intentional.)

regards, tom lane

#17David Rowley
david.rowley@2ndquadrant.com
In reply to: David Rowley (#15)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 27 July 2018 at 04:19, David Rowley <david.rowley@2ndquadrant.com> wrote:

I've attached a delta of the changes I made since your v2 delta and
also a complete updated patch.

I did a very quick performance test of this patch on an AWS m5d.large
instance with fsync=off.

The test setup is the same as is described in my initial email on this thread.

The test compares the performance of INSERTs into a partitioned table
with 10k partitions compared to a non-partitioned table.

Patched with v2 patch on master@39d51fe87

-- partitioned
$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 1063764
latency average = 0.056 ms
tps = 17729.375930 (including connections establishing)
tps = 17729.855215 (excluding connections establishing)

-- non-partitioned
$ pgbench -n -T 60 -f partbench__insert.sql postgres
transaction type: partbench__insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 1147273
latency average = 0.052 ms
tps = 19121.194366 (including connections establishing)
tps = 19121.695469 (excluding connections establishing)

Here we're within 92% of the non-partitioned performance.

Looking back at the first email in this thread where I tested the v1
patch, we were within 82% with:

-- partitioned
tps = 11001.602377 (excluding connections establishing)

-- non-partitioned
tps = 13354.656163 (excluding connections establishing)

Again, same as with the v1 test, the v2 test was done with the locking
of all partitions removed with:

diff --git a/src/backend/executor/execPartition.c
b/src/backend/executor/execPartition.c
index d7b18f52ed..6223c62094 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -80,9 +80,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState
*mtstate, Relation rel)
  PartitionTupleRouting *proute;
  ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
- /* Lock all the partitions. */
- (void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-
  /*
  * Here we attempt to expend as little effort as possible in setting up
  * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
@@ -442,7 +439,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
  * We locked all the partitions in ExecSetupPartitionTupleRouting
  * including the leaf partitions.
  */
- partrel = heap_open(partoid, NoLock);
+ partrel = heap_open(partoid, RowExclusiveLock);

/*
* Keep ResultRelInfo and other information for this partition in the

Again, the reduce locking is not meant for commit for this patch.
Changing the locking will require a discussion on its own thread.

And just for fun, the unpatched performance on the partitioned table:

ubuntu@ip-10-0-0-33:~$ pgbench -n -T 60 -f partbench_insert.sql postgres
transaction type: partbench_insert.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 60 s
number of transactions actually processed: 5751
latency average = 10.434 ms
tps = 95.836052 (including connections establishing)
tps = 95.838490 (excluding connections establishing)

(185x increase)

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#18Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#15)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018/07/27 1:19, David Rowley wrote:

On 18 July 2018 at 20:29, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Let me know what you think of the code in the updated patch.

Thanks for sending the updated patch.

I looked over it tonight and made a number of changes:

1) Got rid of PARTITION_ROUTING_MAXSIZE. The code using this was
useless since the int would have wrapped long before it reached
UINT_MAX.

Oops, you're right.

There's no shortage of other code doubling the size of an
array by multiplying it by 2 unconditionally without considering
overflowing an int. Unsure why you considered this more risky.

Just ill-informed paranoia on my part. Let's just drop it as you say,
given also the Tom's comment that repalloc would fail anyway for requests
over 1GB.

2) Fixed a series of bugs regarding the size of the arrays in
PartitionTupleRouting. The map arrays and the partitions array could
differ in size despite your comment that claimed
child_parent_tupconv_maps was the same size as 'partitions' when
non-NULL. The map arrays being a different size than the partitions
array caused the following two cases to segfault. I've included two
cases as it was two seperate bugs that caused them.

-- case 1

[ .... ]

-- case 2

Indeed, there were some holes in the logic that led to me to come up with
that code.

3) Got rid of ExecUseUpdateResultRelForRouting. I started to change
this to remove references to UPDATE in order to make it more friendly
towards other possible future node types that it would get used for
(aka MERGE). In the end, I found that performance could regress when
in cases like:

drop table listp;
create table listp (a int) partition by list(a);
\o /dev/null
\timing off
select 'create table listp'||x::Text||' partition of listp for values
in('||x::Text||');' from generate_series(1,1000) x;
\gexec
\o
insert into listp select x from generate_series(1,999) x;
\timing on
update listp set a = a+1;

It's true that UPDATEs with a large number of subplans performance is
quite terrible today in the planner, but this code made the
performance of planning+execution a bit worse. If we get around to
fixing the inheritance planner then I think
ExecUseUpdateResultRelForRouting() could easily appear in profiles.

I ended up rewriting it to just get called once and build a hash table
by Oid storing a ResultRelInfo pointer. This also gets rid of the
slow nested loop in the cleanup operation inside
ExecCleanupTupleRouting().

OK, looks neat, although I'd name the hash table subplan_resultrel_hash
(like join_rel_hash in PlannerInfo), instead of subplan_partition_table.

4) Did some tuning work in ExecFindPartition() getting rid of a
redundant check after the loop completion. Also added some likely()
and unlikely() decorations around some conditions.

All changes seem good.

5) Updated some newly out-dated comments since your patch in execPartition.h.

6) Replaced the palloc0() in ExecSetupPartitionTupleRouting() with a
palloc() updating the few fields that were not initialised. This might
save a few TPS (at least once we get rid of the all partition locking)
in the single-row INSERT case, but I've not tested the performance of
this yet.

7) Also moved and edited some comments above
ExecSetupPartitionTupleRouting() that I felt explained a little too
much about some internal implementation details.

Thanks, changes look good.

One thing that I thought of, but didn't do was just having
ExecFindPartition() return the ResultRelInfo. I think it would be much
nicer in both call sites to not have to check the ->partitions array
to get that. The copy.c call site would need a few modifications
around the detection code to see if the partition has changed, but it
all looks quite possible to change. I left this for now as I have
another patch which touches all that code that I feel is closer to
commit than this patch is.

I had wondered about that too, but gave up on doing anything about it
because the callers of ExecFindPartition want to access other fields of
PartitionTupleRouting using the returned index. Maybe, we could make it
return a ResultRelInfo * and also return the index itself using a separate
output argument. Seems like a cosmetic improvement that can be made later.

I've attached a delta of the changes I made since your v2 delta and
also a complete updated patch.

Thanks. Here are some other minor comments on the complete v2 patch.

-            tuple =
ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
+            tuple =
ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps ?
+
proute->parent_child_tupconv_maps[leaf_part_index] :
+                                                NULL,

This piece of code that's present in both ExecPrepareTupleRouting and
CopyFrom can be written as:

if (proute->parent_child_tupconv_maps)
ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
tuple,
proute->partition_tuple_slot,
&slot);

+    /*
+     * If UPDATE needs to do tuple routing, we'll need a slot that will
+     * transiently store the tuple being routed using the root parent's
+     * rowtype.  We must set up at least this slot, because it's needed even
+     * before tuple routing begins.  Other necessary information is
+     * initialized when  tuple routing code calls
+     * ExecUseUpdateResultRelForRouting.
+     */
     if (node && node->operation == CMD_UPDATE)

This comment needs to be updated, because you changed the if block's body as:

+ ExecHashSubPlanResultRelsByOid(mtstate, proute);
proute->root_tuple_slot = MakeTupleTableSlot(NULL);

So, we don't just set up the slot here, we also now set up the hash table
to store sub-plan result rels. Also, ExecUseUpdateResultRelForRouting no
longer exist.

+            /*
+             * Get the index for PartitionTupleRouting->partitions array
index
+             * for this leaf partition.  This may require building a new
+             * ResultRelInfo.
+             */

1st sentence reads a bit strange. Did you mean to write the following?

/*
* Get this leaf partition's index in the
* PartitionTupleRouting->partitions array.
* This may require building a new ResultRelInfo.
*/

The following block of code could use a one-line comment describing what's
going on (although, what's going on might be pretty clear to some eyes
just by looking at the code):

else
{
if (proute->subplan_partition_table)
{
ResultRelInfo *rri;
Oid partoid = partdesc->oids[partidx];

rri = hash_search(proute->subplan_partition_table,
&partoid, HASH_FIND, NULL);

 /*
+ * ExecInitPartitionDispatchInfo
+ *      Initialize PartitionDispatch for a partitioned table
+ *
+ * This also stores it in the proute->partition_dispatch_info array at the
+ * specified index ('dispatchidx'), possibly expanding the array if there
+ * isn't space left in it.
+ */

You renamed 'dispatchidx' to 'partidx' in the function's signature but
forgot to update this comment.

I've attached a delta patch to make the above changes. I'm leaving the
hash table rename up to you though.

Thanks
Amit

Attachments:

v2-delta.patchtext/plain; charset=UTF-8; name=v2-delta.patchDownload
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 44cf3bba12..d135f858d2 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2684,12 +2684,15 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps ?
-												proute->parent_child_tupconv_maps[leaf_part_index] :
-												NULL,
-											  tuple,
-											  proute->partition_tuple_slot,
-											  &slot);
+			if (proute->parent_child_tupconv_maps)
+			{
+				TupleConversionMap *map =
+						proute->parent_child_tupconv_maps[leaf_part_index];
+
+				tuple = ConvertPartitionTupleSlot(map, tuple,
+												  proute->partition_tuple_slot,
+												  &slot);
+			}
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d7b18f52ed..7661b246e4 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -126,12 +126,15 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 										 0);
 
 	/*
-	 * If UPDATE needs to do tuple routing, we'll need a slot that will
-	 * transiently store the tuple being routed using the root parent's
-	 * rowtype.  We must set up at least this slot, because it's needed even
-	 * before tuple routing begins.  Other necessary information is
-	 * initialized when  tuple routing code calls
-	 * ExecUseUpdateResultRelForRouting.
+	 * If UPDATE needs to do tuple routing, we can reuse partition sub-plan
+	 * result rels after tuple routing, so build a hash table to map the OIDs
+	 * of partitions present in mtstate->resultRelInfo to their
+	 * ResultRelInfos.  Every time a tuple is routed to one of the partitions
+	 * present in mtstate->resultRelInfo, looking its OID up in the hash table
+	 * will give us its ResultRelInfo.
+	 *
+	 * Also, we'll need a slot that will transiently store the tuple being
+	 * routed using the root parent's rowtype.
 	 */
 	if (node && node->operation == CMD_UPDATE)
 	{
@@ -244,9 +247,9 @@ ExecFindPartition(ModifyTableState *mtstate,
 		if (partdesc->is_leaf[partidx])
 		{
 			/*
-			 * Get the index for PartitionTupleRouting->partitions array index
-			 * for this leaf partition.  This may require building a new
-			 * ResultRelInfo.
+			 * Get this leaf partition's index in the
+			 * PartitionTupleRouting->partitions array.  We may require
+			 * building a new ResultRelInfo.
 			 */
 			if (likely(parent->indexes[partidx] >= 0))
 			{
@@ -256,6 +259,10 @@ ExecFindPartition(ModifyTableState *mtstate,
 			}
 			else
 			{
+				/*
+				 * No ResultRelInfo found, so either use one of the
+				 * sub-plan result rels or create a fresh one.
+				 */
 				if (proute->subplan_partition_table)
 				{
 					ResultRelInfo *rri;
@@ -858,8 +865,8 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
  *		Initialize PartitionDispatch for a partitioned table
  *
  * This also stores it in the proute->partition_dispatch_info array at the
- * specified index ('dispatchidx'), possibly expanding the array if there
- * isn't space left in it.
+ * specified index ('partdx'), possibly expanding the array if there isn't
+ * space left in it.
  */
 static PartitionDispatch
 ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 6e0c7862dc..4f7cea7668 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1779,12 +1779,9 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps ?
-								proute->parent_child_tupconv_maps[partidx] :
-								NULL,
-							  tuple,
-							  proute->partition_tuple_slot,
-							  &slot);
+	if (proute->parent_child_tupconv_maps)
+		ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
+								  tuple, proute->partition_tuple_slot, &slot);
 
 	/* Initialize information needed to handle ON CONFLICT DO UPDATE. */
 	Assert(mtstate != NULL);
#19David Rowley
david.rowley@2ndquadrant.com
In reply to: Amit Langote (#18)
2 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 27 July 2018 at 19:11, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

I've attached a delta patch to make the above changes. I'm leaving the
hash table rename up to you though.

Thanks for the delta patch. I took all of it, just rewrote a comment slightly.

I also renamed the hash table to your suggestion and changed a few more things.

Attached a delta based on v2 and the full v3 patch.

This includes another small change to make
PartitionDispatchData->indexes an array that's allocated in the same
memory as the PartitionDispatchData. This will save a palloc() call
and also should be a bit more cache friendly.

This also required a rebase on master again due to 3e32109049.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v2-delta2.patchapplication/octet-stream; name=v2-delta2.patchDownload
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 44cf3bba12..6fc1e2b41c 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2684,12 +2684,15 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps ?
-												proute->parent_child_tupconv_maps[leaf_part_index] :
-												NULL,
-											  tuple,
-											  proute->partition_tuple_slot,
-											  &slot);
+			if (proute->parent_child_tupconv_maps)
+			{
+				TupleConversionMap *map =
+				proute->parent_child_tupconv_maps[leaf_part_index];
+
+				tuple = ConvertPartitionTupleSlot(map, tuple,
+												  proute->partition_tuple_slot,
+												  &slot);
+			}
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 24a9d6b426..2a18a30b3e 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -33,8 +33,7 @@
 
 #define PARTITION_ROUTING_INITSIZE	8
 
-static void
-ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
 							   PartitionTupleRouting *proute);
 static void ExecExpandRoutingArrays(PartitionTupleRouting *proute);
 static int ExecInitPartitionInfo(ModifyTableState *mtstate,
@@ -43,7 +42,7 @@ static int ExecInitPartitionInfo(ModifyTableState *mtstate,
 					  EState *estate,
 					  PartitionDispatch parent, int partidx);
 static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
-						Oid partoid, PartitionDispatch parent_pd, int partidx);
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -91,24 +90,23 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	 * single tuple into a single partition.
 	 *
 	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
-	 * PartitionDispatch and ResultRelInfo pointers in their respective arrays.
-	 * More space can be allocated later, if required via
+	 * PartitionDispatch and ResultRelInfo pointers in their respective
+	 * arrays. More space can be allocated later, if required via
 	 * ExecExpandRoutingArrays.
 	 *
-	 * We're certain to only need just 1 PartitionDispatch; the one for the
-	 * partitioned table which is the target of the command.  We'll only setup
-	 * PartitionDispatchs for any subpartitions if tuples actually get routed
-	 * to (through) them.
+	 * The PartitionDispatch for the target partitioned table of the command
+	 * must be setup, but any sub-partitioned tables can be setup lazily as
+	 * and when the tuples get routed to (through) them.
 	 */
 	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
 	proute->partition_root = rel;
 	proute->partition_dispatch_info = (PartitionDispatchData **)
-			palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
 	proute->num_dispatch = 0;
 	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
 
 	proute->partitions = (ResultRelInfo **)
-			palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
+		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
 	proute->num_partitions = 0;
 	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
 
@@ -118,20 +116,23 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	proute->child_parent_map_not_required = NULL;
 
 	/*
-	 * Initialize this table's PartitionDispatch object.  Here we pass in
-	 * the parent is NULL as we don't need to care about any parent of the
-	 * target partitioned table.
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent is NULL as we don't need to care about any parent of the target
+	 * partitioned table.
 	 */
 	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
 										 0);
 
 	/*
-	 * If UPDATE needs to do tuple routing, we'll need a slot that will
-	 * transiently store the tuple being routed using the root parent's
-	 * rowtype.  We must set up at least this slot, because it's needed even
-	 * before tuple routing begins.  Other necessary information is
-	 * initialized when  tuple routing code calls
-	 * ExecUseUpdateResultRelForRouting.
+	 * If UPDATE needs to do tuple routing, we can reuse partition sub-plan
+	 * result rels after tuple routing, so build a hash table to map the OIDs
+	 * of partitions present in mtstate->resultRelInfo to their
+	 * ResultRelInfos.  Every time a tuple is routed to one of the partitions
+	 * present in mtstate->resultRelInfo, looking its OID up in the hash table
+	 * will give us its ResultRelInfo.
+	 *
+	 * Also, we'll need a slot that will transiently store the tuple being
+	 * routed using the root parent's rowtype.
 	 */
 	if (node && node->operation == CMD_UPDATE)
 	{
@@ -140,7 +141,7 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	}
 	else
 	{
-		proute->subplan_partition_table = NULL;
+		proute->subplan_resultrel_hash = NULL;
 		proute->root_tuple_slot = NULL;
 	}
 
@@ -175,7 +176,7 @@ ExecFindPartition(ModifyTableState *mtstate,
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch parent;
-	PartitionDesc	partdesc;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 
@@ -244,9 +245,9 @@ ExecFindPartition(ModifyTableState *mtstate,
 		if (partdesc->is_leaf[partidx])
 		{
 			/*
-			 * Get the index for PartitionTupleRouting->partitions array index
-			 * for this leaf partition.  This may require building a new
-			 * ResultRelInfo.
+			 * Get this leaf partition's index in the
+			 * PartitionTupleRouting->partitions array.  We may require
+			 * building a new ResultRelInfo.
 			 */
 			if (likely(parent->indexes[partidx] >= 0))
 			{
@@ -256,12 +257,17 @@ ExecFindPartition(ModifyTableState *mtstate,
 			}
 			else
 			{
-				if (proute->subplan_partition_table)
+				/*
+				 * A ResultRelInfo has not been setup for this partition yet,
+				 * so either use one of the sub-plan result rels or create a
+				 * fresh one.
+				 */
+				if (proute->subplan_resultrel_hash)
 				{
 					ResultRelInfo *rri;
 					Oid			partoid = partdesc->oids[partidx];
 
-					rri = hash_search(proute->subplan_partition_table,
+					rri = hash_search(proute->subplan_resultrel_hash,
 									  &partoid, HASH_FIND, NULL);
 
 					if (rri)
@@ -308,8 +314,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 				PartitionDispatch subparent;
 
 				subparent = ExecInitPartitionDispatchInfo(proute,
-													partdesc->oids[partidx],
-													parent, partidx);
+														  partdesc->oids[partidx],
+														  parent, partidx);
 				Assert(parent->indexes[partidx] >= 0 &&
 					   parent->indexes[partidx] < proute->num_dispatch);
 				parent = subparent;
@@ -328,12 +334,12 @@ static void
 ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
 							   PartitionTupleRouting *proute)
 {
-	ModifyTable	   *node = (ModifyTable *) mtstate->ps.plan;
-	ResultRelInfo  *subplan_result_rels;
-	HASHCTL			ctl;
-	HTAB		   *htab;
-	int				nsubplans;
-	int				i;
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo *subplan_result_rels;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			nsubplans;
+	int			i;
 
 	subplan_result_rels = mtstate->resultRelInfo;
 	nsubplans = list_length(node->plans);
@@ -345,9 +351,9 @@ ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
 
 	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
 					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-	proute->subplan_partition_table = htab;
+	proute->subplan_resultrel_hash = htab;
 
-	/* Hash all subplan by Oid */
+	/* Hash all subplans by their Oid */
 	for (i = 0; i < nsubplans; i++)
 	{
 		ResultRelInfo *rri = &subplan_result_rels[i];
@@ -356,16 +362,15 @@ ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
 		ResultRelInfo **subplanrri;
 
 		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
-												   &found);
+													&found);
 
 		if (!found)
 			*subplanrri = rri;
 
 		/*
-		 * This is required in order to convert the partition's tuple
-		 * to be compatible with the root partitioned table's tuple
-		 * descriptor.  When generating the per-subplan result rels,
-		 * this was not set.
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
 		 */
 		rri->ri_PartitionRoot = proute->partition_root;
 	}
@@ -378,8 +383,8 @@ ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
 static void
 ExecExpandRoutingArrays(PartitionTupleRouting *proute)
 {
-	int		new_size = proute->partitions_allocsize * 2;
-	int		old_size = proute->partitions_allocsize;
+	int			new_size = proute->partitions_allocsize * 2;
+	int			old_size = proute->partitions_allocsize;
 
 	proute->partitions_allocsize = new_size;
 
@@ -389,8 +394,8 @@ ExecExpandRoutingArrays(PartitionTupleRouting *proute)
 	if (proute->parent_child_tupconv_maps != NULL)
 	{
 		proute->parent_child_tupconv_maps = (TupleConversionMap **)
-			repalloc( proute->parent_child_tupconv_maps,
-						sizeof(TupleConversionMap *) * new_size);
+			repalloc(proute->parent_child_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
 		memset(&proute->parent_child_tupconv_maps[old_size], 0,
 			   sizeof(TupleConversionMap *) * (new_size - old_size));
 	}
@@ -827,7 +832,7 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 
 	if (map)
 	{
-		int		new_size;
+		int			new_size;
 
 		/* Allocate parent child map array only if we need to store a map */
 		if (proute->parent_child_tupconv_maps == NULL)
@@ -858,8 +863,8 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
  *		Initialize PartitionDispatch for a partitioned table
  *
  * This also stores it in the proute->partition_dispatch_info array at the
- * specified index ('dispatchidx'), possibly expanding the array if there
- * isn't space left in it.
+ * specified index ('partidx'), possibly expanding the array if there isn't
+ * space left in it.
  */
 static PartitionDispatch
 ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
@@ -880,7 +885,8 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 	partdesc = RelationGetPartitionDesc(rel);
 	partkey = RelationGetPartitionKey(rel);
 
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
+									+ (partdesc->nparts * sizeof(int)));
 	pd->reldesc = rel;
 	pd->key = partkey;
 	pd->keystate = NIL;
@@ -897,9 +903,9 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 		 */
 		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
 		pd->tupmap =
-				convert_tuples_by_name(RelationGetDescr(parent_pd->reldesc),
-									   tupdesc,
-									   gettext_noop("could not convert row type"));
+			convert_tuples_by_name(RelationGetDescr(parent_pd->reldesc),
+								   tupdesc,
+								   gettext_noop("could not convert row type"));
 	}
 	else
 	{
@@ -908,8 +914,6 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 		pd->tupmap = NULL;
 	}
 
-	pd->indexes = (int *) palloc(sizeof(int) * partdesc->nparts);
-
 	/*
 	 * Initialize with -1 to signify that the corresponding partition's
 	 * ResultRelInfo or PartitionDispatch has not been created yet.
@@ -1046,6 +1050,7 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
 
 	/*
@@ -1078,15 +1083,14 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		 * Check if this result rel is one belonging to the node's subplans,
 		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_table)
+		if (resultrel_hash)
 		{
 			Oid			partoid;
 			bool		found;
 
 			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 
-			(void) hash_search(proute->subplan_partition_table, &partoid,
-							   HASH_FIND, &found);
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
 			if (found)
 				continue;
 		}
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 6e0c7862dc..4f7cea7668 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1779,12 +1779,9 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps ?
-								proute->parent_child_tupconv_maps[partidx] :
-								NULL,
-							  tuple,
-							  proute->partition_tuple_slot,
-							  &slot);
+	if (proute->parent_child_tupconv_maps)
+		ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
+								  tuple, proute->partition_tuple_slot, &slot);
 
 	/* Initialize information needed to handle ON CONFLICT DO UPDATE. */
 	Assert(mtstate != NULL);
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 4cc7508067..4b3b5ae770 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -27,8 +27,7 @@ typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
 	Oid		   *oids;			/* Array of length 'nparts' containing
-								 * partition OIDs in order of the their
-								 * bounds */
+								 * partition OIDs in order of the their bounds */
 	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
 								 * a partition is a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 1b421f2ec5..d921ab6ca0 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -48,7 +48,7 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	TupleConversionMap *tupmap;
-	int		   *indexes;
+	int		   indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
 typedef struct PartitionDispatchData *PartitionDispatch;
@@ -58,23 +58,23 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  * route a tuple inserted into a partitioned table to one of its leaf
  * partitions
  *
- *	partition_root			Root table, that is, the table mentioned in the
+ * partition_root			Root table, that is, the table mentioned in the
  *							command.
  *
- *	partition_dispatch_info	Contains PartitionDispatch objects for every
+ * partition_dispatch_info	Contains PartitionDispatch objects for every
  *							partitioned table touched by tuple routing.  The
  *							entry for the root partitioned table is *always*
  *							present as the first entry of this array.
  *
- *	num_dispatch			The number of existing entries and also serves as
+ * num_dispatch				The number of existing entries and also serves as
  *							the index of the next entry to be allocated and
  *							placed in 'partition_dispatch_info'.
  *
- *	dispatch_allocsize		(>= 'num_dispatch') is the number of entries that
+ * dispatch_allocsize		(>= 'num_dispatch') is the number of entries that
  *							can be stored in 'partition_dispatch_info' before
  *							needing to reallocate more space.
  *
- *	partitions				Contains pointers to a ResultRelInfos of all leaf
+ * partitions				Contains pointers to a ResultRelInfos of all leaf
  *							partitions touched by tuple routing.  Some of
  *							these are pointers to "reused" ResultRelInfos,
  *							that is, those that are created and destroyed
@@ -83,18 +83,18 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *							the partition key.  Rest of them are pointers to
  *							ResultRelInfos managed by execPartition.c itself
  *
- *	num_partitions			The number of existing entries and also serves as
+ * num_partitions			The number of existing entries and also serves as
  *							the index of the next entry to be allocated and
  *							placed in 'partitions'
  *
- *	partitions_allocsize	(>= 'num_partitions') is the number of entries
+ * partitions_allocsize		(>= 'num_partitions') is the number of entries
  *							that can be stored in 'partitions',
  *							'parent_child_tupconv_maps',
  *							'child_parent_tupconv_maps' and
  *							'child_parent_map_not_required' arrays before
  *							needing to reallocate more space
  *
- *	parent_child_tupconv_maps	Contains information to convert tuples of the
+ * parent_child_tupconv_maps	Contains information to convert tuples of the
  *							root parent's rowtype to those of the leaf
  *							partitions' rowtype, but only for those partitions
  *							whose TupleDescs are physically different from the
@@ -105,7 +105,7 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *							need not be more of these maps than there are
  *							partitions that were touched.
  *
- *	partition_tuple_slot	This is a tuple slot used to store a tuple using
+ * partition_tuple_slot		This is a tuple slot used to store a tuple using
  *							rowtype of the partition chosen by tuple
  *							routing.  Maintained separately because partitions
  *							may have different rowtype.
@@ -113,7 +113,7 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  * Note: The following fields are used only when UPDATE ends up needing to
  * do tuple routing.
  *
- *	child_parent_tupconv_maps	Information to convert tuples of the leaf
+ * child_parent_tupconv_maps	Information to convert tuples of the leaf
  *							partitions' rowtype to the root parent's rowtype.
  *							These are needed by transition table machinery
  *							when storing tuples of partition's rowtype into
@@ -124,14 +124,14 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *							needed a conversion map.  Also, if non-NULL, is of
  *							the same size as 'partitions'.
  *
- *	child_parent_map_not_required	Stores if we don't need a conversion
+ * child_parent_map_not_required	Stores if we don't need a conversion
  *							map for a partition so that TupConvMapForLeaf
  *							can return without having to re-check if it needs
  *							to build a map.
  *
- *	subplan_partition_table	Hash table to store subplan index by Oid.
+ * subplan_resultrel_hash	Hash table to store subplan index by Oid.
  *
- *	root_tuple_slot			During UPDATE tuple routing, this tuple slot is
+ * root_tuple_slot			During UPDATE tuple routing, this tuple slot is
  *							used to transiently store a tuple using the root
  *							table's rowtype after converting it from the
  *							tuple's source leaf partition's rowtype.  That is,
@@ -151,7 +151,7 @@ typedef struct PartitionTupleRouting
 	TupleConversionMap **parent_child_tupconv_maps;
 	TupleConversionMap **child_parent_tupconv_maps;
 	bool	   *child_parent_map_not_required;
-	HTAB	   *subplan_partition_table;
+	HTAB	   *subplan_resultrel_hash;
 	TupleTableSlot *root_tuple_slot;
 	TupleTableSlot *partition_tuple_slot;
 } PartitionTupleRouting;
v3-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchapplication/octet-stream; name=v3-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload
From 8cf1c7fd45cf4cc954d7bcce7ec395ad9d01f807 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 26 Jul 2018 19:54:55 +1200
Subject: [PATCH v3] Speed up INSERT and UPDATE on partitioned tables

This is more or less a complete redesign of PartitionTupleRouting. The
aim here is to get rid of all the possibly large arrays that were being
allocated during ExecSetupPartitionTupleRouting().  We now allocate
small arrays to store the partition's ResultRelInfo and only enlarge
these when we run out of space.  The partitions array is now ordered
by the order in which the partition's ResultRelInfos are inititialized
rather than in same order as partdesc.

The slowest part of ExecSetupPartitionTupleRouting still remains.  The
find_all_inheritors call still remains by far the slowest part of the
function. This patch just removes the other slow parts.

Initialization of the parent/child translation maps array is now only
performed when we need to store the first translation map.  If the column
order between the parent and its child are the same, then no map ever
needs to be stored, this (possibly large) array did nothing.

In simple INSERTs hitting a single partition to a partitioned table with
many partitions the shutdown of the executor was also slow in comparison to
the actual execution, this was down to the loop which cleans up each
ResultRelInfo having to loop over an array which often contained mostly
NULLs, which had to be skipped.  Performance of this has now improved as
the array we loop over now no longer has to skip possibly many NULL
values.

David Rowley and Amit Langote
---
 src/backend/commands/copy.c            |  28 +-
 src/backend/executor/execPartition.c   | 750 +++++++++++++++++++--------------
 src/backend/executor/nodeModifyTable.c | 105 +----
 src/backend/utils/cache/partcache.c    |  11 +-
 src/include/catalog/partition.h        |   5 +-
 src/include/executor/execPartition.h   | 161 ++++---
 6 files changed, 567 insertions(+), 493 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3a66cb5025..6fc1e2b41c 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2621,10 +2621,8 @@ CopyFrom(CopyState cstate)
 			 * will get us the ResultRelInfo and TupleConversionMap for the
 			 * partition, respectively.
 			 */
-			leaf_part_index = ExecFindPartition(resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
+			leaf_part_index = ExecFindPartition(mtstate, resultRelInfo,
+												proute, slot, estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
@@ -2644,15 +2642,8 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
+			Assert(proute->partitions[leaf_part_index] != NULL);
 			resultRelInfo = proute->partitions[leaf_part_index];
-			if (resultRelInfo == NULL)
-			{
-				resultRelInfo = ExecInitPartitionInfo(mtstate,
-													  saved_resultRelInfo,
-													  proute, estate,
-													  leaf_part_index);
-				Assert(resultRelInfo != NULL);
-			}
 
 			/*
 			 * For ExecInsertIndexTuples() to work on the partition's indexes
@@ -2693,10 +2684,15 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
-											  tuple,
-											  proute->partition_tuple_slot,
-											  &slot);
+			if (proute->parent_child_tupconv_maps)
+			{
+				TupleConversionMap *map =
+				proute->parent_child_tupconv_maps[leaf_part_index];
+
+				tuple = ConvertPartitionTupleSlot(map, tuple,
+												  proute->partition_tuple_slot,
+												  &slot);
+			}
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index cd0ec08461..2a18a30b3e 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,11 +31,18 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+#define PARTITION_ROUTING_INITSIZE	8
 
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecExpandRoutingArrays(PartitionTupleRouting *proute);
+static int ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch parent, int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -62,138 +69,114 @@ static void find_matching_subplans_recurse(PartitionPruneState *prunestate,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition.  The actual ResultRelInfos are allocated lazily by that
+ * function.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
-
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
 
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
-	}
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
+	 * lazily, only when we actually need to route a tuple to that partition.
+	 * The reason for this is that a common case is for INSERT to insert a
+	 * single tuple into a single partition.
+	 *
+	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
+	 * PartitionDispatch and ResultRelInfo pointers in their respective
+	 * arrays. More space can be allocated later, if required via
+	 * ExecExpandRoutingArrays.
+	 *
+	 * The PartitionDispatch for the target partitioned table of the command
+	 * must be setup, but any sub-partitioned tables can be setup lazily as
+	 * and when the tuples get routed to (through) them.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	proute->partitions = (ResultRelInfo **)
+		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	/* We only allocate these arrays when we need to store the first map */
+	proute->parent_child_tupconv_maps = NULL;
+	proute->child_parent_tupconv_maps = NULL;
+	proute->child_parent_map_not_required = NULL;
 
 	/*
-	 * Initialize an empty slot that will be used to manipulate tuples of any
-	 * given partition's rowtype.  It is attached to the caller-specified node
-	 * (such as ModifyTableState) and released when the node finishes
-	 * processing.
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent is NULL as we don't need to care about any parent of the target
+	 * partitioned table.
 	 */
-	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
 
-	i = 0;
-	foreach(cell, leaf_parts)
+	/*
+	 * If UPDATE needs to do tuple routing, we can reuse partition sub-plan
+	 * result rels after tuple routing, so build a hash table to map the OIDs
+	 * of partitions present in mtstate->resultRelInfo to their
+	 * ResultRelInfos.  Every time a tuple is routed to one of the partitions
+	 * present in mtstate->resultRelInfo, looking its OID up in the hash table
+	 * will give us its ResultRelInfo.
+	 *
+	 * Also, we'll need a slot that will transiently store the tuple being
+	 * routed using the root parent's rowtype.
+	 */
+	if (node && node->operation == CMD_UPDATE)
 	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
-
-		proute->partition_oids[i] = leaf_oid;
-
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
-
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
-
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
-
-			update_rri_index++;
-		}
-
-		proute->partitions[i] = leaf_part_rri;
-		i++;
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
+		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
+	}
+	else
+	{
+		proute->subplan_resultrel_hash = NULL;
+		proute->root_tuple_slot = NULL;
 	}
 
 	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
+	 * Initialize an empty slot that will be used to manipulate tuples of any
+	 * given partition's rowtype.
 	 */
-	Assert(update_rri_index == num_update_rri);
+	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
 
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find a leaf partition for the tuple contained in *slot
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message, else it returns the index of the leaf partition's
+ * ResultRelInfo in the proute->partitions array.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
+	int			result = -1;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch parent;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 
@@ -210,9 +193,10 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	{
 		TupleTableSlot *myslot = parent->tupslot;
 		TupleConversionMap *map = parent->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = parent->reldesc;
+		partdesc = parent->partdesc;
 
 		/*
 		 * Convert the tuple to this parent's layout so that we can do certain
@@ -240,81 +224,230 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(parent, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (parent->partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(parent, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(parent, values, isnull);
-
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
+		if (partdesc->is_leaf[partidx])
 		{
-			result = -1;
-			break;
+			/*
+			 * Get this leaf partition's index in the
+			 * PartitionTupleRouting->partitions array.  We may require
+			 * building a new ResultRelInfo.
+			 */
+			if (likely(parent->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(parent->indexes[partidx] < proute->num_partitions);
+				result = parent->indexes[partidx];
+			}
+			else
+			{
+				/*
+				 * A ResultRelInfo has not been setup for this partition yet,
+				 * so either use one of the sub-plan result rels or create a
+				 * fresh one.
+				 */
+				if (proute->subplan_resultrel_hash)
+				{
+					ResultRelInfo *rri;
+					Oid			partoid = partdesc->oids[partidx];
+
+					rri = hash_search(proute->subplan_resultrel_hash,
+									  &partoid, HASH_FIND, NULL);
+
+					if (rri)
+					{
+						result = proute->num_partitions++;
+						parent->indexes[partidx] = result;
+
+						/* Allocate more space in the arrays, if required */
+						if (result >= proute->partitions_allocsize)
+							ExecExpandRoutingArrays(proute);
+
+						/* Save here for later use. */
+						proute->partitions[result] = rri;
+					}
+				}
+
+				/* We need to create one afresh. */
+				if (result < 0)
+				{
+					result = ExecInitPartitionInfo(mtstate, resultRelInfo,
+												   proute, estate,
+												   parent, partidx);
+					Assert(result >= 0 && result < proute->num_partitions);
+				}
+			}
+
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return result;
 		}
-		else if (parent->indexes[cur_index] >= 0)
+		else
 		{
-			result = parent->indexes[cur_index];
-			break;
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(parent->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(parent->indexes[partidx] < proute->num_dispatch);
+				parent = pd[parent->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subparent;
+
+				subparent = ExecInitPartitionDispatchInfo(proute,
+														  partdesc->oids[partidx],
+														  parent, partidx);
+				Assert(parent->indexes[partidx] >= 0 &&
+					   parent->indexes[partidx] < proute->num_dispatch);
+				parent = subparent;
+			}
 		}
-		else
-			parent = pd[-parent->indexes[cur_index]];
 	}
+}
 
-	/* A partition was not found. */
-	if (result < 0)
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo *subplan_result_rels;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			nsubplans;
+	int			i;
+
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
+
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
+
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_resultrel_hash = htab;
+
+	/* Hash all subplans by their Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		char	   *val_desc;
-
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
+
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+													&found);
+
+		if (!found)
+			*subplanrri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
 	}
+}
 
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
-	return result;
+/*
+ * ExecExpandRoutingArrays
+ *		Double the size of the allocated arrays in 'proute'
+ */
+static void
+ExecExpandRoutingArrays(PartitionTupleRouting *proute)
+{
+	int			new_size = proute->partitions_allocsize * 2;
+	int			old_size = proute->partitions_allocsize;
+
+	proute->partitions_allocsize = new_size;
+
+	proute->partitions = (ResultRelInfo **)
+		repalloc(proute->partitions, sizeof(ResultRelInfo *) * new_size);
+
+	if (proute->parent_child_tupconv_maps != NULL)
+	{
+		proute->parent_child_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->parent_child_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->parent_child_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_map_not_required != NULL)
+	{
+		proute->child_parent_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->child_parent_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->child_parent_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_map_not_required != NULL)
+	{
+		proute->child_parent_map_not_required = (bool *)
+			repalloc(proute->child_parent_map_not_required,
+					 sizeof(bool) * new_size);
+		memset(&proute->child_parent_map_not_required[old_size], 0,
+			   sizeof(bool) * (new_size - old_size));
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
- *
- * Returns the ResultRelInfo
+ *		and store it in the next empty slot in 'proute's partitions array and
+ *		return the index of that element.
  */
-ResultRelInfo *
+static int
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch parent, int partidx)
 {
+	Oid			partoid = parent->partdesc->oids[partidx];
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(partoid, NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -490,15 +623,25 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 									&mtstate->ps, RelationGetDescr(partrel));
 	}
 
+	part_result_rel_index = proute->num_partitions++;
+	parent->indexes[partidx] = part_result_rel_index;
+
+	/* Allocate more space in the arrays, if required */
+	if (part_result_rel_index >= proute->partitions_allocsize)
+		ExecExpandRoutingArrays(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
+
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri,
+						part_result_rel_index);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -511,7 +654,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -524,7 +667,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -538,7 +681,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -548,8 +691,14 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = proute->parent_child_tupconv_maps ?
+				proute->parent_child_tupconv_maps[part_result_rel_index] :
+				NULL;
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -558,7 +707,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -649,12 +798,9 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
-
 	MemoryContextSwitchTo(oldContext);
 
-	return leaf_part_rri;
+	return part_result_rel_index;
 }
 
 /*
@@ -669,6 +815,7 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 					int partidx)
 {
 	MemoryContext oldContext;
+	TupleConversionMap *map;
 
 	/*
 	 * Switch into per-query memory context.
@@ -679,10 +826,24 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
-		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
-							   RelationGetDescr(partRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	map = convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								 RelationGetDescr(partRelInfo->ri_RelationDesc),
+								 gettext_noop("could not convert row type"));
+
+	if (map)
+	{
+		int			new_size;
+
+		/* Allocate parent child map array only if we need to store a map */
+		if (proute->parent_child_tupconv_maps == NULL)
+		{
+			new_size = proute->partitions_allocsize;
+			proute->parent_child_tupconv_maps = (TupleConversionMap **)
+				palloc0(sizeof(TupleConversionMap *) * new_size);
+		}
+
+		proute->parent_child_tupconv_maps[partidx] = map;
+	}
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -697,6 +858,87 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	partRelInfo->ri_PartitionReadyForRouting = true;
 }
 
+/*
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table
+ *
+ * This also stores it in the proute->partition_dispatch_info array at the
+ * specified index ('partidx'), possibly expanding the array if there isn't
+ * space left in it.
+ */
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
+{
+	Relation	rel;
+	TupleDesc	tupdesc;
+	PartitionDesc partdesc;
+	PartitionKey partkey;
+	PartitionDispatch pd;
+	int			dispatchidx;
+
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	tupdesc = RelationGetDescr(rel);
+	partdesc = RelationGetPartitionDesc(rel);
+	partkey = RelationGetPartitionKey(rel);
+
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
+									+ (partdesc->nparts * sizeof(int)));
+	pd->reldesc = rel;
+	pd->key = partkey;
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap =
+			convert_tuples_by_name(RelationGetDescr(parent_pd->reldesc),
+								   tupdesc,
+								   gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
+
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
+
+	dispatchidx = proute->num_dispatch++;
+	if (parent_pd)
+		parent_pd->indexes[partidx] = dispatchidx;
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize *= 2;
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
+
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
+
+	return pd;
+}
+
 /*
  * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
  * child-to-root tuple conversion map array.
@@ -709,19 +951,22 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 void
 ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
 {
+	int			size;
+
 	Assert(proute != NULL);
 
+	size = proute->partitions_allocsize;
+
 	/*
 	 * These array elements get filled up with maps on an on-demand basis.
 	 * Initially just set all of them to NULL.
 	 */
 	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * size);
 
 	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
+	proute->child_parent_map_not_required = (bool *) palloc0(sizeof(bool) *
+															 size);
 }
 
 /*
@@ -732,15 +977,15 @@ TupleConversionMap *
 TupConvMapForLeaf(PartitionTupleRouting *proute,
 				  ResultRelInfo *rootRelInfo, int leaf_index)
 {
-	ResultRelInfo **resultRelInfos = proute->partitions;
 	TupleConversionMap **map;
 	TupleDesc	tupdesc;
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	/* If nobody else set up the per-leaf maps array, do so ourselves. */
+	if (proute->child_parent_tupconv_maps == NULL)
+		ExecSetupChildParentMapForLeaf(proute);
 
 	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
+	else if (proute->child_parent_map_not_required[leaf_index])
 		return NULL;
 
 	/* If we've already got a map, return it. */
@@ -749,13 +994,16 @@ TupConvMapForLeaf(PartitionTupleRouting *proute,
 		return *map;
 
 	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
+	tupdesc = RelationGetDescr(proute->partitions[leaf_index]->ri_RelationDesc);
 	*map =
 		convert_tuples_by_name(tupdesc,
 							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
 							   gettext_noop("could not convert row type"));
 
-	/* If it turns out no map is needed, remember for next time. */
+	/*
+	 * If it turns out no map is needed, remember that so we don't try making
+	 * one again next time.
+	 */
 	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
 
 	return *map;
@@ -802,8 +1050,8 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -824,10 +1072,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-		/* skip further processsing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
-
 		/* Allow any FDWs to shut down if they've been exercised */
 		if (resultRelInfo->ri_PartitionReadyForRouting &&
 			resultRelInfo->ri_FdwRoutine != NULL &&
@@ -836,21 +1080,19 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 														   resultRelInfo);
 
 		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
+		if (resultrel_hash)
 		{
-			subplan_index++;
-			continue;
+			Oid			partoid;
+			bool		found;
+
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
+			if (found)
+				continue;
 		}
 
 		ExecCloseIndices(resultRelInfo);
@@ -864,144 +1106,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
-											tupdesc,
-											gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
-
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
-		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
-
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
-	}
-}
-
 /* ----------------
  *		FormPartitionKeyDatum
  *			Construct values[] and isnull[] arrays for the partition key
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index f535762e2d..4f7cea7668 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1666,7 +1665,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1709,21 +1708,12 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	 * value is to be used as an index into the arrays for the ResultRelInfo
 	 * and TupleConversionMap for the partition.
 	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
+	partidx = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
 	Assert(partidx >= 0 && partidx < proute->num_partitions);
 
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
-	 */
+	/* Get the ResultRelInfo corresponding to the selected partition. */
+	Assert(proute->partitions[partidx] != NULL);
 	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
 
 	/*
 	 * Check whether the partition is routable if we didn't yet
@@ -1789,10 +1779,9 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
-							  tuple,
-							  proute->partition_tuple_slot,
-							  &slot);
+	if (proute->parent_child_tupconv_maps)
+		ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
+								  tuple, proute->partition_tuple_slot, &slot);
 
 	/* Initialize information needed to handle ON CONFLICT DO UPDATE. */
 	Assert(mtstate != NULL);
@@ -1828,17 +1817,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			numResultRelInfos = mtstate->mt_nplans;
 	int			i;
 
-	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
 	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
@@ -1860,79 +1838,18 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	}
 }
 
-/*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
 /*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ouselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 115a9fe78f..aa82aa52eb 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -594,6 +594,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -782,7 +783,15 @@ RelationBuildPartitionDesc(Relation rel)
 		 * defined by canonicalized representation of the partition bounds.
 		 */
 		for (i = 0; i < nparts; i++)
-			result->oids[mapping[i]] = oids[i];
+		{
+			int			index = mapping[i];
+
+			result->oids[index] = oids[i];
+			/* Record if the partition is a subpartitioned table */
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+		}
+
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 1f49e5d3a9..4b3b5ae770 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -26,7 +26,10 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of length 'nparts' containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * a partition is a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 862bf65060..d921ab6ca0 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -31,9 +31,13 @@
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
  *				this table's rowtype (when extracting the partition key of a
  *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *	indexes		Array with partdesc->nparts elements.  For leaf partitions the
+ *				index into the PartitionTupleRouting->partitions array is
+ *				stored.  When the partition is itself a partitioned table then
+ *				we store the index into
+ *				PartitionTupleRouting->partition_dispatch_info.  -1 means
+ *				we've not yet allocated anything in PartitionTupleRouting for
+ *				the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -44,72 +48,112 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	TupleConversionMap *tupmap;
-	int		   *indexes;
+	int		   indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 /*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
- *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slot			TupleTableSlot to be used to manipulate any
- *								given leaf partition's rowtype after that
- *								partition is chosen for insertion by
- *								tuple-routing.
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
+ *
+ * partition_root			Root table, that is, the table mentioned in the
+ *							command.
+ *
+ * partition_dispatch_info	Contains PartitionDispatch objects for every
+ *							partitioned table touched by tuple routing.  The
+ *							entry for the root partitioned table is *always*
+ *							present as the first entry of this array.
+ *
+ * num_dispatch				The number of existing entries and also serves as
+ *							the index of the next entry to be allocated and
+ *							placed in 'partition_dispatch_info'.
+ *
+ * dispatch_allocsize		(>= 'num_dispatch') is the number of entries that
+ *							can be stored in 'partition_dispatch_info' before
+ *							needing to reallocate more space.
+ *
+ * partitions				Contains pointers to a ResultRelInfos of all leaf
+ *							partitions touched by tuple routing.  Some of
+ *							these are pointers to "reused" ResultRelInfos,
+ *							that is, those that are created and destroyed
+ *							outside execPartition.c, for example, when tuple
+ *							routing is used for UPDATE queries that modify
+ *							the partition key.  Rest of them are pointers to
+ *							ResultRelInfos managed by execPartition.c itself
+ *
+ * num_partitions			The number of existing entries and also serves as
+ *							the index of the next entry to be allocated and
+ *							placed in 'partitions'
+ *
+ * partitions_allocsize		(>= 'num_partitions') is the number of entries
+ *							that can be stored in 'partitions',
+ *							'parent_child_tupconv_maps',
+ *							'child_parent_tupconv_maps' and
+ *							'child_parent_map_not_required' arrays before
+ *							needing to reallocate more space
+ *
+ * parent_child_tupconv_maps	Contains information to convert tuples of the
+ *							root parent's rowtype to those of the leaf
+ *							partitions' rowtype, but only for those partitions
+ *							whose TupleDescs are physically different from the
+ *							root parent's.  If none of the partitions has such
+ *							a differing TupleDesc, then it's NULL.  If
+ *							non-NULL, is of the same size as 'partitions', to
+ *							be able to use the same array index.  Also, there
+ *							need not be more of these maps than there are
+ *							partitions that were touched.
+ *
+ * partition_tuple_slot		This is a tuple slot used to store a tuple using
+ *							rowtype of the partition chosen by tuple
+ *							routing.  Maintained separately because partitions
+ *							may have different rowtype.
+ *
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
+ *
+ * child_parent_tupconv_maps	Information to convert tuples of the leaf
+ *							partitions' rowtype to the root parent's rowtype.
+ *							These are needed by transition table machinery
+ *							when storing tuples of partition's rowtype into
+ *							the transition table that can only store tuples of
+ *							the root parent's rowtype.  Like
+ *							'parent_child_tupconv_maps' it remains NULL if
+ *							none of the partitions selected by tuple routing
+ *							needed a conversion map.  Also, if non-NULL, is of
+ *							the same size as 'partitions'.
+ *
+ * child_parent_map_not_required	Stores if we don't need a conversion
+ *							map for a partition so that TupConvMapForLeaf
+ *							can return without having to re-check if it needs
+ *							to build a map.
+ *
+ * subplan_resultrel_hash	Hash table to store subplan index by Oid.
+ *
+ * root_tuple_slot			During UPDATE tuple routing, this tuple slot is
+ *							used to transiently store a tuple using the root
+ *							table's rowtype after converting it from the
+ *							tuple's source leaf partition's rowtype.  That is,
+ *							if leaf partition's rowtype is different.
  *-----------------------
  */
 typedef struct PartitionTupleRouting
 {
+	Relation	partition_root;
+
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
-	Oid		   *partition_oids;
+	int			dispatch_allocsize;
 	ResultRelInfo **partitions;
 	int			num_partitions;
+	int			partitions_allocsize;
 	TupleConversionMap **parent_child_tupconv_maps;
 	TupleConversionMap **child_parent_tupconv_maps;
 	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
-	TupleTableSlot *partition_tuple_slot;
+	HTAB	   *subplan_resultrel_hash;
 	TupleTableSlot *root_tuple_slot;
+	TupleTableSlot *partition_tuple_slot;
 } PartitionTupleRouting;
 
 /*-----------------------
@@ -186,14 +230,15 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern int ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
+extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,
+					 ResultRelInfo *resultRelInfo,
+					 PartitionTupleRouting *proute,
+					 EState *estate, int partidx);
 extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
 					PartitionTupleRouting *proute,
-- 
2.16.2.windows.1

#20Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#19)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018/07/28 10:54, David Rowley wrote:

On 27 July 2018 at 19:11, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

I've attached a delta patch to make the above changes. I'm leaving the
hash table rename up to you though.

Thanks for the delta patch. I took all of it, just rewrote a comment slightly.

I also renamed the hash table to your suggestion and changed a few more things.

Attached a delta based on v2 and the full v3 patch.

This includes another small change to make
PartitionDispatchData->indexes an array that's allocated in the same
memory as the PartitionDispatchData. This will save a palloc() call
and also should be a bit more cache friendly.

This also required a rebase on master again due to 3e32109049.

Thanks for the updated patch.

I couldn't find much to complain about in the latest v3, except I noticed
a few instances of the word "setup" where I think what's really meant is
"set up".

+ * must be setup, but any sub-partitioned tables can be setup lazily as

+ * A ResultRelInfo has not been setup for this partition yet,

By the way, when going over the updated code, I noticed that the code
around child_parent_tupconv_maps could use some refactoring too.
Especially, I noticed that ExecSetupChildParentMapForLeaf() allocates
child-to-parent map array needed for transition tuple capture even if not
needed by any of the leaf partitions. I'm attaching here a patch that
applies on top of your v3 to show what I'm thinking we could do.

Thanks,
Amit

Attachments:

0002-Some-refactoring-around-child_parent_tupconv_maps.patchtext/plain; charset=UTF-8; name=0002-Some-refactoring-around-child_parent_tupconv_maps.patchDownload
From 6ce1654aa929c7f8112c430914af7f464474ed31 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Mon, 30 Jul 2018 14:05:17 +0900
Subject: [PATCH 2/2] Some refactoring around child_parent_tupconv_maps

Just like parent_child_tupconv_maps, we should allocate it only if
needed.  Also, if none of the partitions ended up needing a map, we
should not have allocated the child_parent_tupconv_maps array, only
the child_parent_map_not_required one.  So, get rid of
ExecSetupChildParentMapForLeaf(), which currently does initial,
possibly useless, allocation of both of the above mentioned arrays.
Instead, have TupConvMapForLeaf() allocate the needed array, just
like ExecInitRoutingInfo does, when it needs to store a
parent-to-child map.

Finally, rename the function TupConvMapForLeaf to
LeafToParentTupConvMapForTC for clarity; TC stands for "Transition
Capture".
---
 src/backend/commands/copy.c            |  19 +-----
 src/backend/executor/execPartition.c   | 102 ++++++++++++++++-----------------
 src/backend/executor/nodeModifyTable.c |   4 +-
 src/include/executor/execPartition.h   |  12 ++--
 4 files changed, 59 insertions(+), 78 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 6fc1e2b41c..6d0e9229e0 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2503,22 +2503,9 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
-		PartitionTupleRouting *proute;
-
-		proute = cstate->partition_tuple_routing =
+		cstate->partition_tuple_routing =
 			ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
-		/*
-		 * If we are capturing transition tuples, they may need to be
-		 * converted from partition format back to partitioned table format
-		 * (this is only ever necessary if a BEFORE trigger modifies the
-		 * tuple).
-		 */
-		if (cstate->transition_capture != NULL)
-			ExecSetupChildParentMapForLeaf(proute);
-	}
-
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2666,8 +2653,8 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, saved_resultRelInfo,
-										  leaf_part_index);
+						LeafToParentTupConvMapForTC(proute, saved_resultRelInfo,
+													leaf_part_index);
 				}
 				else
 				{
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 2a18a30b3e..d183e8b758 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -400,7 +400,7 @@ ExecExpandRoutingArrays(PartitionTupleRouting *proute)
 			   sizeof(TupleConversionMap *) * (new_size - old_size));
 	}
 
-	if (proute->child_parent_map_not_required != NULL)
+	if (proute->child_parent_tupconv_maps != NULL)
 	{
 		proute->child_parent_tupconv_maps = (TupleConversionMap **)
 			repalloc(proute->child_parent_tupconv_maps,
@@ -940,73 +940,67 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 }
 
 /*
- * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
- * child-to-root tuple conversion map array.
- *
- * This map is required for capturing transition tuples when the target table
- * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
- * we need to convert it from the leaf partition to the target table
- * descriptor.
- */
-void
-ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
-{
-	int			size;
-
-	Assert(proute != NULL);
-
-	size = proute->partitions_allocsize;
-
-	/*
-	 * These array elements get filled up with maps on an on-demand basis.
-	 * Initially just set all of them to NULL.
-	 */
-	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * size);
-
-	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required = (bool *) palloc0(sizeof(bool) *
-															 size);
-}
-
-/*
- * TupConvMapForLeaf -- Get the tuple conversion map for a given leaf partition
- * index.
+ * LeafToParentTupConvMapForTC -- Get the tuple conversion map to convert
+ * tuples of a leaf partition to the root parent's rowtype for storing in the
+ * transition capture tuplestore
  */
 TupleConversionMap *
-TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index)
+LeafToParentTupConvMapForTC(PartitionTupleRouting *proute,
+							ResultRelInfo *rootRelInfo,
+							int leaf_index)
 {
-	TupleConversionMap **map;
+	TupleConversionMap *map;
 	TupleDesc	tupdesc;
 
-	/* If nobody else set up the per-leaf maps array, do so ourselves. */
-	if (proute->child_parent_tupconv_maps == NULL)
-		ExecSetupChildParentMapForLeaf(proute);
+	Assert(leaf_index < proute->partitions_allocsize);
 
-	/* If it's already known that we don't need a map, return NULL. */
-	else if (proute->child_parent_map_not_required[leaf_index])
+	/* Did we already find out that we don't need a map for this partition? */
+	if (proute->child_parent_map_not_required &&
+		proute->child_parent_map_not_required[leaf_index])
 		return NULL;
 
-	/* If we've already got a map, return it. */
-	map = &proute->child_parent_tupconv_maps[leaf_index];
-	if (*map != NULL)
-		return *map;
-
-	/* No map yet; try to create one. */
 	tupdesc = RelationGetDescr(proute->partitions[leaf_index]->ri_RelationDesc);
-	*map =
+	map =
 		convert_tuples_by_name(tupdesc,
 							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
 							   gettext_noop("could not convert row type"));
 
-	/*
-	 * If it turns out no map is needed, remember that so we don't try making
-	 * one again next time.
-	 */
-	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
+	if (map)
+	{
+		/* If the per-leaf maps array has not been set up, do so ourselves. */
+		if (proute->child_parent_tupconv_maps == NULL)
+		{
+			int		size = proute->partitions_allocsize;
 
-	return *map;
+			/*
+			 * These array elements get filled up with maps on an on-demand
+			 * basis.  Initially just set all of them to NULL.
+			 */
+			proute->child_parent_tupconv_maps =
+				(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
+												size);
+		}
+
+		proute->child_parent_tupconv_maps[leaf_index] = map;
+	}
+	else
+	{
+		if (proute->child_parent_map_not_required == NULL)
+		{
+			int		size = proute->partitions_allocsize;
+
+			/*
+			 * Values for other partitions will be filled whenever they're
+			 * selected by routing.
+			 */
+			proute->child_parent_map_not_required =
+				(bool *) palloc0(sizeof(bool) * size);
+		}
+
+		proute->child_parent_map_not_required[leaf_index] = true;
+	}
+
+	return map;
 }
 
 /*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 4f7cea7668..7d658ddad2 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1758,7 +1758,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			 */
 			mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 			mtstate->mt_transition_capture->tcs_map =
-				TupConvMapForLeaf(proute, targetRelInfo, partidx);
+				LeafToParentTupConvMapForTC(proute, targetRelInfo, partidx);
 		}
 		else
 		{
@@ -1773,7 +1773,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	if (mtstate->mt_oc_transition_capture != NULL)
 	{
 		mtstate->mt_oc_transition_capture->tcs_map =
-			TupConvMapForLeaf(proute, targetRelInfo, partidx);
+			LeafToParentTupConvMapForTC(proute, targetRelInfo, partidx);
 	}
 
 	/*
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index d921ab6ca0..487a131343 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -125,9 +125,9 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *							the same size as 'partitions'.
  *
  * child_parent_map_not_required	Stores if we don't need a conversion
- *							map for a partition so that TupConvMapForLeaf
- *							can return without having to re-check if it needs
- *							to build a map.
+ *							map for a partition so that
+ *							LeafToParentTupConvMapForTC can return without
+ *							having to re-check if it needs to build a map.
  *
  * subplan_resultrel_hash	Hash table to store subplan index by Oid.
  *
@@ -244,9 +244,9 @@ extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					PartitionTupleRouting *proute,
 					ResultRelInfo *partRelInfo,
 					int partidx);
-extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
-extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index);
+extern TupleConversionMap *LeafToParentTupConvMapForTC(PartitionTupleRouting *proute,
+							ResultRelInfo *rootRelInfo,
+							int leaf_index);
 extern HeapTuple ConvertPartitionTupleSlot(TupleConversionMap *map,
 						  HeapTuple tuple,
 						  TupleTableSlot *new_slot,
-- 
2.11.0

#21David Rowley
david.rowley@2ndquadrant.com
In reply to: Amit Langote (#20)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 30 July 2018 at 20:26, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

I couldn't find much to complain about in the latest v3, except I noticed
a few instances of the word "setup" where I think what's really meant is
"set up".

+ * must be setup, but any sub-partitioned tables can be setup lazily as

+ * A ResultRelInfo has not been setup for this partition yet,

Great. I've fixed those and also fixed a few other comments. I found
the comments on PartitionTupleRouting didn't really explain how the
arrays were indexed. I've made an attempt to make that clear.

I've attached a complete v4 patch.

By the way, when going over the updated code, I noticed that the code
around child_parent_tupconv_maps could use some refactoring too.
Especially, I noticed that ExecSetupChildParentMapForLeaf() allocates
child-to-parent map array needed for transition tuple capture even if not
needed by any of the leaf partitions. I'm attaching here a patch that
applies on top of your v3 to show what I'm thinking we could do.

Maybe we can do that as a follow-on patch. I think what we have so far
is already ended up quite complex to review. What do you think?

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v4-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchapplication/octet-stream; name=v4-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload
From 9218a2526a653a01fd62bf2a6480f09987a7dc13 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 26 Jul 2018 19:54:55 +1200
Subject: [PATCH v4] Speed up INSERT and UPDATE on partitioned tables

This is more or less a complete redesign of PartitionTupleRouting. The
aim here is to get rid of all the possibly large arrays that were being
allocated during ExecSetupPartitionTupleRouting().  We now allocate
small arrays to store the partition's ResultRelInfo and only enlarge
these when we run out of space.  The partitions array is now ordered
by the order in which the partition's ResultRelInfos are inititialized
rather than in same order as partdesc.

The slowest part of ExecSetupPartitionTupleRouting still remains.  The
find_all_inheritors call still remains by far the slowest part of the
function. This patch just removes the other slow parts.

Initialization of the parent/child translation maps array is now only
performed when we need to store the first translation map.  If the column
order between the parent and its child are the same, then no map ever
needs to be stored, this (possibly large) array did nothing.

In simple INSERTs hitting a single partition to a partitioned table with
many partitions the shutdown of the executor was also slow in comparison to
the actual execution, this was down to the loop which cleans up each
ResultRelInfo having to loop over an array which often contained mostly
NULLs, which had to be skipped.  Performance of this has now improved as
the array we loop over now no longer has to skip possibly many NULL
values.

David Rowley and Amit Langote
---
 src/backend/commands/copy.c            |  28 +-
 src/backend/executor/execPartition.c   | 752 +++++++++++++++++++--------------
 src/backend/executor/nodeModifyTable.c | 105 +----
 src/backend/utils/cache/partcache.c    |  11 +-
 src/include/catalog/partition.h        |   5 +-
 src/include/executor/execPartition.h   | 155 ++++---
 6 files changed, 563 insertions(+), 493 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3a66cb5025..6fc1e2b41c 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2621,10 +2621,8 @@ CopyFrom(CopyState cstate)
 			 * will get us the ResultRelInfo and TupleConversionMap for the
 			 * partition, respectively.
 			 */
-			leaf_part_index = ExecFindPartition(resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
+			leaf_part_index = ExecFindPartition(mtstate, resultRelInfo,
+												proute, slot, estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
@@ -2644,15 +2642,8 @@ CopyFrom(CopyState cstate)
 			 * to the selected partition.
 			 */
 			saved_resultRelInfo = resultRelInfo;
+			Assert(proute->partitions[leaf_part_index] != NULL);
 			resultRelInfo = proute->partitions[leaf_part_index];
-			if (resultRelInfo == NULL)
-			{
-				resultRelInfo = ExecInitPartitionInfo(mtstate,
-													  saved_resultRelInfo,
-													  proute, estate,
-													  leaf_part_index);
-				Assert(resultRelInfo != NULL);
-			}
 
 			/*
 			 * For ExecInsertIndexTuples() to work on the partition's indexes
@@ -2693,10 +2684,15 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
-											  tuple,
-											  proute->partition_tuple_slot,
-											  &slot);
+			if (proute->parent_child_tupconv_maps)
+			{
+				TupleConversionMap *map =
+				proute->parent_child_tupconv_maps[leaf_part_index];
+
+				tuple = ConvertPartitionTupleSlot(map, tuple,
+												  proute->partition_tuple_slot,
+												  &slot);
+			}
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index cd0ec08461..1878af52d5 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,11 +31,18 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+#define PARTITION_ROUTING_INITSIZE	8
 
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecExpandRoutingArrays(PartitionTupleRouting *proute);
+static int ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch parent, int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -62,138 +69,116 @@ static void find_matching_subplans_recurse(PartitionPruneState *prunestate,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition.  The actual ResultRelInfos are allocated lazily by that
+ * function.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
-
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
 
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
-	}
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
+	 * lazily, only when we actually need to route a tuple to that partition.
+	 * The reason for this is that a common case is for INSERT to insert a
+	 * single tuple into a single partition and this must be fast.
+	 *
+	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
+	 * PartitionDispatch and ResultRelInfo pointers in their respective
+	 * arrays. More space can be allocated later, if required via
+	 * ExecExpandRoutingArrays.
+	 *
+	 * The PartitionDispatch for the target partitioned table of the command
+	 * must be set up, but any sub-partitioned tables can be set up lazily as
+	 * and when the tuples get routed to (through) them.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	proute->partitions = (ResultRelInfo **)
+		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	/* We only allocate these arrays when we need to store the first map */
+	proute->parent_child_tupconv_maps = NULL;
+	proute->child_parent_tupconv_maps = NULL;
+	proute->child_parent_map_not_required = NULL;
 
 	/*
-	 * Initialize an empty slot that will be used to manipulate tuples of any
-	 * given partition's rowtype.  It is attached to the caller-specified node
-	 * (such as ModifyTableState) and released when the node finishes
-	 * processing.
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent as NULL as we don't need to care about any parent of the target
+	 * partitioned table.
 	 */
-	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
 
-	i = 0;
-	foreach(cell, leaf_parts)
+	/*
+	 * If performing an UPDATE with tuple routing, we can reuse partition
+	 * sub-plan result rels.  We build a hash table to map the OIDs of
+	 * partitions present in mtstate->resultRelInfo to their ResultRelInfos.
+	 * Every time a tuple is routed to a partition that we've yet to set the
+	 * ResultRelInfo for, before we go making one, we check for a pre-made one
+	 * in the hash table.
+	 *
+	 * Also, we'll need a slot that will transiently store the tuple being
+	 * routed using the root parent's rowtype.
+	 */
+	if (node && node->operation == CMD_UPDATE)
 	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
-
-		proute->partition_oids[i] = leaf_oid;
-
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
-
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
-
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
-
-			update_rri_index++;
-		}
-
-		proute->partitions[i] = leaf_part_rri;
-		i++;
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
+		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
+	}
+	else
+	{
+		proute->subplan_resultrel_hash = NULL;
+		proute->root_tuple_slot = NULL;
 	}
 
 	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
+	 * Initialize an empty slot that will be used to manipulate tuples of any
+	 * given partition's rowtype.
 	 */
-	Assert(update_rri_index == num_update_rri);
+	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
 
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find a leaf partition for the tuple contained in *slot.
+ * If the partition's ResultRelInfo does not yet exist in 'proute' then we set
+ * one up or reuse one from mtstate's resultRelInfo array.
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message, else it returns the index of the leaf partition's
+ * ResultRelInfo in the proute->partitions array.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
+	int			result = -1;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch parent;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 
@@ -210,9 +195,10 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	{
 		TupleTableSlot *myslot = parent->tupslot;
 		TupleConversionMap *map = parent->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = parent->reldesc;
+		partdesc = parent->partdesc;
 
 		/*
 		 * Convert the tuple to this parent's layout so that we can do certain
@@ -240,81 +226,230 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(parent, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (parent->partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(parent, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(parent, values, isnull);
-
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
+		if (partdesc->is_leaf[partidx])
 		{
-			result = -1;
-			break;
+			/*
+			 * Get this leaf partition's index in the
+			 * PartitionTupleRouting->partitions array.  We may require
+			 * building a new ResultRelInfo.
+			 */
+			if (likely(parent->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(parent->indexes[partidx] < proute->num_partitions);
+				result = parent->indexes[partidx];
+			}
+			else
+			{
+				/*
+				 * A ResultRelInfo has not been set up for this partition yet,
+				 * so either use one of the sub-plan result rels or create a
+				 * fresh one.
+				 */
+				if (proute->subplan_resultrel_hash)
+				{
+					ResultRelInfo *rri;
+					Oid			partoid = partdesc->oids[partidx];
+
+					rri = hash_search(proute->subplan_resultrel_hash,
+									  &partoid, HASH_FIND, NULL);
+
+					if (rri)
+					{
+						result = proute->num_partitions++;
+						parent->indexes[partidx] = result;
+
+						/* Allocate more space in the arrays, if required */
+						if (result >= proute->partitions_allocsize)
+							ExecExpandRoutingArrays(proute);
+
+						/* Save here for later use. */
+						proute->partitions[result] = rri;
+					}
+				}
+
+				/* We need to create one afresh. */
+				if (result < 0)
+				{
+					result = ExecInitPartitionInfo(mtstate, resultRelInfo,
+												   proute, estate,
+												   parent, partidx);
+					Assert(result >= 0 && result < proute->num_partitions);
+				}
+			}
+
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return result;
 		}
-		else if (parent->indexes[cur_index] >= 0)
+		else
 		{
-			result = parent->indexes[cur_index];
-			break;
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(parent->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(parent->indexes[partidx] < proute->num_dispatch);
+				parent = pd[parent->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subparent;
+
+				subparent = ExecInitPartitionDispatchInfo(proute,
+														  partdesc->oids[partidx],
+														  parent, partidx);
+				Assert(parent->indexes[partidx] >= 0 &&
+					   parent->indexes[partidx] < proute->num_dispatch);
+				parent = subparent;
+			}
 		}
-		else
-			parent = pd[-parent->indexes[cur_index]];
 	}
+}
 
-	/* A partition was not found. */
-	if (result < 0)
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo *subplan_result_rels;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			nsubplans;
+	int			i;
+
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
+
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
+
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_resultrel_hash = htab;
+
+	/* Hash all subplans by their Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		char	   *val_desc;
-
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
+
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+													&found);
+
+		if (!found)
+			*subplanrri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
 	}
+}
 
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
-	return result;
+/*
+ * ExecExpandRoutingArrays
+ *		Double the size of the allocated arrays in 'proute'
+ */
+static void
+ExecExpandRoutingArrays(PartitionTupleRouting *proute)
+{
+	int			new_size = proute->partitions_allocsize * 2;
+	int			old_size = proute->partitions_allocsize;
+
+	proute->partitions_allocsize = new_size;
+
+	proute->partitions = (ResultRelInfo **)
+		repalloc(proute->partitions, sizeof(ResultRelInfo *) * new_size);
+
+	if (proute->parent_child_tupconv_maps != NULL)
+	{
+		proute->parent_child_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->parent_child_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->parent_child_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_map_not_required != NULL)
+	{
+		proute->child_parent_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->child_parent_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->child_parent_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_map_not_required != NULL)
+	{
+		proute->child_parent_map_not_required = (bool *)
+			repalloc(proute->child_parent_map_not_required,
+					 sizeof(bool) * new_size);
+		memset(&proute->child_parent_map_not_required[old_size], 0,
+			   sizeof(bool) * (new_size - old_size));
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
- *
- * Returns the ResultRelInfo
+ *		and store it in the next empty slot in 'proute's partitions array and
+ *		return the index of that element.
  */
-ResultRelInfo *
+static int
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch parent, int partidx)
 {
+	Oid			partoid = parent->partdesc->oids[partidx];
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(partoid, NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -490,15 +625,25 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 									&mtstate->ps, RelationGetDescr(partrel));
 	}
 
+	part_result_rel_index = proute->num_partitions++;
+	parent->indexes[partidx] = part_result_rel_index;
+
+	/* Allocate more space in the arrays, if required */
+	if (part_result_rel_index >= proute->partitions_allocsize)
+		ExecExpandRoutingArrays(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
+
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri,
+						part_result_rel_index);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -511,7 +656,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -524,7 +669,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -538,7 +683,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -548,8 +693,14 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = proute->parent_child_tupconv_maps ?
+				proute->parent_child_tupconv_maps[part_result_rel_index] :
+				NULL;
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -558,7 +709,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -649,12 +800,9 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
-
 	MemoryContextSwitchTo(oldContext);
 
-	return leaf_part_rri;
+	return part_result_rel_index;
 }
 
 /*
@@ -669,6 +817,7 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 					int partidx)
 {
 	MemoryContext oldContext;
+	TupleConversionMap *map;
 
 	/*
 	 * Switch into per-query memory context.
@@ -679,10 +828,24 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
-		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
-							   RelationGetDescr(partRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	map = convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								 RelationGetDescr(partRelInfo->ri_RelationDesc),
+								 gettext_noop("could not convert row type"));
+
+	if (map)
+	{
+		/* Allocate parent child map array only if we need to store a map */
+		if (proute->parent_child_tupconv_maps == NULL)
+		{
+			int			new_size;
+
+			new_size = proute->partitions_allocsize;
+			proute->parent_child_tupconv_maps = (TupleConversionMap **)
+				palloc0(sizeof(TupleConversionMap *) * new_size);
+		}
+
+		proute->parent_child_tupconv_maps[partidx] = map;
+	}
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -697,6 +860,87 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	partRelInfo->ri_PartitionReadyForRouting = true;
 }
 
+/*
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table
+ *
+ * This also stores it in the proute->partition_dispatch_info array at the
+ * specified index ('partidx'), possibly expanding the array if there isn't
+ * space left in it.
+ */
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
+{
+	Relation	rel;
+	TupleDesc	tupdesc;
+	PartitionDesc partdesc;
+	PartitionKey partkey;
+	PartitionDispatch pd;
+	int			dispatchidx;
+
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	tupdesc = RelationGetDescr(rel);
+	partdesc = RelationGetPartitionDesc(rel);
+	partkey = RelationGetPartitionKey(rel);
+
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
+									+ (partdesc->nparts * sizeof(int)));
+	pd->reldesc = rel;
+	pd->key = partkey;
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap =
+			convert_tuples_by_name(RelationGetDescr(parent_pd->reldesc),
+								   tupdesc,
+								   gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
+
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
+
+	dispatchidx = proute->num_dispatch++;
+	if (parent_pd)
+		parent_pd->indexes[partidx] = dispatchidx;
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize *= 2;
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
+
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
+
+	return pd;
+}
+
 /*
  * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
  * child-to-root tuple conversion map array.
@@ -709,19 +953,22 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 void
 ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
 {
+	int			size;
+
 	Assert(proute != NULL);
 
+	size = proute->partitions_allocsize;
+
 	/*
 	 * These array elements get filled up with maps on an on-demand basis.
 	 * Initially just set all of them to NULL.
 	 */
 	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * size);
 
 	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
+	proute->child_parent_map_not_required = (bool *) palloc0(sizeof(bool) *
+															 size);
 }
 
 /*
@@ -732,15 +979,15 @@ TupleConversionMap *
 TupConvMapForLeaf(PartitionTupleRouting *proute,
 				  ResultRelInfo *rootRelInfo, int leaf_index)
 {
-	ResultRelInfo **resultRelInfos = proute->partitions;
 	TupleConversionMap **map;
 	TupleDesc	tupdesc;
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	/* If nobody else set up the per-leaf maps array, do so ourselves. */
+	if (proute->child_parent_tupconv_maps == NULL)
+		ExecSetupChildParentMapForLeaf(proute);
 
 	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
+	else if (proute->child_parent_map_not_required[leaf_index])
 		return NULL;
 
 	/* If we've already got a map, return it. */
@@ -749,13 +996,16 @@ TupConvMapForLeaf(PartitionTupleRouting *proute,
 		return *map;
 
 	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
+	tupdesc = RelationGetDescr(proute->partitions[leaf_index]->ri_RelationDesc);
 	*map =
 		convert_tuples_by_name(tupdesc,
 							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
 							   gettext_noop("could not convert row type"));
 
-	/* If it turns out no map is needed, remember for next time. */
+	/*
+	 * If it turns out no map is needed, remember that so we don't try making
+	 * one again next time.
+	 */
 	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
 
 	return *map;
@@ -802,8 +1052,8 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -824,10 +1074,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-		/* skip further processsing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
-
 		/* Allow any FDWs to shut down if they've been exercised */
 		if (resultRelInfo->ri_PartitionReadyForRouting &&
 			resultRelInfo->ri_FdwRoutine != NULL &&
@@ -836,21 +1082,19 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 														   resultRelInfo);
 
 		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
+		if (resultrel_hash)
 		{
-			subplan_index++;
-			continue;
+			Oid			partoid;
+			bool		found;
+
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
+			if (found)
+				continue;
 		}
 
 		ExecCloseIndices(resultRelInfo);
@@ -864,144 +1108,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
-											tupdesc,
-											gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
-
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
-		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
-
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
-	}
-}
-
 /* ----------------
  *		FormPartitionKeyDatum
  *			Construct values[] and isnull[] arrays for the partition key
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index f535762e2d..71fa3ea904 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1666,7 +1665,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1709,21 +1708,12 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	 * value is to be used as an index into the arrays for the ResultRelInfo
 	 * and TupleConversionMap for the partition.
 	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
+	partidx = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
 	Assert(partidx >= 0 && partidx < proute->num_partitions);
 
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
-	 */
+	Assert(proute->partitions[partidx] != NULL);
+	/* Get the ResultRelInfo corresponding to the selected partition. */
 	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
 
 	/*
 	 * Check whether the partition is routable if we didn't yet
@@ -1789,10 +1779,9 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
-							  tuple,
-							  proute->partition_tuple_slot,
-							  &slot);
+	if (proute->parent_child_tupconv_maps)
+		ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
+								  tuple, proute->partition_tuple_slot, &slot);
 
 	/* Initialize information needed to handle ON CONFLICT DO UPDATE. */
 	Assert(mtstate != NULL);
@@ -1828,17 +1817,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			numResultRelInfos = mtstate->mt_nplans;
 	int			i;
 
-	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
 	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
@@ -1860,79 +1838,18 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	}
 }
 
-/*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
 /*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ouselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 115a9fe78f..82acfeb460 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -594,6 +594,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -782,7 +783,15 @@ RelationBuildPartitionDesc(Relation rel)
 		 * defined by canonicalized representation of the partition bounds.
 		 */
 		for (i = 0; i < nparts; i++)
-			result->oids[mapping[i]] = oids[i];
+		{
+			int			index = mapping[i];
+
+			result->oids[index] = oids[i];
+			/* Record if the partition is a leaf partition */
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+		}
+
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 1f49e5d3a9..4b3b5ae770 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -26,7 +26,10 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of length 'nparts' containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * a partition is a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 862bf65060..9b445104dc 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -31,9 +31,13 @@
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
  *				this table's rowtype (when extracting the partition key of a
  *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *	indexes		Array with partdesc->nparts elements.  For leaf partitions the
+ *				index into the PartitionTupleRouting->partitions array is
+ *				stored.  When the partition is itself a partitioned table then
+ *				we store the index into
+ *				PartitionTupleRouting->partition_dispatch_info.  -1 means
+ *				we've not yet allocated anything in PartitionTupleRouting for
+ *				the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -44,72 +48,106 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	TupleConversionMap *tupmap;
-	int		   *indexes;
+	int		   indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 /*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
- *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slot			TupleTableSlot to be used to manipulate any
- *								given leaf partition's rowtype after that
- *								partition is chosen for insertion by
- *								tuple-routing.
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
+ *
+ * partition_root			The partitioned table that's the target of the
+ *							command.
+ *
+ * partition_dispatch_info	Contains PartitionDispatch objects for every
+ *							partitioned table touched by tuple routing.  The
+ *							entry for the target partitioned table is *always*
+ *							present as the first entry of this array.  See
+ *							comment for PartitionDispatchData->indexes for
+ *							details on how this array is indexed.
+ *
+ * num_dispatch				The number of existing entries and also serves as
+ *							the index of the next entry to be allocated and
+ *							placed in 'partition_dispatch_info'.
+ *
+ * dispatch_allocsize		(>= 'num_dispatch') is the number of entries that
+ *							can be stored in 'partition_dispatch_info' before
+ *							needing to reallocate more space.
+ *
+ * partitions				Contains pointers to a ResultRelInfos of all leaf
+ *							partitions touched by tuple routing.  Some of
+ *							these are pointers to "reused" ResultRelInfos,
+ *							that is, those that are created and destroyed
+ *							outside execPartition.c, for example, when tuple
+ *							routing is used for UPDATE queries that modify
+ *							the partition key.  Rest of them are pointers to
+ *							ResultRelInfos managed by execPartition.c itself.
+ *							See comment for PartitionDispatchData->indexes for
+ *							details on how this array is indexed.
+ *
+ * num_partitions			The number of existing entries and also serves as
+ *							the index of the next entry to be allocated and
+ *							placed in 'partitions'
+ *
+ * partitions_allocsize		(>= 'num_partitions') is the number of entries
+ *							that can be stored in 'partitions',
+ *							'parent_child_tupconv_maps',
+ *							'child_parent_tupconv_maps' and
+ *							'child_parent_map_not_required' arrays before
+ *							needing to reallocate more space
+ *
+ * parent_child_tupconv_maps	Array of partitions_allocsize elements
+ *							containing information on how to convert tuples of
+ *							partition_root's rowtype to the rowtype of the
+ *							corresponding partition as stored in 'partitions',
+ *							or NULL if no conversion is required.  The entire
+ *							array is only allocated when the first conversion
+ *							map needs to stored.  When not allocated it's set
+ *							to NULL.
+ *
+ * partition_tuple_slot		This is a tuple slot used to store a tuple using
+ *							rowtype of the partition chosen by tuple
+ *							routing.  Maintained separately because partitions
+ *							may have different rowtype.
+ *
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
+ *
+ * child_parent_tupconv_maps	As 'parent_child_tupconv_maps' but stores
+ *							conversion maps to translate partition tuples into
+ *							partition_root's rowtype.
+ *
+ * child_parent_map_not_required	True if the corresponding
+ *							child_parent_tupconv_maps element has been
+ *							determined to require no translation or set to
+ *							NULL when child_parent_tupconv_maps is NULL.
+ *
+ * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
+ *
+ * root_tuple_slot			During UPDATE tuple routing, this tuple slot is
+ *							used to transiently store a tuple using the root
+ *							table's rowtype after converting it from the
+ *							tuple's source leaf partition's rowtype.  That is,
+ *							if leaf partition's rowtype is different.
  *-----------------------
  */
 typedef struct PartitionTupleRouting
 {
+	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
-	Oid		   *partition_oids;
+	int			dispatch_allocsize;
 	ResultRelInfo **partitions;
 	int			num_partitions;
+	int			partitions_allocsize;
 	TupleConversionMap **parent_child_tupconv_maps;
 	TupleConversionMap **child_parent_tupconv_maps;
 	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
-	TupleTableSlot *partition_tuple_slot;
+	HTAB	   *subplan_resultrel_hash;
 	TupleTableSlot *root_tuple_slot;
+	TupleTableSlot *partition_tuple_slot;
 } PartitionTupleRouting;
 
 /*-----------------------
@@ -186,14 +224,15 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern int ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
+extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,
+					 ResultRelInfo *resultRelInfo,
+					 PartitionTupleRouting *proute,
+					 EState *estate, int partidx);
 extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
 					PartitionTupleRouting *proute,
-- 
2.16.2.windows.1

#22David Rowley
david.rowley@2ndquadrant.com
In reply to: David Rowley (#21)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 31 July 2018 at 19:03, David Rowley <david.rowley@2ndquadrant.com> wrote:

I've attached a complete v4 patch.

I've attached v5 of the patch which is based on top of today's master
(@ 579b985b22)

A couple of recent patches conflict with v4. I've also made another
tidy up pass, which was mostly just rewording comments in
execPartition.h

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v5-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchapplication/octet-stream; name=v5-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload
From 674dde374fc398a91f4adc939f2fbe7c9c63902c Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 26 Jul 2018 19:54:55 +1200
Subject: [PATCH v5] Speed up INSERT and UPDATE on partitioned tables

This is more or less a complete redesign of PartitionTupleRouting. The
aim here is to get rid of all the possibly large arrays that were being
allocated during ExecSetupPartitionTupleRouting().  We now allocate
small arrays to store the partition's ResultRelInfo and only enlarge
these when we run out of space.  The partitions array is now ordered
by the order in which the partition's ResultRelInfos are inititialized
rather than in same order as partdesc.

The slowest part of ExecSetupPartitionTupleRouting still remains.  The
find_all_inheritors call still remains by far the slowest part of the
function. This patch just removes the other slow parts.

Initialization of the parent/child translation maps array is now only
performed when we need to store the first translation map.  If the column
order between the parent and its child are the same, then no map ever
needs to be stored, this (possibly large) array did nothing.

In simple INSERTs hitting a single partition to a partitioned table with
many partitions the shutdown of the executor was also slow in comparison to
the actual execution, this was down to the loop which cleans up each
ResultRelInfo having to loop over an array which often contained mostly
NULLs, which had to be skipped.  Performance of this has now improved as
the array we loop over now no longer has to skip possibly many NULL
values.

David Rowley and Amit Langote
---
 src/backend/commands/copy.c            |  31 +-
 src/backend/executor/execPartition.c   | 764 +++++++++++++++++++--------------
 src/backend/executor/nodeModifyTable.c | 108 +----
 src/backend/utils/cache/partcache.c    |  11 +-
 src/include/catalog/partition.h        |   5 +-
 src/include/executor/execPartition.h   | 163 ++++---
 6 files changed, 579 insertions(+), 503 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 9bc67ce60f..752ba3d767 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2699,10 +2699,8 @@ CopyFrom(CopyState cstate)
 			 * will get us the ResultRelInfo and TupleConversionMap for the
 			 * partition, respectively.
 			 */
-			leaf_part_index = ExecFindPartition(target_resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
+			leaf_part_index = ExecFindPartition(mtstate, target_resultRelInfo,
+												proute, slot, estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
@@ -2800,15 +2798,7 @@ CopyFrom(CopyState cstate)
 				 * one.
 				 */
 				resultRelInfo = proute->partitions[leaf_part_index];
-				if (unlikely(resultRelInfo == NULL))
-				{
-					resultRelInfo = ExecInitPartitionInfo(mtstate,
-														  target_resultRelInfo,
-														  proute, estate,
-														  leaf_part_index);
-					proute->partitions[leaf_part_index] = resultRelInfo;
-					Assert(resultRelInfo != NULL);
-				}
+				Assert(resultRelInfo != NULL);
 
 				/* Determine which triggers exist on this partition */
 				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
@@ -2864,11 +2854,16 @@ CopyFrom(CopyState cstate)
 			 * partition rowtype.  Don't free the already stored tuple as it
 			 * may still be required for a multi-insert batch.
 			 */
-			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
-											  tuple,
-											  proute->partition_tuple_slot,
-											  &slot,
-											  false);
+			if (proute->parent_child_tupconv_maps)
+			{
+				TupleConversionMap *map =
+				proute->parent_child_tupconv_maps[leaf_part_index];
+
+				tuple = ConvertPartitionTupleSlot(map, tuple,
+												  proute->partition_tuple_slot,
+												  &slot,
+												  false);
+			}
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index d13be4145f..7849e04bdb 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,11 +31,18 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+#define PARTITION_ROUTING_INITSIZE	8
 
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecExpandRoutingArrays(PartitionTupleRouting *proute);
+static int ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch parent, int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -62,138 +69,115 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition.  The actual ResultRelInfos are allocated lazily by that
+ * function.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
-
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
 
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
-	}
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
+	 * lazily, only when we actually need to route a tuple to that partition.
+	 * The reason for this is that a common case is for INSERT to insert a
+	 * single tuple into a partitioned table and this must be fast.
+	 *
+	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
+	 * PartitionDispatch and ResultRelInfo pointers in their respective
+	 * arrays. More space can be allocated later, if required via
+	 * ExecExpandRoutingArrays.
+	 *
+	 * The PartitionDispatch for the target partitioned table of the command
+	 * must be set up, but any sub-partitioned tables can be set up lazily as
+	 * and when the tuples get routed to (through) them.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	proute->partitions = (ResultRelInfo **)
+		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	/* We only allocate these arrays when we need to store the first map */
+	proute->parent_child_tupconv_maps = NULL;
+	proute->child_parent_tupconv_maps = NULL;
+	proute->child_parent_map_not_required = NULL;
 
 	/*
-	 * Initialize an empty slot that will be used to manipulate tuples of any
-	 * given partition's rowtype.  It is attached to the caller-specified node
-	 * (such as ModifyTableState) and released when the node finishes
-	 * processing.
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent as NULL as we don't need to care about any parent of the target
+	 * partitioned table.
 	 */
-	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
 
-	i = 0;
-	foreach(cell, leaf_parts)
+	/*
+	 * If performing an UPDATE with tuple routing, we can reuse partition
+	 * sub-plan result rels.  We build a hash table to map the OIDs of
+	 * partitions present in mtstate->resultRelInfo to their ResultRelInfos.
+	 * Every time a tuple is routed to a partition that we've yet to set the
+	 * ResultRelInfo for, before we go making one, we check for a pre-made one
+	 * in the hash table.
+	 *
+	 * Also, we'll need a slot that will transiently store the tuple being
+	 * routed using the root parent's rowtype.
+	 */
+	if (node && node->operation == CMD_UPDATE)
 	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
-
-		proute->partition_oids[i] = leaf_oid;
-
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
-
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
-
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
-
-			update_rri_index++;
-		}
-
-		proute->partitions[i] = leaf_part_rri;
-		i++;
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
+		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
+	}
+	else
+	{
+		proute->subplan_resultrel_hash = NULL;
+		proute->root_tuple_slot = NULL;
 	}
 
 	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
+	 * Initialize an empty slot that will be used to manipulate tuples of any
+	 * given partition's rowtype.
 	 */
-	Assert(update_rri_index == num_update_rri);
+	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
 
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find a leaf partition for the tuple contained in *slot.
+ * If the partition's ResultRelInfo does not yet exist in 'proute' then we set
+ * one up or reuse one from mtstate's resultRelInfo array.
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message, else it returns the index of the leaf partition's
+ * ResultRelInfo in the proute->partitions array.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch dispatch;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 	TupleTableSlot *myslot = NULL;
@@ -216,9 +200,10 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	while (true)
 	{
 		TupleConversionMap *map = dispatch->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = dispatch->reldesc;
+		partdesc = dispatch->partdesc;
 
 		/*
 		 * Convert the tuple to this parent's layout, if different from the
@@ -244,37 +229,114 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (dispatch->partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(dispatch, values, isnull);
-
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
+		if (partdesc->is_leaf[partidx])
 		{
-			result = -1;
-			break;
-		}
-		else if (dispatch->indexes[cur_index] >= 0)
-		{
-			result = dispatch->indexes[cur_index];
-			/* success! */
-			break;
+			int			result = -1;
+
+			/*
+			 * Get this leaf partition's index in the
+			 * PartitionTupleRouting->partitions array.  We may require
+			 * building a new ResultRelInfo.
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(dispatch->indexes[partidx] < proute->num_partitions);
+				result = dispatch->indexes[partidx];
+			}
+			else
+			{
+				/*
+				 * A ResultRelInfo has not been set up for this partition yet,
+				 * so either use one of the sub-plan result rels or create a
+				 * fresh one.
+				 */
+				if (proute->subplan_resultrel_hash)
+				{
+					ResultRelInfo *rri;
+					Oid			partoid = partdesc->oids[partidx];
+
+					rri = hash_search(proute->subplan_resultrel_hash,
+									  &partoid, HASH_FIND, NULL);
+
+					if (rri)
+					{
+						result = proute->num_partitions++;
+						dispatch->indexes[partidx] = result;
+
+
+						/* Allocate more space in the arrays, if required */
+						if (result >= proute->partitions_allocsize)
+							ExecExpandRoutingArrays(proute);
+
+						/* Save here for later use. */
+						proute->partitions[result] = rri;
+					}
+				}
+
+				/* We need to create one afresh. */
+				if (result < 0)
+				{
+					MemoryContextSwitchTo(oldcxt);
+					result = ExecInitPartitionInfo(mtstate, resultRelInfo,
+												   proute, estate,
+												   dispatch, partidx);
+					MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+					Assert(result >= 0 && result < proute->num_partitions);
+				}
+			}
+
+			/* Release the tuple in the lowest parent's dedicated slot. */
+			if (slot == myslot)
+				ExecClearTuple(myslot);
+
+			MemoryContextSwitchTo(oldcxt);
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return result;
 		}
 		else
 		{
-			/* move down one level */
-			dispatch = pd[-dispatch->indexes[cur_index]];
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = pd[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subdispatch;
+
+				MemoryContextSwitchTo(oldcxt);
+				subdispatch = ExecInitPartitionDispatchInfo(proute,
+														  partdesc->oids[partidx],
+														  dispatch, partidx);
+				MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				Assert(dispatch->indexes[partidx] >= 0 &&
+					   dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = subdispatch;
+			}
 
 			/*
 			 * Release the dedicated slot, if it was used.  Create a copy of
@@ -287,58 +349,131 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 			}
 		}
 	}
+}
+
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo *subplan_result_rels;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			nsubplans;
+	int			i;
+
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
 
-	/* Release the tuple in the lowest parent's dedicated slot. */
-	if (slot == myslot)
-		ExecClearTuple(myslot);
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
 
-	/* A partition was not found. */
-	if (result < 0)
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_resultrel_hash = htab;
+
+	/* Hash all subplans by their Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		char	   *val_desc;
-
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
+
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+													&found);
+
+		if (!found)
+			*subplanrri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
 	}
+}
 
-	MemoryContextSwitchTo(oldcxt);
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
+/*
+ * ExecExpandRoutingArrays
+ *		Double the size of the allocated arrays in 'proute'
+ */
+static void
+ExecExpandRoutingArrays(PartitionTupleRouting *proute)
+{
+	int			new_size = proute->partitions_allocsize * 2;
+	int			old_size = proute->partitions_allocsize;
 
-	return result;
+	proute->partitions_allocsize = new_size;
+
+	proute->partitions = (ResultRelInfo **)
+		repalloc(proute->partitions, sizeof(ResultRelInfo *) * new_size);
+
+	if (proute->parent_child_tupconv_maps != NULL)
+	{
+		proute->parent_child_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->parent_child_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->parent_child_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_map_not_required != NULL)
+	{
+		proute->child_parent_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->child_parent_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->child_parent_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_map_not_required != NULL)
+	{
+		proute->child_parent_map_not_required = (bool *)
+			repalloc(proute->child_parent_map_not_required,
+					 sizeof(bool) * new_size);
+		memset(&proute->child_parent_map_not_required[old_size], 0,
+			   sizeof(bool) * (new_size - old_size));
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
- *
- * Returns the ResultRelInfo
+ *		and store it in the next empty slot in 'proute's partitions array and
+ *		return the index of that element.
  */
-ResultRelInfo *
+static int
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx)
 {
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -514,15 +649,25 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 									&mtstate->ps, RelationGetDescr(partrel));
 	}
 
+	part_result_rel_index = proute->num_partitions++;
+	dispatch->indexes[partidx] = part_result_rel_index;
+
+	/* Allocate more space in the arrays, if required */
+	if (part_result_rel_index >= proute->partitions_allocsize)
+		ExecExpandRoutingArrays(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
+
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri,
+						part_result_rel_index);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -535,7 +680,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -548,7 +693,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -562,7 +707,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -572,8 +717,14 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = proute->parent_child_tupconv_maps ?
+				proute->parent_child_tupconv_maps[part_result_rel_index] :
+				NULL;
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -582,7 +733,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -673,12 +824,9 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
-
 	MemoryContextSwitchTo(oldContext);
 
-	return leaf_part_rri;
+	return part_result_rel_index;
 }
 
 /*
@@ -693,6 +841,7 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 					int partidx)
 {
 	MemoryContext oldContext;
+	TupleConversionMap *map;
 
 	/*
 	 * Switch into per-query memory context.
@@ -703,10 +852,24 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
-		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
-							   RelationGetDescr(partRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	map = convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								 RelationGetDescr(partRelInfo->ri_RelationDesc),
+								 gettext_noop("could not convert row type"));
+
+	if (map)
+	{
+		/* Allocate parent child map array only if we need to store a map */
+		if (proute->parent_child_tupconv_maps == NULL)
+		{
+			int			size;
+
+			size = proute->partitions_allocsize;
+			proute->parent_child_tupconv_maps = (TupleConversionMap **)
+				palloc0(sizeof(TupleConversionMap *) * size);
+		}
+
+		proute->parent_child_tupconv_maps[partidx] = map;
+	}
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -721,6 +884,85 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	partRelInfo->ri_PartitionReadyForRouting = true;
 }
 
+/*
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table
+ *
+ * This also stores it in the proute->partition_dispatch_info array at the
+ * specified index ('partidx'), possibly expanding the array if there isn't
+ * enough space left in it.
+ */
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
+{
+	Relation	rel;
+	PartitionDesc partdesc;
+	PartitionDispatch pd;
+	int			dispatchidx;
+
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	partdesc = RelationGetPartitionDesc(rel);
+
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
+									+ (partdesc->nparts * sizeof(int)));
+	pd->reldesc = rel;
+	pd->key = RelationGetPartitionKey(rel);
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		TupleDesc	tupdesc = RelationGetDescr(rel);
+
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap =
+			convert_tuples_by_name(RelationGetDescr(parent_pd->reldesc),
+								   tupdesc,
+								   gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
+
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
+
+	dispatchidx = proute->num_dispatch++;
+	if (parent_pd)
+		parent_pd->indexes[partidx] = dispatchidx;
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize *= 2;
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
+
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
+
+	return pd;
+}
+
 /*
  * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
  * child-to-root tuple conversion map array.
@@ -733,19 +975,22 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 void
 ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
 {
+	int			size;
+
 	Assert(proute != NULL);
 
+	size = proute->partitions_allocsize;
+
 	/*
 	 * These array elements get filled up with maps on an on-demand basis.
 	 * Initially just set all of them to NULL.
 	 */
 	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
+		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * size);
 
 	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
+	proute->child_parent_map_not_required = (bool *) palloc0(sizeof(bool) *
+															 size);
 }
 
 /*
@@ -756,15 +1001,15 @@ TupleConversionMap *
 TupConvMapForLeaf(PartitionTupleRouting *proute,
 				  ResultRelInfo *rootRelInfo, int leaf_index)
 {
-	ResultRelInfo **resultRelInfos = proute->partitions;
 	TupleConversionMap **map;
 	TupleDesc	tupdesc;
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	/* If nobody else set up the per-leaf maps array, do so ourselves. */
+	if (proute->child_parent_tupconv_maps == NULL)
+		ExecSetupChildParentMapForLeaf(proute);
 
 	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
+	else if (proute->child_parent_map_not_required[leaf_index])
 		return NULL;
 
 	/* If we've already got a map, return it. */
@@ -773,13 +1018,16 @@ TupConvMapForLeaf(PartitionTupleRouting *proute,
 		return *map;
 
 	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
+	tupdesc = RelationGetDescr(proute->partitions[leaf_index]->ri_RelationDesc);
 	*map =
 		convert_tuples_by_name(tupdesc,
 							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
 							   gettext_noop("could not convert row type"));
 
-	/* If it turns out no map is needed, remember for next time. */
+	/*
+	 * If it turns out no map is needed, remember that so we don't try making
+	 * one again next time.
+	 */
 	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
 
 	return *map;
@@ -827,8 +1075,8 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -849,10 +1097,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-		/* skip further processsing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
-
 		/* Allow any FDWs to shut down if they've been exercised */
 		if (resultRelInfo->ri_PartitionReadyForRouting &&
 			resultRelInfo->ri_FdwRoutine != NULL &&
@@ -861,21 +1105,19 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 														   resultRelInfo);
 
 		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
+		if (resultrel_hash)
 		{
-			subplan_index++;
-			continue;
+			Oid			partoid;
+			bool		found;
+
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
+			if (found)
+				continue;
 		}
 
 		ExecCloseIndices(resultRelInfo);
@@ -889,144 +1131,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
-											tupdesc,
-											gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
-
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
-		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
-
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
-	}
-}
-
 /* ----------------
  *		FormPartitionKeyDatum
  *			Construct values[] and isnull[] arrays for the partition key
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index d8d89c7983..bbffbd722e 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1667,7 +1666,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1710,21 +1709,13 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	 * value is to be used as an index into the arrays for the ResultRelInfo
 	 * and TupleConversionMap for the partition.
 	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
+	partidx = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
 	Assert(partidx >= 0 && partidx < proute->num_partitions);
 
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
-	 */
+	Assert(proute->partitions[partidx] != NULL);
+	/* Get the ResultRelInfo corresponding to the selected partition. */
 	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
+	Assert(partrel != NULL);
 
 	/*
 	 * Check whether the partition is routable if we didn't yet
@@ -1790,11 +1781,10 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
-							  tuple,
-							  proute->partition_tuple_slot,
-							  &slot,
-							  true);
+	if (proute->parent_child_tupconv_maps)
+		ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
+								  tuple, proute->partition_tuple_slot, &slot,
+								  true);
 
 	/* Initialize information needed to handle ON CONFLICT DO UPDATE. */
 	Assert(mtstate != NULL);
@@ -1830,17 +1820,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			numResultRelInfos = mtstate->mt_nplans;
 	int			i;
 
-	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
 	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
@@ -1862,79 +1841,18 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	}
 }
 
-/*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
 /*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ouselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 115a9fe78f..82acfeb460 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -594,6 +594,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -782,7 +783,15 @@ RelationBuildPartitionDesc(Relation rel)
 		 * defined by canonicalized representation of the partition bounds.
 		 */
 		for (i = 0; i < nparts; i++)
-			result->oids[mapping[i]] = oids[i];
+		{
+			int			index = mapping[i];
+
+			result->oids[index] = oids[i];
+			/* Record if the partition is a leaf partition */
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+		}
+
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 1f49e5d3a9..4b3b5ae770 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -26,7 +26,10 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of length 'nparts' containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * a partition is a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index f6cd842cc9..0b03b9dd76 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -31,9 +31,13 @@
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
  *				this table's rowtype (when extracting the partition key of a
  *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *	indexes		Array with partdesc->nparts elements.  For leaf partitions the
+ *				index into the PartitionTupleRouting->partitions array is
+ *				stored.  When the partition is itself a partitioned table then
+ *				we store the index into
+ *				PartitionTupleRouting->partition_dispatch_info.  -1 means
+ *				we've not yet allocated anything in PartitionTupleRouting for
+ *				the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -44,72 +48,114 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	TupleConversionMap *tupmap;
-	int		   *indexes;
+	int		   indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 /*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
- *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slot			TupleTableSlot to be used to manipulate any
- *								given leaf partition's rowtype after that
- *								partition is chosen for insertion by
- *								tuple-routing.
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
+ *
+ * partition_root			The partitioned table that's the target of the
+ *							command.
+ *
+ * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
+ *							a pointer to a PartitionDispatch objects for every
+ *							partitioned table touched by tuple routing.  The
+ *							entry for the target partitioned table is *always*
+ *							present as the first entry of this array.  See
+ *							comment for PartitionDispatchData->indexes for
+ *							details on how this array is indexed.
+ *
+ * num_dispatch				The current number of items stored in the
+ *							'partition_dispatch_info' array.  Also serves as
+ *							the index of the next free array element for new
+ *							PartitionDispatch which need to be stored.
+ *
+ * dispatch_allocsize		The current allocated size of the
+ *							'partition_dispatch_info' array.
+ *
+ * partitions				Array of 'partitions_allocsize' elements
+ *							containing pointers to a ResultRelInfos of all
+ *							leaf partitions touched by tuple routing.  Some of
+ *							these are pointers to ResultRelInfos which are
+ *							borrowed out of 'subplan_resultrel_hash'.  The
+ *							remainder have been built especially for tuple
+ *							routing.  See comment for
+ *							PartitionDispatchData->indexes for details on how
+ *							this array is indexed.
+ *
+ * num_partitions			The current number of items stored in the
+ *							'partitions' array.  Also serves as the index of
+ *							the next free array element for new ResultRelInfos
+ *							which need to be stored.
+ *
+ * partitions_allocsize		The current allocated size of the 'partitions'
+ *							array.  Also, if they're non-NULL, marks the size
+ *							of the 'parent_child_tupconv_maps',
+ *							'child_parent_tupconv_maps' and
+ *							'child_parent_map_not_required' arrays.
+ *
+ * parent_child_tupconv_maps	Array of partitions_allocsize elements
+ *							containing information on how to convert tuples of
+ *							partition_root's rowtype to the rowtype of the
+ *							corresponding partition as stored in 'partitions',
+ *							or NULL if no conversion is required.  The entire
+ *							array is only allocated when the first conversion
+ *							map needs to stored.  When not allocated it's set
+ *							to NULL.
+ *
+ * partition_tuple_slot		This is a tuple slot used to store a tuple using
+ *							rowtype of the partition chosen by tuple
+ *							routing.  Maintained separately because partitions
+ *							may have different rowtype.
+ *
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
+ *
+ * child_parent_tupconv_maps	As 'parent_child_tupconv_maps' but stores
+ *							conversion maps to translate partition tuples into
+ *							partition_root's rowtype.
+ *
+ * child_parent_map_not_required	True if the corresponding
+ *							child_parent_tupconv_maps element has been
+ *							determined to require no translation or set to
+ *							NULL when child_parent_tupconv_maps is NULL.  This
+ *							is required in order to distinguish tuple
+ *							translations which have been seen to not be
+ *							required due to the TupleDescs being compatible
+ *							with transactions which have yet to be determined.
+ *
+ * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
+ *							This is used to cache ResultRelInfos from subplans
+ *							of a ModifyTable node.  Some of these may be
+ *							useful for tuple routing to save having to build
+ *							duplicates.
+ *
+ * root_tuple_slot			During UPDATE tuple routing, this tuple slot is
+ *							used to transiently store a tuple using the root
+ *							table's rowtype after converting it from the
+ *							tuple's source leaf partition's rowtype.  That is,
+ *							if leaf partition's rowtype is different.
  *-----------------------
  */
 typedef struct PartitionTupleRouting
 {
+	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
-	Oid		   *partition_oids;
+	int			dispatch_allocsize;
 	ResultRelInfo **partitions;
 	int			num_partitions;
+	int			partitions_allocsize;
 	TupleConversionMap **parent_child_tupconv_maps;
 	TupleConversionMap **child_parent_tupconv_maps;
 	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
-	TupleTableSlot *partition_tuple_slot;
+	HTAB	   *subplan_resultrel_hash;
 	TupleTableSlot *root_tuple_slot;
+	TupleTableSlot *partition_tuple_slot;
 } PartitionTupleRouting;
 
 /*
@@ -200,14 +246,15 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern int ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
+extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,
+					 ResultRelInfo *resultRelInfo,
+					 PartitionTupleRouting *proute,
+					 EState *estate, int partidx);
 extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
 					PartitionTupleRouting *proute,
-- 
2.17.1

#23Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#21)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

(looking at the v5 patch but replying to an older email)

On 2018/07/31 16:03, David Rowley wrote:

I've attached a complete v4 patch.

By the way, when going over the updated code, I noticed that the code
around child_parent_tupconv_maps could use some refactoring too.
Especially, I noticed that ExecSetupChildParentMapForLeaf() allocates
child-to-parent map array needed for transition tuple capture even if not
needed by any of the leaf partitions. I'm attaching here a patch that
applies on top of your v3 to show what I'm thinking we could do.

Maybe we can do that as a follow-on patch.

We probably could, but I think it would be a good idea get rid of *all*
redundant allocations due to tuple routing in one patch, if that's the
mission of this thread and the patch anyway.

I think what we have so far
is already ended up quite complex to review. What do you think?

Yeah, it's kind of complex, but at least it seems that we're clear on the
point that what we're trying to do here is to try to get rid of redundant
allocations.

Parts of the patch that appear complex seems to be around the allocation
of various maps. Especially the child-to-parent maps, which as things
stand today, come from two arrays -- a per-update-subplan array that's
needed by update tuple routing proper and per-leaf partition array (one in
PartitionTupleRouting) that's needed by transition capture machinery. The
original coding was such the update tuple routing handling code would try
to avoid allocating the per-update-subplan array if it saw that per-leaf
partition array was already set up in PartitionTupleRouting, because
transition capture is active in the query. For update-tuple-routing code
to be able to use maps from the per-leaf array, it would have to know
which update-subplans mapped to which tuple-routing-initialized
partitions. That was maintained in the subplan_partition_offset array
that's now gone with this patch, because we no longer want to fix the
tuple-routing-initialized-partition offsets in advance. So, it's better
to dissociate per-subplan maps which are initialized during
ExecInitModifyTable from per-leaf maps which are initialized lazily when
tuple routing initializes a partition, which is what my portion of the
patch did.

As mentioned in my last email, I still think it would be a good idea to
simplify the handling of child-to-parent maps in PartitionTupleRouting
even further, while we're at improving the code in this area. I revised
the patch such that it makes the handling of maps in PartitionTupleRouting
even more uniform. With that patch, we no longer have two completely
unrelated places in the code managing parent-to-child and child-to-parent
maps, even though both arrays are in the same PartitionTupleRouting.
Please find the updated patch attached with this email.

Thanks,
Amit

Attachments:

v2-0002-Refactor-handling-of-child_parent_tupconv_maps.patchtext/plain; charset=UTF-8; name=v2-0002-Refactor-handling-of-child_parent_tupconv_maps.patchDownload
From 1a814f5a40774a51bf702757ec91e02f418a5aba Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Fri, 3 Aug 2018 14:09:51 +0900
Subject: [PATCH v2 2/2] Refactor handling of child_parent_tupconv_maps

They're currently created and handed out by TupConvMapForLeaf, which
makes them look somewhat different from parent_to_child_tupconv_maps.
In fact, both contain conversion maps possibly needed between a
partition initialized by tuple routing and the root parent in one or
the other direction, so it seems odd that parent-to-child ones are
created in ExecInitRoutingInfo, whereas child-to-parent ones in
TupConvMapForLeaf.

The child-to-parent ones are only needed if transition capture is
active, but we can already check that in ExecInitRoutingInfo via
the incoming ModifyTableState (sure, we need to teach CopyFrom to
add the necessary information into its dummy ModifyTableState, but
that doesn't seem too bad).

This way, we can manage both parent-to-child and child-to-parent maps
in similar ways, and more importantly, use the same criterion of
checking whether a partition's slot in the respective array is NULL
or not to conclude if tuple conversion is necessary or not.
---
 src/backend/commands/copy.c            |  37 +++++-------
 src/backend/executor/execPartition.c   | 102 +++++++++------------------------
 src/backend/executor/nodeModifyTable.c |  11 ++--
 src/include/executor/execPartition.h   |  33 ++++++-----
 4 files changed, 64 insertions(+), 119 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 752ba3d767..6f4069d321 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2510,8 +2510,12 @@ CopyFrom(CopyState cstate)
 	/*
 	 * If there are any triggers with transition tables on the named relation,
 	 * we need to be prepared to capture transition tuples.
+	 *
+	 * Because partition tuple routing would like to know about whether
+	 * transition capture is active, we also set it in mtstate, which is
+	 * passed to ExecFindPartition below.
 	 */
-	cstate->transition_capture =
+	cstate->transition_capture = mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(cstate->rel->trigdesc,
 								   RelationGetRelid(cstate->rel),
 								   CMD_INSERT);
@@ -2521,19 +2525,8 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
 		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
-		/*
-		 * If we are capturing transition tuples, they may need to be
-		 * converted from partition format back to partitioned table format
-		 * (this is only ever necessary if a BEFORE trigger modifies the
-		 * tuple).
-		 */
-		if (cstate->transition_capture != NULL)
-			ExecSetupChildParentMapForLeaf(proute);
-	}
-
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2835,8 +2828,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, target_resultRelInfo,
-										  leaf_part_index);
+						PROUTE_CHILD_TO_PARENT_MAP(proute, leaf_part_index);
 				}
 				else
 				{
@@ -2854,16 +2846,13 @@ CopyFrom(CopyState cstate)
 			 * partition rowtype.  Don't free the already stored tuple as it
 			 * may still be required for a multi-insert batch.
 			 */
-			if (proute->parent_child_tupconv_maps)
-			{
-				TupleConversionMap *map =
-				proute->parent_child_tupconv_maps[leaf_part_index];
-
-				tuple = ConvertPartitionTupleSlot(map, tuple,
-												  proute->partition_tuple_slot,
-												  &slot,
-												  false);
-			}
+			tuple =
+				ConvertPartitionTupleSlot(PROUTE_PARENT_TO_CHILD_MAP(proute,
+										  leaf_part_index),
+										  tuple,
+										  proute->partition_tuple_slot,
+										  &slot,
+										  false);
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 7849e04bdb..4242f81548 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -113,7 +113,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	/* We only allocate these arrays when we need to store the first map */
 	proute->parent_child_tupconv_maps = NULL;
 	proute->child_parent_tupconv_maps = NULL;
-	proute->child_parent_map_not_required = NULL;
 
 	/*
 	 * Initialize this table's PartitionDispatch object.  Here we pass in the
@@ -719,9 +718,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		{
 			TupleConversionMap *map;
 
-			map = proute->parent_child_tupconv_maps ?
-				proute->parent_child_tupconv_maps[part_result_rel_index] :
-				NULL;
+			map = PROUTE_PARENT_TO_CHILD_MAP(proute, part_result_rel_index);
 
 			Assert(node->onConflictSet != NIL);
 			Assert(rootResultRelInfo->ri_onConflict != NULL);
@@ -872,6 +869,33 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	}
 
 	/*
+	 * Also, if transition capture is active, store a map to convert tuples
+	 * from partition's rowtype to parent's.
+	 */
+	if (mtstate && mtstate->mt_transition_capture)
+	{
+		map =
+		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
+							   RelationGetDescr(partRelInfo->ri_PartitionRoot),
+							   gettext_noop("could not convert row type"));
+
+		/* Allocate parent child map array only if we need to store a map */
+		if (map)
+		{
+			if (proute->child_parent_tupconv_maps == NULL)
+			{
+				int			size;
+
+				size = proute->partitions_allocsize;
+				proute->child_parent_tupconv_maps = (TupleConversionMap **)
+					palloc0(sizeof(TupleConversionMap *) * size);
+			}
+
+			proute->child_parent_tupconv_maps[partidx] = map;
+		}
+	}
+
+	/*
 	 * If the partition is a foreign table, let the FDW init itself for
 	 * routing tuples to the partition.
 	 */
@@ -964,76 +988,6 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 }
 
 /*
- * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
- * child-to-root tuple conversion map array.
- *
- * This map is required for capturing transition tuples when the target table
- * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
- * we need to convert it from the leaf partition to the target table
- * descriptor.
- */
-void
-ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
-{
-	int			size;
-
-	Assert(proute != NULL);
-
-	size = proute->partitions_allocsize;
-
-	/*
-	 * These array elements get filled up with maps on an on-demand basis.
-	 * Initially just set all of them to NULL.
-	 */
-	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * size);
-
-	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required = (bool *) palloc0(sizeof(bool) *
-															 size);
-}
-
-/*
- * TupConvMapForLeaf -- Get the tuple conversion map for a given leaf partition
- * index.
- */
-TupleConversionMap *
-TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index)
-{
-	TupleConversionMap **map;
-	TupleDesc	tupdesc;
-
-	/* If nobody else set up the per-leaf maps array, do so ourselves. */
-	if (proute->child_parent_tupconv_maps == NULL)
-		ExecSetupChildParentMapForLeaf(proute);
-
-	/* If it's already known that we don't need a map, return NULL. */
-	else if (proute->child_parent_map_not_required[leaf_index])
-		return NULL;
-
-	/* If we've already got a map, return it. */
-	map = &proute->child_parent_tupconv_maps[leaf_index];
-	if (*map != NULL)
-		return *map;
-
-	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(proute->partitions[leaf_index]->ri_RelationDesc);
-	*map =
-		convert_tuples_by_name(tupdesc,
-							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
-
-	/*
-	 * If it turns out no map is needed, remember that so we don't try making
-	 * one again next time.
-	 */
-	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
-
-	return *map;
-}
-
-/*
  * ConvertPartitionTupleSlot -- convenience function for tuple conversion.
  * The tuple, if converted, is stored in new_slot, and *p_my_slot is
  * updated to point to it.  new_slot typically should be one of the
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index bbffbd722e..f592e4c51a 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1760,7 +1760,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			 */
 			mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 			mtstate->mt_transition_capture->tcs_map =
-				TupConvMapForLeaf(proute, targetRelInfo, partidx);
+				PROUTE_CHILD_TO_PARENT_MAP(proute, partidx);
 		}
 		else
 		{
@@ -1775,16 +1775,15 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	if (mtstate->mt_oc_transition_capture != NULL)
 	{
 		mtstate->mt_oc_transition_capture->tcs_map =
-			TupConvMapForLeaf(proute, targetRelInfo, partidx);
+			PROUTE_CHILD_TO_PARENT_MAP(proute, partidx);
 	}
 
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	if (proute->parent_child_tupconv_maps)
-		ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
-								  tuple, proute->partition_tuple_slot, &slot,
-								  true);
+	ConvertPartitionTupleSlot(PROUTE_PARENT_TO_CHILD_MAP(proute, partidx),
+							  tuple, proute->partition_tuple_slot, &slot,
+							  true);
 
 	/* Initialize information needed to handle ON CONFLICT DO UPDATE. */
 	Assert(mtstate != NULL);
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 0b03b9dd76..0bb84a27aa 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -112,21 +112,13 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *							routing.  Maintained separately because partitions
  *							may have different rowtype.
  *
- * Note: The following fields are used only when UPDATE ends up needing to
- * do tuple routing.
- *
  * child_parent_tupconv_maps	As 'parent_child_tupconv_maps' but stores
  *							conversion maps to translate partition tuples into
- *							partition_root's rowtype.
+ *							partition_root's rowtype, needed if transition
+ *							capture is active
  *
- * child_parent_map_not_required	True if the corresponding
- *							child_parent_tupconv_maps element has been
- *							determined to require no translation or set to
- *							NULL when child_parent_tupconv_maps is NULL.  This
- *							is required in order to distinguish tuple
- *							translations which have been seen to not be
- *							required due to the TupleDescs being compatible
- *							with transactions which have yet to be determined.
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
  *
  * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
  *							This is used to cache ResultRelInfos from subplans
@@ -159,6 +151,20 @@ typedef struct PartitionTupleRouting
 } PartitionTupleRouting;
 
 /*
+ * Accessor macros for tuple conversion maps contained in
+ * PartitionTupleRouting
+ */
+#define PROUTE_CHILD_TO_PARENT_MAP(proute, partidx) \
+			((proute)->child_parent_tupconv_maps != NULL ? \
+				proute->child_parent_tupconv_maps[(partidx)] : \
+							NULL)
+
+#define PROUTE_PARENT_TO_CHILD_MAP(proute, partidx) \
+			((proute)->parent_child_tupconv_maps != NULL ? \
+				proute->parent_child_tupconv_maps[(partidx)] : \
+							NULL)
+
+/*
  * PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
  * of partitions.  For a multilevel partitioned table, we have one of these
  * for the topmost partition plus one for each non-leaf child partition.
@@ -260,9 +266,6 @@ extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					PartitionTupleRouting *proute,
 					ResultRelInfo *partRelInfo,
 					int partidx);
-extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
-extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index);
 extern HeapTuple ConvertPartitionTupleSlot(TupleConversionMap *map,
 						  HeapTuple tuple,
 						  TupleTableSlot *new_slot,
-- 
2.11.0

#24David Rowley
david.rowley@2ndquadrant.com
In reply to: Amit Langote (#23)
2 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 3 August 2018 at 17:58, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2018/07/31 16:03, David Rowley wrote:

Maybe we can do that as a follow-on patch.

We probably could, but I think it would be a good idea get rid of *all*
redundant allocations due to tuple routing in one patch, if that's the
mission of this thread and the patch anyway.

I started looking at this patch today and I now agree that it should
be included in the main patch.

I changed a few things with the patch. For example, the map access
macros you'd defined were not in CamelCase. I also fixed a bug where
the child to parent map was not being initialised when on conflict
transition capture was required. I added a test which was crashing the
backend but fixed the code so it works correctly. I also got rid of
the child_parent_map_not_required array since we now no longer need
it. The code now always initialises the maps in cases where they're
going to be required.

I've attached a v3 version of your patch and also v6 of the main patch
which includes the v3 patch.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v3_Refactor-handling-of-child_parent_tupconv_maps.patchapplication/octet-stream; name=v3_Refactor-handling-of-child_parent_tupconv_maps.patchDownload
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 752ba3d767..0dfb9e2e95 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2510,8 +2510,12 @@ CopyFrom(CopyState cstate)
 	/*
 	 * If there are any triggers with transition tables on the named relation,
 	 * we need to be prepared to capture transition tuples.
+	 *
+	 * Because partition tuple routing would like to know about whether
+	 * transition capture is active, we also set it in mtstate, which is
+	 * passed to ExecFindPartition below.
 	 */
-	cstate->transition_capture =
+	cstate->transition_capture = mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(cstate->rel->trigdesc,
 								   RelationGetRelid(cstate->rel),
 								   CMD_INSERT);
@@ -2521,19 +2525,8 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
 		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
-		/*
-		 * If we are capturing transition tuples, they may need to be
-		 * converted from partition format back to partitioned table format
-		 * (this is only ever necessary if a BEFORE trigger modifies the
-		 * tuple).
-		 */
-		if (cstate->transition_capture != NULL)
-			ExecSetupChildParentMapForLeaf(proute);
-	}
-
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2835,8 +2828,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, target_resultRelInfo,
-										  leaf_part_index);
+						PartitionTupRoutingGetToParentMap(proute, leaf_part_index);
 				}
 				else
 				{
@@ -2854,16 +2846,13 @@ CopyFrom(CopyState cstate)
 			 * partition rowtype.  Don't free the already stored tuple as it
 			 * may still be required for a multi-insert batch.
 			 */
-			if (proute->parent_child_tupconv_maps)
-			{
-				TupleConversionMap *map =
-				proute->parent_child_tupconv_maps[leaf_part_index];
-
-				tuple = ConvertPartitionTupleSlot(map, tuple,
-												  proute->partition_tuple_slot,
-												  &slot,
-												  false);
-			}
+			tuple =
+				ConvertPartitionTupleSlot(PartitionTupRoutingGetToChildMap(proute,
+																		   leaf_part_index),
+										  tuple,
+										  proute->partition_tuple_slot,
+										  &slot,
+										  false);
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index ad5fb32203..49f52b9a10 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -113,7 +113,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	/* We only allocate these arrays when we need to store the first map */
 	proute->parent_child_tupconv_maps = NULL;
 	proute->child_parent_tupconv_maps = NULL;
-	proute->child_parent_map_not_required = NULL;
 
 	/*
 	 * Initialize this table's PartitionDispatch object.  Here we pass in the
@@ -427,7 +426,7 @@ ExecExpandRoutingArrays(PartitionTupleRouting *proute)
 			   sizeof(TupleConversionMap *) * (new_size - old_size));
 	}
 
-	if (proute->child_parent_map_not_required != NULL)
+	if (proute->child_parent_tupconv_maps != NULL)
 	{
 		proute->child_parent_tupconv_maps = (TupleConversionMap **)
 			repalloc(proute->child_parent_tupconv_maps,
@@ -435,15 +434,6 @@ ExecExpandRoutingArrays(PartitionTupleRouting *proute)
 		memset(&proute->child_parent_tupconv_maps[old_size], 0,
 			   sizeof(TupleConversionMap *) * (new_size - old_size));
 	}
-
-	if (proute->child_parent_map_not_required != NULL)
-	{
-		proute->child_parent_map_not_required = (bool *)
-			repalloc(proute->child_parent_map_not_required,
-					 sizeof(bool) * new_size);
-		memset(&proute->child_parent_map_not_required[old_size], 0,
-			   sizeof(bool) * (new_size - old_size));
-	}
 }
 
 /*
@@ -719,9 +709,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		{
 			TupleConversionMap *map;
 
-			map = proute->parent_child_tupconv_maps ?
-				proute->parent_child_tupconv_maps[part_result_rel_index] :
-				NULL;
+			map = PartitionTupRoutingGetToChildMap(proute, part_result_rel_index);
 
 			Assert(node->onConflictSet != NIL);
 			Assert(rootResultRelInfo->ri_onConflict != NULL);
@@ -871,6 +859,34 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 		proute->parent_child_tupconv_maps[partidx] = map;
 	}
 
+	/*
+	 * Also, if transition capture is required, store a map to convert tuples
+	 * from partition's rowtype to parent's.
+	 */
+	if (mtstate &&
+		(mtstate->mt_transition_capture || mtstate->mt_oc_transition_capture))
+	{
+		map =
+			convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
+								   RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								   gettext_noop("could not convert row type"));
+
+		/* Allocate child parent map array only if we need to store a map */
+		if (map)
+		{
+			if (proute->child_parent_tupconv_maps == NULL)
+			{
+				int			size;
+
+				size = proute->partitions_allocsize;
+				proute->child_parent_tupconv_maps = (TupleConversionMap **)
+					palloc0(sizeof(TupleConversionMap *) * size);
+			}
+
+			proute->child_parent_tupconv_maps[partidx] = map;
+		}
+	}
+
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
 	 * routing tuples to the partition.
@@ -963,76 +979,6 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 	return pd;
 }
 
-/*
- * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
- * child-to-root tuple conversion map array.
- *
- * This map is required for capturing transition tuples when the target table
- * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
- * we need to convert it from the leaf partition to the target table
- * descriptor.
- */
-void
-ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
-{
-	int			size;
-
-	Assert(proute != NULL);
-
-	size = proute->partitions_allocsize;
-
-	/*
-	 * These array elements get filled up with maps on an on-demand basis.
-	 * Initially just set all of them to NULL.
-	 */
-	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) * size);
-
-	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required = (bool *) palloc0(sizeof(bool) *
-															 size);
-}
-
-/*
- * TupConvMapForLeaf -- Get the tuple conversion map for a given leaf partition
- * index.
- */
-TupleConversionMap *
-TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index)
-{
-	TupleConversionMap **map;
-	TupleDesc	tupdesc;
-
-	/* If nobody else set up the per-leaf maps array, do so ourselves. */
-	if (proute->child_parent_tupconv_maps == NULL)
-		ExecSetupChildParentMapForLeaf(proute);
-
-	/* If it's already known that we don't need a map, return NULL. */
-	else if (proute->child_parent_map_not_required[leaf_index])
-		return NULL;
-
-	/* If we've already got a map, return it. */
-	map = &proute->child_parent_tupconv_maps[leaf_index];
-	if (*map != NULL)
-		return *map;
-
-	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(proute->partitions[leaf_index]->ri_RelationDesc);
-	*map =
-		convert_tuples_by_name(tupdesc,
-							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
-
-	/*
-	 * If it turns out no map is needed, remember that so we don't try making
-	 * one again next time.
-	 */
-	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
-
-	return *map;
-}
-
 /*
  * ConvertPartitionTupleSlot -- convenience function for tuple conversion.
  * The tuple, if converted, is stored in new_slot, and *p_my_slot is
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index bbffbd722e..cd89263f21 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1760,7 +1760,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			 */
 			mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 			mtstate->mt_transition_capture->tcs_map =
-				TupConvMapForLeaf(proute, targetRelInfo, partidx);
+				PartitionTupRoutingGetToParentMap(proute, partidx);
 		}
 		else
 		{
@@ -1775,16 +1775,15 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	if (mtstate->mt_oc_transition_capture != NULL)
 	{
 		mtstate->mt_oc_transition_capture->tcs_map =
-			TupConvMapForLeaf(proute, targetRelInfo, partidx);
+			PartitionTupRoutingGetToParentMap(proute, partidx);
 	}
 
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	if (proute->parent_child_tupconv_maps)
-		ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
-								  tuple, proute->partition_tuple_slot, &slot,
-								  true);
+	ConvertPartitionTupleSlot(PartitionTupRoutingGetToChildMap(proute, partidx),
+							  tuple, proute->partition_tuple_slot, &slot,
+							  true);
 
 	/* Initialize information needed to handle ON CONFLICT DO UPDATE. */
 	Assert(mtstate != NULL);
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 0b03b9dd76..4bf7a4033a 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -112,21 +112,13 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *							routing.  Maintained separately because partitions
  *							may have different rowtype.
  *
- * Note: The following fields are used only when UPDATE ends up needing to
- * do tuple routing.
- *
  * child_parent_tupconv_maps	As 'parent_child_tupconv_maps' but stores
  *							conversion maps to translate partition tuples into
- *							partition_root's rowtype.
+ *							partition_root's rowtype, needed if transition
+ *							capture is active
  *
- * child_parent_map_not_required	True if the corresponding
- *							child_parent_tupconv_maps element has been
- *							determined to require no translation or set to
- *							NULL when child_parent_tupconv_maps is NULL.  This
- *							is required in order to distinguish tuple
- *							translations which have been seen to not be
- *							required due to the TupleDescs being compatible
- *							with transactions which have yet to be determined.
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
  *
  * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
  *							This is used to cache ResultRelInfos from subplans
@@ -152,12 +144,25 @@ typedef struct PartitionTupleRouting
 	int			partitions_allocsize;
 	TupleConversionMap **parent_child_tupconv_maps;
 	TupleConversionMap **child_parent_tupconv_maps;
-	bool	   *child_parent_map_not_required;
 	HTAB	   *subplan_resultrel_hash;
 	TupleTableSlot *root_tuple_slot;
 	TupleTableSlot *partition_tuple_slot;
 } PartitionTupleRouting;
 
+/*
+ * Accessor macros for tuple conversion maps contained in
+ * PartitionTupleRouting.  Beware of multiple evaluations of p!
+ */
+#define PartitionTupRoutingGetToParentMap(p, i) \
+			((p)->child_parent_tupconv_maps != NULL ? \
+				(p)->child_parent_tupconv_maps[(i)] : \
+							NULL)
+
+#define PartitionTupRoutingGetToChildMap(p, i) \
+			((p)->parent_child_tupconv_maps != NULL ? \
+				(p)->parent_child_tupconv_maps[(i)] : \
+							NULL)
+
 /*
  * PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
  * of partitions.  For a multilevel partitioned table, we have one of these
@@ -260,9 +265,6 @@ extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					PartitionTupleRouting *proute,
 					ResultRelInfo *partRelInfo,
 					int partidx);
-extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
-extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index);
 extern HeapTuple ConvertPartitionTupleSlot(TupleConversionMap *map,
 						  HeapTuple tuple,
 						  TupleTableSlot *new_slot,
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index 27cf5a01b3..6b841c7850 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -904,4 +904,26 @@ select * from parted_conflict order by a;
  50 | cincuenta | 2
 (1 row)
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+truncate parted_conflict;
+insert into parted_conflict values (0, 'cero', 1);
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+NOTICE:  a = 0, b = cero, c = 2
 drop table parted_conflict;
+drop function parted_conflict_update_func();
diff --git a/src/test/regress/sql/insert_conflict.sql b/src/test/regress/sql/insert_conflict.sql
index c677d70fb7..fe6dcfaa06 100644
--- a/src/test/regress/sql/insert_conflict.sql
+++ b/src/test/regress/sql/insert_conflict.sql
@@ -576,4 +576,30 @@ insert into parted_conflict values (50, 'cincuenta', 2)
 -- should see (50, 'cincuenta', 2)
 select * from parted_conflict order by a;
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+
+truncate parted_conflict;
+
+insert into parted_conflict values (0, 'cero', 1);
+
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+
 drop table parted_conflict;
+drop function parted_conflict_update_func();
v6-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchapplication/octet-stream; name=v6-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload
From 602521912e3c02513ce213ec0f14d973468a0dc5 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 26 Jul 2018 19:54:55 +1200
Subject: [PATCH v6] Speed up INSERT and UPDATE on partitioned tables

This is more or less a complete redesign of PartitionTupleRouting. The
aim here is to get rid of all the possibly large arrays that were being
allocated during ExecSetupPartitionTupleRouting().  We now allocate
small arrays to store the partition's ResultRelInfo and only enlarge
these when we run out of space.  The partitions array is now ordered
by the order in which the partition's ResultRelInfos are initialized
rather than in same order as partdesc.

The slowest part of ExecSetupPartitionTupleRouting still remains.  The
find_all_inheritors call still remains by far the slowest part of the
function. This patch just removes the other slow parts.

Initialization of the parent to child and child to parent translation maps
arrays are now only performed when we need to store the first translation
map.  If the column order between the parent and its child are the same,
then no map ever needs to be stored, these (possibly large) arrays did
nothing.  The fact that we now always initialize the child to parent map
whenever transition capture is required, we no longer need the
child_parent_map_not_required array.  Previously this was only required
so we could determine if no map was required or if the map had not yet
been initialized.

In simple INSERTs hitting a single partition to a partitioned table with
many partitions, the shutdown of the executor was also slow in comparison
to the actual execution. This was down to the loop which cleans up each
ResultRelInfo having to loop over an array which contained mostly NULLs
which had to be skipped.  Performance of this has now improved as the array
we loop over now no longer has to skip possibly many NULL values.

David Rowley and Amit Langote
---
 src/backend/commands/copy.c                   |  48 +-
 src/backend/executor/execPartition.c          | 796 ++++++++++++++------------
 src/backend/executor/nodeModifyTable.c        | 109 +---
 src/backend/utils/cache/partcache.c           |  11 +-
 src/include/catalog/partition.h               |   6 +-
 src/include/executor/execPartition.h          | 173 ++++--
 src/test/regress/expected/insert_conflict.out |  22 +
 src/test/regress/sql/insert_conflict.sql      |  26 +
 8 files changed, 626 insertions(+), 565 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 9bc67ce60f..0dfb9e2e95 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2510,8 +2510,12 @@ CopyFrom(CopyState cstate)
 	/*
 	 * If there are any triggers with transition tables on the named relation,
 	 * we need to be prepared to capture transition tuples.
+	 *
+	 * Because partition tuple routing would like to know about whether
+	 * transition capture is active, we also set it in mtstate, which is
+	 * passed to ExecFindPartition below.
 	 */
-	cstate->transition_capture =
+	cstate->transition_capture = mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(cstate->rel->trigdesc,
 								   RelationGetRelid(cstate->rel),
 								   CMD_INSERT);
@@ -2521,19 +2525,8 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
 		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
-		/*
-		 * If we are capturing transition tuples, they may need to be
-		 * converted from partition format back to partitioned table format
-		 * (this is only ever necessary if a BEFORE trigger modifies the
-		 * tuple).
-		 */
-		if (cstate->transition_capture != NULL)
-			ExecSetupChildParentMapForLeaf(proute);
-	}
-
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2699,10 +2692,8 @@ CopyFrom(CopyState cstate)
 			 * will get us the ResultRelInfo and TupleConversionMap for the
 			 * partition, respectively.
 			 */
-			leaf_part_index = ExecFindPartition(target_resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
+			leaf_part_index = ExecFindPartition(mtstate, target_resultRelInfo,
+												proute, slot, estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
@@ -2800,15 +2791,7 @@ CopyFrom(CopyState cstate)
 				 * one.
 				 */
 				resultRelInfo = proute->partitions[leaf_part_index];
-				if (unlikely(resultRelInfo == NULL))
-				{
-					resultRelInfo = ExecInitPartitionInfo(mtstate,
-														  target_resultRelInfo,
-														  proute, estate,
-														  leaf_part_index);
-					proute->partitions[leaf_part_index] = resultRelInfo;
-					Assert(resultRelInfo != NULL);
-				}
+				Assert(resultRelInfo != NULL);
 
 				/* Determine which triggers exist on this partition */
 				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
@@ -2845,8 +2828,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, target_resultRelInfo,
-										  leaf_part_index);
+						PartitionTupRoutingGetToParentMap(proute, leaf_part_index);
 				}
 				else
 				{
@@ -2864,11 +2846,13 @@ CopyFrom(CopyState cstate)
 			 * partition rowtype.  Don't free the already stored tuple as it
 			 * may still be required for a multi-insert batch.
 			 */
-			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
-											  tuple,
-											  proute->partition_tuple_slot,
-											  &slot,
-											  false);
+			tuple =
+				ConvertPartitionTupleSlot(PartitionTupRoutingGetToChildMap(proute,
+																		   leaf_part_index),
+										  tuple,
+										  proute->partition_tuple_slot,
+										  &slot,
+										  false);
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 1a9943c3aa..9eee7f8f15 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,11 +31,18 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+#define PARTITION_ROUTING_INITSIZE	8
 
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecExpandRoutingArrays(PartitionTupleRouting *proute);
+static int ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch parent, int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -62,143 +69,119 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition.  The actual ResultRelInfos are allocated lazily by that
+ * function.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
-
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
 
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
-	}
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
+	 * lazily, only when we actually need to route a tuple to that partition.
+	 * The reason for this is that a common case is for INSERT to insert a
+	 * single tuple into a partitioned table and this must be fast.
+	 *
+	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
+	 * PartitionDispatch and ResultRelInfo pointers in their respective
+	 * arrays. More space can be allocated later, if required via
+	 * ExecExpandRoutingArrays.
+	 *
+	 * The PartitionDispatch for the target partitioned table of the command
+	 * must be set up, but any sub-partitioned tables can be set up lazily as
+	 * and when the tuples get routed to (through) them.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	proute->partitions = (ResultRelInfo **)
+		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	/* We only allocate these arrays when we need to store the first map */
+	proute->parent_child_tupconv_maps = NULL;
+	proute->child_parent_tupconv_maps = NULL;
 
 	/*
-	 * Initialize an empty slot that will be used to manipulate tuples of any
-	 * given partition's rowtype.  It is attached to the caller-specified node
-	 * (such as ModifyTableState) and released when the node finishes
-	 * processing.
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent as NULL as we don't need to care about any parent of the target
+	 * partitioned table.
 	 */
-	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
 
-	i = 0;
-	foreach(cell, leaf_parts)
+	/*
+	 * If performing an UPDATE with tuple routing, we can reuse partition
+	 * sub-plan result rels.  We build a hash table to map the OIDs of
+	 * partitions present in mtstate->resultRelInfo to their ResultRelInfos.
+	 * Every time a tuple is routed to a partition that we've yet to set the
+	 * ResultRelInfo for, before we go making one, we check for a pre-made one
+	 * in the hash table.
+	 *
+	 * Also, we'll need a slot that will transiently store the tuple being
+	 * routed using the root parent's rowtype.
+	 */
+	if (node && node->operation == CMD_UPDATE)
 	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
-
-		proute->partition_oids[i] = leaf_oid;
-
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
-
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
-
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
-
-			update_rri_index++;
-		}
-
-		proute->partitions[i] = leaf_part_rri;
-		i++;
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
+		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
+	}
+	else
+	{
+		proute->subplan_resultrel_hash = NULL;
+		proute->root_tuple_slot = NULL;
 	}
 
 	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
+	 * Initialize an empty slot that will be used to manipulate tuples of any
+	 * given partition's rowtype.
 	 */
-	Assert(update_rri_index == num_update_rri);
+	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
 
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find a leaf partition for the tuple contained in *slot.
+ * If the partition's ResultRelInfo does not yet exist in 'proute' then we set
+ * one up or reuse one from mtstate's resultRelInfo array.
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message, else it returns the index of the leaf partition's
+ * ResultRelInfo in the proute->partitions array.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch dispatch;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 	TupleTableSlot *myslot = NULL;
-	MemoryContext	oldcxt;
-	HeapTuple		tuple;
+	MemoryContext oldcxt;
+	HeapTuple	tuple;
 
 	/* use per-tuple context here to avoid leaking memory */
 	oldcxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
@@ -216,9 +199,10 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	while (true)
 	{
 		TupleConversionMap *map = dispatch->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = dispatch->reldesc;
+		partdesc = dispatch->partdesc;
 
 		/*
 		 * Convert the tuple to this parent's layout, if different from the
@@ -244,37 +228,114 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (dispatch->partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(dispatch, values, isnull);
-
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
+		if (partdesc->is_leaf[partidx])
 		{
-			result = -1;
-			break;
-		}
-		else if (dispatch->indexes[cur_index] >= 0)
-		{
-			result = dispatch->indexes[cur_index];
-			/* success! */
-			break;
+			int			result = -1;
+
+			/*
+			 * Get this leaf partition's index in the
+			 * PartitionTupleRouting->partitions array.  We may require
+			 * building a new ResultRelInfo.
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(dispatch->indexes[partidx] < proute->num_partitions);
+				result = dispatch->indexes[partidx];
+			}
+			else
+			{
+				/*
+				 * A ResultRelInfo has not been set up for this partition yet,
+				 * so either use one of the sub-plan result rels or create a
+				 * fresh one.
+				 */
+				if (proute->subplan_resultrel_hash)
+				{
+					ResultRelInfo *rri;
+					Oid			partoid = partdesc->oids[partidx];
+
+					rri = hash_search(proute->subplan_resultrel_hash,
+									  &partoid, HASH_FIND, NULL);
+
+					if (rri)
+					{
+						result = proute->num_partitions++;
+						dispatch->indexes[partidx] = result;
+
+
+						/* Allocate more space in the arrays, if required */
+						if (result >= proute->partitions_allocsize)
+							ExecExpandRoutingArrays(proute);
+
+						/* Save here for later use. */
+						proute->partitions[result] = rri;
+					}
+				}
+
+				/* We need to create one afresh. */
+				if (result < 0)
+				{
+					MemoryContextSwitchTo(oldcxt);
+					result = ExecInitPartitionInfo(mtstate, resultRelInfo,
+												   proute, estate,
+												   dispatch, partidx);
+					MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+					Assert(result >= 0 && result < proute->num_partitions);
+				}
+			}
+
+			/* Release the tuple in the lowest parent's dedicated slot. */
+			if (slot == myslot)
+				ExecClearTuple(myslot);
+
+			MemoryContextSwitchTo(oldcxt);
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return result;
 		}
 		else
 		{
-			/* move down one level */
-			dispatch = pd[-dispatch->indexes[cur_index]];
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = pd[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subdispatch;
+
+				MemoryContextSwitchTo(oldcxt);
+				subdispatch = ExecInitPartitionDispatchInfo(proute,
+															partdesc->oids[partidx],
+															dispatch, partidx);
+				MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				Assert(dispatch->indexes[partidx] >= 0 &&
+					   dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = subdispatch;
+			}
 
 			/*
 			 * Release the dedicated slot, if it was used.  Create a copy of
@@ -287,58 +348,122 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 			}
 		}
 	}
+}
+
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo *subplan_result_rels;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			nsubplans;
+	int			i;
+
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
+
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
 
-	/* Release the tuple in the lowest parent's dedicated slot. */
-	if (slot == myslot)
-		ExecClearTuple(myslot);
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_resultrel_hash = htab;
 
-	/* A partition was not found. */
-	if (result < 0)
+	/* Hash all subplans by their Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		char	   *val_desc;
-
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
+
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+													&found);
+
+		if (!found)
+			*subplanrri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
 	}
+}
 
-	MemoryContextSwitchTo(oldcxt);
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
+/*
+ * ExecExpandRoutingArrays
+ *		Double the size of the allocated arrays in 'proute'
+ */
+static void
+ExecExpandRoutingArrays(PartitionTupleRouting *proute)
+{
+	int			new_size = proute->partitions_allocsize * 2;
+	int			old_size = proute->partitions_allocsize;
 
-	return result;
+	proute->partitions_allocsize = new_size;
+
+	proute->partitions = (ResultRelInfo **)
+		repalloc(proute->partitions, sizeof(ResultRelInfo *) * new_size);
+
+	if (proute->parent_child_tupconv_maps != NULL)
+	{
+		proute->parent_child_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->parent_child_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->parent_child_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_tupconv_maps != NULL)
+	{
+		proute->child_parent_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->child_parent_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->child_parent_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
- *
- * Returns the ResultRelInfo
+ *		and store it in the next empty slot in 'proute's partitions array and
+ *		return the index of that element.
  */
-ResultRelInfo *
+static int
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx)
 {
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -514,15 +639,25 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 									&mtstate->ps, RelationGetDescr(partrel));
 	}
 
+	part_result_rel_index = proute->num_partitions++;
+	dispatch->indexes[partidx] = part_result_rel_index;
+
+	/* Allocate more space in the arrays, if required */
+	if (part_result_rel_index >= proute->partitions_allocsize)
+		ExecExpandRoutingArrays(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
+
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri,
+						part_result_rel_index);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -535,7 +670,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -548,7 +683,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -562,7 +697,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -572,8 +707,12 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = PartitionTupRoutingGetToChildMap(proute, part_result_rel_index);
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -582,7 +721,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -673,12 +812,9 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
-
 	MemoryContextSwitchTo(oldContext);
 
-	return leaf_part_rri;
+	return part_result_rel_index;
 }
 
 /*
@@ -693,6 +829,7 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 					int partidx)
 {
 	MemoryContext oldContext;
+	TupleConversionMap *map;
 
 	/*
 	 * Switch into per-query memory context.
@@ -703,10 +840,52 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
-		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
-							   RelationGetDescr(partRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	map = convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								 RelationGetDescr(partRelInfo->ri_RelationDesc),
+								 gettext_noop("could not convert row type"));
+
+	if (map)
+	{
+		/* Allocate parent child map array only if we need to store a map */
+		if (proute->parent_child_tupconv_maps == NULL)
+		{
+			int			size;
+
+			size = proute->partitions_allocsize;
+			proute->parent_child_tupconv_maps = (TupleConversionMap **)
+				palloc0(sizeof(TupleConversionMap *) * size);
+		}
+
+		proute->parent_child_tupconv_maps[partidx] = map;
+	}
+
+	/*
+	 * Also, if transition capture is required, store a map to convert tuples
+	 * from partition's rowtype to the parent's.
+	 */
+	if (mtstate &&
+		(mtstate->mt_transition_capture || mtstate->mt_oc_transition_capture))
+	{
+		map =
+			convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
+								   RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								   gettext_noop("could not convert row type"));
+
+		/* Allocate child parent map array only if we need to store a map */
+		if (map)
+		{
+			if (proute->child_parent_tupconv_maps == NULL)
+			{
+				int			size;
+
+				size = proute->partitions_allocsize;
+				proute->child_parent_tupconv_maps = (TupleConversionMap **)
+					palloc0(sizeof(TupleConversionMap *) * size);
+			}
+
+			proute->child_parent_tupconv_maps[partidx] = map;
+		}
+	}
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -722,67 +901,82 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 }
 
 /*
- * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
- * child-to-root tuple conversion map array.
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table
  *
- * This map is required for capturing transition tuples when the target table
- * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
- * we need to convert it from the leaf partition to the target table
- * descriptor.
+ * This also stores it in the proute->partition_dispatch_info array at the
+ * specified index ('partidx'), possibly expanding the array if there isn't
+ * enough space left in it.
  */
-void
-ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
 {
-	Assert(proute != NULL);
-
-	/*
-	 * These array elements get filled up with maps on an on-demand basis.
-	 * Initially just set all of them to NULL.
-	 */
-	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
-
-	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
-}
+	Relation	rel;
+	PartitionDesc partdesc;
+	PartitionDispatch pd;
+	int			dispatchidx;
 
-/*
- * TupConvMapForLeaf -- Get the tuple conversion map for a given leaf partition
- * index.
- */
-TupleConversionMap *
-TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index)
-{
-	ResultRelInfo **resultRelInfos = proute->partitions;
-	TupleConversionMap **map;
-	TupleDesc	tupdesc;
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	partdesc = RelationGetPartitionDesc(rel);
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
+									+ (partdesc->nparts * sizeof(int)));
+	pd->reldesc = rel;
+	pd->key = RelationGetPartitionKey(rel);
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		TupleDesc	tupdesc = RelationGetDescr(rel);
 
-	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
-		return NULL;
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap =
+			convert_tuples_by_name(RelationGetDescr(parent_pd->reldesc),
+								   tupdesc,
+								   gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
 
-	/* If we've already got a map, return it. */
-	map = &proute->child_parent_tupconv_maps[leaf_index];
-	if (*map != NULL)
-		return *map;
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
 
-	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
-	*map =
-		convert_tuples_by_name(tupdesc,
-							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	dispatchidx = proute->num_dispatch++;
+	if (parent_pd)
+		parent_pd->indexes[partidx] = dispatchidx;
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize *= 2;
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
 
-	/* If it turns out no map is needed, remember for next time. */
-	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
 
-	return *map;
+	return pd;
 }
 
 /*
@@ -827,8 +1021,8 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -849,10 +1043,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-		/* skip further processsing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
-
 		/* Allow any FDWs to shut down if they've been exercised */
 		if (resultRelInfo->ri_PartitionReadyForRouting &&
 			resultRelInfo->ri_FdwRoutine != NULL &&
@@ -861,21 +1051,19 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 														   resultRelInfo);
 
 		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
+		if (resultrel_hash)
 		{
-			subplan_index++;
-			continue;
+			Oid			partoid;
+			bool		found;
+
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
+			if (found)
+				continue;
 		}
 
 		ExecCloseIndices(resultRelInfo);
@@ -889,144 +1077,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
-											tupdesc,
-											gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
-
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
-		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
-
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
-	}
-}
-
 /* ----------------
  *		FormPartitionKeyDatum
  *			Construct values[] and isnull[] arrays for the partition key
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index d8d89c7983..cd89263f21 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1667,7 +1666,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1710,21 +1709,13 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	 * value is to be used as an index into the arrays for the ResultRelInfo
 	 * and TupleConversionMap for the partition.
 	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
+	partidx = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
 	Assert(partidx >= 0 && partidx < proute->num_partitions);
 
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
-	 */
+	Assert(proute->partitions[partidx] != NULL);
+	/* Get the ResultRelInfo corresponding to the selected partition. */
 	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
+	Assert(partrel != NULL);
 
 	/*
 	 * Check whether the partition is routable if we didn't yet
@@ -1769,7 +1760,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			 */
 			mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 			mtstate->mt_transition_capture->tcs_map =
-				TupConvMapForLeaf(proute, targetRelInfo, partidx);
+				PartitionTupRoutingGetToParentMap(proute, partidx);
 		}
 		else
 		{
@@ -1784,16 +1775,14 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	if (mtstate->mt_oc_transition_capture != NULL)
 	{
 		mtstate->mt_oc_transition_capture->tcs_map =
-			TupConvMapForLeaf(proute, targetRelInfo, partidx);
+			PartitionTupRoutingGetToParentMap(proute, partidx);
 	}
 
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
-							  tuple,
-							  proute->partition_tuple_slot,
-							  &slot,
+	ConvertPartitionTupleSlot(PartitionTupRoutingGetToChildMap(proute, partidx),
+							  tuple, proute->partition_tuple_slot, &slot,
 							  true);
 
 	/* Initialize information needed to handle ON CONFLICT DO UPDATE. */
@@ -1830,17 +1819,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			numResultRelInfos = mtstate->mt_nplans;
 	int			i;
 
-	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
 	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
@@ -1862,79 +1840,18 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	}
 }
 
-/*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
 /*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ouselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 115a9fe78f..82acfeb460 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -594,6 +594,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -782,7 +783,15 @@ RelationBuildPartitionDesc(Relation rel)
 		 * defined by canonicalized representation of the partition bounds.
 		 */
 		for (i = 0; i < nparts; i++)
-			result->oids[mapping[i]] = oids[i];
+		{
+			int			index = mapping[i];
+
+			result->oids[index] = oids[i];
+			/* Record if the partition is a leaf partition */
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+		}
+
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 1f49e5d3a9..8a639b8b7d 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -26,7 +26,11 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of 'nparts' elements containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * the corresponding 'oids' element belongs to
+								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index f6cd842cc9..7370e24b1c 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -31,9 +31,13 @@
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
  *				this table's rowtype (when extracting the partition key of a
  *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *	indexes		Array with partdesc->nparts elements.  For leaf partitions the
+ *				index into the PartitionTupleRouting->partitions array is
+ *				stored.  When the partition is itself a partitioned table then
+ *				we store the index into
+ *				PartitionTupleRouting->partition_dispatch_info.  -1 means
+ *				we've not yet allocated anything in PartitionTupleRouting for
+ *				the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -44,74 +48,121 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	TupleConversionMap *tupmap;
-	int		   *indexes;
+	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 /*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
- *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slot			TupleTableSlot to be used to manipulate any
- *								given leaf partition's rowtype after that
- *								partition is chosen for insertion by
- *								tuple-routing.
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
+ *
+ * partition_root			The partitioned table that's the target of the
+ *							command.
+ *
+ * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
+ *							a pointer to a PartitionDispatch objects for every
+ *							partitioned table touched by tuple routing.  The
+ *							entry for the target partitioned table is *always*
+ *							present as the first entry of this array.  See
+ *							comment for PartitionDispatchData->indexes for
+ *							details on how this array is indexed.
+ *
+ * num_dispatch				The current number of items stored in the
+ *							'partition_dispatch_info' array.  Also serves as
+ *							the index of the next free array element for new
+ *							PartitionDispatch which need to be stored.
+ *
+ * dispatch_allocsize		The current allocated size of the
+ *							'partition_dispatch_info' array.
+ *
+ * partitions				Array of 'partitions_allocsize' elements
+ *							containing pointers to a ResultRelInfos of all
+ *							leaf partitions touched by tuple routing.  Some of
+ *							these are pointers to ResultRelInfos which are
+ *							borrowed out of 'subplan_resultrel_hash'.  The
+ *							remainder have been built especially for tuple
+ *							routing.  See comment for
+ *							PartitionDispatchData->indexes for details on how
+ *							this array is indexed.
+ *
+ * num_partitions			The current number of items stored in the
+ *							'partitions' array.  Also serves as the index of
+ *							the next free array element for new ResultRelInfos
+ *							which need to be stored.
+ *
+ * partitions_allocsize		The current allocated size of the 'partitions'
+ *							array.  Also, if they're non-NULL, marks the size
+ *							of the 'parent_child_tupconv_maps',
+ *							'child_parent_tupconv_maps' and
+ *							'child_parent_map_not_required' arrays.
+ *
+ * parent_child_tupconv_maps	Array of partitions_allocsize elements
+ *							containing information on how to convert tuples of
+ *							partition_root's rowtype to the rowtype of the
+ *							corresponding partition as stored in 'partitions',
+ *							or NULL if no conversion is required.  The entire
+ *							array is only allocated when the first conversion
+ *							map needs to stored.  When not allocated it's set
+ *							to NULL.
+ *
+ * partition_tuple_slot		This is a tuple slot used to store a tuple using
+ *							rowtype of the partition chosen by tuple
+ *							routing.  Maintained separately because partitions
+ *							may have different rowtype.
+ *
+ * child_parent_tupconv_maps	As 'parent_child_tupconv_maps' but stores
+ *							conversion maps to translate partition tuples into
+ *							partition_root's rowtype, needed if transition
+ *							capture is active
+ *
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
+ *
+ * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
+ *							This is used to cache ResultRelInfos from subplans
+ *							of a ModifyTable node.  Some of these may be
+ *							useful for tuple routing to save having to build
+ *							duplicates.
+ *
+ * root_tuple_slot			During UPDATE tuple routing, this tuple slot is
+ *							used to transiently store a tuple using the root
+ *							table's rowtype after converting it from the
+ *							tuple's source leaf partition's rowtype.  That is,
+ *							if leaf partition's rowtype is different.
  *-----------------------
  */
 typedef struct PartitionTupleRouting
 {
+	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
-	Oid		   *partition_oids;
+	int			dispatch_allocsize;
 	ResultRelInfo **partitions;
 	int			num_partitions;
+	int			partitions_allocsize;
 	TupleConversionMap **parent_child_tupconv_maps;
 	TupleConversionMap **child_parent_tupconv_maps;
-	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
-	TupleTableSlot *partition_tuple_slot;
+	HTAB	   *subplan_resultrel_hash;
 	TupleTableSlot *root_tuple_slot;
+	TupleTableSlot *partition_tuple_slot;
 } PartitionTupleRouting;
 
+/*
+ * Accessor macros for tuple conversion maps contained in
+ * PartitionTupleRouting.  Beware of multiple evaluations of p!
+ */
+#define PartitionTupRoutingGetToParentMap(p, i) \
+			((p)->child_parent_tupconv_maps != NULL ? \
+				(p)->child_parent_tupconv_maps[(i)] : \
+							NULL)
+
+#define PartitionTupRoutingGetToChildMap(p, i) \
+			((p)->parent_child_tupconv_maps != NULL ? \
+				(p)->parent_child_tupconv_maps[(i)] : \
+							NULL)
+
 /*
  * PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
  * of partitions.  For a multilevel partitioned table, we have one of these
@@ -200,22 +251,20 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern int ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
+extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,
+					 ResultRelInfo *resultRelInfo,
+					 PartitionTupleRouting *proute,
+					 EState *estate, int partidx);
 extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
 					PartitionTupleRouting *proute,
 					ResultRelInfo *partRelInfo,
 					int partidx);
-extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
-extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index);
 extern HeapTuple ConvertPartitionTupleSlot(TupleConversionMap *map,
 						  HeapTuple tuple,
 						  TupleTableSlot *new_slot,
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index 27cf5a01b3..6b841c7850 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -904,4 +904,26 @@ select * from parted_conflict order by a;
  50 | cincuenta | 2
 (1 row)
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+truncate parted_conflict;
+insert into parted_conflict values (0, 'cero', 1);
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+NOTICE:  a = 0, b = cero, c = 2
 drop table parted_conflict;
+drop function parted_conflict_update_func();
diff --git a/src/test/regress/sql/insert_conflict.sql b/src/test/regress/sql/insert_conflict.sql
index c677d70fb7..fe6dcfaa06 100644
--- a/src/test/regress/sql/insert_conflict.sql
+++ b/src/test/regress/sql/insert_conflict.sql
@@ -576,4 +576,30 @@ insert into parted_conflict values (50, 'cincuenta', 2)
 -- should see (50, 'cincuenta', 2)
 select * from parted_conflict order by a;
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+
+truncate parted_conflict;
+
+insert into parted_conflict values (0, 'cero', 1);
+
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+
 drop table parted_conflict;
+drop function parted_conflict_update_func();
-- 
2.16.2.windows.1

#25Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#24)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018/08/21 14:44, David Rowley wrote:

On 3 August 2018 at 17:58, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2018/07/31 16:03, David Rowley wrote:

Maybe we can do that as a follow-on patch.

We probably could, but I think it would be a good idea get rid of *all*
redundant allocations due to tuple routing in one patch, if that's the
mission of this thread and the patch anyway.

I started looking at this patch today and I now agree that it should
be included in the main patch.

Great, thanks.

I changed a few things with the patch. For example, the map access
macros you'd defined were not in CamelCase.

In the updated patch:

+#define PartitionTupRoutingGetToParentMap(p, i) \
+#define PartitionTupRoutingGetToChildMap(p, i) \

If the "Get" could be replaced by "Child" and "Parent", respectively,
they'd sound more meaningful, imho.

I also fixed a bug where
the child to parent map was not being initialised when on conflict
transition capture was required. I added a test which was crashing the
backend but fixed the code so it works correctly.

Oops, I guess you mean my omission of checking if
mtstate->mt_oc_transition_capture is non-NULL in ExecInitRoutingInfo.

Thanks for fixing it and adding the test case.

I also got rid of
the child_parent_map_not_required array since we now no longer need
it. The code now always initialises the maps in cases where they're
going to be required.

Yes, thought I had removed the field in my patch, but looks like I had
just removed the comment about it.

I've attached a v3 version of your patch and also v6 of the main patch
which includes the v3 patch.

I've looked at v6 and spotted some minor typos.

+ * ResultRelInfo for, before we go making one, we check for a
pre-made one

s/making/make/g

+ /* If nobody else set the per-subplan array of maps, do so ouselves. */

I guess I'm the one to blame here for misspelling "ourselves".

Since the above two are minor issues, fixed them myself in the attached
updated version; didn't touch the macro though.

Do you agree to setting this patch to "Ready for Committer" in the
September CF?

Thanks,
Amit

Attachments:

v7-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchtext/plain; charset=UTF-8; name=v7-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload
From 79c906997d80dc426530dea0b75363ef20286001 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 26 Jul 2018 19:54:55 +1200
Subject: [PATCH v7] Speed up INSERT and UPDATE on partitioned tables

This is more or less a complete redesign of PartitionTupleRouting. The
aim here is to get rid of all the possibly large arrays that were being
allocated during ExecSetupPartitionTupleRouting().  We now allocate
small arrays to store the partition's ResultRelInfo and only enlarge
these when we run out of space.  The partitions array is now ordered
by the order in which the partition's ResultRelInfos are initialized
rather than in same order as partdesc.

The slowest part of ExecSetupPartitionTupleRouting still remains.  The
find_all_inheritors call still remains by far the slowest part of the
function. This patch just removes the other slow parts.

Initialization of the parent to child and child to parent translation maps
arrays are now only performed when we need to store the first translation
map.  If the column order between the parent and its child are the same,
then no map ever needs to be stored, these (possibly large) arrays did
nothing.  The fact that we now always initialize the child to parent map
whenever transition capture is required, we no longer need the
child_parent_map_not_required array.  Previously this was only required
so we could determine if no map was required or if the map had not yet
been initialized.

In simple INSERTs hitting a single partition to a partitioned table with
many partitions, the shutdown of the executor was also slow in comparison
to the actual execution. This was down to the loop which cleans up each
ResultRelInfo having to loop over an array which contained mostly NULLs
which had to be skipped.  Performance of this has now improved as the array
we loop over now no longer has to skip possibly many NULL values.

David Rowley and Amit Langote
---
 src/backend/commands/copy.c                   |  48 +-
 src/backend/executor/execPartition.c          | 798 ++++++++++++++------------
 src/backend/executor/nodeModifyTable.c        | 109 +---
 src/backend/utils/cache/partcache.c           |  11 +-
 src/include/catalog/partition.h               |   6 +-
 src/include/executor/execPartition.h          | 171 ++++--
 src/test/regress/expected/insert_conflict.out |  22 +
 src/test/regress/sql/insert_conflict.sql      |  26 +
 8 files changed, 626 insertions(+), 565 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 9bc67ce60f..0dfb9e2e95 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2510,8 +2510,12 @@ CopyFrom(CopyState cstate)
 	/*
 	 * If there are any triggers with transition tables on the named relation,
 	 * we need to be prepared to capture transition tuples.
+	 *
+	 * Because partition tuple routing would like to know about whether
+	 * transition capture is active, we also set it in mtstate, which is
+	 * passed to ExecFindPartition below.
 	 */
-	cstate->transition_capture =
+	cstate->transition_capture = mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(cstate->rel->trigdesc,
 								   RelationGetRelid(cstate->rel),
 								   CMD_INSERT);
@@ -2521,19 +2525,8 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
 		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
-		/*
-		 * If we are capturing transition tuples, they may need to be
-		 * converted from partition format back to partitioned table format
-		 * (this is only ever necessary if a BEFORE trigger modifies the
-		 * tuple).
-		 */
-		if (cstate->transition_capture != NULL)
-			ExecSetupChildParentMapForLeaf(proute);
-	}
-
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2699,10 +2692,8 @@ CopyFrom(CopyState cstate)
 			 * will get us the ResultRelInfo and TupleConversionMap for the
 			 * partition, respectively.
 			 */
-			leaf_part_index = ExecFindPartition(target_resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
+			leaf_part_index = ExecFindPartition(mtstate, target_resultRelInfo,
+												proute, slot, estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
@@ -2800,15 +2791,7 @@ CopyFrom(CopyState cstate)
 				 * one.
 				 */
 				resultRelInfo = proute->partitions[leaf_part_index];
-				if (unlikely(resultRelInfo == NULL))
-				{
-					resultRelInfo = ExecInitPartitionInfo(mtstate,
-														  target_resultRelInfo,
-														  proute, estate,
-														  leaf_part_index);
-					proute->partitions[leaf_part_index] = resultRelInfo;
-					Assert(resultRelInfo != NULL);
-				}
+				Assert(resultRelInfo != NULL);
 
 				/* Determine which triggers exist on this partition */
 				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
@@ -2845,8 +2828,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, target_resultRelInfo,
-										  leaf_part_index);
+						PartitionTupRoutingGetToParentMap(proute, leaf_part_index);
 				}
 				else
 				{
@@ -2864,11 +2846,13 @@ CopyFrom(CopyState cstate)
 			 * partition rowtype.  Don't free the already stored tuple as it
 			 * may still be required for a multi-insert batch.
 			 */
-			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
-											  tuple,
-											  proute->partition_tuple_slot,
-											  &slot,
-											  false);
+			tuple =
+				ConvertPartitionTupleSlot(PartitionTupRoutingGetToChildMap(proute,
+																		   leaf_part_index),
+										  tuple,
+										  proute->partition_tuple_slot,
+										  &slot,
+										  false);
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 1a9943c3aa..9ba3664441 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,11 +31,18 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+#define PARTITION_ROUTING_INITSIZE	8
 
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecExpandRoutingArrays(PartitionTupleRouting *proute);
+static int ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch parent, int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -62,143 +69,119 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition.  The actual ResultRelInfos are allocated lazily by that
+ * function.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
 
-	/* Set up details specific to the type of tuple routing we are doing. */
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
+	 * lazily, only when we actually need to route a tuple to that partition.
+	 * The reason for this is that a common case is for INSERT to insert a
+	 * single tuple into a partitioned table and this must be fast.
+	 *
+	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
+	 * PartitionDispatch and ResultRelInfo pointers in their respective
+	 * arrays. More space can be allocated later, if required via
+	 * ExecExpandRoutingArrays.
+	 *
+	 * The PartitionDispatch for the target partitioned table of the command
+	 * must be set up, but any sub-partitioned tables can be set up lazily as
+	 * and when the tuples get routed to (through) them.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	proute->partitions = (ResultRelInfo **)
+		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	/* We only allocate these arrays when we need to store the first map */
+	proute->parent_child_tupconv_maps = NULL;
+	proute->child_parent_tupconv_maps = NULL;
+
+	/*
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent as NULL as we don't need to care about any parent of the target
+	 * partitioned table.
+	 */
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
+
+	/*
+	 * If performing an UPDATE with tuple routing, we can reuse partition
+	 * sub-plan result rels.  We build a hash table to map the OIDs of
+	 * partitions present in mtstate->resultRelInfo to their ResultRelInfos.
+	 * Every time a tuple is routed to a partition that we've yet to set the
+	 * ResultRelInfo for, before we go make one, we check for a pre-made one
+	 * in the hash table.
+	 *
+	 * Also, we'll need a slot that will transiently store the tuple being
+	 * routed using the root parent's rowtype.
+	 */
 	if (node && node->operation == CMD_UPDATE)
 	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
-
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
 		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
 	}
+	else
+	{
+		proute->subplan_resultrel_hash = NULL;
+		proute->root_tuple_slot = NULL;
+	}
 
 	/*
 	 * Initialize an empty slot that will be used to manipulate tuples of any
-	 * given partition's rowtype.  It is attached to the caller-specified node
-	 * (such as ModifyTableState) and released when the node finishes
-	 * processing.
+	 * given partition's rowtype.
 	 */
 	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
 
-	i = 0;
-	foreach(cell, leaf_parts)
-	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
-
-		proute->partition_oids[i] = leaf_oid;
-
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
-
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
-
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
-
-			update_rri_index++;
-		}
-
-		proute->partitions[i] = leaf_part_rri;
-		i++;
-	}
-
-	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
-	 */
-	Assert(update_rri_index == num_update_rri);
-
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find a leaf partition for the tuple contained in *slot.
+ * If the partition's ResultRelInfo does not yet exist in 'proute' then we set
+ * one up or reuse one from mtstate's resultRelInfo array.
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message, else it returns the index of the leaf partition's
+ * ResultRelInfo in the proute->partitions array.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch dispatch;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 	TupleTableSlot *myslot = NULL;
-	MemoryContext	oldcxt;
-	HeapTuple		tuple;
+	MemoryContext oldcxt;
+	HeapTuple	tuple;
 
 	/* use per-tuple context here to avoid leaking memory */
 	oldcxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
@@ -216,9 +199,10 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	while (true)
 	{
 		TupleConversionMap *map = dispatch->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = dispatch->reldesc;
+		partdesc = dispatch->partdesc;
 
 		/*
 		 * Convert the tuple to this parent's layout, if different from the
@@ -244,37 +228,114 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (dispatch->partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(dispatch, values, isnull);
+		if (partdesc->is_leaf[partidx])
+		{
+			int			result = -1;
 
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
-		{
-			result = -1;
-			break;
-		}
-		else if (dispatch->indexes[cur_index] >= 0)
-		{
-			result = dispatch->indexes[cur_index];
-			/* success! */
-			break;
+			/*
+			 * Get this leaf partition's index in the
+			 * PartitionTupleRouting->partitions array.  We may require
+			 * building a new ResultRelInfo.
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(dispatch->indexes[partidx] < proute->num_partitions);
+				result = dispatch->indexes[partidx];
+			}
+			else
+			{
+				/*
+				 * A ResultRelInfo has not been set up for this partition yet,
+				 * so either use one of the sub-plan result rels or create a
+				 * fresh one.
+				 */
+				if (proute->subplan_resultrel_hash)
+				{
+					ResultRelInfo *rri;
+					Oid			partoid = partdesc->oids[partidx];
+
+					rri = hash_search(proute->subplan_resultrel_hash,
+									  &partoid, HASH_FIND, NULL);
+
+					if (rri)
+					{
+						result = proute->num_partitions++;
+						dispatch->indexes[partidx] = result;
+
+
+						/* Allocate more space in the arrays, if required */
+						if (result >= proute->partitions_allocsize)
+							ExecExpandRoutingArrays(proute);
+
+						/* Save here for later use. */
+						proute->partitions[result] = rri;
+					}
+				}
+
+				/* We need to create one afresh. */
+				if (result < 0)
+				{
+					MemoryContextSwitchTo(oldcxt);
+					result = ExecInitPartitionInfo(mtstate, resultRelInfo,
+												   proute, estate,
+												   dispatch, partidx);
+					MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+					Assert(result >= 0 && result < proute->num_partitions);
+				}
+			}
+
+			/* Release the tuple in the lowest parent's dedicated slot. */
+			if (slot == myslot)
+				ExecClearTuple(myslot);
+
+			MemoryContextSwitchTo(oldcxt);
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return result;
 		}
 		else
 		{
-			/* move down one level */
-			dispatch = pd[-dispatch->indexes[cur_index]];
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = pd[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subdispatch;
+
+				MemoryContextSwitchTo(oldcxt);
+				subdispatch = ExecInitPartitionDispatchInfo(proute,
+															partdesc->oids[partidx],
+															dispatch, partidx);
+				MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				Assert(dispatch->indexes[partidx] >= 0 &&
+					   dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = subdispatch;
+			}
 
 			/*
 			 * Release the dedicated slot, if it was used.  Create a copy of
@@ -287,58 +348,122 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 			}
 		}
 	}
+}
 
-	/* Release the tuple in the lowest parent's dedicated slot. */
-	if (slot == myslot)
-		ExecClearTuple(myslot);
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo *subplan_result_rels;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			nsubplans;
+	int			i;
 
-	/* A partition was not found. */
-	if (result < 0)
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
+
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
+
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_resultrel_hash = htab;
+
+	/* Hash all subplans by their Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		char	   *val_desc;
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
 
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+													&found);
+
+		if (!found)
+			*subplanrri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
+	}
+}
+
+/*
+ * ExecExpandRoutingArrays
+ *		Double the size of the allocated arrays in 'proute'
+ */
+static void
+ExecExpandRoutingArrays(PartitionTupleRouting *proute)
+{
+	int			new_size = proute->partitions_allocsize * 2;
+	int			old_size = proute->partitions_allocsize;
+
+	proute->partitions_allocsize = new_size;
+
+	proute->partitions = (ResultRelInfo **)
+		repalloc(proute->partitions, sizeof(ResultRelInfo *) * new_size);
+
+	if (proute->parent_child_tupconv_maps != NULL)
+	{
+		proute->parent_child_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->parent_child_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->parent_child_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
 	}
 
-	MemoryContextSwitchTo(oldcxt);
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
-
-	return result;
+	if (proute->child_parent_tupconv_maps != NULL)
+	{
+		proute->child_parent_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->child_parent_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->child_parent_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
- *
- * Returns the ResultRelInfo
+ *		and store it in the next empty slot in 'proute's partitions array and
+ *		return the index of that element.
  */
-ResultRelInfo *
+static int
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx)
 {
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -514,15 +639,25 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 									&mtstate->ps, RelationGetDescr(partrel));
 	}
 
+	part_result_rel_index = proute->num_partitions++;
+	dispatch->indexes[partidx] = part_result_rel_index;
+
+	/* Allocate more space in the arrays, if required */
+	if (part_result_rel_index >= proute->partitions_allocsize)
+		ExecExpandRoutingArrays(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
+
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri,
+						part_result_rel_index);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -535,7 +670,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -548,7 +683,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -562,7 +697,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -572,8 +707,12 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = PartitionTupRoutingGetToChildMap(proute, part_result_rel_index);
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -582,7 +721,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -673,12 +812,9 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
-
 	MemoryContextSwitchTo(oldContext);
 
-	return leaf_part_rri;
+	return part_result_rel_index;
 }
 
 /*
@@ -693,6 +829,7 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 					int partidx)
 {
 	MemoryContext oldContext;
+	TupleConversionMap *map;
 
 	/*
 	 * Switch into per-query memory context.
@@ -703,10 +840,52 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
-		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
-							   RelationGetDescr(partRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	map = convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								 RelationGetDescr(partRelInfo->ri_RelationDesc),
+								 gettext_noop("could not convert row type"));
+
+	if (map)
+	{
+		/* Allocate parent child map array only if we need to store a map */
+		if (proute->parent_child_tupconv_maps == NULL)
+		{
+			int			size;
+
+			size = proute->partitions_allocsize;
+			proute->parent_child_tupconv_maps = (TupleConversionMap **)
+				palloc0(sizeof(TupleConversionMap *) * size);
+		}
+
+		proute->parent_child_tupconv_maps[partidx] = map;
+	}
+
+	/*
+	 * Also, if transition capture is required, store a map to convert tuples
+	 * from partition's rowtype to the parent's.
+	 */
+	if (mtstate &&
+		(mtstate->mt_transition_capture || mtstate->mt_oc_transition_capture))
+	{
+		map =
+			convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
+								   RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								   gettext_noop("could not convert row type"));
+
+		/* Allocate child parent map array only if we need to store a map */
+		if (map)
+		{
+			if (proute->child_parent_tupconv_maps == NULL)
+			{
+				int			size;
+
+				size = proute->partitions_allocsize;
+				proute->child_parent_tupconv_maps = (TupleConversionMap **)
+					palloc0(sizeof(TupleConversionMap *) * size);
+			}
+
+			proute->child_parent_tupconv_maps[partidx] = map;
+		}
+	}
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -722,67 +901,82 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 }
 
 /*
- * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
- * child-to-root tuple conversion map array.
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table
  *
- * This map is required for capturing transition tuples when the target table
- * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
- * we need to convert it from the leaf partition to the target table
- * descriptor.
+ * This also stores it in the proute->partition_dispatch_info array at the
+ * specified index ('partidx'), possibly expanding the array if there isn't
+ * enough space left in it.
  */
-void
-ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
 {
-	Assert(proute != NULL);
+	Relation	rel;
+	PartitionDesc partdesc;
+	PartitionDispatch pd;
+	int			dispatchidx;
+
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	partdesc = RelationGetPartitionDesc(rel);
+
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
+									+ (partdesc->nparts * sizeof(int)));
+	pd->reldesc = rel;
+	pd->key = RelationGetPartitionKey(rel);
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		TupleDesc	tupdesc = RelationGetDescr(rel);
+
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap =
+			convert_tuples_by_name(RelationGetDescr(parent_pd->reldesc),
+								   tupdesc,
+								   gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
 
 	/*
-	 * These array elements get filled up with maps on an on-demand basis.
-	 * Initially just set all of them to NULL.
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
 	 */
-	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
 
-	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
-}
+	dispatchidx = proute->num_dispatch++;
+	if (parent_pd)
+		parent_pd->indexes[partidx] = dispatchidx;
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize *= 2;
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
 
-/*
- * TupConvMapForLeaf -- Get the tuple conversion map for a given leaf partition
- * index.
- */
-TupleConversionMap *
-TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index)
-{
-	ResultRelInfo **resultRelInfos = proute->partitions;
-	TupleConversionMap **map;
-	TupleDesc	tupdesc;
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
-
-	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
-		return NULL;
-
-	/* If we've already got a map, return it. */
-	map = &proute->child_parent_tupconv_maps[leaf_index];
-	if (*map != NULL)
-		return *map;
-
-	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
-	*map =
-		convert_tuples_by_name(tupdesc,
-							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
-
-	/* If it turns out no map is needed, remember for next time. */
-	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
-
-	return *map;
+	return pd;
 }
 
 /*
@@ -827,8 +1021,8 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -849,10 +1043,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-		/* skip further processsing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
-
 		/* Allow any FDWs to shut down if they've been exercised */
 		if (resultRelInfo->ri_PartitionReadyForRouting &&
 			resultRelInfo->ri_FdwRoutine != NULL &&
@@ -861,21 +1051,19 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 														   resultRelInfo);
 
 		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
+		if (resultrel_hash)
 		{
-			subplan_index++;
-			continue;
+			Oid			partoid;
+			bool		found;
+
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
+			if (found)
+				continue;
 		}
 
 		ExecCloseIndices(resultRelInfo);
@@ -889,144 +1077,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
-											tupdesc,
-											gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
-
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
-		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
-
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
-	}
-}
-
 /* ----------------
  *		FormPartitionKeyDatum
  *			Construct values[] and isnull[] arrays for the partition key
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index d8d89c7983..365b4fd6f9 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1667,7 +1666,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1710,21 +1709,13 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	 * value is to be used as an index into the arrays for the ResultRelInfo
 	 * and TupleConversionMap for the partition.
 	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
+	partidx = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
 	Assert(partidx >= 0 && partidx < proute->num_partitions);
 
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
-	 */
+	Assert(proute->partitions[partidx] != NULL);
+	/* Get the ResultRelInfo corresponding to the selected partition. */
 	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
+	Assert(partrel != NULL);
 
 	/*
 	 * Check whether the partition is routable if we didn't yet
@@ -1769,7 +1760,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			 */
 			mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 			mtstate->mt_transition_capture->tcs_map =
-				TupConvMapForLeaf(proute, targetRelInfo, partidx);
+				PartitionTupRoutingGetToParentMap(proute, partidx);
 		}
 		else
 		{
@@ -1784,16 +1775,14 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	if (mtstate->mt_oc_transition_capture != NULL)
 	{
 		mtstate->mt_oc_transition_capture->tcs_map =
-			TupConvMapForLeaf(proute, targetRelInfo, partidx);
+			PartitionTupRoutingGetToParentMap(proute, partidx);
 	}
 
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
-							  tuple,
-							  proute->partition_tuple_slot,
-							  &slot,
+	ConvertPartitionTupleSlot(PartitionTupRoutingGetToChildMap(proute, partidx),
+							  tuple, proute->partition_tuple_slot, &slot,
 							  true);
 
 	/* Initialize information needed to handle ON CONFLICT DO UPDATE. */
@@ -1831,17 +1820,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			i;
 
 	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
-	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
 	 * conversion is necessary, which is hopefully a common case.
@@ -1863,78 +1841,17 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 }
 
 /*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
-/*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ourselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
-
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 115a9fe78f..82acfeb460 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -594,6 +594,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -782,7 +783,15 @@ RelationBuildPartitionDesc(Relation rel)
 		 * defined by canonicalized representation of the partition bounds.
 		 */
 		for (i = 0; i < nparts; i++)
-			result->oids[mapping[i]] = oids[i];
+		{
+			int			index = mapping[i];
+
+			result->oids[index] = oids[i];
+			/* Record if the partition is a leaf partition */
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+		}
+
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 1f49e5d3a9..8a639b8b7d 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -26,7 +26,11 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of 'nparts' elements containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * the corresponding 'oids' element belongs to
+								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index f6cd842cc9..7370e24b1c 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -31,9 +31,13 @@
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
  *				this table's rowtype (when extracting the partition key of a
  *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *	indexes		Array with partdesc->nparts elements.  For leaf partitions the
+ *				index into the PartitionTupleRouting->partitions array is
+ *				stored.  When the partition is itself a partitioned table then
+ *				we store the index into
+ *				PartitionTupleRouting->partition_dispatch_info.  -1 means
+ *				we've not yet allocated anything in PartitionTupleRouting for
+ *				the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -44,75 +48,122 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	TupleConversionMap *tupmap;
-	int		   *indexes;
+	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 /*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
  *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slot			TupleTableSlot to be used to manipulate any
- *								given leaf partition's rowtype after that
- *								partition is chosen for insertion by
- *								tuple-routing.
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
+ * partition_root			The partitioned table that's the target of the
+ *							command.
+ *
+ * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
+ *							a pointer to a PartitionDispatch objects for every
+ *							partitioned table touched by tuple routing.  The
+ *							entry for the target partitioned table is *always*
+ *							present as the first entry of this array.  See
+ *							comment for PartitionDispatchData->indexes for
+ *							details on how this array is indexed.
+ *
+ * num_dispatch				The current number of items stored in the
+ *							'partition_dispatch_info' array.  Also serves as
+ *							the index of the next free array element for new
+ *							PartitionDispatch which need to be stored.
+ *
+ * dispatch_allocsize		The current allocated size of the
+ *							'partition_dispatch_info' array.
+ *
+ * partitions				Array of 'partitions_allocsize' elements
+ *							containing pointers to a ResultRelInfos of all
+ *							leaf partitions touched by tuple routing.  Some of
+ *							these are pointers to ResultRelInfos which are
+ *							borrowed out of 'subplan_resultrel_hash'.  The
+ *							remainder have been built especially for tuple
+ *							routing.  See comment for
+ *							PartitionDispatchData->indexes for details on how
+ *							this array is indexed.
+ *
+ * num_partitions			The current number of items stored in the
+ *							'partitions' array.  Also serves as the index of
+ *							the next free array element for new ResultRelInfos
+ *							which need to be stored.
+ *
+ * partitions_allocsize		The current allocated size of the 'partitions'
+ *							array.  Also, if they're non-NULL, marks the size
+ *							of the 'parent_child_tupconv_maps',
+ *							'child_parent_tupconv_maps' and
+ *							'child_parent_map_not_required' arrays.
+ *
+ * parent_child_tupconv_maps	Array of partitions_allocsize elements
+ *							containing information on how to convert tuples of
+ *							partition_root's rowtype to the rowtype of the
+ *							corresponding partition as stored in 'partitions',
+ *							or NULL if no conversion is required.  The entire
+ *							array is only allocated when the first conversion
+ *							map needs to stored.  When not allocated it's set
+ *							to NULL.
+ *
+ * partition_tuple_slot		This is a tuple slot used to store a tuple using
+ *							rowtype of the partition chosen by tuple
+ *							routing.  Maintained separately because partitions
+ *							may have different rowtype.
+ *
+ * child_parent_tupconv_maps	As 'parent_child_tupconv_maps' but stores
+ *							conversion maps to translate partition tuples into
+ *							partition_root's rowtype, needed if transition
+ *							capture is active
+ *
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
+ *
+ * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
+ *							This is used to cache ResultRelInfos from subplans
+ *							of a ModifyTable node.  Some of these may be
+ *							useful for tuple routing to save having to build
+ *							duplicates.
+ *
+ * root_tuple_slot			During UPDATE tuple routing, this tuple slot is
+ *							used to transiently store a tuple using the root
+ *							table's rowtype after converting it from the
+ *							tuple's source leaf partition's rowtype.  That is,
+ *							if leaf partition's rowtype is different.
  *-----------------------
  */
 typedef struct PartitionTupleRouting
 {
+	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
-	Oid		   *partition_oids;
+	int			dispatch_allocsize;
 	ResultRelInfo **partitions;
 	int			num_partitions;
+	int			partitions_allocsize;
 	TupleConversionMap **parent_child_tupconv_maps;
 	TupleConversionMap **child_parent_tupconv_maps;
-	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
-	TupleTableSlot *partition_tuple_slot;
+	HTAB	   *subplan_resultrel_hash;
 	TupleTableSlot *root_tuple_slot;
+	TupleTableSlot *partition_tuple_slot;
 } PartitionTupleRouting;
 
 /*
+ * Accessor macros for tuple conversion maps contained in
+ * PartitionTupleRouting.  Beware of multiple evaluations of p!
+ */
+#define PartitionTupRoutingGetToParentMap(p, i) \
+			((p)->child_parent_tupconv_maps != NULL ? \
+				(p)->child_parent_tupconv_maps[(i)] : \
+							NULL)
+
+#define PartitionTupRoutingGetToChildMap(p, i) \
+			((p)->parent_child_tupconv_maps != NULL ? \
+				(p)->parent_child_tupconv_maps[(i)] : \
+							NULL)
+
+/*
  * PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
  * of partitions.  For a multilevel partitioned table, we have one of these
  * for the topmost partition plus one for each non-leaf child partition.
@@ -200,22 +251,20 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern int ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
+extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,
+					 ResultRelInfo *resultRelInfo,
+					 PartitionTupleRouting *proute,
+					 EState *estate, int partidx);
 extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
 					PartitionTupleRouting *proute,
 					ResultRelInfo *partRelInfo,
 					int partidx);
-extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
-extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index);
 extern HeapTuple ConvertPartitionTupleSlot(TupleConversionMap *map,
 						  HeapTuple tuple,
 						  TupleTableSlot *new_slot,
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index 27cf5a01b3..6b841c7850 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -904,4 +904,26 @@ select * from parted_conflict order by a;
  50 | cincuenta | 2
 (1 row)
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+truncate parted_conflict;
+insert into parted_conflict values (0, 'cero', 1);
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+NOTICE:  a = 0, b = cero, c = 2
 drop table parted_conflict;
+drop function parted_conflict_update_func();
diff --git a/src/test/regress/sql/insert_conflict.sql b/src/test/regress/sql/insert_conflict.sql
index c677d70fb7..fe6dcfaa06 100644
--- a/src/test/regress/sql/insert_conflict.sql
+++ b/src/test/regress/sql/insert_conflict.sql
@@ -576,4 +576,30 @@ insert into parted_conflict values (50, 'cincuenta', 2)
 -- should see (50, 'cincuenta', 2)
 select * from parted_conflict order by a;
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+
+truncate parted_conflict;
+
+insert into parted_conflict values (0, 'cero', 1);
+
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+
 drop table parted_conflict;
+drop function parted_conflict_update_func();
-- 
2.11.0

#26David Rowley
david.rowley@2ndquadrant.com
In reply to: Amit Langote (#25)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 22 August 2018 at 19:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

+#define PartitionTupRoutingGetToParentMap(p, i) \
+#define PartitionTupRoutingGetToChildMap(p, i) \

If the "Get" could be replaced by "Child" and "Parent", respectively,
they'd sound more meaningful, imho.

I did that to save 3 chars. I think putting the additional
Child/Parent in the name is not really required. It's not as if we're
going to have a ParentToParent or a ChildToChild map, so I thought it
might be okay to assume that if it's "ToParent", that it's being
converted from the child and "ToChild" seems safe to assume it's being
converted from the parent. I can change it though if you feel very
strongly that what I've got is no good.

I also fixed a bug where
the child to parent map was not being initialised when on conflict
transition capture was required. I added a test which was crashing the
backend but fixed the code so it works correctly.

Oops, I guess you mean my omission of checking if
mtstate->mt_oc_transition_capture is non-NULL in ExecInitRoutingInfo.

Yeah.

I've looked at v6 and spotted some minor typos.

+ * ResultRelInfo for, before we go making one, we check for a
pre-made one

s/making/make/g

I disagree, but perhaps we can just reword it a bit. I've now got:

+ * Every time a tuple is routed to a partition that we've yet to set the
+ * ResultRelInfo for, before we go to the trouble of making one, we check
+ * for a pre-made one in the hash table.

+ /* If nobody else set the per-subplan array of maps, do so ouselves. */

I guess I'm the one to blame here for misspelling "ourselves".

Thanks for noticing.

Since the above two are minor issues, fixed them myself in the attached
updated version; didn't touch the macro though.

I've attached a v8. The only change from your v7 is in the "go making" comment.

Do you agree to setting this patch to "Ready for Committer" in the
September CF?

I read through the entire patch a couple of times yesterday and saw
nothing else, so yeah, I think now is a good time for someone with
more authority to have a look at it.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v8-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchapplication/octet-stream; name=v8-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload
From fdf14a8d6549e565a8e1735dddec1ffac946d89e Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 26 Jul 2018 19:54:55 +1200
Subject: [PATCH v8] Speed up INSERT and UPDATE on partitioned tables

This is more or less a complete redesign of PartitionTupleRouting. The
aim here is to get rid of all the possibly large arrays that were being
allocated during ExecSetupPartitionTupleRouting().  We now allocate
small arrays to store the partition's ResultRelInfo and only enlarge
these when we run out of space.  The partitions array is now ordered
by the order in which the partition's ResultRelInfos are initialized
rather than in same order as partdesc.

The slowest part of ExecSetupPartitionTupleRouting still remains.  The
find_all_inheritors call still remains by far the slowest part of the
function. This patch just removes the other slow parts.

Initialization of the parent to child and child to parent translation maps
arrays are now only performed when we need to store the first translation
map.  If the column order between the parent and its child are the same,
then no map ever needs to be stored, these (possibly large) arrays did
nothing.  The fact that we now always initialize the child to parent map
whenever transition capture is required, we no longer need the
child_parent_map_not_required array.  Previously this was only required
so we could determine if no map was required or if the map had not yet
been initialized.

In simple INSERTs hitting a single partition to a partitioned table with
many partitions, the shutdown of the executor was also slow in comparison
to the actual execution. This was down to the loop which cleans up each
ResultRelInfo having to loop over an array which contained mostly NULLs
which had to be skipped.  Performance of this has now improved as the array
we loop over now no longer has to skip possibly many NULL values.

David Rowley and Amit Langote
---
 src/backend/commands/copy.c                   |  48 +-
 src/backend/executor/execPartition.c          | 796 ++++++++++++++------------
 src/backend/executor/nodeModifyTable.c        | 109 +---
 src/backend/utils/cache/partcache.c           |  11 +-
 src/include/catalog/partition.h               |   6 +-
 src/include/executor/execPartition.h          | 173 ++++--
 src/test/regress/expected/insert_conflict.out |  22 +
 src/test/regress/sql/insert_conflict.sql      |  26 +
 8 files changed, 626 insertions(+), 565 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 9bc67ce60f..0dfb9e2e95 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2510,8 +2510,12 @@ CopyFrom(CopyState cstate)
 	/*
 	 * If there are any triggers with transition tables on the named relation,
 	 * we need to be prepared to capture transition tuples.
+	 *
+	 * Because partition tuple routing would like to know about whether
+	 * transition capture is active, we also set it in mtstate, which is
+	 * passed to ExecFindPartition below.
 	 */
-	cstate->transition_capture =
+	cstate->transition_capture = mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(cstate->rel->trigdesc,
 								   RelationGetRelid(cstate->rel),
 								   CMD_INSERT);
@@ -2521,19 +2525,8 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
 		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
-		/*
-		 * If we are capturing transition tuples, they may need to be
-		 * converted from partition format back to partitioned table format
-		 * (this is only ever necessary if a BEFORE trigger modifies the
-		 * tuple).
-		 */
-		if (cstate->transition_capture != NULL)
-			ExecSetupChildParentMapForLeaf(proute);
-	}
-
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2699,10 +2692,8 @@ CopyFrom(CopyState cstate)
 			 * will get us the ResultRelInfo and TupleConversionMap for the
 			 * partition, respectively.
 			 */
-			leaf_part_index = ExecFindPartition(target_resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
+			leaf_part_index = ExecFindPartition(mtstate, target_resultRelInfo,
+												proute, slot, estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
@@ -2800,15 +2791,7 @@ CopyFrom(CopyState cstate)
 				 * one.
 				 */
 				resultRelInfo = proute->partitions[leaf_part_index];
-				if (unlikely(resultRelInfo == NULL))
-				{
-					resultRelInfo = ExecInitPartitionInfo(mtstate,
-														  target_resultRelInfo,
-														  proute, estate,
-														  leaf_part_index);
-					proute->partitions[leaf_part_index] = resultRelInfo;
-					Assert(resultRelInfo != NULL);
-				}
+				Assert(resultRelInfo != NULL);
 
 				/* Determine which triggers exist on this partition */
 				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
@@ -2845,8 +2828,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, target_resultRelInfo,
-										  leaf_part_index);
+						PartitionTupRoutingGetToParentMap(proute, leaf_part_index);
 				}
 				else
 				{
@@ -2864,11 +2846,13 @@ CopyFrom(CopyState cstate)
 			 * partition rowtype.  Don't free the already stored tuple as it
 			 * may still be required for a multi-insert batch.
 			 */
-			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
-											  tuple,
-											  proute->partition_tuple_slot,
-											  &slot,
-											  false);
+			tuple =
+				ConvertPartitionTupleSlot(PartitionTupRoutingGetToChildMap(proute,
+																		   leaf_part_index),
+										  tuple,
+										  proute->partition_tuple_slot,
+										  &slot,
+										  false);
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 1a9943c3aa..f0a4067d93 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,11 +31,18 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+#define PARTITION_ROUTING_INITSIZE	8
 
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecExpandRoutingArrays(PartitionTupleRouting *proute);
+static int ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch parent, int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -62,143 +69,119 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition.  The actual ResultRelInfos are allocated lazily by that
+ * function.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
-
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
 
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
-	}
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
+	 * lazily, only when we actually need to route a tuple to that partition.
+	 * The reason for this is that a common case is for INSERT to insert a
+	 * single tuple into a partitioned table and this must be fast.
+	 *
+	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
+	 * PartitionDispatch and ResultRelInfo pointers in their respective
+	 * arrays. More space can be allocated later, if required via
+	 * ExecExpandRoutingArrays.
+	 *
+	 * The PartitionDispatch for the target partitioned table of the command
+	 * must be set up, but any sub-partitioned tables can be set up lazily as
+	 * and when the tuples get routed to (through) them.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	proute->partitions = (ResultRelInfo **)
+		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	/* We only allocate these arrays when we need to store the first map */
+	proute->parent_child_tupconv_maps = NULL;
+	proute->child_parent_tupconv_maps = NULL;
 
 	/*
-	 * Initialize an empty slot that will be used to manipulate tuples of any
-	 * given partition's rowtype.  It is attached to the caller-specified node
-	 * (such as ModifyTableState) and released when the node finishes
-	 * processing.
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent as NULL as we don't need to care about any parent of the target
+	 * partitioned table.
 	 */
-	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
 
-	i = 0;
-	foreach(cell, leaf_parts)
+	/*
+	 * If performing an UPDATE with tuple routing, we can reuse partition
+	 * sub-plan result rels.  We build a hash table to map the OIDs of
+	 * partitions present in mtstate->resultRelInfo to their ResultRelInfos.
+	 * Every time a tuple is routed to a partition that we've yet to set the
+	 * ResultRelInfo for, before we go to the trouble of making one, we check
+	 * for a pre-made one in the hash table.
+	 *
+	 * Also, we'll need a slot that will transiently store the tuple being
+	 * routed using the root parent's rowtype.
+	 */
+	if (node && node->operation == CMD_UPDATE)
 	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
-
-		proute->partition_oids[i] = leaf_oid;
-
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
-
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
-
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
-
-			update_rri_index++;
-		}
-
-		proute->partitions[i] = leaf_part_rri;
-		i++;
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
+		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
+	}
+	else
+	{
+		proute->subplan_resultrel_hash = NULL;
+		proute->root_tuple_slot = NULL;
 	}
 
 	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
+	 * Initialize an empty slot that will be used to manipulate tuples of any
+	 * given partition's rowtype.
 	 */
-	Assert(update_rri_index == num_update_rri);
+	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
 
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find a leaf partition for the tuple contained in *slot.
+ * If the partition's ResultRelInfo does not yet exist in 'proute' then we set
+ * one up or reuse one from mtstate's resultRelInfo array.
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message, else it returns the index of the leaf partition's
+ * ResultRelInfo in the proute->partitions array.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch dispatch;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 	TupleTableSlot *myslot = NULL;
-	MemoryContext	oldcxt;
-	HeapTuple		tuple;
+	MemoryContext oldcxt;
+	HeapTuple	tuple;
 
 	/* use per-tuple context here to avoid leaking memory */
 	oldcxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
@@ -216,9 +199,10 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	while (true)
 	{
 		TupleConversionMap *map = dispatch->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = dispatch->reldesc;
+		partdesc = dispatch->partdesc;
 
 		/*
 		 * Convert the tuple to this parent's layout, if different from the
@@ -244,37 +228,114 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (dispatch->partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(dispatch, values, isnull);
-
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
+		if (partdesc->is_leaf[partidx])
 		{
-			result = -1;
-			break;
-		}
-		else if (dispatch->indexes[cur_index] >= 0)
-		{
-			result = dispatch->indexes[cur_index];
-			/* success! */
-			break;
+			int			result = -1;
+
+			/*
+			 * Get this leaf partition's index in the
+			 * PartitionTupleRouting->partitions array.  We may require
+			 * building a new ResultRelInfo.
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(dispatch->indexes[partidx] < proute->num_partitions);
+				result = dispatch->indexes[partidx];
+			}
+			else
+			{
+				/*
+				 * A ResultRelInfo has not been set up for this partition yet,
+				 * so either use one of the sub-plan result rels or create a
+				 * fresh one.
+				 */
+				if (proute->subplan_resultrel_hash)
+				{
+					ResultRelInfo *rri;
+					Oid			partoid = partdesc->oids[partidx];
+
+					rri = hash_search(proute->subplan_resultrel_hash,
+									  &partoid, HASH_FIND, NULL);
+
+					if (rri)
+					{
+						result = proute->num_partitions++;
+						dispatch->indexes[partidx] = result;
+
+
+						/* Allocate more space in the arrays, if required */
+						if (result >= proute->partitions_allocsize)
+							ExecExpandRoutingArrays(proute);
+
+						/* Save here for later use. */
+						proute->partitions[result] = rri;
+					}
+				}
+
+				/* We need to create one afresh. */
+				if (result < 0)
+				{
+					MemoryContextSwitchTo(oldcxt);
+					result = ExecInitPartitionInfo(mtstate, resultRelInfo,
+												   proute, estate,
+												   dispatch, partidx);
+					MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+					Assert(result >= 0 && result < proute->num_partitions);
+				}
+			}
+
+			/* Release the tuple in the lowest parent's dedicated slot. */
+			if (slot == myslot)
+				ExecClearTuple(myslot);
+
+			MemoryContextSwitchTo(oldcxt);
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return result;
 		}
 		else
 		{
-			/* move down one level */
-			dispatch = pd[-dispatch->indexes[cur_index]];
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = pd[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subdispatch;
+
+				MemoryContextSwitchTo(oldcxt);
+				subdispatch = ExecInitPartitionDispatchInfo(proute,
+															partdesc->oids[partidx],
+															dispatch, partidx);
+				MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				Assert(dispatch->indexes[partidx] >= 0 &&
+					   dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = subdispatch;
+			}
 
 			/*
 			 * Release the dedicated slot, if it was used.  Create a copy of
@@ -287,58 +348,122 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 			}
 		}
 	}
+}
+
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo *subplan_result_rels;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			nsubplans;
+	int			i;
+
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
+
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
 
-	/* Release the tuple in the lowest parent's dedicated slot. */
-	if (slot == myslot)
-		ExecClearTuple(myslot);
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_resultrel_hash = htab;
 
-	/* A partition was not found. */
-	if (result < 0)
+	/* Hash all subplans by their Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		char	   *val_desc;
-
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
+
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+													&found);
+
+		if (!found)
+			*subplanrri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
 	}
+}
 
-	MemoryContextSwitchTo(oldcxt);
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
+/*
+ * ExecExpandRoutingArrays
+ *		Double the size of the allocated arrays in 'proute'
+ */
+static void
+ExecExpandRoutingArrays(PartitionTupleRouting *proute)
+{
+	int			new_size = proute->partitions_allocsize * 2;
+	int			old_size = proute->partitions_allocsize;
 
-	return result;
+	proute->partitions_allocsize = new_size;
+
+	proute->partitions = (ResultRelInfo **)
+		repalloc(proute->partitions, sizeof(ResultRelInfo *) * new_size);
+
+	if (proute->parent_child_tupconv_maps != NULL)
+	{
+		proute->parent_child_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->parent_child_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->parent_child_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_tupconv_maps != NULL)
+	{
+		proute->child_parent_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->child_parent_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->child_parent_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
- *
- * Returns the ResultRelInfo
+ *		and store it in the next empty slot in 'proute's partitions array and
+ *		return the index of that element.
  */
-ResultRelInfo *
+static int
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx)
 {
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -514,15 +639,25 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 									&mtstate->ps, RelationGetDescr(partrel));
 	}
 
+	part_result_rel_index = proute->num_partitions++;
+	dispatch->indexes[partidx] = part_result_rel_index;
+
+	/* Allocate more space in the arrays, if required */
+	if (part_result_rel_index >= proute->partitions_allocsize)
+		ExecExpandRoutingArrays(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
+
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri,
+						part_result_rel_index);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -535,7 +670,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -548,7 +683,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -562,7 +697,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -572,8 +707,12 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = PartitionTupRoutingGetToChildMap(proute, part_result_rel_index);
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -582,7 +721,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -673,12 +812,9 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
-
 	MemoryContextSwitchTo(oldContext);
 
-	return leaf_part_rri;
+	return part_result_rel_index;
 }
 
 /*
@@ -693,6 +829,7 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 					int partidx)
 {
 	MemoryContext oldContext;
+	TupleConversionMap *map;
 
 	/*
 	 * Switch into per-query memory context.
@@ -703,10 +840,52 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
-		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
-							   RelationGetDescr(partRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	map = convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								 RelationGetDescr(partRelInfo->ri_RelationDesc),
+								 gettext_noop("could not convert row type"));
+
+	if (map)
+	{
+		/* Allocate parent child map array only if we need to store a map */
+		if (proute->parent_child_tupconv_maps == NULL)
+		{
+			int			size;
+
+			size = proute->partitions_allocsize;
+			proute->parent_child_tupconv_maps = (TupleConversionMap **)
+				palloc0(sizeof(TupleConversionMap *) * size);
+		}
+
+		proute->parent_child_tupconv_maps[partidx] = map;
+	}
+
+	/*
+	 * Also, if transition capture is required, store a map to convert tuples
+	 * from partition's rowtype to the parent's.
+	 */
+	if (mtstate &&
+		(mtstate->mt_transition_capture || mtstate->mt_oc_transition_capture))
+	{
+		map =
+			convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
+								   RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								   gettext_noop("could not convert row type"));
+
+		/* Allocate child parent map array only if we need to store a map */
+		if (map)
+		{
+			if (proute->child_parent_tupconv_maps == NULL)
+			{
+				int			size;
+
+				size = proute->partitions_allocsize;
+				proute->child_parent_tupconv_maps = (TupleConversionMap **)
+					palloc0(sizeof(TupleConversionMap *) * size);
+			}
+
+			proute->child_parent_tupconv_maps[partidx] = map;
+		}
+	}
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -722,67 +901,82 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 }
 
 /*
- * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
- * child-to-root tuple conversion map array.
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table
  *
- * This map is required for capturing transition tuples when the target table
- * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
- * we need to convert it from the leaf partition to the target table
- * descriptor.
+ * This also stores it in the proute->partition_dispatch_info array at the
+ * specified index ('partidx'), possibly expanding the array if there isn't
+ * enough space left in it.
  */
-void
-ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
 {
-	Assert(proute != NULL);
-
-	/*
-	 * These array elements get filled up with maps on an on-demand basis.
-	 * Initially just set all of them to NULL.
-	 */
-	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
-
-	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
-}
+	Relation	rel;
+	PartitionDesc partdesc;
+	PartitionDispatch pd;
+	int			dispatchidx;
 
-/*
- * TupConvMapForLeaf -- Get the tuple conversion map for a given leaf partition
- * index.
- */
-TupleConversionMap *
-TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index)
-{
-	ResultRelInfo **resultRelInfos = proute->partitions;
-	TupleConversionMap **map;
-	TupleDesc	tupdesc;
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	partdesc = RelationGetPartitionDesc(rel);
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
+									+ (partdesc->nparts * sizeof(int)));
+	pd->reldesc = rel;
+	pd->key = RelationGetPartitionKey(rel);
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		TupleDesc	tupdesc = RelationGetDescr(rel);
 
-	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
-		return NULL;
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap =
+			convert_tuples_by_name(RelationGetDescr(parent_pd->reldesc),
+								   tupdesc,
+								   gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
 
-	/* If we've already got a map, return it. */
-	map = &proute->child_parent_tupconv_maps[leaf_index];
-	if (*map != NULL)
-		return *map;
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
 
-	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
-	*map =
-		convert_tuples_by_name(tupdesc,
-							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	dispatchidx = proute->num_dispatch++;
+	if (parent_pd)
+		parent_pd->indexes[partidx] = dispatchidx;
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize *= 2;
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
 
-	/* If it turns out no map is needed, remember for next time. */
-	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
 
-	return *map;
+	return pd;
 }
 
 /*
@@ -827,8 +1021,8 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -849,10 +1043,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-		/* skip further processsing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
-
 		/* Allow any FDWs to shut down if they've been exercised */
 		if (resultRelInfo->ri_PartitionReadyForRouting &&
 			resultRelInfo->ri_FdwRoutine != NULL &&
@@ -861,21 +1051,19 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 														   resultRelInfo);
 
 		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
+		if (resultrel_hash)
 		{
-			subplan_index++;
-			continue;
+			Oid			partoid;
+			bool		found;
+
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
+			if (found)
+				continue;
 		}
 
 		ExecCloseIndices(resultRelInfo);
@@ -889,144 +1077,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
-											tupdesc,
-											gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
-
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
-		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
-
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
-	}
-}
-
 /* ----------------
  *		FormPartitionKeyDatum
  *			Construct values[] and isnull[] arrays for the partition key
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index d8d89c7983..365b4fd6f9 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1667,7 +1666,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1710,21 +1709,13 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	 * value is to be used as an index into the arrays for the ResultRelInfo
 	 * and TupleConversionMap for the partition.
 	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
+	partidx = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
 	Assert(partidx >= 0 && partidx < proute->num_partitions);
 
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
-	 */
+	Assert(proute->partitions[partidx] != NULL);
+	/* Get the ResultRelInfo corresponding to the selected partition. */
 	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
+	Assert(partrel != NULL);
 
 	/*
 	 * Check whether the partition is routable if we didn't yet
@@ -1769,7 +1760,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			 */
 			mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 			mtstate->mt_transition_capture->tcs_map =
-				TupConvMapForLeaf(proute, targetRelInfo, partidx);
+				PartitionTupRoutingGetToParentMap(proute, partidx);
 		}
 		else
 		{
@@ -1784,16 +1775,14 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	if (mtstate->mt_oc_transition_capture != NULL)
 	{
 		mtstate->mt_oc_transition_capture->tcs_map =
-			TupConvMapForLeaf(proute, targetRelInfo, partidx);
+			PartitionTupRoutingGetToParentMap(proute, partidx);
 	}
 
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
-							  tuple,
-							  proute->partition_tuple_slot,
-							  &slot,
+	ConvertPartitionTupleSlot(PartitionTupRoutingGetToChildMap(proute, partidx),
+							  tuple, proute->partition_tuple_slot, &slot,
 							  true);
 
 	/* Initialize information needed to handle ON CONFLICT DO UPDATE. */
@@ -1830,17 +1819,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			numResultRelInfos = mtstate->mt_nplans;
 	int			i;
 
-	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
 	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
@@ -1862,79 +1840,18 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	}
 }
 
-/*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
 /*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ourselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 115a9fe78f..82acfeb460 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -594,6 +594,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -782,7 +783,15 @@ RelationBuildPartitionDesc(Relation rel)
 		 * defined by canonicalized representation of the partition bounds.
 		 */
 		for (i = 0; i < nparts; i++)
-			result->oids[mapping[i]] = oids[i];
+		{
+			int			index = mapping[i];
+
+			result->oids[index] = oids[i];
+			/* Record if the partition is a leaf partition */
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+		}
+
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 1f49e5d3a9..8a639b8b7d 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -26,7 +26,11 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of 'nparts' elements containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * the corresponding 'oids' element belongs to
+								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index f6cd842cc9..7370e24b1c 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -31,9 +31,13 @@
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
  *				this table's rowtype (when extracting the partition key of a
  *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *	indexes		Array with partdesc->nparts elements.  For leaf partitions the
+ *				index into the PartitionTupleRouting->partitions array is
+ *				stored.  When the partition is itself a partitioned table then
+ *				we store the index into
+ *				PartitionTupleRouting->partition_dispatch_info.  -1 means
+ *				we've not yet allocated anything in PartitionTupleRouting for
+ *				the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -44,74 +48,121 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	TupleConversionMap *tupmap;
-	int		   *indexes;
+	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 /*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
- *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slot			TupleTableSlot to be used to manipulate any
- *								given leaf partition's rowtype after that
- *								partition is chosen for insertion by
- *								tuple-routing.
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
+ *
+ * partition_root			The partitioned table that's the target of the
+ *							command.
+ *
+ * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
+ *							a pointer to a PartitionDispatch objects for every
+ *							partitioned table touched by tuple routing.  The
+ *							entry for the target partitioned table is *always*
+ *							present as the first entry of this array.  See
+ *							comment for PartitionDispatchData->indexes for
+ *							details on how this array is indexed.
+ *
+ * num_dispatch				The current number of items stored in the
+ *							'partition_dispatch_info' array.  Also serves as
+ *							the index of the next free array element for new
+ *							PartitionDispatch which need to be stored.
+ *
+ * dispatch_allocsize		The current allocated size of the
+ *							'partition_dispatch_info' array.
+ *
+ * partitions				Array of 'partitions_allocsize' elements
+ *							containing pointers to a ResultRelInfos of all
+ *							leaf partitions touched by tuple routing.  Some of
+ *							these are pointers to ResultRelInfos which are
+ *							borrowed out of 'subplan_resultrel_hash'.  The
+ *							remainder have been built especially for tuple
+ *							routing.  See comment for
+ *							PartitionDispatchData->indexes for details on how
+ *							this array is indexed.
+ *
+ * num_partitions			The current number of items stored in the
+ *							'partitions' array.  Also serves as the index of
+ *							the next free array element for new ResultRelInfos
+ *							which need to be stored.
+ *
+ * partitions_allocsize		The current allocated size of the 'partitions'
+ *							array.  Also, if they're non-NULL, marks the size
+ *							of the 'parent_child_tupconv_maps',
+ *							'child_parent_tupconv_maps' and
+ *							'child_parent_map_not_required' arrays.
+ *
+ * parent_child_tupconv_maps	Array of partitions_allocsize elements
+ *							containing information on how to convert tuples of
+ *							partition_root's rowtype to the rowtype of the
+ *							corresponding partition as stored in 'partitions',
+ *							or NULL if no conversion is required.  The entire
+ *							array is only allocated when the first conversion
+ *							map needs to stored.  When not allocated it's set
+ *							to NULL.
+ *
+ * partition_tuple_slot		This is a tuple slot used to store a tuple using
+ *							rowtype of the partition chosen by tuple
+ *							routing.  Maintained separately because partitions
+ *							may have different rowtype.
+ *
+ * child_parent_tupconv_maps	As 'parent_child_tupconv_maps' but stores
+ *							conversion maps to translate partition tuples into
+ *							partition_root's rowtype, needed if transition
+ *							capture is active
+ *
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
+ *
+ * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
+ *							This is used to cache ResultRelInfos from subplans
+ *							of a ModifyTable node.  Some of these may be
+ *							useful for tuple routing to save having to build
+ *							duplicates.
+ *
+ * root_tuple_slot			During UPDATE tuple routing, this tuple slot is
+ *							used to transiently store a tuple using the root
+ *							table's rowtype after converting it from the
+ *							tuple's source leaf partition's rowtype.  That is,
+ *							if leaf partition's rowtype is different.
  *-----------------------
  */
 typedef struct PartitionTupleRouting
 {
+	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
-	Oid		   *partition_oids;
+	int			dispatch_allocsize;
 	ResultRelInfo **partitions;
 	int			num_partitions;
+	int			partitions_allocsize;
 	TupleConversionMap **parent_child_tupconv_maps;
 	TupleConversionMap **child_parent_tupconv_maps;
-	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
-	TupleTableSlot *partition_tuple_slot;
+	HTAB	   *subplan_resultrel_hash;
 	TupleTableSlot *root_tuple_slot;
+	TupleTableSlot *partition_tuple_slot;
 } PartitionTupleRouting;
 
+/*
+ * Accessor macros for tuple conversion maps contained in
+ * PartitionTupleRouting.  Beware of multiple evaluations of p!
+ */
+#define PartitionTupRoutingGetToParentMap(p, i) \
+			((p)->child_parent_tupconv_maps != NULL ? \
+				(p)->child_parent_tupconv_maps[(i)] : \
+							NULL)
+
+#define PartitionTupRoutingGetToChildMap(p, i) \
+			((p)->parent_child_tupconv_maps != NULL ? \
+				(p)->parent_child_tupconv_maps[(i)] : \
+							NULL)
+
 /*
  * PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
  * of partitions.  For a multilevel partitioned table, we have one of these
@@ -200,22 +251,20 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern int ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
+extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,
+					 ResultRelInfo *resultRelInfo,
+					 PartitionTupleRouting *proute,
+					 EState *estate, int partidx);
 extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
 					PartitionTupleRouting *proute,
 					ResultRelInfo *partRelInfo,
 					int partidx);
-extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
-extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index);
 extern HeapTuple ConvertPartitionTupleSlot(TupleConversionMap *map,
 						  HeapTuple tuple,
 						  TupleTableSlot *new_slot,
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index 27cf5a01b3..6b841c7850 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -904,4 +904,26 @@ select * from parted_conflict order by a;
  50 | cincuenta | 2
 (1 row)
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+truncate parted_conflict;
+insert into parted_conflict values (0, 'cero', 1);
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+NOTICE:  a = 0, b = cero, c = 2
 drop table parted_conflict;
+drop function parted_conflict_update_func();
diff --git a/src/test/regress/sql/insert_conflict.sql b/src/test/regress/sql/insert_conflict.sql
index c677d70fb7..fe6dcfaa06 100644
--- a/src/test/regress/sql/insert_conflict.sql
+++ b/src/test/regress/sql/insert_conflict.sql
@@ -576,4 +576,30 @@ insert into parted_conflict values (50, 'cincuenta', 2)
 -- should see (50, 'cincuenta', 2)
 select * from parted_conflict order by a;
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+
+truncate parted_conflict;
+
+insert into parted_conflict values (0, 'cero', 1);
+
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+
 drop table parted_conflict;
+drop function parted_conflict_update_func();
-- 
2.16.2.windows.1

#27Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#26)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018/08/22 21:30, David Rowley wrote:

On 22 August 2018 at 19:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

+#define PartitionTupRoutingGetToParentMap(p, i) \
+#define PartitionTupRoutingGetToChildMap(p, i) \

If the "Get" could be replaced by "Child" and "Parent", respectively,
they'd sound more meaningful, imho.

I did that to save 3 chars. I think putting the additional
Child/Parent in the name is not really required. It's not as if we're
going to have a ParentToParent or a ChildToChild map, so I thought it
might be okay to assume that if it's "ToParent", that it's being
converted from the child and "ToChild" seems safe to assume it's being
converted from the parent. I can change it though if you feel very
strongly that what I've got is no good.

No strong preference as such. Maybe, let's defer to committer.

I've looked at v6 and spotted some minor typos.

+ * ResultRelInfo for, before we go making one, we check for a
pre-made one

s/making/make/g

I disagree, but perhaps we can just reword it a bit. I've now got:

+ * Every time a tuple is routed to a partition that we've yet to set the
+ * ResultRelInfo for, before we go to the trouble of making one, we check
+ * for a pre-made one in the hash table.

Sure. I guess "to the trouble of" was missing then. :)

+ /* If nobody else set the per-subplan array of maps, do so ouselves. */

I guess I'm the one to blame here for misspelling "ourselves".

Thanks for noticing.

Since the above two are minor issues, fixed them myself in the attached
updated version; didn't touch the macro though.

I've attached a v8. The only change from your v7 is in the "go making" comment.

Thanks.

Do you agree to setting this patch to "Ready for Committer" in the
September CF?

I read through the entire patch a couple of times yesterday and saw
nothing else, so yeah, I think now is a good time for someone with
more authority to have a look at it.

Okay, doing it now.

Thanks,
Amit

#28David Rowley
david.rowley@2ndquadrant.com
In reply to: David Rowley (#26)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 23 August 2018 at 00:30, David Rowley <david.rowley@2ndquadrant.com>
wrote:

I've attached a v8. The only change from your v7 is in the "go making"
comment.

v9 patch attached. Fixes conflict with 6b78231d.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v9-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchapplication/octet-stream; name=v9-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload
From 6e997c119a713ce953fb21f0b7908c9e1b59e540 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 26 Jul 2018 19:54:55 +1200
Subject: [PATCH v9] Speed up INSERT and UPDATE on partitioned tables

This is more or less a complete redesign of PartitionTupleRouting. The
aim here is to get rid of all the possibly large arrays that were being
allocated during ExecSetupPartitionTupleRouting().  We now allocate
small arrays to store the partition's ResultRelInfo and only enlarge
these when we run out of space.  The partitions array is now ordered
by the order in which the partition's ResultRelInfos are initialized
rather than in same order as partdesc.

The slowest part of ExecSetupPartitionTupleRouting still remains.  The
find_all_inheritors call still remains by far the slowest part of the
function. This patch just removes the other slow parts.

Initialization of the parent to child and child to parent translation maps
arrays are now only performed when we need to store the first translation
map.  If the column order between the parent and its child are the same,
then no map ever needs to be stored, these (possibly large) arrays did
nothing.  The fact that we now always initialize the child to parent map
whenever transition capture is required, we no longer need the
child_parent_map_not_required array.  Previously this was only required
so we could determine if no map was required or if the map had not yet
been initialized.

In simple INSERTs hitting a single partition to a partitioned table with
many partitions, the shutdown of the executor was also slow in comparison
to the actual execution. This was down to the loop which cleans up each
ResultRelInfo having to loop over an array which contained mostly NULLs
which had to be skipped.  Performance of this has now improved as the array
we loop over now no longer has to skip possibly many NULL values.

David Rowley and Amit Langote
---
 src/backend/commands/copy.c                   |  48 +-
 src/backend/executor/execPartition.c          | 804 ++++++++++++++------------
 src/backend/executor/nodeModifyTable.c        | 109 +---
 src/backend/utils/cache/partcache.c           |  11 +-
 src/include/catalog/partition.h               |   6 +-
 src/include/executor/execPartition.h          | 161 ++++--
 src/test/regress/expected/insert_conflict.out |  22 +
 src/test/regress/sql/insert_conflict.sql      |  26 +
 8 files changed, 626 insertions(+), 561 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 9bc67ce60f..0dfb9e2e95 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2510,8 +2510,12 @@ CopyFrom(CopyState cstate)
 	/*
 	 * If there are any triggers with transition tables on the named relation,
 	 * we need to be prepared to capture transition tuples.
+	 *
+	 * Because partition tuple routing would like to know about whether
+	 * transition capture is active, we also set it in mtstate, which is
+	 * passed to ExecFindPartition below.
 	 */
-	cstate->transition_capture =
+	cstate->transition_capture = mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(cstate->rel->trigdesc,
 								   RelationGetRelid(cstate->rel),
 								   CMD_INSERT);
@@ -2521,19 +2525,8 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
 		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
-		/*
-		 * If we are capturing transition tuples, they may need to be
-		 * converted from partition format back to partitioned table format
-		 * (this is only ever necessary if a BEFORE trigger modifies the
-		 * tuple).
-		 */
-		if (cstate->transition_capture != NULL)
-			ExecSetupChildParentMapForLeaf(proute);
-	}
-
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2699,10 +2692,8 @@ CopyFrom(CopyState cstate)
 			 * will get us the ResultRelInfo and TupleConversionMap for the
 			 * partition, respectively.
 			 */
-			leaf_part_index = ExecFindPartition(target_resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
+			leaf_part_index = ExecFindPartition(mtstate, target_resultRelInfo,
+												proute, slot, estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
@@ -2800,15 +2791,7 @@ CopyFrom(CopyState cstate)
 				 * one.
 				 */
 				resultRelInfo = proute->partitions[leaf_part_index];
-				if (unlikely(resultRelInfo == NULL))
-				{
-					resultRelInfo = ExecInitPartitionInfo(mtstate,
-														  target_resultRelInfo,
-														  proute, estate,
-														  leaf_part_index);
-					proute->partitions[leaf_part_index] = resultRelInfo;
-					Assert(resultRelInfo != NULL);
-				}
+				Assert(resultRelInfo != NULL);
 
 				/* Determine which triggers exist on this partition */
 				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
@@ -2845,8 +2828,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, target_resultRelInfo,
-										  leaf_part_index);
+						PartitionTupRoutingGetToParentMap(proute, leaf_part_index);
 				}
 				else
 				{
@@ -2864,11 +2846,13 @@ CopyFrom(CopyState cstate)
 			 * partition rowtype.  Don't free the already stored tuple as it
 			 * may still be required for a multi-insert batch.
 			 */
-			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
-											  tuple,
-											  proute->partition_tuple_slot,
-											  &slot,
-											  false);
+			tuple =
+				ConvertPartitionTupleSlot(PartitionTupRoutingGetToChildMap(proute,
+																		   leaf_part_index),
+										  tuple,
+										  proute->partition_tuple_slot,
+										  &slot,
+										  false);
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 38ecc4192e..a4e7e70525 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,6 +31,7 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+#define PARTITION_ROUTING_INITSIZE	8
 
 /*-----------------------
  * PartitionDispatch - information about one partitioned table in a partition
@@ -45,9 +46,13 @@
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
  *				this table's rowtype (when extracting the partition key of a
  *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *	indexes		Array with partdesc->narts elements.  For leaf partitions the
+ *				index into the PartitionTupleRouting->partitions array is
+ *				stored.  When the partition is itself a partitioned table then
+ *				we store the index into
+ *				PartitionTupleRouting->partition_dispatch_info.  -1 means
+ *				we've not yet allocated anything in PartitionTupleRouting for
+ *				the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -58,7 +63,7 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	TupleConversionMap *tupmap;
-	int		   *indexes;
+	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
 
@@ -66,6 +71,16 @@ static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
 								 int *num_parted, List **leaf_part_oids);
 static void get_partition_dispatch_recurse(Relation rel, Relation parent,
 							   List **pds, List **leaf_part_oids);
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecExpandRoutingArrays(PartitionTupleRouting *proute);
+static int ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch parent, int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -92,143 +107,119 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition.  The actual ResultRelInfos are allocated lazily by that
+ * function.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
-
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
 
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
-	}
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
+	 * lazily, only when we actually need to route a tuple to that partition.
+	 * The reason for this is that a common case is for INSERT to insert a
+	 * single tuple into a partitioned table and this must be fast.
+	 *
+	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
+	 * PartitionDispatch and ResultRelInfo pointers in their respective
+	 * arrays. More space can be allocated later, if required via
+	 * ExecExpandRoutingArrays.
+	 *
+	 * The PartitionDispatch for the target partitioned table of the command
+	 * must be set up, but any sub-partitioned tables can be set up lazily as
+	 * and when the tuples get routed to (through) them.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	proute->partitions = (ResultRelInfo **)
+		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
+
+	/* We only allocate these arrays when we need to store the first map */
+	proute->parent_child_tupconv_maps = NULL;
+	proute->child_parent_tupconv_maps = NULL;
 
 	/*
-	 * Initialize an empty slot that will be used to manipulate tuples of any
-	 * given partition's rowtype.  It is attached to the caller-specified node
-	 * (such as ModifyTableState) and released when the node finishes
-	 * processing.
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent as NULL as we don't need to care about any parent of the target
+	 * partitioned table.
 	 */
-	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
 
-	i = 0;
-	foreach(cell, leaf_parts)
+	/*
+	 * If performing an UPDATE with tuple routing, we can reuse partition
+	 * sub-plan result rels.  We build a hash table to map the OIDs of
+	 * partitions present in mtstate->resultRelInfo to their ResultRelInfos.
+	 * Every time a tuple is routed to a partition that we've yet to set the
+	 * ResultRelInfo for, before we go to the trouble of making one, we check
+	 * for a pre-made one in the hash table.
+	 *
+	 * Also, we'll need a slot that will transiently store the tuple being
+	 * routed using the root parent's rowtype.
+	 */
+	if (node && node->operation == CMD_UPDATE)
 	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
-
-		proute->partition_oids[i] = leaf_oid;
-
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
-
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
-
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
-
-			update_rri_index++;
-		}
-
-		proute->partitions[i] = leaf_part_rri;
-		i++;
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
+		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
+	}
+	else
+	{
+		proute->subplan_resultrel_hash = NULL;
+		proute->root_tuple_slot = NULL;
 	}
 
 	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
+	 * Initialize an empty slot that will be used to manipulate tuples of any
+	 * given partition's rowtype.
 	 */
-	Assert(update_rri_index == num_update_rri);
+	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
 
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find a leaf partition for the tuple contained in *slot.
+ * If the partition's ResultRelInfo does not yet exist in 'proute' then we set
+ * one up or reuse one from mtstate's resultRelInfo array.
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message, else it returns the index of the leaf partition's
+ * ResultRelInfo in the proute->partitions array.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch dispatch;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 	TupleTableSlot *myslot = NULL;
-	MemoryContext	oldcxt;
-	HeapTuple		tuple;
+	MemoryContext oldcxt;
+	HeapTuple	tuple;
 
 	/* use per-tuple context here to avoid leaking memory */
 	oldcxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
@@ -246,9 +237,10 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	while (true)
 	{
 		TupleConversionMap *map = dispatch->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = dispatch->reldesc;
+		partdesc = dispatch->partdesc;
 
 		/*
 		 * Convert the tuple to this parent's layout, if different from the
@@ -274,37 +266,114 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (dispatch->partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(dispatch, values, isnull);
-
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
+		if (partdesc->is_leaf[partidx])
 		{
-			result = -1;
-			break;
-		}
-		else if (dispatch->indexes[cur_index] >= 0)
-		{
-			result = dispatch->indexes[cur_index];
-			/* success! */
-			break;
+			int			result = -1;
+
+			/*
+			 * Get this leaf partition's index in the
+			 * PartitionTupleRouting->partitions array.  We may require
+			 * building a new ResultRelInfo.
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(dispatch->indexes[partidx] < proute->num_partitions);
+				result = dispatch->indexes[partidx];
+			}
+			else
+			{
+				/*
+				 * A ResultRelInfo has not been set up for this partition yet,
+				 * so either use one of the sub-plan result rels or create a
+				 * fresh one.
+				 */
+				if (proute->subplan_resultrel_hash)
+				{
+					ResultRelInfo *rri;
+					Oid			partoid = partdesc->oids[partidx];
+
+					rri = hash_search(proute->subplan_resultrel_hash,
+									  &partoid, HASH_FIND, NULL);
+
+					if (rri)
+					{
+						result = proute->num_partitions++;
+						dispatch->indexes[partidx] = result;
+
+
+						/* Allocate more space in the arrays, if required */
+						if (result >= proute->partitions_allocsize)
+							ExecExpandRoutingArrays(proute);
+
+						/* Save here for later use. */
+						proute->partitions[result] = rri;
+					}
+				}
+
+				/* We need to create one afresh. */
+				if (result < 0)
+				{
+					MemoryContextSwitchTo(oldcxt);
+					result = ExecInitPartitionInfo(mtstate, resultRelInfo,
+												   proute, estate,
+												   dispatch, partidx);
+					MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+					Assert(result >= 0 && result < proute->num_partitions);
+				}
+			}
+
+			/* Release the tuple in the lowest parent's dedicated slot. */
+			if (slot == myslot)
+				ExecClearTuple(myslot);
+
+			MemoryContextSwitchTo(oldcxt);
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return result;
 		}
 		else
 		{
-			/* move down one level */
-			dispatch = pd[-dispatch->indexes[cur_index]];
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = pd[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subdispatch;
+
+				MemoryContextSwitchTo(oldcxt);
+				subdispatch = ExecInitPartitionDispatchInfo(proute,
+															partdesc->oids[partidx],
+															dispatch, partidx);
+				MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				Assert(dispatch->indexes[partidx] >= 0 &&
+					   dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = subdispatch;
+			}
 
 			/*
 			 * Release the dedicated slot, if it was used.  Create a copy of
@@ -317,58 +386,122 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 			}
 		}
 	}
+}
+
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo *subplan_result_rels;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			nsubplans;
+	int			i;
+
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
 
-	/* Release the tuple in the lowest parent's dedicated slot. */
-	if (slot == myslot)
-		ExecClearTuple(myslot);
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
 
-	/* A partition was not found. */
-	if (result < 0)
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_resultrel_hash = htab;
+
+	/* Hash all subplans by their Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		char	   *val_desc;
-
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
+
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+													&found);
+
+		if (!found)
+			*subplanrri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
 	}
+}
+
+/*
+ * ExecExpandRoutingArrays
+ *		Double the size of the allocated arrays in 'proute'
+ */
+static void
+ExecExpandRoutingArrays(PartitionTupleRouting *proute)
+{
+	int			new_size = proute->partitions_allocsize * 2;
+	int			old_size = proute->partitions_allocsize;
 
-	MemoryContextSwitchTo(oldcxt);
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
+	proute->partitions_allocsize = new_size;
 
-	return result;
+	proute->partitions = (ResultRelInfo **)
+		repalloc(proute->partitions, sizeof(ResultRelInfo *) * new_size);
+
+	if (proute->parent_child_tupconv_maps != NULL)
+	{
+		proute->parent_child_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->parent_child_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->parent_child_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_tupconv_maps != NULL)
+	{
+		proute->child_parent_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->child_parent_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->child_parent_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
- *
- * Returns the ResultRelInfo
+ *		and store it in the next empty slot in 'proute's partitions array and
+ *		return the index of that element.
  */
-ResultRelInfo *
+static int
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx)
 {
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -544,15 +677,25 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 									&mtstate->ps, RelationGetDescr(partrel));
 	}
 
+	part_result_rel_index = proute->num_partitions++;
+	dispatch->indexes[partidx] = part_result_rel_index;
+
+	/* Allocate more space in the arrays, if required */
+	if (part_result_rel_index >= proute->partitions_allocsize)
+		ExecExpandRoutingArrays(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
+
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri,
+						part_result_rel_index);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -565,7 +708,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -578,7 +721,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -592,7 +735,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -602,8 +745,12 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = PartitionTupRoutingGetToChildMap(proute, part_result_rel_index);
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -612,7 +759,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -703,12 +850,9 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
-
 	MemoryContextSwitchTo(oldContext);
 
-	return leaf_part_rri;
+	return part_result_rel_index;
 }
 
 /*
@@ -723,6 +867,7 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 					int partidx)
 {
 	MemoryContext oldContext;
+	TupleConversionMap *map;
 
 	/*
 	 * Switch into per-query memory context.
@@ -733,10 +878,52 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
-		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
-							   RelationGetDescr(partRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	map = convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								 RelationGetDescr(partRelInfo->ri_RelationDesc),
+								 gettext_noop("could not convert row type"));
+
+	if (map)
+	{
+		/* Allocate parent child map array only if we need to store a map */
+		if (proute->parent_child_tupconv_maps == NULL)
+		{
+			int			size;
+
+			size = proute->partitions_allocsize;
+			proute->parent_child_tupconv_maps = (TupleConversionMap **)
+				palloc0(sizeof(TupleConversionMap *) * size);
+		}
+
+		proute->parent_child_tupconv_maps[partidx] = map;
+	}
+
+	/*
+	 * Also, if transition capture is required, store a map to convert tuples
+	 * from partition's rowtype to the parent's.
+	 */
+	if (mtstate &&
+		(mtstate->mt_transition_capture || mtstate->mt_oc_transition_capture))
+	{
+		map =
+			convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
+								   RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								   gettext_noop("could not convert row type"));
+
+		/* Allocate child parent map array only if we need to store a map */
+		if (map)
+		{
+			if (proute->child_parent_tupconv_maps == NULL)
+			{
+				int			size;
+
+				size = proute->partitions_allocsize;
+				proute->child_parent_tupconv_maps = (TupleConversionMap **)
+					palloc0(sizeof(TupleConversionMap *) * size);
+			}
+
+			proute->child_parent_tupconv_maps[partidx] = map;
+		}
+	}
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -752,67 +939,82 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 }
 
 /*
- * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
- * child-to-root tuple conversion map array.
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table
  *
- * This map is required for capturing transition tuples when the target table
- * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
- * we need to convert it from the leaf partition to the target table
- * descriptor.
+ * This also stores it in the proute->partition_dispatch_info array at the
+ * specified index ('partidx'), possibly expanding the array if there isn't
+ * enough space left in it.
  */
-void
-ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
 {
-	Assert(proute != NULL);
-
-	/*
-	 * These array elements get filled up with maps on an on-demand basis.
-	 * Initially just set all of them to NULL.
-	 */
-	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
-
-	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
-}
+	Relation	rel;
+	PartitionDesc partdesc;
+	PartitionDispatch pd;
+	int			dispatchidx;
 
-/*
- * TupConvMapForLeaf -- Get the tuple conversion map for a given leaf partition
- * index.
- */
-TupleConversionMap *
-TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index)
-{
-	ResultRelInfo **resultRelInfos = proute->partitions;
-	TupleConversionMap **map;
-	TupleDesc	tupdesc;
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	partdesc = RelationGetPartitionDesc(rel);
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
+									+ (partdesc->nparts * sizeof(int)));
+	pd->reldesc = rel;
+	pd->key = RelationGetPartitionKey(rel);
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		TupleDesc	tupdesc = RelationGetDescr(rel);
 
-	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
-		return NULL;
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap =
+			convert_tuples_by_name(RelationGetDescr(parent_pd->reldesc),
+								   tupdesc,
+								   gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
 
-	/* If we've already got a map, return it. */
-	map = &proute->child_parent_tupconv_maps[leaf_index];
-	if (*map != NULL)
-		return *map;
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
 
-	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
-	*map =
-		convert_tuples_by_name(tupdesc,
-							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	dispatchidx = proute->num_dispatch++;
+	if (parent_pd)
+		parent_pd->indexes[partidx] = dispatchidx;
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize *= 2;
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
 
-	/* If it turns out no map is needed, remember for next time. */
-	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
 
-	return *map;
+	return pd;
 }
 
 /*
@@ -857,8 +1059,8 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -879,10 +1081,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-		/* skip further processsing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
-
 		/* Allow any FDWs to shut down if they've been exercised */
 		if (resultRelInfo->ri_PartitionReadyForRouting &&
 			resultRelInfo->ri_FdwRoutine != NULL &&
@@ -891,21 +1089,19 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 														   resultRelInfo);
 
 		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
+		if (resultrel_hash)
 		{
-			subplan_index++;
-			continue;
+			Oid			partoid;
+			bool		found;
+
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
+			if (found)
+				continue;
 		}
 
 		ExecCloseIndices(resultRelInfo);
@@ -919,144 +1115,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name(RelationGetDescr(parent),
-											tupdesc,
-											gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
-
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
-		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
-
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
-	}
-}
-
 /* ----------------
  *		FormPartitionKeyDatum
  *			Construct values[] and isnull[] arrays for the partition key
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index d8d89c7983..365b4fd6f9 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1667,7 +1666,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1710,21 +1709,13 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	 * value is to be used as an index into the arrays for the ResultRelInfo
 	 * and TupleConversionMap for the partition.
 	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
+	partidx = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
 	Assert(partidx >= 0 && partidx < proute->num_partitions);
 
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
-	 */
+	Assert(proute->partitions[partidx] != NULL);
+	/* Get the ResultRelInfo corresponding to the selected partition. */
 	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
+	Assert(partrel != NULL);
 
 	/*
 	 * Check whether the partition is routable if we didn't yet
@@ -1769,7 +1760,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			 */
 			mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 			mtstate->mt_transition_capture->tcs_map =
-				TupConvMapForLeaf(proute, targetRelInfo, partidx);
+				PartitionTupRoutingGetToParentMap(proute, partidx);
 		}
 		else
 		{
@@ -1784,16 +1775,14 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	if (mtstate->mt_oc_transition_capture != NULL)
 	{
 		mtstate->mt_oc_transition_capture->tcs_map =
-			TupConvMapForLeaf(proute, targetRelInfo, partidx);
+			PartitionTupRoutingGetToParentMap(proute, partidx);
 	}
 
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
-							  tuple,
-							  proute->partition_tuple_slot,
-							  &slot,
+	ConvertPartitionTupleSlot(PartitionTupRoutingGetToChildMap(proute, partidx),
+							  tuple, proute->partition_tuple_slot, &slot,
 							  true);
 
 	/* Initialize information needed to handle ON CONFLICT DO UPDATE. */
@@ -1830,17 +1819,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			numResultRelInfos = mtstate->mt_nplans;
 	int			i;
 
-	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
 	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
@@ -1862,79 +1840,18 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	}
 }
 
-/*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
 /*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ourselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index e35a43405e..16eb728370 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -582,6 +582,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -770,7 +771,15 @@ RelationBuildPartitionDesc(Relation rel)
 		 * defined by canonicalized representation of the partition bounds.
 		 */
 		for (i = 0; i < nparts; i++)
-			result->oids[mapping[i]] = oids[i];
+		{
+			int			index = mapping[i];
+
+			result->oids[index] = oids[i];
+			/* Record if the partition is a leaf partition */
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+		}
+
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index 1f49e5d3a9..8a639b8b7d 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -26,7 +26,11 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of 'nparts' elements containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * the corresponding 'oids' element belongs to
+								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 89ce53815c..de403e0f5a 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -22,68 +22,115 @@
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 /*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
- *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slot			TupleTableSlot to be used to manipulate any
- *								given leaf partition's rowtype after that
- *								partition is chosen for insertion by
- *								tuple-routing.
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
+ *
+ * partition_root			The partitioned table that's the target of the
+ *							command.
+ *
+ * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
+ *							a pointer to a PartitionDispatch objects for every
+ *							partitioned table touched by tuple routing.  The
+ *							entry for the target partitioned table is *always*
+ *							present as the first entry of this array.  See
+ *							comment for PartitionDispatchData->indexes for
+ *							details on how this array is indexed.
+ *
+ * num_dispatch				The current number of items stored in the
+ *							'partition_dispatch_info' array.  Also serves as
+ *							the index of the next free array element for new
+ *							PartitionDispatch which need to be stored.
+ *
+ * dispatch_allocsize		The current allocated size of the
+ *							'partition_dispatch_info' array.
+ *
+ * partitions				Array of 'partitions_allocsize' elements
+ *							containing pointers to a ResultRelInfos of all
+ *							leaf partitions touched by tuple routing.  Some of
+ *							these are pointers to ResultRelInfos which are
+ *							borrowed out of 'subplan_resultrel_hash'.  The
+ *							remainder have been built especially for tuple
+ *							routing.  See comment for
+ *							PartitionDispatchData->indexes for details on how
+ *							this array is indexed.
+ *
+ * num_partitions			The current number of items stored in the
+ *							'partitions' array.  Also serves as the index of
+ *							the next free array element for new ResultRelInfos
+ *							which need to be stored.
+ *
+ * partitions_allocsize		The current allocated size of the 'partitions'
+ *							array.  Also, if they're non-NULL, marks the size
+ *							of the 'parent_child_tupconv_maps',
+ *							'child_parent_tupconv_maps' and
+ *							'child_parent_map_not_required' arrays.
+ *
+ * parent_child_tupconv_maps	Array of partitions_allocsize elements
+ *							containing information on how to convert tuples of
+ *							partition_root's rowtype to the rowtype of the
+ *							corresponding partition as stored in 'partitions',
+ *							or NULL if no conversion is required.  The entire
+ *							array is only allocated when the first conversion
+ *							map needs to stored.  When not allocated it's set
+ *							to NULL.
+ *
+ * partition_tuple_slot		This is a tuple slot used to store a tuple using
+ *							rowtype of the partition chosen by tuple
+ *							routing.  Maintained separately because partitions
+ *							may have different rowtype.
+ *
+ * child_parent_tupconv_maps	As 'parent_child_tupconv_maps' but stores
+ *							conversion maps to translate partition tuples into
+ *							partition_root's rowtype, needed if transition
+ *							capture is active
+ *
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
+ *
+ * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
+ *							This is used to cache ResultRelInfos from subplans
+ *							of a ModifyTable node.  Some of these may be
+ *							useful for tuple routing to save having to build
+ *							duplicates.
+ *
+ * root_tuple_slot			During UPDATE tuple routing, this tuple slot is
+ *							used to transiently store a tuple using the root
+ *							table's rowtype after converting it from the
+ *							tuple's source leaf partition's rowtype.  That is,
+ *							if leaf partition's rowtype is different.
  *-----------------------
  */
 typedef struct PartitionTupleRouting
 {
+	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
-	Oid		   *partition_oids;
+	int			dispatch_allocsize;
 	ResultRelInfo **partitions;
 	int			num_partitions;
+	int			partitions_allocsize;
 	TupleConversionMap **parent_child_tupconv_maps;
 	TupleConversionMap **child_parent_tupconv_maps;
-	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
-	TupleTableSlot *partition_tuple_slot;
+	HTAB	   *subplan_resultrel_hash;
 	TupleTableSlot *root_tuple_slot;
+	TupleTableSlot *partition_tuple_slot;
 } PartitionTupleRouting;
 
+/*
+ * Accessor macros for tuple conversion maps contained in
+ * PartitionTupleRouting.  Beware of multiple evaluations of p!
+ */
+#define PartitionTupRoutingGetToParentMap(p, i) \
+			((p)->child_parent_tupconv_maps != NULL ? \
+				(p)->child_parent_tupconv_maps[(i)] : \
+							NULL)
+
+#define PartitionTupRoutingGetToChildMap(p, i) \
+			((p)->parent_child_tupconv_maps != NULL ? \
+				(p)->parent_child_tupconv_maps[(i)] : \
+							NULL)
+
 /*
  * PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
  * of partitions.  For a multilevel partitioned table, we have one of these
@@ -172,22 +219,20 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern int ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
+extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,
+					 ResultRelInfo *resultRelInfo,
+					 PartitionTupleRouting *proute,
+					 EState *estate, int partidx);
 extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
 					PartitionTupleRouting *proute,
 					ResultRelInfo *partRelInfo,
 					int partidx);
-extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
-extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index);
 extern HeapTuple ConvertPartitionTupleSlot(TupleConversionMap *map,
 						  HeapTuple tuple,
 						  TupleTableSlot *new_slot,
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index 27cf5a01b3..6b841c7850 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -904,4 +904,26 @@ select * from parted_conflict order by a;
  50 | cincuenta | 2
 (1 row)
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+truncate parted_conflict;
+insert into parted_conflict values (0, 'cero', 1);
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+NOTICE:  a = 0, b = cero, c = 2
 drop table parted_conflict;
+drop function parted_conflict_update_func();
diff --git a/src/test/regress/sql/insert_conflict.sql b/src/test/regress/sql/insert_conflict.sql
index c677d70fb7..fe6dcfaa06 100644
--- a/src/test/regress/sql/insert_conflict.sql
+++ b/src/test/regress/sql/insert_conflict.sql
@@ -576,4 +576,30 @@ insert into parted_conflict values (50, 'cincuenta', 2)
 -- should see (50, 'cincuenta', 2)
 select * from parted_conflict order by a;
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+
+truncate parted_conflict;
+
+insert into parted_conflict values (0, 'cero', 1);
+
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+
 drop table parted_conflict;
+drop function parted_conflict_update_func();
-- 
2.16.2.windows.1

#29David Rowley
david.rowley@2ndquadrant.com
In reply to: David Rowley (#28)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 17 September 2018 at 21:15, David Rowley
<david.rowley@2ndquadrant.com> wrote:

v9 patch attached. Fixes conflict with 6b78231d.

v10 patch attached. Fixes conflict with cc2905e9.

I'm not so sure we need to zero the partition_tuple_slots[] array at
all since we always set a value there is there's a corresponding map
stored. I considered pulling that out, but in the end, I didn't as I
saw some Asserts checking it's been properly set by checking the
element != NULL in nodeModifyTable.c and copy.c. Perhaps I should
have just gotten rid of those Asserts along with the palloc0 and
subsequent memset after the expansion of the array. I'm undecided.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v10-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchapplication/octet-stream; name=v10-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload
From a248e365811696a473123ec211b0310963d88e08 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 26 Jul 2018 19:54:55 +1200
Subject: [PATCH v10] Speed up INSERT and UPDATE on partitioned tables

This is more or less a complete redesign of PartitionTupleRouting. The
aim here is to get rid of all the possibly large arrays that were being
allocated during ExecSetupPartitionTupleRouting().  We now allocate
small arrays to store the partition's ResultRelInfo and only enlarge
these when we run out of space.  The partitions array is now ordered
by the order in which the partition's ResultRelInfos are initialized
rather than in same order as partdesc.

The slowest part of ExecSetupPartitionTupleRouting still remains.  The
find_all_inheritors call still remains by far the slowest part of the
function. This patch just removes the other slow parts.

Initialization of the parent to child and child to parent translation maps
arrays are now only performed when we need to store the first translation
map.  If the column order between the parent and its child are the same,
then no map ever needs to be stored, these (possibly large) arrays did
nothing.  The fact that we now always initialize the child to parent map
whenever transition capture is required, we no longer need the
child_parent_map_not_required array.  Previously this was only required
so we could determine if no map was required or if the map had not yet
been initialized.

In simple INSERTs hitting a single partition to a partitioned table with
many partitions, the shutdown of the executor was also slow in comparison
to the actual execution. This was down to the loop which cleans up each
ResultRelInfo having to loop over an array which contained mostly NULLs
which had to be skipped.  Performance of this has now improved as the array
we loop over now no longer has to skip possibly many NULL values.

David Rowley and Amit Langote
---
 src/backend/commands/copy.c                   |  46 +-
 src/backend/executor/execPartition.c          | 843 ++++++++++++++------------
 src/backend/executor/nodeModifyTable.c        | 105 +---
 src/backend/optimizer/prep/prepunion.c        |   3 -
 src/backend/utils/cache/partcache.c           |  11 +-
 src/include/catalog/partition.h               |   6 +-
 src/include/executor/execPartition.h          | 166 +++--
 src/test/regress/expected/insert_conflict.out |  22 +
 src/test/regress/sql/insert_conflict.sql      |  26 +
 9 files changed, 657 insertions(+), 571 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 32706fad90..b7d7cffc81 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2512,8 +2512,12 @@ CopyFrom(CopyState cstate)
 	/*
 	 * If there are any triggers with transition tables on the named relation,
 	 * we need to be prepared to capture transition tuples.
+	 *
+	 * Because partition tuple routing would like to know about whether
+	 * transition capture is active, we also set it in mtstate, which is
+	 * passed to ExecFindPartition below.
 	 */
-	cstate->transition_capture =
+	cstate->transition_capture = mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(cstate->rel->trigdesc,
 								   RelationGetRelid(cstate->rel),
 								   CMD_INSERT);
@@ -2523,19 +2527,8 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
 		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
-		/*
-		 * If we are capturing transition tuples, they may need to be
-		 * converted from partition format back to partitioned table format
-		 * (this is only ever necessary if a BEFORE trigger modifies the
-		 * tuple).
-		 */
-		if (cstate->transition_capture != NULL)
-			ExecSetupChildParentMapForLeaf(proute);
-	}
-
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2695,17 +2688,11 @@ CopyFrom(CopyState cstate)
 			TupleConversionMap *map;
 
 			/*
-			 * Away we go ... If we end up not finding a partition after all,
-			 * ExecFindPartition() does not return and errors out instead.
-			 * Otherwise, the returned value is to be used as an index into
-			 * arrays mt_partitions[] and mt_partition_tupconv_maps[] that
-			 * will get us the ResultRelInfo and TupleConversionMap for the
-			 * partition, respectively.
+			 * Attempt to find a partition suitable for this tuple.
+			 * ExecFindPartition() will raise an error if none can be found.
 			 */
-			leaf_part_index = ExecFindPartition(target_resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
+			leaf_part_index = ExecFindPartition(mtstate, target_resultRelInfo,
+												proute, slot, estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
@@ -2803,15 +2790,7 @@ CopyFrom(CopyState cstate)
 				 * one.
 				 */
 				resultRelInfo = proute->partitions[leaf_part_index];
-				if (unlikely(resultRelInfo == NULL))
-				{
-					resultRelInfo = ExecInitPartitionInfo(mtstate,
-														  target_resultRelInfo,
-														  proute, estate,
-														  leaf_part_index);
-					proute->partitions[leaf_part_index] = resultRelInfo;
-					Assert(resultRelInfo != NULL);
-				}
+				Assert(resultRelInfo != NULL);
 
 				/* Determine which triggers exist on this partition */
 				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
@@ -2848,8 +2827,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, target_resultRelInfo,
-										  leaf_part_index);
+						PartitionTupRoutingGetToParentMap(proute, leaf_part_index);
 				}
 				else
 				{
@@ -2866,7 +2844,7 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = proute->parent_child_tupconv_maps[leaf_part_index];
+			map = PartitionTupRoutingGetToChildMap(proute, leaf_part_index);
 			if (map != NULL)
 			{
 				TupleTableSlot *new_slot;
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 832c79b41e..94f0230540 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,6 +31,7 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+#define PARTITION_ROUTING_INITSIZE	8
 
 /*-----------------------
  * PartitionDispatch - information about one partitioned table in a partition
@@ -45,9 +46,13 @@
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
  *				this table's rowtype (when extracting the partition key of a
  *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *	indexes		Array of partdesc->nparts elements.  For leaf partitions the
+ *				index into the parenting PartitionTupleRouting's 'partitions'
+ *				array is stored.  When the partition is itself a partitioned
+ *				table then we store the index into parenting
+ *				PartitionTupleRouting 'partition_dispatch_info' array.  An
+ *				index of -1 means we've not yet allocated anything in
+ *				PartitionTupleRouting for the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -58,14 +63,20 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrNumber *tupmap;
-	int		   *indexes;
+	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
 
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecExpandRoutingArrays(PartitionTupleRouting *proute);
+static int ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -92,130 +103,112 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition.  The actual ResultRelInfos are allocated lazily by that
+ * function.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
-
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
 
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
-	}
-
-	i = 0;
-	foreach(cell, leaf_parts)
-	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
-
-		proute->partition_oids[i] = leaf_oid;
-
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
+	 * on demand, only when we actually need to route a tuple to that
+	 * partition.  The reason for this is that a common case is for INSERT to
+	 * insert a single tuple into a partitioned table and this must be fast.
+	 *
+	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
+	 * PartitionDispatch and ResultRelInfo pointers in their respective
+	 * arrays. More space can be allocated later, if required via
+	 * ExecExpandRoutingArrays.
+	 *
+	 * Initially we must only setup 1 PartitionDispatch object; the one for
+	 * the partitioned table that's the target of the command.  If we must
+	 * route tuple via some sub-partitioned table, then the PartitionDispatch
+	 * for those is only built the first time it's required.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
 
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
+	proute->partitions = (ResultRelInfo **)
+		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
 
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
+	/* Mark that no items are yet stored in the 'partitions' array. */
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
 
-			update_rri_index++;
-		}
+	/* We only allocate these arrays when we need to store the first map */
+	proute->parent_child_tupconv_maps = NULL;
+	proute->child_parent_tupconv_maps = NULL;
+	proute->partition_tuple_slots = NULL;
 
-		proute->partitions[i] = leaf_part_rri;
-		i++;
-	}
+	/*
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent as NULL as we don't need to care about any parent of the target
+	 * partitioned table.
+	 */
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
 
 	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
+	 * If performing an UPDATE with tuple routing, we can reuse partition
+	 * sub-plan result rels.  We build a hash table to map the OIDs of
+	 * partitions present in mtstate->resultRelInfo to their ResultRelInfos.
+	 * Every time a tuple is routed to a partition that we've yet to set the
+	 * ResultRelInfo for, before we go to the trouble of making one, we check
+	 * for a pre-made one in the hash table.
+	 *
+	 * Also, we'll need a slot that will transiently store the tuple being
+	 * routed using the root parent's rowtype.
 	 */
-	Assert(update_rri_index == num_update_rri);
+	if (node && node->operation == CMD_UPDATE)
+	{
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
+		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
+	}
+	else
+	{
+		proute->subplan_resultrel_hash = NULL;
+		proute->root_tuple_slot = NULL;
+	}
 
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find a leaf partition for the tuple contained in *slot.
+ * If the partition's ResultRelInfo does not yet exist in 'proute' then we set
+ * one up or reuse one from mtstate's resultRelInfo array.
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message, else it returns the index of the leaf partition's
+ * ResultRelInfo in the proute->partitions array.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch dispatch;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 	TupleTableSlot *myslot = NULL;
@@ -236,9 +229,10 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	while (true)
 	{
 		AttrNumber *map = dispatch->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = dispatch->reldesc;
+		partdesc = dispatch->partdesc;
 
 		/*
 		 * Convert the tuple to this parent's layout, if different from the
@@ -260,91 +254,251 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (dispatch->partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(dispatch, values, isnull);
-
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
-		{
-			result = -1;
-			break;
-		}
-		else if (dispatch->indexes[cur_index] >= 0)
+		if (partdesc->is_leaf[partidx])
 		{
-			result = dispatch->indexes[cur_index];
-			/* success! */
-			break;
+			int			result = -1;
+
+			/*
+			 * Get this leaf partition's index in the
+			 * PartitionTupleRouting->partitions array.  We need to build a
+			 * new ResultRelInfo.
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(dispatch->indexes[partidx] < proute->num_partitions);
+				result = dispatch->indexes[partidx];
+			}
+			else
+			{
+				/*
+				 * A ResultRelInfo has not been set up for this partition yet,
+				 * so either use one of the sub-plan result rels or create a
+				 * fresh one.
+				 */
+				if (proute->subplan_resultrel_hash)
+				{
+					ResultRelInfo *rri;
+					Oid			partoid = partdesc->oids[partidx];
+
+					rri = hash_search(proute->subplan_resultrel_hash,
+									  &partoid, HASH_FIND, NULL);
+
+					if (rri)
+					{
+						result = proute->num_partitions++;
+						dispatch->indexes[partidx] = result;
+
+
+						/* Allocate more space in the arrays, if required */
+						if (result >= proute->partitions_allocsize)
+							ExecExpandRoutingArrays(proute);
+
+						/* Save here for later use. */
+						proute->partitions[result] = rri;
+					}
+				}
+
+				/* We need to create a new one. */
+				if (result < 0)
+				{
+					MemoryContextSwitchTo(oldcxt);
+					result = ExecInitPartitionInfo(mtstate, resultRelInfo,
+												   proute, estate,
+												   dispatch, partidx);
+					MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+					Assert(result >= 0 && result < proute->num_partitions);
+				}
+			}
+
+			/* Release the tuple in the lowest parent's dedicated slot. */
+			if (slot == myslot)
+				ExecClearTuple(myslot);
+
+			MemoryContextSwitchTo(oldcxt);
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return result;
 		}
 		else
 		{
-			/* move down one level */
-			dispatch = pd[-dispatch->indexes[cur_index]];
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
+
+				/*
+				 * Move down to the next partition level and search again
+				 * until we find a leaf partition that matches this tuple
+				 */
+				dispatch = pd[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subdispatch;
+
+				MemoryContextSwitchTo(oldcxt);
+
+				/*
+				 * Create the new PartitionDispatch.  We pass the current one
+				 * in as the parent PartitionDispatch
+				 */
+				subdispatch = ExecInitPartitionDispatchInfo(proute,
+															partdesc->oids[partidx],
+															dispatch, partidx);
+				MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				Assert(dispatch->indexes[partidx] >= 0 &&
+					   dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = subdispatch;
+			}
 		}
 	}
+}
+
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo *subplan_result_rels;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			nsubplans;
+	int			i;
+
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
 
-	/* Release the tuple in the lowest parent's dedicated slot. */
-	if (slot == myslot)
-		ExecClearTuple(myslot);
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
 
-	/* A partition was not found. */
-	if (result < 0)
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_resultrel_hash = htab;
+
+	/* Hash all subplans by their Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		char	   *val_desc;
-
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
+
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+													&found);
+
+		if (!found)
+			*subplanrri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
 	}
+}
 
-	MemoryContextSwitchTo(oldcxt);
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
+/*
+ * ExecExpandRoutingArrays
+ *		Double the size of the allocated arrays in 'proute'
+ */
+static void
+ExecExpandRoutingArrays(PartitionTupleRouting *proute)
+{
+	int			new_size = proute->partitions_allocsize * 2;
+	int			old_size = proute->partitions_allocsize;
 
-	return result;
+	proute->partitions_allocsize = new_size;
+
+	proute->partitions = (ResultRelInfo **)
+		repalloc(proute->partitions, sizeof(ResultRelInfo *) * new_size);
+
+	if (proute->parent_child_tupconv_maps != NULL)
+	{
+		proute->parent_child_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->parent_child_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->parent_child_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_tupconv_maps != NULL)
+	{
+		proute->child_parent_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->child_parent_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->child_parent_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->partition_tuple_slots != NULL)
+	{
+		proute->partition_tuple_slots = (TupleTableSlot **)
+			repalloc(proute->partition_tuple_slots,
+				sizeof(TupleTableSlot **) * new_size);
+		memset(&proute->partition_tuple_slots[old_size], 0,
+			sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
- *
- * Returns the ResultRelInfo
+ *		and store it in the next empty slot in proute's partitions array.
+ *		Return the index of the array element.
  */
-ResultRelInfo *
+static int
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx)
 {
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -520,15 +674,25 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 									&mtstate->ps, RelationGetDescr(partrel));
 	}
 
+	part_result_rel_index = proute->num_partitions++;
+	dispatch->indexes[partidx] = part_result_rel_index;
+
+	/* Allocate more space in the arrays, if required */
+	if (part_result_rel_index >= proute->partitions_allocsize)
+		ExecExpandRoutingArrays(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
+
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri,
+						part_result_rel_index);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -541,7 +705,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -554,7 +718,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -568,7 +732,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -578,8 +742,12 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = PartitionTupRoutingGetToChildMap(proute, part_result_rel_index);
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -588,7 +756,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -679,12 +847,9 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
-
 	MemoryContextSwitchTo(oldContext);
 
-	return leaf_part_rri;
+	return part_result_rel_index;
 }
 
 /*
@@ -699,6 +864,7 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 					int partidx)
 {
 	MemoryContext oldContext;
+	TupleConversionMap *map;
 
 	/*
 	 * Switch into per-query memory context.
@@ -709,29 +875,42 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
-		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
-							   RelationGetDescr(partRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	map = convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								 RelationGetDescr(partRelInfo->ri_RelationDesc),
+								 gettext_noop("could not convert row type"));
 
 	/*
-	 * If a partition has a different rowtype than the root parent, initialize
-	 * a slot dedicated to storing this partition's tuples.  The slot is used
-	 * for various operations that are applied to tuples after routing, such
-	 * as checking constraints.
+	 * If a partition has a different rowtype than the root parent, store the
+	 * translation map and initialize a slot dedicated to storing this
+	 * partition's tuples.  The slot is used for various operations that are
+	 * applied to tuples after routing, such as checking constraints.
 	 */
-	if (proute->parent_child_tupconv_maps[partidx] != NULL)
+	if (map)
 	{
 		Relation	partrel = partRelInfo->ri_RelationDesc;
 
-		/*
-		 * Initialize the array in proute where these slots are stored, if not
-		 * already done.
-		 */
+		/* Allocate parent child map array only if we need to store a map */
+		if (proute->parent_child_tupconv_maps == NULL)
+		{
+			int			size = proute->partitions_allocsize;
+
+			proute->parent_child_tupconv_maps = (TupleConversionMap **)
+				palloc0(sizeof(TupleConversionMap *) * size);
+		}
+
+		proute->parent_child_tupconv_maps[partidx] = map;
+
+		 /*
+		  * Initialize the array in proute where these slots are stored, if not
+		  * already done.
+		  */
 		if (proute->partition_tuple_slots == NULL)
+		{
+			int			size = proute->partitions_allocsize;
+
 			proute->partition_tuple_slots = (TupleTableSlot **)
-				palloc0(proute->num_partitions *
-						sizeof(TupleTableSlot *));
+				palloc0(sizeof(TupleTableSlot *) * size);
+		}
 
 		/*
 		 * Initialize the slot itself setting its descriptor to this
@@ -741,6 +920,35 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 		proute->partition_tuple_slots[partidx] =
 			ExecInitExtraTupleSlot(estate,
 								   RelationGetDescr(partrel));
+
+	}
+
+	/*
+	 * Also, if transition capture is required, store a map to convert tuples
+	 * from partition's rowtype to the parent's.
+	 */
+	if (mtstate &&
+		(mtstate->mt_transition_capture || mtstate->mt_oc_transition_capture))
+	{
+		map =
+			convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
+								   RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								   gettext_noop("could not convert row type"));
+
+		/* Allocate child parent map array only if we need to store a map */
+		if (map)
+		{
+			if (proute->child_parent_tupconv_maps == NULL)
+			{
+				int			size;
+
+				size = proute->partitions_allocsize;
+				proute->child_parent_tupconv_maps = (TupleConversionMap **)
+					palloc0(sizeof(TupleConversionMap *) * size);
+			}
+
+			proute->child_parent_tupconv_maps[partidx] = map;
+		}
 	}
 
 	/*
@@ -757,67 +965,88 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 }
 
 /*
- * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
- * child-to-root tuple conversion map array.
- *
- * This map is required for capturing transition tuples when the target table
- * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
- * we need to convert it from the leaf partition to the target table
- * descriptor.
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table and store it in
+ *		the next available slot in the 'proute' partition_dispatch_info[]
+ *		array.  Also, record the index into this array in the 'parent_pd'
+ *		indexes[] array in the partidx element so that we can properly
+ *		retrieve the newly created PartitionDispatch later.
  */
-void
-ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
 {
-	Assert(proute != NULL);
+	Relation	rel;
+	PartitionDesc partdesc;
+	PartitionDispatch pd;
+	int			dispatchidx;
 
-	/*
-	 * These array elements get filled up with maps on an on-demand basis.
-	 * Initially just set all of them to NULL.
-	 */
-	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	partdesc = RelationGetPartitionDesc(rel);
 
-	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
-}
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
+									+ (partdesc->nparts * sizeof(int)));
+	pd->reldesc = rel;
+	pd->key = RelationGetPartitionKey(rel);
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		TupleDesc	tupdesc = RelationGetDescr(rel);
 
-/*
- * TupConvMapForLeaf -- Get the tuple conversion map for a given leaf partition
- * index.
- */
-TupleConversionMap *
-TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index)
-{
-	ResultRelInfo **resultRelInfos = proute->partitions;
-	TupleConversionMap **map;
-	TupleDesc	tupdesc;
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent_pd->reldesc),
+													   tupdesc,
+													   gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
 
-	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
-		return NULL;
+	dispatchidx = proute->num_dispatch++;
 
-	/* If we've already got a map, return it. */
-	map = &proute->child_parent_tupconv_maps[leaf_index];
-	if (*map != NULL)
-		return *map;
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize *= 2;
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
 
-	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
-	*map =
-		convert_tuples_by_name(tupdesc,
-							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
 
-	/* If it turns out no map is needed, remember for next time. */
-	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
+	/*
+	 * Finally, if setting up a PartitionDispatch for a sub-partitioned table,
+	 * install the link to allow us to descend the partition hierarchy for
+	 * future searches
+	 */
+	if (parent_pd)
+		parent_pd->indexes[partidx] = dispatchidx;
 
-	return *map;
+	return pd;
 }
 
 /*
@@ -830,8 +1059,8 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -852,10 +1081,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-		/* skip further processsing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
-
 		/* Allow any FDWs to shut down if they've been exercised */
 		if (resultRelInfo->ri_PartitionReadyForRouting &&
 			resultRelInfo->ri_FdwRoutine != NULL &&
@@ -864,21 +1089,19 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 														   resultRelInfo);
 
 		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
+		if (resultrel_hash)
 		{
-			subplan_index++;
-			continue;
+			Oid			partoid;
+			bool		found;
+
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
+			if (found)
+				continue;
 		}
 
 		ExecCloseIndices(resultRelInfo);
@@ -890,144 +1113,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent),
-													   tupdesc,
-													   gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
-
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
-		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
-
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
-	}
-}
-
 /* ----------------
  *		FormPartitionKeyDatum
  *			Construct values[] and isnull[] arrays for the partition key
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 24beb40435..40fe9c6b0f 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1665,7 +1664,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1709,21 +1708,13 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	 * value is to be used as an index into the arrays for the ResultRelInfo
 	 * and TupleConversionMap for the partition.
 	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
+	partidx = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
 	Assert(partidx >= 0 && partidx < proute->num_partitions);
 
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
-	 */
+	Assert(proute->partitions[partidx] != NULL);
+	/* Get the ResultRelInfo corresponding to the selected partition. */
 	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
+	Assert(partrel != NULL);
 
 	/*
 	 * Check whether the partition is routable if we didn't yet
@@ -1768,7 +1759,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			 */
 			mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 			mtstate->mt_transition_capture->tcs_map =
-				TupConvMapForLeaf(proute, targetRelInfo, partidx);
+				PartitionTupRoutingGetToParentMap(proute, partidx);
 		}
 		else
 		{
@@ -1783,13 +1774,13 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	if (mtstate->mt_oc_transition_capture != NULL)
 	{
 		mtstate->mt_oc_transition_capture->tcs_map =
-			TupConvMapForLeaf(proute, targetRelInfo, partidx);
+			PartitionTupRoutingGetToParentMap(proute, partidx);
 	}
 
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	map = proute->parent_child_tupconv_maps[partidx];
+	map = PartitionTupRoutingGetToChildMap(proute, partidx);
 	if (map != NULL)
 	{
 		TupleTableSlot *new_slot;
@@ -1834,17 +1825,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			numResultRelInfos = mtstate->mt_nplans;
 	int			i;
 
-	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
 	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
@@ -1866,79 +1846,18 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	}
 }
 
-/*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
 /*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ourselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index d5720518a8..2a1c1cb2e1 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1657,9 +1657,6 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 /*
  * expand_partitioned_rtentry
  *		Recursively expand an RTE for a partitioned table.
- *
- * Note that RelationGetPartitionDispatchInfo will expand partitions in the
- * same order as this code.
  */
 static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 5757301d05..2afde69134 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -582,6 +582,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -770,7 +771,15 @@ RelationBuildPartitionDesc(Relation rel)
 		 * defined by canonicalized representation of the partition bounds.
 		 */
 		for (i = 0; i < nparts; i++)
-			result->oids[mapping[i]] = oids[i];
+		{
+			int			index = mapping[i];
+
+			result->oids[index] = oids[i];
+			/* Record if the partition is a leaf partition */
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+		}
+
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index a53de2372e..59c7a6ab69 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -25,7 +25,11 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of 'nparts' elements containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * the corresponding 'oids' element belongs to
+								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 8b4a9ca044..10ac801f54 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -22,71 +22,119 @@
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 /*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
- *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slots		Array of TupleTableSlot objects; if non-NULL,
- *								contains one entry for every leaf partition,
- *								of which only those of the leaf partitions
- *								whose attribute numbers differ from the root
- *								parent have a non-NULL value.  NULL if all of
- *								the partitions encountered by a given command
- *								happen to have same rowtype as the root parent
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
+ *
+ * partition_root			The partitioned table that's the target of the
+ *							command.
+ *
+ * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
+ *							a pointer to a PartitionDispatch objects for every
+ *							partitioned table touched by tuple routing.  The
+ *							entry for the target partitioned table is *always*
+ *							present in the 0th element of this array.  See
+ *							comment for PartitionDispatchData->indexes for
+ *							details on how this array is indexed.
+ *
+ * num_dispatch				The current number of items stored in the
+ *							'partition_dispatch_info' array.  Also serves as
+ *							the index of the next free array element for new
+ *							PartitionDispatch which need to be stored.
+ *
+ * dispatch_allocsize		The current allocated size of the
+ *							'partition_dispatch_info' array.
+ *
+ * partitions				Array of 'partitions_allocsize' elements
+ *							containing pointers to a ResultRelInfos of all
+ *							leaf partitions touched by tuple routing.  Some of
+ *							these are pointers to ResultRelInfos which are
+ *							borrowed out of 'subplan_resultrel_hash'.  The
+ *							remainder have been built especially for tuple
+ *							routing.  See comment for
+ *							PartitionDispatchData->indexes for details on how
+ *							this array is indexed.
+ *
+ * num_partitions			The current number of items stored in the
+ *							'partitions' array.  Also serves as the index of
+ *							the next free array element for new ResultRelInfos
+ *							which need to be stored.
+ *
+ * partitions_allocsize		The current allocated size of the 'partitions'
+ *							array.  Also, if they're non-NULL, marks the size
+ *							of the 'parent_child_tupconv_maps',
+ *							'child_parent_tupconv_maps',
+ *							'child_parent_map_not_required' and
+ *							'partition_tuple_slots' arrays.
+ *
+ * parent_child_tupconv_maps	Array of partitions_allocsize elements
+ *							containing information on how to convert tuples of
+ *							partition_root's rowtype to the rowtype of the
+ *							corresponding partition as stored in 'partitions',
+ *							or NULL if no conversion is required.  The entire
+ *							array is only allocated when the first conversion
+ *							map needs to stored.  When not allocated it's set
+ *							to NULL.
+ *
+ * partition_tuple_slots	Array of TupleTableSlot objects; if non-NULL,
+ *							contains one entry for every leaf partition,
+ *							of which only those of the leaf partitions
+ *							whose attribute numbers differ from the root
+ *							parent have a non-NULL value.  NULL if all of
+ *							the partitions encountered by a given command
+ *							happen to have same rowtype as the root parent
+ *
+ * child_parent_tupconv_maps	As 'parent_child_tupconv_maps' but stores
+ *							conversion maps to translate partition tuples into
+ *							partition_root's rowtype, needed if transition
+ *							capture is active
+ *
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
+ *
+ * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
+ *							This is used to cache ResultRelInfos from subplans
+ *							of a ModifyTable node.  Some of these may be
+ *							useful for tuple routing to save having to build
+ *							duplicates.
+ *
+ * root_tuple_slot			During UPDATE tuple routing, this tuple slot is
+ *							used to transiently store a tuple using the root
+ *							table's rowtype after converting it from the
+ *							tuple's source leaf partition's rowtype.  That is,
+ *							if leaf partition's rowtype is different.
  *-----------------------
  */
 typedef struct PartitionTupleRouting
 {
+	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
-	Oid		   *partition_oids;
+	int			dispatch_allocsize;
 	ResultRelInfo **partitions;
 	int			num_partitions;
+	int			partitions_allocsize;
 	TupleConversionMap **parent_child_tupconv_maps;
 	TupleConversionMap **child_parent_tupconv_maps;
-	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
 	TupleTableSlot **partition_tuple_slots;
 	TupleTableSlot *root_tuple_slot;
+	HTAB	   *subplan_resultrel_hash;
 } PartitionTupleRouting;
 
+/*
+ * Accessor macros for tuple conversion maps contained in
+ * PartitionTupleRouting.  Beware of multiple evaluations of p!
+ */
+#define PartitionTupRoutingGetToParentMap(p, i) \
+			((p)->child_parent_tupconv_maps != NULL ? \
+				(p)->child_parent_tupconv_maps[(i)] : \
+							NULL)
+
+#define PartitionTupRoutingGetToChildMap(p, i) \
+			((p)->parent_child_tupconv_maps != NULL ? \
+				(p)->parent_child_tupconv_maps[(i)] : \
+							NULL)
+
 /*
  * PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
  * of partitions.  For a multilevel partitioned table, we have one of these
@@ -175,22 +223,20 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern int ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
+extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,
+					 ResultRelInfo *resultRelInfo,
+					 PartitionTupleRouting *proute,
+					 EState *estate, int partidx);
 extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
 					PartitionTupleRouting *proute,
 					ResultRelInfo *partRelInfo,
 					int partidx);
-extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
-extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index);
 extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute);
 extern PartitionPruneState *ExecCreatePartitionPruneState(PlanState *planstate,
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index 27cf5a01b3..6b841c7850 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -904,4 +904,26 @@ select * from parted_conflict order by a;
  50 | cincuenta | 2
 (1 row)
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+truncate parted_conflict;
+insert into parted_conflict values (0, 'cero', 1);
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+NOTICE:  a = 0, b = cero, c = 2
 drop table parted_conflict;
+drop function parted_conflict_update_func();
diff --git a/src/test/regress/sql/insert_conflict.sql b/src/test/regress/sql/insert_conflict.sql
index c677d70fb7..fe6dcfaa06 100644
--- a/src/test/regress/sql/insert_conflict.sql
+++ b/src/test/regress/sql/insert_conflict.sql
@@ -576,4 +576,30 @@ insert into parted_conflict values (50, 'cincuenta', 2)
 -- should see (50, 'cincuenta', 2)
 select * from parted_conflict order by a;
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+
+truncate parted_conflict;
+
+insert into parted_conflict values (0, 'cero', 1);
+
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+
 drop table parted_conflict;
+drop function parted_conflict_update_func();
-- 
2.16.2.windows.1

#30Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#29)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

Hi David,

On 2018/10/05 21:55, David Rowley wrote:

On 17 September 2018 at 21:15, David Rowley
<david.rowley@2ndquadrant.com> wrote:

v9 patch attached. Fixes conflict with 6b78231d.

v10 patch attached. Fixes conflict with cc2905e9.

Thanks for rebasing.

I'm not so sure we need to zero the partition_tuple_slots[] array at
all since we always set a value there is there's a corresponding map
stored. I considered pulling that out, but in the end, I didn't as I
saw some Asserts checking it's been properly set by checking the
element != NULL in nodeModifyTable.c and copy.c. Perhaps I should
have just gotten rid of those Asserts along with the palloc0 and
subsequent memset after the expansion of the array. I'm undecided.

Maybe it's a good thing that it's doing the same thing as with the
child_to_parent_maps array, which is to zero-init it when allocated.

Thanks,
Amit

#31David Rowley
david.rowley@2ndquadrant.com
In reply to: Amit Langote (#30)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 9 October 2018 at 15:49, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2018/10/05 21:55, David Rowley wrote:

I'm not so sure we need to zero the partition_tuple_slots[] array at
all since we always set a value there is there's a corresponding map
stored. I considered pulling that out, but in the end, I didn't as I
saw some Asserts checking it's been properly set by checking the
element != NULL in nodeModifyTable.c and copy.c. Perhaps I should
have just gotten rid of those Asserts along with the palloc0 and
subsequent memset after the expansion of the array. I'm undecided.

Maybe it's a good thing that it's doing the same thing as with the
child_to_parent_maps array, which is to zero-init it when allocated.

Perhaps, but the maps do need to be zeroed, the partition_tuple_slots
array does not since we only access it when the parent to child map is
set.

In any case, since PARTITION_ROUTING_INITSIZE is just 8, it'll likely
not save much since it's really just saving a memset(..., 0, 64), for
single-row INSERTs on a 64-bit machine. So likely it won't save more
than a bunch of nanoseconds.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#32Krzysztof Nienartowicz
krzysztof.nienartowicz@gmail.com
In reply to: David Rowley (#31)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

We see quite prohibitive 5-6x slowdown with native partitioning on in
comparison to trigger based in PG9.5.
This is clearly visible with highly parallel inserts (Can share
flamediagrams comparing the two).

This basically excludes native partitioning from being used by us. Do you
think your changes could be backported to PG10? - we checked and this would
need quite a number of changes but given the weight of this change maybe it
could be considered?

Thanks
Krzysztof

--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

#33Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Krzysztof Nienartowicz (#32)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

Hi,

On 2018/10/15 19:04, Krzysztof Nienartowicz wrote:

We see quite prohibitive 5-6x slowdown with native partitioning on in
comparison to trigger based in PG9.5.
This is clearly visible with highly parallel inserts (Can share
flamediagrams comparing the two).

This basically excludes native partitioning from being used by us. Do you
think your changes could be backported to PG10? - we checked and this would
need quite a number of changes but given the weight of this change maybe it
could be considered?

It's unfortunate that PG 10's partitioning cannot be used for your use
case, but I don't think such a major refactoring will be back-ported to 10
or 11. :-(

Thanks,
Amit

#34David Rowley
david.rowley@2ndquadrant.com
In reply to: Krzysztof Nienartowicz (#32)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 15 October 2018 at 23:04, Krzysztof Nienartowicz
<krzysztof.nienartowicz@gmail.com> wrote:

We see quite prohibitive 5-6x slowdown with native partitioning on in
comparison to trigger based in PG9.5.
This is clearly visible with highly parallel inserts (Can share
flamediagrams comparing the two).

Does the 0001 patch here fix the problem? I imagined that it would be
the locking of all partitions that would have killed the performance.

This basically excludes native partitioning from being used by us. Do you
think your changes could be backported to PG10? - we checked and this would
need quite a number of changes but given the weight of this change maybe it
could be considered?

It's very unlikely to happen, especially so with the 0002 patch, which
I've so far just attached as a demonstration of where the performance
could end up.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#35Krzysztof Nienartowicz
krzysztof.nienartowicz@gmail.com
In reply to: David Rowley (#34)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

In the end we hacked the code to re-enable triggers on partitioned
tables and switch off native insert code on partitioned tables. Quite
hackish and would be nice to have it fixed in a more natural manner.
Yes, it looked like locking but not only -
ExecSetupPartitionTupleRouting: ExecOpenIndices/find_all_inheritors
looked like dominant and also convert_tuples_by_name but not sure if
the last one was not an artifact of perf sampling.
Will check the patch 0001, thanks.

On Tue, Oct 23, 2018 at 12:36 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

Show quoted text

On 15 October 2018 at 23:04, Krzysztof Nienartowicz
<krzysztof.nienartowicz@gmail.com> wrote:

We see quite prohibitive 5-6x slowdown with native partitioning on in
comparison to trigger based in PG9.5.
This is clearly visible with highly parallel inserts (Can share
flamediagrams comparing the two).

Does the 0001 patch here fix the problem? I imagined that it would be
the locking of all partitions that would have killed the performance.

This basically excludes native partitioning from being used by us. Do you
think your changes could be backported to PG10? - we checked and this would
need quite a number of changes but given the weight of this change maybe it
could be considered?

It's very unlikely to happen, especially so with the 0002 patch, which
I've so far just attached as a demonstration of where the performance
could end up.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#36David Rowley
david.rowley@2ndquadrant.com
In reply to: Krzysztof Nienartowicz (#35)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 23 October 2018 at 11:55, Krzysztof Nienartowicz
<krzysztof.nienartowicz@gmail.com> wrote:

In the end we hacked the code to re-enable triggers on partitioned
tables and switch off native insert code on partitioned tables. Quite
hackish and would be nice to have it fixed in a more natural manner.
Yes, it looked like locking but not only -
ExecSetupPartitionTupleRouting: ExecOpenIndices/find_all_inheritors
looked like dominant and also convert_tuples_by_name but not sure if
the last one was not an artifact of perf sampling.

The ExecOpenIndices was likely fixed in edd44738bc8 (PG11).
find_all_inheritors does obtain the partition locks during the call,
so the slowness there most likely down to the locking rather than the
scanning of pg_inherits.

42f70cd9c3dbf improved the situation for convert_tuples_by_name (PG12).

Will check the patch 0001, thanks.

I more meant that it might be 0002 that fixes the issue for you. I
just wanted to check if you'd tried 0001 and found that the problem
was fixed with that alone.

Do you mind sharing how many partitions you have and how many columns
the partitioned table has?

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#37Krzysztof Nienartowicz
krzysztof.nienartowicz@gmail.com
In reply to: David Rowley (#36)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On Tue, Oct 23, 2018 at 4:02 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On 23 October 2018 at 11:55, Krzysztof Nienartowicz
<krzysztof.nienartowicz@gmail.com> wrote:

In the end we hacked the code to re-enable triggers on partitioned
tables and switch off native insert code on partitioned tables. Quite
hackish and would be nice to have it fixed in a more natural manner.
Yes, it looked like locking but not only -
ExecSetupPartitionTupleRouting: ExecOpenIndices/find_all_inheritors
looked like dominant and also convert_tuples_by_name but not sure if
the last one was not an artifact of perf sampling.

The ExecOpenIndices was likely fixed in edd44738bc8 (PG11).
find_all_inheritors does obtain the partition locks during the call,
so the slowness there most likely down to the locking rather than the
scanning of pg_inherits.

42f70cd9c3dbf improved the situation for convert_tuples_by_name (PG12).

Will check the patch 0001, thanks.

I more meant that it might be 0002 that fixes the issue for you. I
just wanted to check if you'd tried 0001 and found that the problem
was fixed with that alone.

Will it apply on PG10? (In fact the code base is PG XL10 but
src/backend/executor/nodeModifyTable.c is pure PG)

Do you mind sharing how many partitions you have and how many columns
the partitioned table has?

We have 2 level partitioning: 10 (possibly changing, up to say 20-30)
range partitions at first level and 20 range at the second level. We
have potentially hundreds processes inserting at the same time.

Show quoted text

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#38Krzysztof Nienartowicz
krzysztof.nienartowicz@gmail.com
In reply to: Krzysztof Nienartowicz (#37)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

To complement the info: number of columns varies from 20 to 100 but
some of the columns are composite types or arrays of composite types.

The flamegraph after applying changes from patch 0002 can be seen
here: https://gaiaowncloud.isdc.unige.ch/index.php/s/W3DLecAWAfkesiP
shows most of the time is spent in the

convert_tuples_by_name (PG10 version).

Thanks
On Thu, Oct 25, 2018 at 5:58 PM Krzysztof Nienartowicz
<krzysztof.nienartowicz@gmail.com> wrote:

Show quoted text

On Tue, Oct 23, 2018 at 4:02 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On 23 October 2018 at 11:55, Krzysztof Nienartowicz
<krzysztof.nienartowicz@gmail.com> wrote:

In the end we hacked the code to re-enable triggers on partitioned
tables and switch off native insert code on partitioned tables. Quite
hackish and would be nice to have it fixed in a more natural manner.
Yes, it looked like locking but not only -
ExecSetupPartitionTupleRouting: ExecOpenIndices/find_all_inheritors
looked like dominant and also convert_tuples_by_name but not sure if
the last one was not an artifact of perf sampling.

The ExecOpenIndices was likely fixed in edd44738bc8 (PG11).
find_all_inheritors does obtain the partition locks during the call,
so the slowness there most likely down to the locking rather than the
scanning of pg_inherits.

42f70cd9c3dbf improved the situation for convert_tuples_by_name (PG12).

Will check the patch 0001, thanks.

I more meant that it might be 0002 that fixes the issue for you. I
just wanted to check if you'd tried 0001 and found that the problem
was fixed with that alone.

Will it apply on PG10? (In fact the code base is PG XL10 but
src/backend/executor/nodeModifyTable.c is pure PG)

Do you mind sharing how many partitions you have and how many columns
the partitioned table has?

We have 2 level partitioning: 10 (possibly changing, up to say 20-30)
range partitions at first level and 20 range at the second level. We
have potentially hundreds processes inserting at the same time.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#39Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Krzysztof Nienartowicz (#38)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018/10/30 8:41, Krzysztof Nienartowicz wrote:

On Thu, Oct 25, 2018 at 5:58 PM Krzysztof Nienartowicz
<krzysztof.nienartowicz@gmail.com> wrote:

On Tue, Oct 23, 2018 at 4:02 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

I more meant that it might be 0002 that fixes the issue for you. I
just wanted to check if you'd tried 0001 and found that the problem
was fixed with that alone.

Will it apply on PG10? (In fact the code base is PG XL10 but
src/backend/executor/nodeModifyTable.c is pure PG)

To complement the info: number of columns varies from 20 to 100 but
some of the columns are composite types or arrays of composite types.

The flamegraph after applying changes from patch 0002 can be seen
here: https://gaiaowncloud.isdc.unige.ch/index.php/s/W3DLecAWAfkesiP
shows most of the time is spent in the

convert_tuples_by_name (PG10 version).

As David mentioned, the patches on this thread are meant to be applied
against latest PG 12 HEAD. The insert tuple routing code has undergone
quite a bit of refactoring in PG 11, which itself should have gotten rid
of at least some of the hot-spots that are seen in the flame graph you shared.

What happens in PG 10 (as seen in the flame graph) is that
ExecSetupPartitionTupleRouting initializes information for *all*
partitions, which happens even before the 1st tuple processed. So if
there are many partitions and with many columns, a lot of processing
happens in ExecSetupPartitionTupleRouting. PG 11 changes it such that the
partition info is only initialized after the 1st tuple is processed and
only of the partition that's targeted, but some overheads still remain in
that code. The patches on this thread are meant to address those overheads.

Unfortunately, I don't think the community will agree to back-porting the
changes in PG 11 and the patches being discussed here to PG 10.

Thanks,
Amit

#40Krzysztof Nienartowicz
krzysztof.nienartowicz@gmail.com
In reply to: Amit Langote (#39)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

Thanks for both clarifications!
I skimmed through the commits related to Inserts with partitioning
since 10 and indeed - while not impossible it seems like quite some
work to merge them into PG 10 codebase.
We might consider preparing the patch in-house as otherwise PG 10
based partitioning is a major regression and we'd have to go back to
inheritance based one - which seems the best option for now.
Regards,
Krzysztof

On Tue, Oct 30, 2018 at 1:54 AM Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

Show quoted text

On 2018/10/30 8:41, Krzysztof Nienartowicz wrote:

On Thu, Oct 25, 2018 at 5:58 PM Krzysztof Nienartowicz
<krzysztof.nienartowicz@gmail.com> wrote:

On Tue, Oct 23, 2018 at 4:02 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

I more meant that it might be 0002 that fixes the issue for you. I
just wanted to check if you'd tried 0001 and found that the problem
was fixed with that alone.

Will it apply on PG10? (In fact the code base is PG XL10 but
src/backend/executor/nodeModifyTable.c is pure PG)

To complement the info: number of columns varies from 20 to 100 but
some of the columns are composite types or arrays of composite types.

The flamegraph after applying changes from patch 0002 can be seen
here: https://gaiaowncloud.isdc.unige.ch/index.php/s/W3DLecAWAfkesiP
shows most of the time is spent in the

convert_tuples_by_name (PG10 version).

As David mentioned, the patches on this thread are meant to be applied
against latest PG 12 HEAD. The insert tuple routing code has undergone
quite a bit of refactoring in PG 11, which itself should have gotten rid
of at least some of the hot-spots that are seen in the flame graph you shared.

What happens in PG 10 (as seen in the flame graph) is that
ExecSetupPartitionTupleRouting initializes information for *all*
partitions, which happens even before the 1st tuple processed. So if
there are many partitions and with many columns, a lot of processing
happens in ExecSetupPartitionTupleRouting. PG 11 changes it such that the
partition info is only initialized after the 1st tuple is processed and
only of the partition that's targeted, but some overheads still remain in
that code. The patches on this thread are meant to address those overheads.

Unfortunately, I don't think the community will agree to back-porting the
changes in PG 11 and the patches being discussed here to PG 10.

Thanks,
Amit

#41Robert Haas
robertmhaas@gmail.com
In reply to: David Rowley (#26)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On Wed, Aug 22, 2018 at 8:30 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On 22 August 2018 at 19:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

+#define PartitionTupRoutingGetToParentMap(p, i) \
+#define PartitionTupRoutingGetToChildMap(p, i) \

If the "Get" could be replaced by "Child" and "Parent", respectively,
they'd sound more meaningful, imho.

I did that to save 3 chars. I think putting the additional
Child/Parent in the name is not really required. It's not as if we're
going to have a ParentToParent or a ChildToChild map, so I thought it
might be okay to assume that if it's "ToParent", that it's being
converted from the child and "ToChild" seems safe to assume it's being
converted from the parent. I can change it though if you feel very
strongly that what I've got is no good.

I'm not sure exactly what is best here, but it seems to unlikely to me
that somebody is going to read that macro name and think, oh, that
means "get the to-parent map". They are more like be confused by the
juxtaposition of "get" and "to".

I think a better way to shorten the name would be to truncate the
PartitionTupRouting() prefix in some way, e.g. dropping TupRouting.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#42David Rowley
david.rowley@2ndquadrant.com
In reply to: Robert Haas (#41)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 1 November 2018 at 06:45, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 22, 2018 at 8:30 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On 22 August 2018 at 19:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

+#define PartitionTupRoutingGetToParentMap(p, i) \
+#define PartitionTupRoutingGetToChildMap(p, i) \

If the "Get" could be replaced by "Child" and "Parent", respectively,
they'd sound more meaningful, imho.

I did that to save 3 chars. I think putting the additional
Child/Parent in the name is not really required. It's not as if we're
going to have a ParentToParent or a ChildToChild map, so I thought it
might be okay to assume that if it's "ToParent", that it's being
converted from the child and "ToChild" seems safe to assume it's being
converted from the parent. I can change it though if you feel very
strongly that what I've got is no good.

I'm not sure exactly what is best here, but it seems to unlikely to me
that somebody is going to read that macro name and think, oh, that
means "get the to-parent map". They are more like be confused by the
juxtaposition of "get" and "to".

I think a better way to shorten the name would be to truncate the
PartitionTupRouting() prefix in some way, e.g. dropping TupRouting.

Thanks for chipping in on this.

I agree. I don't think "TupRouting" really needs to be in the name.
Probably "To" can also just become "2" and we can put back the
Parent/Child before that.

I've attached v11, which does this.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v11-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchapplication/octet-stream; name=v11-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload
From 6dff6773b0684ebf5d390372ee240462eb4bf138 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 26 Jul 2018 19:54:55 +1200
Subject: [PATCH v11] Speed up INSERT and UPDATE on partitioned tables

This is more or less a complete redesign of PartitionTupleRouting. The
aim here is to get rid of all the possibly large arrays that were being
allocated during ExecSetupPartitionTupleRouting().  We now allocate
small arrays to store the partition's ResultRelInfo and only enlarge
these when we run out of space.  The partitions array is now ordered
by the order in which the partition's ResultRelInfos are initialized
rather than in same order as partdesc.

The find_all_inheritors() call still remains by far the slowest part of
ExecSetupPartitionTupleRouting(). This patch just removes the other slow
parts.

Initialization of the parent to child and child to parent translation maps
arrays are now only performed when we need to store the first translation
map.  If the column order between the parent and its child are the same,
then no map ever needs to be stored, these (possibly large) arrays did
nothing.  The fact that we now always initialize the child to parent map
whenever transition capture is required, we no longer need the
child_parent_map_not_required array.  Previously this was only required
so we could determine if no map was required or if the map had not yet
been initialized.

In simple INSERTs hitting a single partition to a partitioned table with
many partitions, the shutdown of the executor was also slow in comparison
to the actual execution. This was down to the loop which cleans up each
ResultRelInfo having to loop over an array which contained mostly NULLs
that had to be skipped.  Performance of this has now improved as the array
we loop over now no longer has to skip possibly many NULL values.

David Rowley and Amit Langote
---
 src/backend/commands/copy.c                   |  46 +-
 src/backend/executor/execPartition.c          | 843 ++++++++++++++------------
 src/backend/executor/nodeModifyTable.c        | 105 +---
 src/backend/optimizer/prep/prepunion.c        |   3 -
 src/backend/utils/cache/partcache.c           |  11 +-
 src/include/catalog/partition.h               |   6 +-
 src/include/executor/execPartition.h          | 166 +++--
 src/test/regress/expected/insert_conflict.out |  22 +
 src/test/regress/sql/insert_conflict.sql      |  26 +
 9 files changed, 657 insertions(+), 571 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b58a74f4e3..b157238526 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2513,8 +2513,12 @@ CopyFrom(CopyState cstate)
 	/*
 	 * If there are any triggers with transition tables on the named relation,
 	 * we need to be prepared to capture transition tuples.
+	 *
+	 * Because partition tuple routing would like to know about whether
+	 * transition capture is active, we also set it in mtstate, which is
+	 * passed to ExecFindPartition below.
 	 */
-	cstate->transition_capture =
+	cstate->transition_capture = mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(cstate->rel->trigdesc,
 								   RelationGetRelid(cstate->rel),
 								   CMD_INSERT);
@@ -2524,19 +2528,8 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
 		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
-		/*
-		 * If we are capturing transition tuples, they may need to be
-		 * converted from partition format back to partitioned table format
-		 * (this is only ever necessary if a BEFORE trigger modifies the
-		 * tuple).
-		 */
-		if (cstate->transition_capture != NULL)
-			ExecSetupChildParentMapForLeaf(proute);
-	}
-
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2696,17 +2689,11 @@ CopyFrom(CopyState cstate)
 			TupleConversionMap *map;
 
 			/*
-			 * Away we go ... If we end up not finding a partition after all,
-			 * ExecFindPartition() does not return and errors out instead.
-			 * Otherwise, the returned value is to be used as an index into
-			 * arrays mt_partitions[] and mt_partition_tupconv_maps[] that
-			 * will get us the ResultRelInfo and TupleConversionMap for the
-			 * partition, respectively.
+			 * Attempt to find a partition suitable for this tuple.
+			 * ExecFindPartition() will raise an error if none can be found.
 			 */
-			leaf_part_index = ExecFindPartition(target_resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
+			leaf_part_index = ExecFindPartition(mtstate, target_resultRelInfo,
+												proute, slot, estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
@@ -2790,15 +2777,7 @@ CopyFrom(CopyState cstate)
 				 * one.
 				 */
 				resultRelInfo = proute->partitions[leaf_part_index];
-				if (unlikely(resultRelInfo == NULL))
-				{
-					resultRelInfo = ExecInitPartitionInfo(mtstate,
-														  target_resultRelInfo,
-														  proute, estate,
-														  leaf_part_index);
-					proute->partitions[leaf_part_index] = resultRelInfo;
-					Assert(resultRelInfo != NULL);
-				}
+				Assert(resultRelInfo != NULL);
 
 				/* Determine which triggers exist on this partition */
 				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
@@ -2848,8 +2827,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, target_resultRelInfo,
-										  leaf_part_index);
+							PartitionChild2ParentMap(proute, leaf_part_index);
 				}
 				else
 				{
@@ -2866,7 +2844,7 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = proute->parent_child_tupconv_maps[leaf_part_index];
+			map = PartitionParent2ChildMap(proute, leaf_part_index);
 			if (map != NULL)
 			{
 				TupleTableSlot *new_slot;
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 0bcb2377c3..0274bf0a7e 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,6 +31,7 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+#define PARTITION_ROUTING_INITSIZE	8
 
 /*-----------------------
  * PartitionDispatch - information about one partitioned table in a partition
@@ -45,9 +46,13 @@
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
  *				this table's rowtype (when extracting the partition key of a
  *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *	indexes		Array of partdesc->nparts elements.  For leaf partitions the
+ *				index into the parenting PartitionTupleRouting's 'partitions'
+ *				array is stored.  When the partition is itself a partitioned
+ *				table then we store the index into parenting
+ *				PartitionTupleRouting 'partition_dispatch_info' array.  An
+ *				index of -1 means we've not yet allocated anything in
+ *				PartitionTupleRouting for the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -58,14 +63,20 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrNumber *tupmap;
-	int		   *indexes;
+	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
 
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecExpandRoutingArrays(PartitionTupleRouting *proute);
+static int ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -92,130 +103,112 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition.  The actual ResultRelInfos are allocated lazily by that
+ * function.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
-
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
 
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
-	}
-
-	i = 0;
-	foreach(cell, leaf_parts)
-	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
-
-		proute->partition_oids[i] = leaf_oid;
-
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
+	 * on demand, only when we actually need to route a tuple to that
+	 * partition.  The reason for this is that a common case is for INSERT to
+	 * insert a single tuple into a partitioned table and this must be fast.
+	 *
+	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
+	 * PartitionDispatch and ResultRelInfo pointers in their respective
+	 * arrays. More space can be allocated later, if required via
+	 * ExecExpandRoutingArrays.
+	 *
+	 * Initially we must only setup 1 PartitionDispatch object; the one for
+	 * the partitioned table that's the target of the command.  If we must
+	 * route tuple via some sub-partitioned table, then the PartitionDispatch
+	 * for those is only built the first time it's required.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
 
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
+	proute->partitions = (ResultRelInfo **)
+		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
 
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
+	/* Mark that no items are yet stored in the 'partitions' array. */
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
 
-			update_rri_index++;
-		}
+	/* We only allocate these arrays when we need to store the first map */
+	proute->parent_child_tupconv_maps = NULL;
+	proute->child_parent_tupconv_maps = NULL;
+	proute->partition_tuple_slots = NULL;
 
-		proute->partitions[i] = leaf_part_rri;
-		i++;
-	}
+	/*
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent as NULL as we don't need to care about any parent of the target
+	 * partitioned table.
+	 */
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
 
 	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
+	 * If performing an UPDATE with tuple routing, we can reuse partition
+	 * sub-plan result rels.  We build a hash table to map the OIDs of
+	 * partitions present in mtstate->resultRelInfo to their ResultRelInfos.
+	 * Every time a tuple is routed to a partition that we've yet to set the
+	 * ResultRelInfo for, before we go to the trouble of making one, we check
+	 * for a pre-made one in the hash table.
+	 *
+	 * Also, we'll need a slot that will transiently store the tuple being
+	 * routed using the root parent's rowtype.
 	 */
-	Assert(update_rri_index == num_update_rri);
+	if (node && node->operation == CMD_UPDATE)
+	{
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
+		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
+	}
+	else
+	{
+		proute->subplan_resultrel_hash = NULL;
+		proute->root_tuple_slot = NULL;
+	}
 
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find a leaf partition for the tuple contained in *slot.
+ * If the partition's ResultRelInfo does not yet exist in 'proute' then we set
+ * one up or reuse one from mtstate's resultRelInfo array.
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message, else it returns the index of the leaf partition's
+ * ResultRelInfo in the proute->partitions array.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch dispatch;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 	TupleTableSlot *myslot = NULL;
@@ -236,9 +229,10 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	while (true)
 	{
 		AttrNumber *map = dispatch->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = dispatch->reldesc;
+		partdesc = dispatch->partdesc;
 
 		/*
 		 * Convert the tuple to this parent's layout, if different from the
@@ -260,91 +254,251 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (dispatch->partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(dispatch, values, isnull);
-
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
-		{
-			result = -1;
-			break;
-		}
-		else if (dispatch->indexes[cur_index] >= 0)
+		if (partdesc->is_leaf[partidx])
 		{
-			result = dispatch->indexes[cur_index];
-			/* success! */
-			break;
+			int			result = -1;
+
+			/*
+			 * Get this leaf partition's index in the
+			 * PartitionTupleRouting->partitions array.  We need to build a
+			 * new ResultRelInfo.
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(dispatch->indexes[partidx] < proute->num_partitions);
+				result = dispatch->indexes[partidx];
+			}
+			else
+			{
+				/*
+				 * A ResultRelInfo has not been set up for this partition yet,
+				 * so either use one of the sub-plan result rels or create a
+				 * fresh one.
+				 */
+				if (proute->subplan_resultrel_hash)
+				{
+					ResultRelInfo *rri;
+					Oid			partoid = partdesc->oids[partidx];
+
+					rri = hash_search(proute->subplan_resultrel_hash,
+									  &partoid, HASH_FIND, NULL);
+
+					if (rri)
+					{
+						result = proute->num_partitions++;
+						dispatch->indexes[partidx] = result;
+
+
+						/* Allocate more space in the arrays, if required */
+						if (result >= proute->partitions_allocsize)
+							ExecExpandRoutingArrays(proute);
+
+						/* Save here for later use. */
+						proute->partitions[result] = rri;
+					}
+				}
+
+				/* We need to create a new one. */
+				if (result < 0)
+				{
+					MemoryContextSwitchTo(oldcxt);
+					result = ExecInitPartitionInfo(mtstate, resultRelInfo,
+												   proute, estate,
+												   dispatch, partidx);
+					MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+					Assert(result >= 0 && result < proute->num_partitions);
+				}
+			}
+
+			/* Release the tuple in the lowest parent's dedicated slot. */
+			if (slot == myslot)
+				ExecClearTuple(myslot);
+
+			MemoryContextSwitchTo(oldcxt);
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return result;
 		}
 		else
 		{
-			/* move down one level */
-			dispatch = pd[-dispatch->indexes[cur_index]];
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
+
+				/*
+				 * Move down to the next partition level and search again
+				 * until we find a leaf partition that matches this tuple
+				 */
+				dispatch = pd[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subdispatch;
+
+				MemoryContextSwitchTo(oldcxt);
+
+				/*
+				 * Create the new PartitionDispatch.  We pass the current one
+				 * in as the parent PartitionDispatch
+				 */
+				subdispatch = ExecInitPartitionDispatchInfo(proute,
+															partdesc->oids[partidx],
+															dispatch, partidx);
+				MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				Assert(dispatch->indexes[partidx] >= 0 &&
+					   dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = subdispatch;
+			}
 		}
 	}
+}
+
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo *subplan_result_rels;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			nsubplans;
+	int			i;
+
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
 
-	/* Release the tuple in the lowest parent's dedicated slot. */
-	if (slot == myslot)
-		ExecClearTuple(myslot);
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
 
-	/* A partition was not found. */
-	if (result < 0)
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_resultrel_hash = htab;
+
+	/* Hash all subplans by their Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		char	   *val_desc;
-
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
+
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+													&found);
+
+		if (!found)
+			*subplanrri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
 	}
+}
 
-	MemoryContextSwitchTo(oldcxt);
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
+/*
+ * ExecExpandRoutingArrays
+ *		Double the size of the allocated arrays in 'proute'
+ */
+static void
+ExecExpandRoutingArrays(PartitionTupleRouting *proute)
+{
+	int			new_size = proute->partitions_allocsize * 2;
+	int			old_size = proute->partitions_allocsize;
 
-	return result;
+	proute->partitions_allocsize = new_size;
+
+	proute->partitions = (ResultRelInfo **)
+		repalloc(proute->partitions, sizeof(ResultRelInfo *) * new_size);
+
+	if (proute->parent_child_tupconv_maps != NULL)
+	{
+		proute->parent_child_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->parent_child_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->parent_child_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_tupconv_maps != NULL)
+	{
+		proute->child_parent_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->child_parent_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->child_parent_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->partition_tuple_slots != NULL)
+	{
+		proute->partition_tuple_slots = (TupleTableSlot **)
+			repalloc(proute->partition_tuple_slots,
+				sizeof(TupleTableSlot **) * new_size);
+		memset(&proute->partition_tuple_slots[old_size], 0,
+			sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
- *
- * Returns the ResultRelInfo
+ *		and store it in the next empty slot in proute's partitions array.
+ *		Return the index of the array element.
  */
-ResultRelInfo *
+static int
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx)
 {
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -520,15 +674,25 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 									&mtstate->ps, RelationGetDescr(partrel));
 	}
 
+	part_result_rel_index = proute->num_partitions++;
+	dispatch->indexes[partidx] = part_result_rel_index;
+
+	/* Allocate more space in the arrays, if required */
+	if (part_result_rel_index >= proute->partitions_allocsize)
+		ExecExpandRoutingArrays(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
+
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri,
+						part_result_rel_index);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -541,7 +705,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -554,7 +718,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -568,7 +732,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -578,8 +742,12 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = PartitionParent2ChildMap(proute, part_result_rel_index);
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -588,7 +756,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -679,12 +847,9 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
-
 	MemoryContextSwitchTo(oldContext);
 
-	return leaf_part_rri;
+	return part_result_rel_index;
 }
 
 /*
@@ -699,6 +864,7 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 					int partidx)
 {
 	MemoryContext oldContext;
+	TupleConversionMap *map;
 
 	/*
 	 * Switch into per-query memory context.
@@ -709,29 +875,42 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
-		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
-							   RelationGetDescr(partRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	map = convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								 RelationGetDescr(partRelInfo->ri_RelationDesc),
+								 gettext_noop("could not convert row type"));
 
 	/*
-	 * If a partition has a different rowtype than the root parent, initialize
-	 * a slot dedicated to storing this partition's tuples.  The slot is used
-	 * for various operations that are applied to tuples after routing, such
-	 * as checking constraints.
+	 * If a partition has a different rowtype than the root parent, store the
+	 * translation map and initialize a slot dedicated to storing this
+	 * partition's tuples.  The slot is used for various operations that are
+	 * applied to tuples after routing, such as checking constraints.
 	 */
-	if (proute->parent_child_tupconv_maps[partidx] != NULL)
+	if (map)
 	{
 		Relation	partrel = partRelInfo->ri_RelationDesc;
 
-		/*
-		 * Initialize the array in proute where these slots are stored, if not
-		 * already done.
-		 */
+		/* Allocate parent child map array only if we need to store a map */
+		if (proute->parent_child_tupconv_maps == NULL)
+		{
+			int			size = proute->partitions_allocsize;
+
+			proute->parent_child_tupconv_maps = (TupleConversionMap **)
+				palloc0(sizeof(TupleConversionMap *) * size);
+		}
+
+		proute->parent_child_tupconv_maps[partidx] = map;
+
+		 /*
+		  * Initialize the array in proute where these slots are stored, if not
+		  * already done.
+		  */
 		if (proute->partition_tuple_slots == NULL)
+		{
+			int			size = proute->partitions_allocsize;
+
 			proute->partition_tuple_slots = (TupleTableSlot **)
-				palloc0(proute->num_partitions *
-						sizeof(TupleTableSlot *));
+				palloc0(sizeof(TupleTableSlot *) * size);
+		}
 
 		/*
 		 * Initialize the slot itself setting its descriptor to this
@@ -741,6 +920,35 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 		proute->partition_tuple_slots[partidx] =
 			ExecInitExtraTupleSlot(estate,
 								   RelationGetDescr(partrel));
+
+	}
+
+	/*
+	 * Also, if transition capture is required, store a map to convert tuples
+	 * from partition's rowtype to the parent's.
+	 */
+	if (mtstate &&
+		(mtstate->mt_transition_capture || mtstate->mt_oc_transition_capture))
+	{
+		map =
+			convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
+								   RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								   gettext_noop("could not convert row type"));
+
+		/* Allocate child parent map array only if we need to store a map */
+		if (map)
+		{
+			if (proute->child_parent_tupconv_maps == NULL)
+			{
+				int			size;
+
+				size = proute->partitions_allocsize;
+				proute->child_parent_tupconv_maps = (TupleConversionMap **)
+					palloc0(sizeof(TupleConversionMap *) * size);
+			}
+
+			proute->child_parent_tupconv_maps[partidx] = map;
+		}
 	}
 
 	/*
@@ -757,67 +965,88 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 }
 
 /*
- * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
- * child-to-root tuple conversion map array.
- *
- * This map is required for capturing transition tuples when the target table
- * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
- * we need to convert it from the leaf partition to the target table
- * descriptor.
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table and store it in
+ *		the next available slot in the 'proute' partition_dispatch_info[]
+ *		array.  Also, record the index into this array in the 'parent_pd'
+ *		indexes[] array in the partidx element so that we can properly
+ *		retrieve the newly created PartitionDispatch later.
  */
-void
-ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
 {
-	Assert(proute != NULL);
+	Relation	rel;
+	PartitionDesc partdesc;
+	PartitionDispatch pd;
+	int			dispatchidx;
 
-	/*
-	 * These array elements get filled up with maps on an on-demand basis.
-	 * Initially just set all of them to NULL.
-	 */
-	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	partdesc = RelationGetPartitionDesc(rel);
 
-	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
-}
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
+									+ (partdesc->nparts * sizeof(int)));
+	pd->reldesc = rel;
+	pd->key = RelationGetPartitionKey(rel);
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		TupleDesc	tupdesc = RelationGetDescr(rel);
 
-/*
- * TupConvMapForLeaf -- Get the tuple conversion map for a given leaf partition
- * index.
- */
-TupleConversionMap *
-TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index)
-{
-	ResultRelInfo **resultRelInfos = proute->partitions;
-	TupleConversionMap **map;
-	TupleDesc	tupdesc;
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent_pd->reldesc),
+													   tupdesc,
+													   gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
 
-	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
-		return NULL;
+	dispatchidx = proute->num_dispatch++;
 
-	/* If we've already got a map, return it. */
-	map = &proute->child_parent_tupconv_maps[leaf_index];
-	if (*map != NULL)
-		return *map;
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize *= 2;
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
 
-	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
-	*map =
-		convert_tuples_by_name(tupdesc,
-							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
 
-	/* If it turns out no map is needed, remember for next time. */
-	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
+	/*
+	 * Finally, if setting up a PartitionDispatch for a sub-partitioned table,
+	 * install the link to allow us to descend the partition hierarchy for
+	 * future searches
+	 */
+	if (parent_pd)
+		parent_pd->indexes[partidx] = dispatchidx;
 
-	return *map;
+	return pd;
 }
 
 /*
@@ -830,8 +1059,8 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -852,10 +1081,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-		/* skip further processsing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
-
 		/* Allow any FDWs to shut down if they've been exercised */
 		if (resultRelInfo->ri_PartitionReadyForRouting &&
 			resultRelInfo->ri_FdwRoutine != NULL &&
@@ -864,21 +1089,19 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 														   resultRelInfo);
 
 		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
+		if (resultrel_hash)
 		{
-			subplan_index++;
-			continue;
+			Oid			partoid;
+			bool		found;
+
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
+			if (found)
+				continue;
 		}
 
 		ExecCloseIndices(resultRelInfo);
@@ -890,144 +1113,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent),
-													   tupdesc,
-													   gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
-
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
-		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
-
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
-	}
-}
-
 /* ----------------
  *		FormPartitionKeyDatum
  *			Construct values[] and isnull[] arrays for the partition key
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 528f58717e..4fc965110c 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1665,7 +1664,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1709,21 +1708,13 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	 * value is to be used as an index into the arrays for the ResultRelInfo
 	 * and TupleConversionMap for the partition.
 	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
+	partidx = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
 	Assert(partidx >= 0 && partidx < proute->num_partitions);
 
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
-	 */
+	Assert(proute->partitions[partidx] != NULL);
+	/* Get the ResultRelInfo corresponding to the selected partition. */
 	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
+	Assert(partrel != NULL);
 
 	/*
 	 * Check whether the partition is routable if we didn't yet
@@ -1768,7 +1759,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			 */
 			mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 			mtstate->mt_transition_capture->tcs_map =
-				TupConvMapForLeaf(proute, targetRelInfo, partidx);
+									PartitionChild2ParentMap(proute, partidx);
 		}
 		else
 		{
@@ -1783,13 +1774,13 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	if (mtstate->mt_oc_transition_capture != NULL)
 	{
 		mtstate->mt_oc_transition_capture->tcs_map =
-			TupConvMapForLeaf(proute, targetRelInfo, partidx);
+									PartitionChild2ParentMap(proute, partidx);
 	}
 
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	map = proute->parent_child_tupconv_maps[partidx];
+	map = PartitionParent2ChildMap(proute, partidx);
 	if (map != NULL)
 	{
 		TupleTableSlot *new_slot;
@@ -1834,17 +1825,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			numResultRelInfos = mtstate->mt_nplans;
 	int			i;
 
-	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
 	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
@@ -1866,79 +1846,18 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	}
 }
 
-/*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
 /*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ourselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index d5720518a8..2a1c1cb2e1 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1657,9 +1657,6 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 /*
  * expand_partitioned_rtentry
  *		Recursively expand an RTE for a partitioned table.
- *
- * Note that RelationGetPartitionDispatchInfo will expand partitions in the
- * same order as this code.
  */
 static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 5757301d05..2afde69134 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -582,6 +582,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -770,7 +771,15 @@ RelationBuildPartitionDesc(Relation rel)
 		 * defined by canonicalized representation of the partition bounds.
 		 */
 		for (i = 0; i < nparts; i++)
-			result->oids[mapping[i]] = oids[i];
+		{
+			int			index = mapping[i];
+
+			result->oids[index] = oids[i];
+			/* Record if the partition is a leaf partition */
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+		}
+
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index a53de2372e..59c7a6ab69 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -25,7 +25,11 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of 'nparts' elements containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * the corresponding 'oids' element belongs to
+								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 3e08104ea4..e09f94ebf3 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -22,71 +22,119 @@
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 /*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
- *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slots		Array of TupleTableSlot objects; if non-NULL,
- *								contains one entry for every leaf partition,
- *								of which only those of the leaf partitions
- *								whose attribute numbers differ from the root
- *								parent have a non-NULL value.  NULL if all of
- *								the partitions encountered by a given command
- *								happen to have same rowtype as the root parent
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
+ *
+ * partition_root			The partitioned table that's the target of the
+ *							command.
+ *
+ * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
+ *							a pointer to a PartitionDispatch objects for every
+ *							partitioned table touched by tuple routing.  The
+ *							entry for the target partitioned table is *always*
+ *							present in the 0th element of this array.  See
+ *							comment for PartitionDispatchData->indexes for
+ *							details on how this array is indexed.
+ *
+ * num_dispatch				The current number of items stored in the
+ *							'partition_dispatch_info' array.  Also serves as
+ *							the index of the next free array element for new
+ *							PartitionDispatch which need to be stored.
+ *
+ * dispatch_allocsize		The current allocated size of the
+ *							'partition_dispatch_info' array.
+ *
+ * partitions				Array of 'partitions_allocsize' elements
+ *							containing pointers to a ResultRelInfos of all
+ *							leaf partitions touched by tuple routing.  Some of
+ *							these are pointers to ResultRelInfos which are
+ *							borrowed out of 'subplan_resultrel_hash'.  The
+ *							remainder have been built especially for tuple
+ *							routing.  See comment for
+ *							PartitionDispatchData->indexes for details on how
+ *							this array is indexed.
+ *
+ * num_partitions			The current number of items stored in the
+ *							'partitions' array.  Also serves as the index of
+ *							the next free array element for new ResultRelInfos
+ *							which need to be stored.
+ *
+ * partitions_allocsize		The current allocated size of the 'partitions'
+ *							array.  Also, if they're non-NULL, marks the size
+ *							of the 'parent_child_tupconv_maps',
+ *							'child_parent_tupconv_maps',
+ *							'child_parent_map_not_required' and
+ *							'partition_tuple_slots' arrays.
+ *
+ * parent_child_tupconv_maps	Array of partitions_allocsize elements
+ *							containing information on how to convert tuples of
+ *							partition_root's rowtype to the rowtype of the
+ *							corresponding partition as stored in 'partitions',
+ *							or NULL if no conversion is required.  The entire
+ *							array is only allocated when the first conversion
+ *							map needs to stored.  When not allocated it's set
+ *							to NULL.
+ *
+ * partition_tuple_slots	Array of TupleTableSlot objects; if non-NULL,
+ *							contains one entry for every leaf partition,
+ *							of which only those of the leaf partitions
+ *							whose attribute numbers differ from the root
+ *							parent have a non-NULL value.  NULL if all of
+ *							the partitions encountered by a given command
+ *							happen to have same rowtype as the root parent
+ *
+ * child_parent_tupconv_maps	As 'parent_child_tupconv_maps' but stores
+ *							conversion maps to translate partition tuples into
+ *							partition_root's rowtype, needed if transition
+ *							capture is active
+ *
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
+ *
+ * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
+ *							This is used to cache ResultRelInfos from subplans
+ *							of a ModifyTable node.  Some of these may be
+ *							useful for tuple routing to save having to build
+ *							duplicates.
+ *
+ * root_tuple_slot			During UPDATE tuple routing, this tuple slot is
+ *							used to transiently store a tuple using the root
+ *							table's rowtype after converting it from the
+ *							tuple's source leaf partition's rowtype.  That is,
+ *							if leaf partition's rowtype is different.
  *-----------------------
  */
 typedef struct PartitionTupleRouting
 {
+	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
-	Oid		   *partition_oids;
+	int			dispatch_allocsize;
 	ResultRelInfo **partitions;
 	int			num_partitions;
+	int			partitions_allocsize;
 	TupleConversionMap **parent_child_tupconv_maps;
 	TupleConversionMap **child_parent_tupconv_maps;
-	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
 	TupleTableSlot **partition_tuple_slots;
 	TupleTableSlot *root_tuple_slot;
+	HTAB	   *subplan_resultrel_hash;
 } PartitionTupleRouting;
 
+/*
+ * Accessor macros for tuple conversion maps contained in
+ * PartitionTupleRouting.  Beware of multiple evaluations of p!
+ */
+#define PartitionChild2ParentMap(p, i) \
+			((p)->child_parent_tupconv_maps != NULL ? \
+				(p)->child_parent_tupconv_maps[(i)] : \
+							NULL)
+
+#define PartitionParent2ChildMap(p, i) \
+			((p)->parent_child_tupconv_maps != NULL ? \
+				(p)->parent_child_tupconv_maps[(i)] : \
+							NULL)
+
 /*
  * PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
  * of partitions.  For a multilevel partitioned table, we have one of these
@@ -175,22 +223,20 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern int ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
+extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,
+					 ResultRelInfo *resultRelInfo,
+					 PartitionTupleRouting *proute,
+					 EState *estate, int partidx);
 extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
 					PartitionTupleRouting *proute,
 					ResultRelInfo *partRelInfo,
 					int partidx);
-extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
-extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index);
 extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute);
 extern PartitionPruneState *ExecCreatePartitionPruneState(PlanState *planstate,
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index 27cf5a01b3..6b841c7850 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -904,4 +904,26 @@ select * from parted_conflict order by a;
  50 | cincuenta | 2
 (1 row)
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+truncate parted_conflict;
+insert into parted_conflict values (0, 'cero', 1);
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+NOTICE:  a = 0, b = cero, c = 2
 drop table parted_conflict;
+drop function parted_conflict_update_func();
diff --git a/src/test/regress/sql/insert_conflict.sql b/src/test/regress/sql/insert_conflict.sql
index c677d70fb7..fe6dcfaa06 100644
--- a/src/test/regress/sql/insert_conflict.sql
+++ b/src/test/regress/sql/insert_conflict.sql
@@ -576,4 +576,30 @@ insert into parted_conflict values (50, 'cincuenta', 2)
 -- should see (50, 'cincuenta', 2)
 select * from parted_conflict order by a;
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+
+truncate parted_conflict;
+
+insert into parted_conflict values (0, 'cero', 1);
+
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+
 drop table parted_conflict;
+drop function parted_conflict_update_func();
-- 
2.16.2.windows.1

#43Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#42)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018/11/01 8:58, David Rowley wrote:

On 1 November 2018 at 06:45, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 22, 2018 at 8:30 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On 22 August 2018 at 19:08, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

+#define PartitionTupRoutingGetToParentMap(p, i) \
+#define PartitionTupRoutingGetToChildMap(p, i) \

If the "Get" could be replaced by "Child" and "Parent", respectively,
they'd sound more meaningful, imho.

I did that to save 3 chars. I think putting the additional
Child/Parent in the name is not really required. It's not as if we're
going to have a ParentToParent or a ChildToChild map, so I thought it
might be okay to assume that if it's "ToParent", that it's being
converted from the child and "ToChild" seems safe to assume it's being
converted from the parent. I can change it though if you feel very
strongly that what I've got is no good.

I'm not sure exactly what is best here, but it seems to unlikely to me
that somebody is going to read that macro name and think, oh, that
means "get the to-parent map". They are more like be confused by the
juxtaposition of "get" and "to".

I think a better way to shorten the name would be to truncate the
PartitionTupRouting() prefix in some way, e.g. dropping TupRouting.

Thanks for chipping in on this.

I agree. I don't think "TupRouting" really needs to be in the name.
Probably "To" can also just become "2" and we can put back the
Parent/Child before that.

Agree that "TupRouting" can go, but "To" is not too long for using "2"
instead of it.

Thanks,
Amit

#44David Rowley
david.rowley@2ndquadrant.com
In reply to: Amit Langote (#43)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 1 November 2018 at 13:35, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2018/11/01 8:58, David Rowley wrote:

On 1 November 2018 at 06:45, Robert Haas <robertmhaas@gmail.com> wrote:

I think a better way to shorten the name would be to truncate the
PartitionTupRouting() prefix in some way, e.g. dropping TupRouting.

Thanks for chipping in on this.

I agree. I don't think "TupRouting" really needs to be in the name.
Probably "To" can also just become "2" and we can put back the
Parent/Child before that.

Agree that "TupRouting" can go, but "To" is not too long for using "2"
instead of it.

Okay. Here's a version with "2" put back to "To"...

It's great to know the patch is now so perfect that we've only the
macro naming left to debate ;-)

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v12-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchapplication/octet-stream; name=v12-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload
From 8e24aadc0b83782220e2dd7d680e2b8385de3bd1 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 26 Jul 2018 19:54:55 +1200
Subject: [PATCH v12] Speed up INSERT and UPDATE on partitioned tables

This is more or less a complete redesign of PartitionTupleRouting. The
aim here is to get rid of all the possibly large arrays that were being
allocated during ExecSetupPartitionTupleRouting().  We now allocate
small arrays to store the partition's ResultRelInfo and only enlarge
these when we run out of space.  The partitions array is now ordered
by the order in which the partition's ResultRelInfos are initialized
rather than in same order as partdesc.

The find_all_inheritors() call still remains by far the slowest part of
ExecSetupPartitionTupleRouting(). This patch just removes the other slow
parts.

Initialization of the parent to child and child to parent translation maps
arrays are now only performed when we need to store the first translation
map.  If the column order between the parent and its child are the same,
then no map ever needs to be stored, these (possibly large) arrays did
nothing.  The fact that we now always initialize the child to parent map
whenever transition capture is required, we no longer need the
child_parent_map_not_required array.  Previously this was only required
so we could determine if no map was required or if the map had not yet
been initialized.

In simple INSERTs hitting a single partition to a partitioned table with
many partitions, the shutdown of the executor was also slow in comparison
to the actual execution. This was down to the loop which cleans up each
ResultRelInfo having to loop over an array which contained mostly NULLs
that had to be skipped.  Performance of this has now improved as the array
we loop over now no longer has to skip possibly many NULL values.

David Rowley and Amit Langote
---
 src/backend/commands/copy.c                   |  46 +-
 src/backend/executor/execPartition.c          | 843 ++++++++++++++------------
 src/backend/executor/nodeModifyTable.c        | 105 +---
 src/backend/optimizer/prep/prepunion.c        |   3 -
 src/backend/utils/cache/partcache.c           |  11 +-
 src/include/catalog/partition.h               |   6 +-
 src/include/executor/execPartition.h          | 166 +++--
 src/test/regress/expected/insert_conflict.out |  22 +
 src/test/regress/sql/insert_conflict.sql      |  26 +
 9 files changed, 657 insertions(+), 571 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b58a74f4e3..0b0696e61e 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2513,8 +2513,12 @@ CopyFrom(CopyState cstate)
 	/*
 	 * If there are any triggers with transition tables on the named relation,
 	 * we need to be prepared to capture transition tuples.
+	 *
+	 * Because partition tuple routing would like to know about whether
+	 * transition capture is active, we also set it in mtstate, which is
+	 * passed to ExecFindPartition below.
 	 */
-	cstate->transition_capture =
+	cstate->transition_capture = mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(cstate->rel->trigdesc,
 								   RelationGetRelid(cstate->rel),
 								   CMD_INSERT);
@@ -2524,19 +2528,8 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
 		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
-		/*
-		 * If we are capturing transition tuples, they may need to be
-		 * converted from partition format back to partitioned table format
-		 * (this is only ever necessary if a BEFORE trigger modifies the
-		 * tuple).
-		 */
-		if (cstate->transition_capture != NULL)
-			ExecSetupChildParentMapForLeaf(proute);
-	}
-
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2696,17 +2689,11 @@ CopyFrom(CopyState cstate)
 			TupleConversionMap *map;
 
 			/*
-			 * Away we go ... If we end up not finding a partition after all,
-			 * ExecFindPartition() does not return and errors out instead.
-			 * Otherwise, the returned value is to be used as an index into
-			 * arrays mt_partitions[] and mt_partition_tupconv_maps[] that
-			 * will get us the ResultRelInfo and TupleConversionMap for the
-			 * partition, respectively.
+			 * Attempt to find a partition suitable for this tuple.
+			 * ExecFindPartition() will raise an error if none can be found.
 			 */
-			leaf_part_index = ExecFindPartition(target_resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
+			leaf_part_index = ExecFindPartition(mtstate, target_resultRelInfo,
+												proute, slot, estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
@@ -2790,15 +2777,7 @@ CopyFrom(CopyState cstate)
 				 * one.
 				 */
 				resultRelInfo = proute->partitions[leaf_part_index];
-				if (unlikely(resultRelInfo == NULL))
-				{
-					resultRelInfo = ExecInitPartitionInfo(mtstate,
-														  target_resultRelInfo,
-														  proute, estate,
-														  leaf_part_index);
-					proute->partitions[leaf_part_index] = resultRelInfo;
-					Assert(resultRelInfo != NULL);
-				}
+				Assert(resultRelInfo != NULL);
 
 				/* Determine which triggers exist on this partition */
 				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
@@ -2848,8 +2827,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, target_resultRelInfo,
-										  leaf_part_index);
+						PartitionChildToParentMap(proute, leaf_part_index);
 				}
 				else
 				{
@@ -2866,7 +2844,7 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = proute->parent_child_tupconv_maps[leaf_part_index];
+			map = PartitionParentToChildMap(proute, leaf_part_index);
 			if (map != NULL)
 			{
 				TupleTableSlot *new_slot;
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 0bcb2377c3..542578102f 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,6 +31,7 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+#define PARTITION_ROUTING_INITSIZE	8
 
 /*-----------------------
  * PartitionDispatch - information about one partitioned table in a partition
@@ -45,9 +46,13 @@
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
  *				this table's rowtype (when extracting the partition key of a
  *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *	indexes		Array of partdesc->nparts elements.  For leaf partitions the
+ *				index into the parenting PartitionTupleRouting's 'partitions'
+ *				array is stored.  When the partition is itself a partitioned
+ *				table then we store the index into parenting
+ *				PartitionTupleRouting 'partition_dispatch_info' array.  An
+ *				index of -1 means we've not yet allocated anything in
+ *				PartitionTupleRouting for the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -58,14 +63,20 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrNumber *tupmap;
-	int		   *indexes;
+	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
 
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecExpandRoutingArrays(PartitionTupleRouting *proute);
+static int ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -92,130 +103,112 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition.  The actual ResultRelInfos are allocated lazily by that
+ * function.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
-
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
 
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
-	}
-
-	i = 0;
-	foreach(cell, leaf_parts)
-	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
-
-		proute->partition_oids[i] = leaf_oid;
-
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
+	 * on demand, only when we actually need to route a tuple to that
+	 * partition.  The reason for this is that a common case is for INSERT to
+	 * insert a single tuple into a partitioned table and this must be fast.
+	 *
+	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
+	 * PartitionDispatch and ResultRelInfo pointers in their respective
+	 * arrays. More space can be allocated later, if required via
+	 * ExecExpandRoutingArrays.
+	 *
+	 * Initially we must only setup 1 PartitionDispatch object; the one for
+	 * the partitioned table that's the target of the command.  If we must
+	 * route tuple via some sub-partitioned table, then the PartitionDispatch
+	 * for those is only built the first time it's required.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
 
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
+	proute->partitions = (ResultRelInfo **)
+		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
 
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
+	/* Mark that no items are yet stored in the 'partitions' array. */
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
 
-			update_rri_index++;
-		}
+	/* We only allocate these arrays when we need to store the first map */
+	proute->parent_child_tupconv_maps = NULL;
+	proute->child_parent_tupconv_maps = NULL;
+	proute->partition_tuple_slots = NULL;
 
-		proute->partitions[i] = leaf_part_rri;
-		i++;
-	}
+	/*
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent as NULL as we don't need to care about any parent of the target
+	 * partitioned table.
+	 */
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
 
 	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
+	 * If performing an UPDATE with tuple routing, we can reuse partition
+	 * sub-plan result rels.  We build a hash table to map the OIDs of
+	 * partitions present in mtstate->resultRelInfo to their ResultRelInfos.
+	 * Every time a tuple is routed to a partition that we've yet to set the
+	 * ResultRelInfo for, before we go to the trouble of making one, we check
+	 * for a pre-made one in the hash table.
+	 *
+	 * Also, we'll need a slot that will transiently store the tuple being
+	 * routed using the root parent's rowtype.
 	 */
-	Assert(update_rri_index == num_update_rri);
+	if (node && node->operation == CMD_UPDATE)
+	{
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
+		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
+	}
+	else
+	{
+		proute->subplan_resultrel_hash = NULL;
+		proute->root_tuple_slot = NULL;
+	}
 
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find a leaf partition for the tuple contained in *slot.
+ * If the partition's ResultRelInfo does not yet exist in 'proute' then we set
+ * one up or reuse one from mtstate's resultRelInfo array.
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message, else it returns the index of the leaf partition's
+ * ResultRelInfo in the proute->partitions array.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch dispatch;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 	TupleTableSlot *myslot = NULL;
@@ -236,9 +229,10 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	while (true)
 	{
 		AttrNumber *map = dispatch->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = dispatch->reldesc;
+		partdesc = dispatch->partdesc;
 
 		/*
 		 * Convert the tuple to this parent's layout, if different from the
@@ -260,91 +254,251 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (dispatch->partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(dispatch, values, isnull);
-
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
-		{
-			result = -1;
-			break;
-		}
-		else if (dispatch->indexes[cur_index] >= 0)
+		if (partdesc->is_leaf[partidx])
 		{
-			result = dispatch->indexes[cur_index];
-			/* success! */
-			break;
+			int			result = -1;
+
+			/*
+			 * Get this leaf partition's index in the
+			 * PartitionTupleRouting->partitions array.  We need to build a
+			 * new ResultRelInfo.
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(dispatch->indexes[partidx] < proute->num_partitions);
+				result = dispatch->indexes[partidx];
+			}
+			else
+			{
+				/*
+				 * A ResultRelInfo has not been set up for this partition yet,
+				 * so either use one of the sub-plan result rels or create a
+				 * fresh one.
+				 */
+				if (proute->subplan_resultrel_hash)
+				{
+					ResultRelInfo *rri;
+					Oid			partoid = partdesc->oids[partidx];
+
+					rri = hash_search(proute->subplan_resultrel_hash,
+									  &partoid, HASH_FIND, NULL);
+
+					if (rri)
+					{
+						result = proute->num_partitions++;
+						dispatch->indexes[partidx] = result;
+
+
+						/* Allocate more space in the arrays, if required */
+						if (result >= proute->partitions_allocsize)
+							ExecExpandRoutingArrays(proute);
+
+						/* Save here for later use. */
+						proute->partitions[result] = rri;
+					}
+				}
+
+				/* We need to create a new one. */
+				if (result < 0)
+				{
+					MemoryContextSwitchTo(oldcxt);
+					result = ExecInitPartitionInfo(mtstate, resultRelInfo,
+												   proute, estate,
+												   dispatch, partidx);
+					MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+					Assert(result >= 0 && result < proute->num_partitions);
+				}
+			}
+
+			/* Release the tuple in the lowest parent's dedicated slot. */
+			if (slot == myslot)
+				ExecClearTuple(myslot);
+
+			MemoryContextSwitchTo(oldcxt);
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return result;
 		}
 		else
 		{
-			/* move down one level */
-			dispatch = pd[-dispatch->indexes[cur_index]];
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
+
+				/*
+				 * Move down to the next partition level and search again
+				 * until we find a leaf partition that matches this tuple
+				 */
+				dispatch = pd[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subdispatch;
+
+				MemoryContextSwitchTo(oldcxt);
+
+				/*
+				 * Create the new PartitionDispatch.  We pass the current one
+				 * in as the parent PartitionDispatch
+				 */
+				subdispatch = ExecInitPartitionDispatchInfo(proute,
+															partdesc->oids[partidx],
+															dispatch, partidx);
+				MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				Assert(dispatch->indexes[partidx] >= 0 &&
+					   dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = subdispatch;
+			}
 		}
 	}
+}
+
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo *subplan_result_rels;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			nsubplans;
+	int			i;
+
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
 
-	/* Release the tuple in the lowest parent's dedicated slot. */
-	if (slot == myslot)
-		ExecClearTuple(myslot);
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
 
-	/* A partition was not found. */
-	if (result < 0)
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_resultrel_hash = htab;
+
+	/* Hash all subplans by their Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		char	   *val_desc;
-
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
+
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+													&found);
+
+		if (!found)
+			*subplanrri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
 	}
+}
 
-	MemoryContextSwitchTo(oldcxt);
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
+/*
+ * ExecExpandRoutingArrays
+ *		Double the size of the allocated arrays in 'proute'
+ */
+static void
+ExecExpandRoutingArrays(PartitionTupleRouting *proute)
+{
+	int			new_size = proute->partitions_allocsize * 2;
+	int			old_size = proute->partitions_allocsize;
 
-	return result;
+	proute->partitions_allocsize = new_size;
+
+	proute->partitions = (ResultRelInfo **)
+		repalloc(proute->partitions, sizeof(ResultRelInfo *) * new_size);
+
+	if (proute->parent_child_tupconv_maps != NULL)
+	{
+		proute->parent_child_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->parent_child_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->parent_child_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_tupconv_maps != NULL)
+	{
+		proute->child_parent_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->child_parent_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->child_parent_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->partition_tuple_slots != NULL)
+	{
+		proute->partition_tuple_slots = (TupleTableSlot **)
+			repalloc(proute->partition_tuple_slots,
+				sizeof(TupleTableSlot **) * new_size);
+		memset(&proute->partition_tuple_slots[old_size], 0,
+			sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
- *
- * Returns the ResultRelInfo
+ *		and store it in the next empty slot in proute's partitions array.
+ *		Return the index of the array element.
  */
-ResultRelInfo *
+static int
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx)
 {
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -520,15 +674,25 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 									&mtstate->ps, RelationGetDescr(partrel));
 	}
 
+	part_result_rel_index = proute->num_partitions++;
+	dispatch->indexes[partidx] = part_result_rel_index;
+
+	/* Allocate more space in the arrays, if required */
+	if (part_result_rel_index >= proute->partitions_allocsize)
+		ExecExpandRoutingArrays(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
+
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri,
+						part_result_rel_index);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -541,7 +705,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -554,7 +718,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -568,7 +732,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -578,8 +742,12 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = PartitionParentToChildMap(proute, part_result_rel_index);
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -588,7 +756,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -679,12 +847,9 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
-
 	MemoryContextSwitchTo(oldContext);
 
-	return leaf_part_rri;
+	return part_result_rel_index;
 }
 
 /*
@@ -699,6 +864,7 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 					int partidx)
 {
 	MemoryContext oldContext;
+	TupleConversionMap *map;
 
 	/*
 	 * Switch into per-query memory context.
@@ -709,29 +875,42 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
-		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
-							   RelationGetDescr(partRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	map = convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								 RelationGetDescr(partRelInfo->ri_RelationDesc),
+								 gettext_noop("could not convert row type"));
 
 	/*
-	 * If a partition has a different rowtype than the root parent, initialize
-	 * a slot dedicated to storing this partition's tuples.  The slot is used
-	 * for various operations that are applied to tuples after routing, such
-	 * as checking constraints.
+	 * If a partition has a different rowtype than the root parent, store the
+	 * translation map and initialize a slot dedicated to storing this
+	 * partition's tuples.  The slot is used for various operations that are
+	 * applied to tuples after routing, such as checking constraints.
 	 */
-	if (proute->parent_child_tupconv_maps[partidx] != NULL)
+	if (map)
 	{
 		Relation	partrel = partRelInfo->ri_RelationDesc;
 
-		/*
-		 * Initialize the array in proute where these slots are stored, if not
-		 * already done.
-		 */
+		/* Allocate parent child map array only if we need to store a map */
+		if (proute->parent_child_tupconv_maps == NULL)
+		{
+			int			size = proute->partitions_allocsize;
+
+			proute->parent_child_tupconv_maps = (TupleConversionMap **)
+				palloc0(sizeof(TupleConversionMap *) * size);
+		}
+
+		proute->parent_child_tupconv_maps[partidx] = map;
+
+		 /*
+		  * Initialize the array in proute where these slots are stored, if not
+		  * already done.
+		  */
 		if (proute->partition_tuple_slots == NULL)
+		{
+			int			size = proute->partitions_allocsize;
+
 			proute->partition_tuple_slots = (TupleTableSlot **)
-				palloc0(proute->num_partitions *
-						sizeof(TupleTableSlot *));
+				palloc0(sizeof(TupleTableSlot *) * size);
+		}
 
 		/*
 		 * Initialize the slot itself setting its descriptor to this
@@ -741,6 +920,35 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 		proute->partition_tuple_slots[partidx] =
 			ExecInitExtraTupleSlot(estate,
 								   RelationGetDescr(partrel));
+
+	}
+
+	/*
+	 * Also, if transition capture is required, store a map to convert tuples
+	 * from partition's rowtype to the parent's.
+	 */
+	if (mtstate &&
+		(mtstate->mt_transition_capture || mtstate->mt_oc_transition_capture))
+	{
+		map =
+			convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
+								   RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								   gettext_noop("could not convert row type"));
+
+		/* Allocate child parent map array only if we need to store a map */
+		if (map)
+		{
+			if (proute->child_parent_tupconv_maps == NULL)
+			{
+				int			size;
+
+				size = proute->partitions_allocsize;
+				proute->child_parent_tupconv_maps = (TupleConversionMap **)
+					palloc0(sizeof(TupleConversionMap *) * size);
+			}
+
+			proute->child_parent_tupconv_maps[partidx] = map;
+		}
 	}
 
 	/*
@@ -757,67 +965,88 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 }
 
 /*
- * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
- * child-to-root tuple conversion map array.
- *
- * This map is required for capturing transition tuples when the target table
- * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
- * we need to convert it from the leaf partition to the target table
- * descriptor.
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table and store it in
+ *		the next available slot in the 'proute' partition_dispatch_info[]
+ *		array.  Also, record the index into this array in the 'parent_pd'
+ *		indexes[] array in the partidx element so that we can properly
+ *		retrieve the newly created PartitionDispatch later.
  */
-void
-ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
 {
-	Assert(proute != NULL);
+	Relation	rel;
+	PartitionDesc partdesc;
+	PartitionDispatch pd;
+	int			dispatchidx;
 
-	/*
-	 * These array elements get filled up with maps on an on-demand basis.
-	 * Initially just set all of them to NULL.
-	 */
-	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	partdesc = RelationGetPartitionDesc(rel);
 
-	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
-}
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
+									+ (partdesc->nparts * sizeof(int)));
+	pd->reldesc = rel;
+	pd->key = RelationGetPartitionKey(rel);
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		TupleDesc	tupdesc = RelationGetDescr(rel);
 
-/*
- * TupConvMapForLeaf -- Get the tuple conversion map for a given leaf partition
- * index.
- */
-TupleConversionMap *
-TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index)
-{
-	ResultRelInfo **resultRelInfos = proute->partitions;
-	TupleConversionMap **map;
-	TupleDesc	tupdesc;
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent_pd->reldesc),
+													   tupdesc,
+													   gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
 
-	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
-		return NULL;
+	dispatchidx = proute->num_dispatch++;
 
-	/* If we've already got a map, return it. */
-	map = &proute->child_parent_tupconv_maps[leaf_index];
-	if (*map != NULL)
-		return *map;
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize *= 2;
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
 
-	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
-	*map =
-		convert_tuples_by_name(tupdesc,
-							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
 
-	/* If it turns out no map is needed, remember for next time. */
-	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
+	/*
+	 * Finally, if setting up a PartitionDispatch for a sub-partitioned table,
+	 * install the link to allow us to descend the partition hierarchy for
+	 * future searches
+	 */
+	if (parent_pd)
+		parent_pd->indexes[partidx] = dispatchidx;
 
-	return *map;
+	return pd;
 }
 
 /*
@@ -830,8 +1059,8 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -852,10 +1081,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-		/* skip further processsing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
-
 		/* Allow any FDWs to shut down if they've been exercised */
 		if (resultRelInfo->ri_PartitionReadyForRouting &&
 			resultRelInfo->ri_FdwRoutine != NULL &&
@@ -864,21 +1089,19 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 														   resultRelInfo);
 
 		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
+		if (resultrel_hash)
 		{
-			subplan_index++;
-			continue;
+			Oid			partoid;
+			bool		found;
+
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
+			if (found)
+				continue;
 		}
 
 		ExecCloseIndices(resultRelInfo);
@@ -890,144 +1113,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent),
-													   tupdesc,
-													   gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
-
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
-		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
-
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
-	}
-}
-
 /* ----------------
  *		FormPartitionKeyDatum
  *			Construct values[] and isnull[] arrays for the partition key
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 528f58717e..840b98811f 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1665,7 +1664,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1709,21 +1708,13 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	 * value is to be used as an index into the arrays for the ResultRelInfo
 	 * and TupleConversionMap for the partition.
 	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
+	partidx = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
 	Assert(partidx >= 0 && partidx < proute->num_partitions);
 
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
-	 */
+	Assert(proute->partitions[partidx] != NULL);
+	/* Get the ResultRelInfo corresponding to the selected partition. */
 	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
+	Assert(partrel != NULL);
 
 	/*
 	 * Check whether the partition is routable if we didn't yet
@@ -1768,7 +1759,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			 */
 			mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 			mtstate->mt_transition_capture->tcs_map =
-				TupConvMapForLeaf(proute, targetRelInfo, partidx);
+									PartitionChildToParentMap(proute, partidx);
 		}
 		else
 		{
@@ -1783,13 +1774,13 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	if (mtstate->mt_oc_transition_capture != NULL)
 	{
 		mtstate->mt_oc_transition_capture->tcs_map =
-			TupConvMapForLeaf(proute, targetRelInfo, partidx);
+								PartitionChildToParentMap(proute, partidx);
 	}
 
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	map = proute->parent_child_tupconv_maps[partidx];
+	map = PartitionParentToChildMap(proute, partidx);
 	if (map != NULL)
 	{
 		TupleTableSlot *new_slot;
@@ -1834,17 +1825,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			numResultRelInfos = mtstate->mt_nplans;
 	int			i;
 
-	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
 	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
@@ -1866,79 +1846,18 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	}
 }
 
-/*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
 /*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ourselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index d5720518a8..2a1c1cb2e1 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1657,9 +1657,6 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 /*
  * expand_partitioned_rtentry
  *		Recursively expand an RTE for a partitioned table.
- *
- * Note that RelationGetPartitionDispatchInfo will expand partitions in the
- * same order as this code.
  */
 static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 5757301d05..2afde69134 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -582,6 +582,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -770,7 +771,15 @@ RelationBuildPartitionDesc(Relation rel)
 		 * defined by canonicalized representation of the partition bounds.
 		 */
 		for (i = 0; i < nparts; i++)
-			result->oids[mapping[i]] = oids[i];
+		{
+			int			index = mapping[i];
+
+			result->oids[index] = oids[i];
+			/* Record if the partition is a leaf partition */
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+		}
+
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index a53de2372e..59c7a6ab69 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -25,7 +25,11 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of 'nparts' elements containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * the corresponding 'oids' element belongs to
+								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 3e08104ea4..45d5f6a8d0 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -22,71 +22,119 @@
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 /*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
- *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slots		Array of TupleTableSlot objects; if non-NULL,
- *								contains one entry for every leaf partition,
- *								of which only those of the leaf partitions
- *								whose attribute numbers differ from the root
- *								parent have a non-NULL value.  NULL if all of
- *								the partitions encountered by a given command
- *								happen to have same rowtype as the root parent
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
+ *
+ * partition_root			The partitioned table that's the target of the
+ *							command.
+ *
+ * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
+ *							a pointer to a PartitionDispatch objects for every
+ *							partitioned table touched by tuple routing.  The
+ *							entry for the target partitioned table is *always*
+ *							present in the 0th element of this array.  See
+ *							comment for PartitionDispatchData->indexes for
+ *							details on how this array is indexed.
+ *
+ * num_dispatch				The current number of items stored in the
+ *							'partition_dispatch_info' array.  Also serves as
+ *							the index of the next free array element for new
+ *							PartitionDispatch which need to be stored.
+ *
+ * dispatch_allocsize		The current allocated size of the
+ *							'partition_dispatch_info' array.
+ *
+ * partitions				Array of 'partitions_allocsize' elements
+ *							containing pointers to a ResultRelInfos of all
+ *							leaf partitions touched by tuple routing.  Some of
+ *							these are pointers to ResultRelInfos which are
+ *							borrowed out of 'subplan_resultrel_hash'.  The
+ *							remainder have been built especially for tuple
+ *							routing.  See comment for
+ *							PartitionDispatchData->indexes for details on how
+ *							this array is indexed.
+ *
+ * num_partitions			The current number of items stored in the
+ *							'partitions' array.  Also serves as the index of
+ *							the next free array element for new ResultRelInfos
+ *							which need to be stored.
+ *
+ * partitions_allocsize		The current allocated size of the 'partitions'
+ *							array.  Also, if they're non-NULL, marks the size
+ *							of the 'parent_child_tupconv_maps',
+ *							'child_parent_tupconv_maps',
+ *							'child_parent_map_not_required' and
+ *							'partition_tuple_slots' arrays.
+ *
+ * parent_child_tupconv_maps	Array of partitions_allocsize elements
+ *							containing information on how to convert tuples of
+ *							partition_root's rowtype to the rowtype of the
+ *							corresponding partition as stored in 'partitions',
+ *							or NULL if no conversion is required.  The entire
+ *							array is only allocated when the first conversion
+ *							map needs to stored.  When not allocated it's set
+ *							to NULL.
+ *
+ * partition_tuple_slots	Array of TupleTableSlot objects; if non-NULL,
+ *							contains one entry for every leaf partition,
+ *							of which only those of the leaf partitions
+ *							whose attribute numbers differ from the root
+ *							parent have a non-NULL value.  NULL if all of
+ *							the partitions encountered by a given command
+ *							happen to have same rowtype as the root parent
+ *
+ * child_parent_tupconv_maps	As 'parent_child_tupconv_maps' but stores
+ *							conversion maps to translate partition tuples into
+ *							partition_root's rowtype, needed if transition
+ *							capture is active
+ *
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
+ *
+ * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
+ *							This is used to cache ResultRelInfos from subplans
+ *							of a ModifyTable node.  Some of these may be
+ *							useful for tuple routing to save having to build
+ *							duplicates.
+ *
+ * root_tuple_slot			During UPDATE tuple routing, this tuple slot is
+ *							used to transiently store a tuple using the root
+ *							table's rowtype after converting it from the
+ *							tuple's source leaf partition's rowtype.  That is,
+ *							if leaf partition's rowtype is different.
  *-----------------------
  */
 typedef struct PartitionTupleRouting
 {
+	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
-	Oid		   *partition_oids;
+	int			dispatch_allocsize;
 	ResultRelInfo **partitions;
 	int			num_partitions;
+	int			partitions_allocsize;
 	TupleConversionMap **parent_child_tupconv_maps;
 	TupleConversionMap **child_parent_tupconv_maps;
-	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
 	TupleTableSlot **partition_tuple_slots;
 	TupleTableSlot *root_tuple_slot;
+	HTAB	   *subplan_resultrel_hash;
 } PartitionTupleRouting;
 
+/*
+ * Accessor macros for tuple conversion maps contained in
+ * PartitionTupleRouting.  Beware of multiple evaluations of p!
+ */
+#define PartitionChildToParentMap(p, i) \
+			((p)->child_parent_tupconv_maps != NULL ? \
+				(p)->child_parent_tupconv_maps[(i)] : \
+							NULL)
+
+#define PartitionParentToChildMap(p, i) \
+			((p)->parent_child_tupconv_maps != NULL ? \
+				(p)->parent_child_tupconv_maps[(i)] : \
+							NULL)
+
 /*
  * PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
  * of partitions.  For a multilevel partitioned table, we have one of these
@@ -175,22 +223,20 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern int ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
+extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,
+					 ResultRelInfo *resultRelInfo,
+					 PartitionTupleRouting *proute,
+					 EState *estate, int partidx);
 extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
 					PartitionTupleRouting *proute,
 					ResultRelInfo *partRelInfo,
 					int partidx);
-extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
-extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index);
 extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute);
 extern PartitionPruneState *ExecCreatePartitionPruneState(PlanState *planstate,
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index 27cf5a01b3..6b841c7850 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -904,4 +904,26 @@ select * from parted_conflict order by a;
  50 | cincuenta | 2
 (1 row)
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+truncate parted_conflict;
+insert into parted_conflict values (0, 'cero', 1);
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+NOTICE:  a = 0, b = cero, c = 2
 drop table parted_conflict;
+drop function parted_conflict_update_func();
diff --git a/src/test/regress/sql/insert_conflict.sql b/src/test/regress/sql/insert_conflict.sql
index c677d70fb7..fe6dcfaa06 100644
--- a/src/test/regress/sql/insert_conflict.sql
+++ b/src/test/regress/sql/insert_conflict.sql
@@ -576,4 +576,30 @@ insert into parted_conflict values (50, 'cincuenta', 2)
 -- should see (50, 'cincuenta', 2)
 select * from parted_conflict order by a;
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+
+truncate parted_conflict;
+
+insert into parted_conflict values (0, 'cero', 1);
+
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+
 drop table parted_conflict;
+drop function parted_conflict_update_func();
-- 
2.16.2.windows.1

#45Gavin Flower
GavinFlower@archidevsys.co.nz
In reply to: David Rowley (#44)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 01/11/2018 14:30, David Rowley wrote:

On 1 November 2018 at 13:35, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2018/11/01 8:58, David Rowley wrote:

[...]

I agree. I don't think "TupRouting" really needs to be in the name.
Probably "To" can also just become "2" and we can put back the
Parent/Child before that.

Agree that "TupRouting" can go, but "To" is not too long for using "2"
instead of it.

I think that while '2' may only be one character less than 'to', the
character '2' stands out more.  However, can't say I could claim this
was of the utmost importance!

Okay. Here's a version with "2" put back to "To"...

[...]

#46Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#44)
2 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018/11/01 10:30, David Rowley wrote:

It's great to know the patch is now so perfect that we've only the
macro naming left to debate ;-)

I looked over v12 again and noticed a couple minor issues.

+ *              table then we store the index into parenting
+ *              PartitionTupleRouting 'partition_dispatch_info' array.  An

s/PartitionTupleRouting/PartitionTupleRouting's/g

Also, I got a bit concerned about "parenting". Does it mean something
like "enclosing", because the PartitionDispatch is a member of
PartitionTupleRouting? I got concerned because using "parent" like this
may be confusing as this is the partitioning code we're talking about,
where "parent" is generally used to mean "parent" table.

+     * the partitioned table that's the target of the command.  If we must
+     * route tuple via some sub-partitioned table, then the PartitionDispatch
+     * for those is only built the first time it's required.

... via some sub-partitioned table"s"

Or perhaps rewrite a bit as:

If we must route the tuple via some sub-partitioned table, then its
PartitionDispatch is built the first time it's required.

The macro naming discussion got me thinking today about the macro itself.
It encapsulates access to the various PartitionTupleRouting arrays
containing the maps, but maybe we've got the interface of tuple routing a
bit (maybe a lot given this thread!) wrong to begin with. Instead of
ExecFindPartition returning indexes into various PartitionTupleRouting
arrays and its callers then using those indexes to fetch various objects
from those arrays, why doesn't it return those objects itself? Although
we made sure that the callers don't need to worry about the meaning of
these indexes changing with this patch, it still seems a bit odd for them
to have to go back to those arrays to get various objects.

How about we change ExecFindPartition's interface so that it returns the
ResultRelInfo, the two maps, and the partition slot? So, the arrays
simply become a cache for ExecFindPartition et al and are no longer
accessed outside execPartition.c. Although that makes the interface of
ExecFindPartition longer, I think it reduces overall complexity.

I've implemented that in the attached patch
1-revise-ExecFindPartition-interface.patch.

Also, since all members of PartitionTupleRouting are only accessed within
execPartition.c save root_tuple_slot, we can move it to execPartition.c to
make its internals private, after doing something about root_tuple_slot.
Looking at the code related to root_tuple_slot, it seems the field really
belongs in ModifyTableState, because it got nothing to do with routing.
Attached 2-make-PartitionTupleRouting-private.patch does that.

These patches 1 and 2 apply on top of v12-0001.. patch.

Thanks,
Amit

Attachments:

1-revise-ExecFindPartition-interface.patchtext/plain; charset=UTF-8; name=1-revise-ExecFindPartition-interface.patchDownload
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 0b0696e61e..b45972682f 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2316,6 +2316,7 @@ CopyFrom(CopyState cstate)
 	bool	   *nulls;
 	ResultRelInfo *resultRelInfo;
 	ResultRelInfo *target_resultRelInfo;
+	ResultRelInfo *prev_part_rel = NULL;
 	EState	   *estate = CreateExecutorState(); /* for ExecConstraints() */
 	ModifyTableState *mtstate;
 	ExprContext *econtext;
@@ -2331,7 +2332,6 @@ CopyFrom(CopyState cstate)
 	CopyInsertMethod insertMethod;
 	uint64		processed = 0;
 	int			nBufferedTuples = 0;
-	int			prev_leaf_part_index = -1;
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
 	bool		leafpart_use_multi_insert = false;
@@ -2685,19 +2685,24 @@ CopyFrom(CopyState cstate)
 		/* Determine the partition to heap_insert the tuple into */
 		if (proute)
 		{
-			int			leaf_part_index;
-			TupleConversionMap *map;
+			TupleTableSlot *partition_slot = NULL;
+			TupleConversionMap *child_to_parent_map,
+							   *parent_to_child_map;
 
 			/*
 			 * Attempt to find a partition suitable for this tuple.
 			 * ExecFindPartition() will raise an error if none can be found.
+			 * This replaces the original target ResultRelInfo with
+			 * partition's.
 			 */
-			leaf_part_index = ExecFindPartition(mtstate, target_resultRelInfo,
-												proute, slot, estate);
-			Assert(leaf_part_index >= 0 &&
-				   leaf_part_index < proute->num_partitions);
+			resultRelInfo = ExecFindPartition(mtstate, target_resultRelInfo,
+											  proute, slot, estate,
+											  &parent_to_child_map,
+											  &partition_slot,
+											  &child_to_parent_map);
+			Assert(resultRelInfo != NULL);
 
-			if (prev_leaf_part_index != leaf_part_index)
+			if (prev_part_rel != resultRelInfo)
 			{
 				/* Check if we can multi-insert into this partition */
 				if (insertMethod == CIM_MULTI_CONDITIONAL)
@@ -2710,12 +2715,9 @@ CopyFrom(CopyState cstate)
 					if (nBufferedTuples > 0)
 					{
 						ExprContext *swapcontext;
-						ResultRelInfo *presultRelInfo;
-
-						presultRelInfo = proute->partitions[prev_leaf_part_index];
 
 						CopyFromInsertBatch(cstate, estate, mycid, hi_options,
-											presultRelInfo, myslot, bistate,
+											prev_part_rel, myslot, bistate,
 											nBufferedTuples, bufferedTuples,
 											firstBufferedLineNo);
 						nBufferedTuples = 0;
@@ -2772,13 +2774,6 @@ CopyFrom(CopyState cstate)
 					}
 				}
 
-				/*
-				 * Overwrite resultRelInfo with the corresponding partition's
-				 * one.
-				 */
-				resultRelInfo = proute->partitions[leaf_part_index];
-				Assert(resultRelInfo != NULL);
-
 				/* Determine which triggers exist on this partition */
 				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
 											  resultRelInfo->ri_TrigDesc->trig_insert_before_row);
@@ -2804,7 +2799,7 @@ CopyFrom(CopyState cstate)
 				 * buffer when the partition being inserted into changes.
 				 */
 				ReleaseBulkInsertStatePin(bistate);
-				prev_leaf_part_index = leaf_part_index;
+				prev_part_rel = resultRelInfo;
 			}
 
 			/*
@@ -2826,8 +2821,7 @@ CopyFrom(CopyState cstate)
 					 * tuplestore format.
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
-					cstate->transition_capture->tcs_map =
-						PartitionChildToParentMap(proute, leaf_part_index);
+					cstate->transition_capture->tcs_map = child_to_parent_map;
 				}
 				else
 				{
@@ -2844,16 +2838,13 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = PartitionParentToChildMap(proute, leaf_part_index);
-			if (map != NULL)
+			if (parent_to_child_map != NULL)
 			{
-				TupleTableSlot *new_slot;
 				MemoryContext oldcontext;
 
-				Assert(proute->partition_tuple_slots != NULL &&
-					   proute->partition_tuple_slots[leaf_part_index] != NULL);
-				new_slot = proute->partition_tuple_slots[leaf_part_index];
-				slot = execute_attr_map_slot(map->attrMap, slot, new_slot);
+				Assert(partition_slot != NULL);
+				slot = execute_attr_map_slot(parent_to_child_map->attrMap,
+											 slot, partition_slot);
 
 				/*
 				 * Get the tuple in the per-tuple context, so that it will be
@@ -2997,12 +2988,8 @@ CopyFrom(CopyState cstate)
 	{
 		if (insertMethod == CIM_MULTI_CONDITIONAL)
 		{
-			ResultRelInfo *presultRelInfo;
-
-			presultRelInfo = proute->partitions[prev_leaf_part_index];
-
 			CopyFromInsertBatch(cstate, estate, mycid, hi_options,
-								presultRelInfo, myslot, bistate,
+								prev_part_rel, myslot, bistate,
 								nBufferedTuples, bufferedTuples,
 								firstBufferedLineNo);
 		}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 542578102f..4b27874f01 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -70,11 +70,16 @@ typedef struct PartitionDispatchData
 static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
 							   PartitionTupleRouting *proute);
 static void ExecExpandRoutingArrays(PartitionTupleRouting *proute);
-static int ExecInitPartitionInfo(ModifyTableState *mtstate,
+static ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
 					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
 					  EState *estate,
 					  PartitionDispatch dispatch, int partidx);
+static void ExecInitRoutingInfo(ModifyTableState *mtstate,
+					EState *estate,
+					PartitionTupleRouting *proute,
+					ResultRelInfo *partRelInfo,
+					int partidx);
 static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
 							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
@@ -194,14 +199,25 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the index of the leaf partition's
- * ResultRelInfo in the proute->partitions array.
+ * error message, else it returns the leaf partition's ResultRelInfo.
+ *
+ * *parent_to_child_map is set if the parent tuples would need to be converted
+ * before inserting into the chosen partition.  In that case,
+ * *partition_tuple_slot is also set.
+ *
+ * *child_to_parent_map is set if the tuples inserted into the partition after
+ * routing would need to be converted back to parent's rowtype for storing
+ * into the transition tuple store, that is, only if transition capture is
+ * active for the command.
  */
-int
+ResultRelInfo *
 ExecFindPartition(ModifyTableState *mtstate,
 				  ResultRelInfo *resultRelInfo,
 				  PartitionTupleRouting *proute,
-				  TupleTableSlot *slot, EState *estate)
+				  TupleTableSlot *slot, EState *estate,
+				  TupleConversionMap **parent_to_child_map,
+				  TupleTableSlot **partition_slot,
+				  TupleConversionMap **child_to_parent_map)
 {
 	PartitionDispatch *pd = proute->partition_dispatch_info;
 	Datum		values[PARTITION_MAX_KEYS];
@@ -214,6 +230,9 @@ ExecFindPartition(ModifyTableState *mtstate,
 	TupleTableSlot *myslot = NULL;
 	MemoryContext oldcxt;
 
+	*parent_to_child_map = *child_to_parent_map = NULL;
+	*partition_slot = NULL;
+
 	/* use per-tuple context here to avoid leaking memory */
 	oldcxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
 
@@ -274,7 +293,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 
 		if (partdesc->is_leaf[partidx])
 		{
-			int			result = -1;
+			int			index = -1;
+			ResultRelInfo *result = NULL;
 
 			/*
 			 * Get this leaf partition's index in the
@@ -285,7 +305,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 			{
 				/* ResultRelInfo already built */
 				Assert(dispatch->indexes[partidx] < proute->num_partitions);
-				result = dispatch->indexes[partidx];
+				index = dispatch->indexes[partidx];
+				result = proute->partitions[index];
 			}
 			else
 			{
@@ -296,36 +317,58 @@ ExecFindPartition(ModifyTableState *mtstate,
 				 */
 				if (proute->subplan_resultrel_hash)
 				{
-					ResultRelInfo *rri;
 					Oid			partoid = partdesc->oids[partidx];
 
-					rri = hash_search(proute->subplan_resultrel_hash,
-									  &partoid, HASH_FIND, NULL);
+					result = hash_search(proute->subplan_resultrel_hash,
+										 &partoid, HASH_FIND, NULL);
 
-					if (rri)
+					if (result)
 					{
-						result = proute->num_partitions++;
-						dispatch->indexes[partidx] = result;
+						index = proute->num_partitions++;
+						dispatch->indexes[partidx] = index;
 
 
 						/* Allocate more space in the arrays, if required */
-						if (result >= proute->partitions_allocsize)
+						if (index >= proute->partitions_allocsize)
 							ExecExpandRoutingArrays(proute);
 
 						/* Save here for later use. */
-						proute->partitions[result] = rri;
+						proute->partitions[index] = result;
+
+						/*
+						 * We need to make this result rel "routable" if it's
+						 * the first time is is being used for routing.  Also,
+						 * we would've only checked if the relation is a valid
+						 * target for UPDATE when creating this ResultRelInfo
+						 * and now we're about to insert the routed tuple into
+						 * it, so we need to check if it's a valid target for
+						 * INSERT as well.
+						 */
+						if (!result->ri_PartitionReadyForRouting)
+						{
+							CheckValidResultRel(result, CMD_INSERT);
+
+							/*
+							 * Also set up information needed for routing
+							 * tuples to the partition.
+							 */
+							ExecInitRoutingInfo(mtstate, estate, proute,
+												result, index);
+						}
 					}
 				}
 
 				/* We need to create a new one. */
-				if (result < 0)
+				if (result == NULL)
 				{
 					MemoryContextSwitchTo(oldcxt);
 					result = ExecInitPartitionInfo(mtstate, resultRelInfo,
 												   proute, estate,
 												   dispatch, partidx);
 					MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
-					Assert(result >= 0 && result < proute->num_partitions);
+					Assert(result != NULL);
+					index = dispatch->indexes[partidx];
+					Assert(index >= 0 && index < proute->num_partitions);
 				}
 			}
 
@@ -335,6 +378,26 @@ ExecFindPartition(ModifyTableState *mtstate,
 
 			MemoryContextSwitchTo(oldcxt);
 			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+
+			/* Set the values of result maps and the slot if needed. */
+			if (proute->parent_child_tupconv_maps)
+			{
+				*parent_to_child_map = proute->parent_child_tupconv_maps[index];
+
+				/*
+				 * If non-NULL, correponding tuple slot must have been
+				 * initialized for the partition.
+				 */
+				if (*parent_to_child_map != NULL)
+				{
+					*partition_slot = proute->partition_tuple_slots[index];
+					Assert(*partition_slot != NULL);
+				}
+			}
+
+			if (proute->child_parent_tupconv_maps)
+				*child_to_parent_map = proute->child_parent_tupconv_maps[index];
+
 			return result;
 		}
 		else
@@ -374,6 +437,9 @@ ExecFindPartition(ModifyTableState *mtstate,
 			}
 		}
 	}
+
+	Assert(false);
+	return NULL;	/* keep compiler quiet */
 }
 
 /*
@@ -475,9 +541,10 @@ ExecExpandRoutingArrays(PartitionTupleRouting *proute)
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
  *		and store it in the next empty slot in proute's partitions array.
- *		Return the index of the array element.
+ *
+ * Returns the ResultRelInfo
  */
-static int
+static ResultRelInfo *
 ExecInitPartitionInfo(ModifyTableState *mtstate,
 					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
@@ -742,9 +809,10 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
-			TupleConversionMap *map;
+			TupleConversionMap *map = NULL;
 
-			map = PartitionParentToChildMap(proute, part_result_rel_index);
+			if (proute->parent_child_tupconv_maps)
+				map = proute->parent_child_tupconv_maps[part_result_rel_index];
 
 			Assert(node->onConflictSet != NIL);
 			Assert(rootResultRelInfo->ri_onConflict != NULL);
@@ -849,14 +917,14 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 
 	MemoryContextSwitchTo(oldContext);
 
-	return part_result_rel_index;
+	return leaf_part_rri;
 }
 
 /*
  * ExecInitRoutingInfo
  *		Set up information needed for routing tuples to a leaf partition
  */
-void
+static void
 ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
 					PartitionTupleRouting *proute,
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 840b98811f..18c55e2b19 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1697,46 +1697,22 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						TupleTableSlot *slot)
 {
 	ModifyTable *node;
-	int			partidx;
 	ResultRelInfo *partrel;
 	HeapTuple	tuple;
-	TupleConversionMap *map;
+	TupleTableSlot *partition_slot = NULL;
+	TupleConversionMap *child_to_parent_map,
+					   *parent_to_child_map;
 
 	/*
 	 * Determine the target partition.  If ExecFindPartition does not find a
-	 * partition after all, it doesn't return here; otherwise, the returned
-	 * value is to be used as an index into the arrays for the ResultRelInfo
-	 * and TupleConversionMap for the partition.
+	 * partition after all, it doesn't return here.
 	 */
-	partidx = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
-	Assert(partidx >= 0 && partidx < proute->num_partitions);
-
-	Assert(proute->partitions[partidx] != NULL);
-	/* Get the ResultRelInfo corresponding to the selected partition. */
-	partrel = proute->partitions[partidx];
+	partrel = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate,
+								&parent_to_child_map, &partition_slot,
+								&child_to_parent_map);
 	Assert(partrel != NULL);
 
 	/*
-	 * Check whether the partition is routable if we didn't yet
-	 *
-	 * Note: an UPDATE of a partition key invokes an INSERT that moves the
-	 * tuple to a new partition.  This check would be applied to a subplan
-	 * partition of such an UPDATE that is chosen as the partition to route
-	 * the tuple to.  The reason we do this check here rather than in
-	 * ExecSetupPartitionTupleRouting is to avoid aborting such an UPDATE
-	 * unnecessarily due to non-routable subplan partitions that may not be
-	 * chosen for update tuple movement after all.
-	 */
-	if (!partrel->ri_PartitionReadyForRouting)
-	{
-		/* Verify the partition is a valid target for INSERT. */
-		CheckValidResultRel(partrel, CMD_INSERT);
-
-		/* Set up information needed for routing tuples to the partition. */
-		ExecInitRoutingInfo(mtstate, estate, proute, partrel, partidx);
-	}
-
-	/*
 	 * Make it look like we are inserting into the partition.
 	 */
 	estate->es_result_relation_info = partrel;
@@ -1758,8 +1734,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			 * to be ready to convert their result back to tuplestore format.
 			 */
 			mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
-			mtstate->mt_transition_capture->tcs_map =
-									PartitionChildToParentMap(proute, partidx);
+			mtstate->mt_transition_capture->tcs_map = child_to_parent_map;
 		}
 		else
 		{
@@ -1772,23 +1747,17 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 		}
 	}
 	if (mtstate->mt_oc_transition_capture != NULL)
-	{
-		mtstate->mt_oc_transition_capture->tcs_map =
-								PartitionChildToParentMap(proute, partidx);
-	}
+		mtstate->mt_oc_transition_capture->tcs_map = child_to_parent_map;
 
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	map = PartitionParentToChildMap(proute, partidx);
-	if (map != NULL)
+	if (parent_to_child_map != NULL)
 	{
-		TupleTableSlot *new_slot;
 
-		Assert(proute->partition_tuple_slots != NULL &&
-			   proute->partition_tuple_slots[partidx] != NULL);
-		new_slot = proute->partition_tuple_slots[partidx];
-		slot = execute_attr_map_slot(map->attrMap, slot, new_slot);
+		Assert(partition_slot != NULL);
+		slot = execute_attr_map_slot(parent_to_child_map->attrMap, slot,
+									 partition_slot);
 	}
 
 	/* Initialize information needed to handle ON CONFLICT DO UPDATE. */
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 45d5f6a8d0..7c8314362c 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -122,20 +122,6 @@ typedef struct PartitionTupleRouting
 } PartitionTupleRouting;
 
 /*
- * Accessor macros for tuple conversion maps contained in
- * PartitionTupleRouting.  Beware of multiple evaluations of p!
- */
-#define PartitionChildToParentMap(p, i) \
-			((p)->child_parent_tupconv_maps != NULL ? \
-				(p)->child_parent_tupconv_maps[(i)] : \
-							NULL)
-
-#define PartitionParentToChildMap(p, i) \
-			((p)->parent_child_tupconv_maps != NULL ? \
-				(p)->parent_child_tupconv_maps[(i)] : \
-							NULL)
-
-/*
  * PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
  * of partitions.  For a multilevel partitioned table, we have one of these
  * for the topmost partition plus one for each non-leaf child partition.
@@ -223,20 +209,14 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ModifyTableState *mtstate,
+extern ResultRelInfo *ExecFindPartition(ModifyTableState *mtstate,
 				  ResultRelInfo *resultRelInfo,
 				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
-				  EState *estate);
-extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,
-					 ResultRelInfo *resultRelInfo,
-					 PartitionTupleRouting *proute,
-					 EState *estate, int partidx);
-extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
-					EState *estate,
-					PartitionTupleRouting *proute,
-					ResultRelInfo *partRelInfo,
-					int partidx);
+				  EState *estate,
+				  TupleConversionMap **parent_to_child_map,
+				  TupleTableSlot **partition_slot,
+				  TupleConversionMap **child_to_parent_map);
 extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute);
 extern PartitionPruneState *ExecCreatePartitionPruneState(PlanState *planstate,
2-make-PartitionTupleRouting-private.patchtext/plain; charset=UTF-8; name=2-make-PartitionTupleRouting-private.patchDownload
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b45972682f..2b5e71b843 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2528,7 +2528,7 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
+		proute = ExecSetupPartitionTupleRouting(mtstate, cstate->rel);
 
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 4b27874f01..0978a55f48 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -66,6 +66,98 @@ typedef struct PartitionDispatchData
 	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
+/*-----------------------
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
+ *
+ * partition_root			The partitioned table that's the target of the
+ *							command.
+ *
+ * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
+ *							a pointer to a PartitionDispatch objects for every
+ *							partitioned table touched by tuple routing.  The
+ *							entry for the target partitioned table is *always*
+ *							present in the 0th element of this array.  See
+ *							comment for PartitionDispatchData->indexes for
+ *							details on how this array is indexed.
+ *
+ * num_dispatch				The current number of items stored in the
+ *							'partition_dispatch_info' array.  Also serves as
+ *							the index of the next free array element for new
+ *							PartitionDispatch which need to be stored.
+ *
+ * dispatch_allocsize		The current allocated size of the
+ *							'partition_dispatch_info' array.
+ *
+ * partitions				Array of 'partitions_allocsize' elements
+ *							containing pointers to a ResultRelInfos of all
+ *							leaf partitions touched by tuple routing.  Some of
+ *							these are pointers to ResultRelInfos which are
+ *							borrowed out of 'subplan_resultrel_hash'.  The
+ *							remainder have been built especially for tuple
+ *							routing.  See comment for
+ *							PartitionDispatchData->indexes for details on how
+ *							this array is indexed.
+ *
+ * num_partitions			The current number of items stored in the
+ *							'partitions' array.  Also serves as the index of
+ *							the next free array element for new ResultRelInfos
+ *							which need to be stored.
+ *
+ * partitions_allocsize		The current allocated size of the 'partitions'
+ *							array.  Also, if they're non-NULL, marks the size
+ *							of the 'parent_child_tupconv_maps',
+ *							'child_parent_tupconv_maps',
+ *							'child_parent_map_not_required' and
+ *							'partition_tuple_slots' arrays.
+ *
+ * parent_child_tupconv_maps	Array of partitions_allocsize elements
+ *							containing information on how to convert tuples of
+ *							partition_root's rowtype to the rowtype of the
+ *							corresponding partition as stored in 'partitions',
+ *							or NULL if no conversion is required.  The entire
+ *							array is only allocated when the first conversion
+ *							map needs to stored.  When not allocated it's set
+ *							to NULL.
+ *
+ * partition_tuple_slots	Array of TupleTableSlot objects; if non-NULL,
+ *							contains one entry for every leaf partition,
+ *							of which only those of the leaf partitions
+ *							whose attribute numbers differ from the root
+ *							parent have a non-NULL value.  NULL if all of
+ *							the partitions encountered by a given command
+ *							happen to have same rowtype as the root parent
+ *
+ * child_parent_tupconv_maps	As 'parent_child_tupconv_maps' but stores
+ *							conversion maps to translate partition tuples into
+ *							partition_root's rowtype, needed if transition
+ *							capture is active
+ *
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
+ *
+ * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
+ *							This is used to cache ResultRelInfos from subplans
+ *							of a ModifyTable node.  Some of these may be
+ *							useful for tuple routing to save having to build
+ *							duplicates.
+ *-----------------------
+ */
+typedef struct PartitionTupleRouting
+{
+	Relation	partition_root;
+	PartitionDispatch *partition_dispatch_info;
+	int			num_dispatch;
+	int			dispatch_allocsize;
+	ResultRelInfo **partitions;
+	int			num_partitions;
+	int			partitions_allocsize;
+	TupleConversionMap **parent_child_tupconv_maps;
+	TupleConversionMap **child_parent_tupconv_maps;
+	TupleTableSlot **partition_tuple_slots;
+	HTAB	   *subplan_resultrel_hash;
+} PartitionTupleRouting;
 
 static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
 							   PartitionTupleRouting *proute);
@@ -179,12 +271,12 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	if (node && node->operation == CMD_UPDATE)
 	{
 		ExecHashSubPlanResultRelsByOid(mtstate, proute);
-		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
+		mtstate->mt_root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
 	}
 	else
 	{
 		proute->subplan_resultrel_hash = NULL;
-		proute->root_tuple_slot = NULL;
+		mtstate->mt_root_tuple_slot = NULL;
 	}
 
 	return proute;
@@ -1175,10 +1267,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
-
-	/* Release the standalone partition tuple descriptors, if any */
-	if (proute->root_tuple_slot)
-		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
 }
 
 /* ----------------
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 18c55e2b19..390191bdc8 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1161,8 +1161,8 @@ lreplace:;
 			Assert(map_index >= 0 && map_index < mtstate->mt_nplans);
 			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
 			if (tupconv_map != NULL)
-				slot = execute_attr_map_slot(tupconv_map->attrMap,
-											 slot, proute->root_tuple_slot);
+				slot = execute_attr_map_slot(tupconv_map->attrMap, slot,
+											 mtstate->mt_root_tuple_slot);
 
 			/*
 			 * Prepare for tuple routing, making it look like we're inserting
@@ -2616,6 +2616,10 @@ ExecEndModifyTable(ModifyTableState *node)
 	 */
 	for (i = 0; i < node->mt_nplans; i++)
 		ExecEndNode(node->mt_plans[i]);
+
+	/* Release the standalone partition tuple descriptors, if any */
+	if (node->mt_root_tuple_slot)
+		ExecDropSingleTupleTableSlot(node->mt_root_tuple_slot);
 }
 
 void
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 7c8314362c..91886f1b19 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -18,108 +18,9 @@
 #include "nodes/plannodes.h"
 #include "partitioning/partprune.h"
 
-/* See execPartition.c for the definition. */
+/* See execPartition.c for the definitions. */
 typedef struct PartitionDispatchData *PartitionDispatch;
-
-/*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to
- * route a tuple inserted into a partitioned table to one of its leaf
- * partitions
- *
- * partition_root			The partitioned table that's the target of the
- *							command.
- *
- * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
- *							a pointer to a PartitionDispatch objects for every
- *							partitioned table touched by tuple routing.  The
- *							entry for the target partitioned table is *always*
- *							present in the 0th element of this array.  See
- *							comment for PartitionDispatchData->indexes for
- *							details on how this array is indexed.
- *
- * num_dispatch				The current number of items stored in the
- *							'partition_dispatch_info' array.  Also serves as
- *							the index of the next free array element for new
- *							PartitionDispatch which need to be stored.
- *
- * dispatch_allocsize		The current allocated size of the
- *							'partition_dispatch_info' array.
- *
- * partitions				Array of 'partitions_allocsize' elements
- *							containing pointers to a ResultRelInfos of all
- *							leaf partitions touched by tuple routing.  Some of
- *							these are pointers to ResultRelInfos which are
- *							borrowed out of 'subplan_resultrel_hash'.  The
- *							remainder have been built especially for tuple
- *							routing.  See comment for
- *							PartitionDispatchData->indexes for details on how
- *							this array is indexed.
- *
- * num_partitions			The current number of items stored in the
- *							'partitions' array.  Also serves as the index of
- *							the next free array element for new ResultRelInfos
- *							which need to be stored.
- *
- * partitions_allocsize		The current allocated size of the 'partitions'
- *							array.  Also, if they're non-NULL, marks the size
- *							of the 'parent_child_tupconv_maps',
- *							'child_parent_tupconv_maps',
- *							'child_parent_map_not_required' and
- *							'partition_tuple_slots' arrays.
- *
- * parent_child_tupconv_maps	Array of partitions_allocsize elements
- *							containing information on how to convert tuples of
- *							partition_root's rowtype to the rowtype of the
- *							corresponding partition as stored in 'partitions',
- *							or NULL if no conversion is required.  The entire
- *							array is only allocated when the first conversion
- *							map needs to stored.  When not allocated it's set
- *							to NULL.
- *
- * partition_tuple_slots	Array of TupleTableSlot objects; if non-NULL,
- *							contains one entry for every leaf partition,
- *							of which only those of the leaf partitions
- *							whose attribute numbers differ from the root
- *							parent have a non-NULL value.  NULL if all of
- *							the partitions encountered by a given command
- *							happen to have same rowtype as the root parent
- *
- * child_parent_tupconv_maps	As 'parent_child_tupconv_maps' but stores
- *							conversion maps to translate partition tuples into
- *							partition_root's rowtype, needed if transition
- *							capture is active
- *
- * Note: The following fields are used only when UPDATE ends up needing to
- * do tuple routing.
- *
- * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
- *							This is used to cache ResultRelInfos from subplans
- *							of a ModifyTable node.  Some of these may be
- *							useful for tuple routing to save having to build
- *							duplicates.
- *
- * root_tuple_slot			During UPDATE tuple routing, this tuple slot is
- *							used to transiently store a tuple using the root
- *							table's rowtype after converting it from the
- *							tuple's source leaf partition's rowtype.  That is,
- *							if leaf partition's rowtype is different.
- *-----------------------
- */
-typedef struct PartitionTupleRouting
-{
-	Relation	partition_root;
-	PartitionDispatch *partition_dispatch_info;
-	int			num_dispatch;
-	int			dispatch_allocsize;
-	ResultRelInfo **partitions;
-	int			num_partitions;
-	int			partitions_allocsize;
-	TupleConversionMap **parent_child_tupconv_maps;
-	TupleConversionMap **child_parent_tupconv_maps;
-	TupleTableSlot **partition_tuple_slots;
-	TupleTableSlot *root_tuple_slot;
-	HTAB	   *subplan_resultrel_hash;
-} PartitionTupleRouting;
+typedef struct PartitionTupleRouting PartitionTupleRouting;
 
 /*
  * PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 880a03e4e4..73ecc20074 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1084,6 +1084,14 @@ typedef struct ModifyTableState
 
 	/* Per plan map for tuple conversion from child to root */
 	TupleConversionMap **mt_per_subplan_tupconv_maps;
+
+	/*
+	 * During UPDATE tuple routing, this tuple slot is used to transiently
+	 * store a tuple using the root table's rowtype after converting it from
+	 * the tuple's source leaf partition's rowtype.  That is, if leaf
+	 * partition's rowtype is different.
+	 */
+	TupleTableSlot *mt_root_tuple_slot;
 } ModifyTableState;
 
 /* ----------------
#47David Rowley
david.rowley@2ndquadrant.com
In reply to: Amit Langote (#46)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 1 November 2018 at 22:39, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2018/11/01 10:30, David Rowley wrote:

It's great to know the patch is now so perfect that we've only the
macro naming left to debate ;-)

I looked over v12 again and noticed a couple minor issues.

+ *              table then we store the index into parenting
+ *              PartitionTupleRouting 'partition_dispatch_info' array.  An

s/PartitionTupleRouting/PartitionTupleRouting's/g

Also, I got a bit concerned about "parenting". Does it mean something
like "enclosing", because the PartitionDispatch is a member of
PartitionTupleRouting? I got concerned because using "parent" like this
may be confusing as this is the partitioning code we're talking about,
where "parent" is generally used to mean "parent" table.

+     * the partitioned table that's the target of the command.  If we must
+     * route tuple via some sub-partitioned table, then the PartitionDispatch
+     * for those is only built the first time it's required.

... via some sub-partitioned table"s"

Or perhaps rewrite a bit as:

If we must route the tuple via some sub-partitioned table, then its
PartitionDispatch is built the first time it's required.

I've attached v13 which hopefully addresses these.

The macro naming discussion got me thinking today about the macro itself.
It encapsulates access to the various PartitionTupleRouting arrays
containing the maps, but maybe we've got the interface of tuple routing a
bit (maybe a lot given this thread!) wrong to begin with. Instead of
ExecFindPartition returning indexes into various PartitionTupleRouting
arrays and its callers then using those indexes to fetch various objects
from those arrays, why doesn't it return those objects itself? Although
we made sure that the callers don't need to worry about the meaning of
these indexes changing with this patch, it still seems a bit odd for them
to have to go back to those arrays to get various objects.

How about we change ExecFindPartition's interface so that it returns the
ResultRelInfo, the two maps, and the partition slot? So, the arrays
simply become a cache for ExecFindPartition et al and are no longer
accessed outside execPartition.c. Although that makes the interface of
ExecFindPartition longer, I think it reduces overall complexity.

I don't really think stuffing values into a bunch of output parameters
is much of an improvement. I'd rather allow callers to fetch what they
need using the index we return. Most callers don't need to know about
the child to parent maps, so it seems nicer for those places not to
have to invent a dummy variable to pass along to ExecFindPartition()
so it can needlessly populate it for them.

Perhaps a better design would be to instead of having random special
partitioned-table-only fields in ResultRelInfo, just have an extra
struct there that contains the extra partition information which would
include the translation maps and then just return the ResultRelInfo
and allow callers to lookup any extra details they require. I've not
looked into this in detail, but I think the committer work that's
required for the patch as it is today is already quite significant.
I'm not keen on warding any willing one off by making the commit job
any harder. I agree that it would be good to stabilise the API for
all this partitioning code sometime, but I don't believe it needs to
be done all in one commit. My intent here is to improve performance or
INSERT and UPDATE on partitioned tables. Perhaps we can shape some API
redesign later in the release cycle. What do you think?

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v13-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchapplication/octet-stream; name=v13-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload
From 2208fb195ed64fdd6ded17b51a2b92457f829f3a Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 26 Jul 2018 19:54:55 +1200
Subject: [PATCH v13] Speed up INSERT and UPDATE on partitioned tables

This is more or less a complete redesign of PartitionTupleRouting. The
aim here is to get rid of all the possibly large arrays that were being
allocated during ExecSetupPartitionTupleRouting().  We now allocate
small arrays to store the partition's ResultRelInfo and only enlarge
these when we run out of space.  The partitions array is now ordered
by the order in which the partition's ResultRelInfos are initialized
rather than in same order as partdesc.

The find_all_inheritors() call still remains by far the slowest part of
ExecSetupPartitionTupleRouting(). This patch just removes the other slow
parts.

Initialization of the parent to child and child to parent translation maps
arrays are now only performed when we need to store the first translation
map.  If the column order between the parent and its child are the same,
then no map ever needs to be stored, these (possibly large) arrays did
nothing.  The fact that we now always initialize the child to parent map
whenever transition capture is required, we no longer need the
child_parent_map_not_required array.  Previously this was only required
so we could determine if no map was required or if the map had not yet
been initialized.

In simple INSERTs hitting a single partition to a partitioned table with
many partitions, the shutdown of the executor was also slow in comparison
to the actual execution. This was down to the loop which cleans up each
ResultRelInfo having to loop over an array which contained mostly NULLs
that had to be skipped.  Performance of this has now improved as the array
we loop over now no longer has to skip possibly many NULL values.

David Rowley and Amit Langote
---
 src/backend/commands/copy.c                   |  46 +-
 src/backend/executor/execPartition.c          | 846 ++++++++++++++------------
 src/backend/executor/nodeModifyTable.c        | 105 +---
 src/backend/optimizer/prep/prepunion.c        |   3 -
 src/backend/utils/cache/partcache.c           |  11 +-
 src/include/catalog/partition.h               |   6 +-
 src/include/executor/execPartition.h          | 166 +++--
 src/test/regress/expected/insert_conflict.out |  22 +
 src/test/regress/sql/insert_conflict.sql      |  26 +
 9 files changed, 663 insertions(+), 568 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b58a74f4e3..0b0696e61e 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2513,8 +2513,12 @@ CopyFrom(CopyState cstate)
 	/*
 	 * If there are any triggers with transition tables on the named relation,
 	 * we need to be prepared to capture transition tuples.
+	 *
+	 * Because partition tuple routing would like to know about whether
+	 * transition capture is active, we also set it in mtstate, which is
+	 * passed to ExecFindPartition below.
 	 */
-	cstate->transition_capture =
+	cstate->transition_capture = mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(cstate->rel->trigdesc,
 								   RelationGetRelid(cstate->rel),
 								   CMD_INSERT);
@@ -2524,19 +2528,8 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
 		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
-		/*
-		 * If we are capturing transition tuples, they may need to be
-		 * converted from partition format back to partitioned table format
-		 * (this is only ever necessary if a BEFORE trigger modifies the
-		 * tuple).
-		 */
-		if (cstate->transition_capture != NULL)
-			ExecSetupChildParentMapForLeaf(proute);
-	}
-
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2696,17 +2689,11 @@ CopyFrom(CopyState cstate)
 			TupleConversionMap *map;
 
 			/*
-			 * Away we go ... If we end up not finding a partition after all,
-			 * ExecFindPartition() does not return and errors out instead.
-			 * Otherwise, the returned value is to be used as an index into
-			 * arrays mt_partitions[] and mt_partition_tupconv_maps[] that
-			 * will get us the ResultRelInfo and TupleConversionMap for the
-			 * partition, respectively.
+			 * Attempt to find a partition suitable for this tuple.
+			 * ExecFindPartition() will raise an error if none can be found.
 			 */
-			leaf_part_index = ExecFindPartition(target_resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
+			leaf_part_index = ExecFindPartition(mtstate, target_resultRelInfo,
+												proute, slot, estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
@@ -2790,15 +2777,7 @@ CopyFrom(CopyState cstate)
 				 * one.
 				 */
 				resultRelInfo = proute->partitions[leaf_part_index];
-				if (unlikely(resultRelInfo == NULL))
-				{
-					resultRelInfo = ExecInitPartitionInfo(mtstate,
-														  target_resultRelInfo,
-														  proute, estate,
-														  leaf_part_index);
-					proute->partitions[leaf_part_index] = resultRelInfo;
-					Assert(resultRelInfo != NULL);
-				}
+				Assert(resultRelInfo != NULL);
 
 				/* Determine which triggers exist on this partition */
 				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
@@ -2848,8 +2827,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, target_resultRelInfo,
-										  leaf_part_index);
+						PartitionChildToParentMap(proute, leaf_part_index);
 				}
 				else
 				{
@@ -2866,7 +2844,7 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			map = proute->parent_child_tupconv_maps[leaf_part_index];
+			map = PartitionParentToChildMap(proute, leaf_part_index);
 			if (map != NULL)
 			{
 				TupleTableSlot *new_slot;
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 1e72e9fb3f..a1d8cb3fb9 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,10 +31,13 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+#define PARTITION_ROUTING_INITSIZE	8
 
 /*-----------------------
  * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
+ * hierarchy required to route a tuple to any of its partitions.  A
+ * PartitionDispatch is always encapsulated inside a PartitionTupleRouting
+ * struct and stored inside its 'partition_dispatch_info' array.
  *
  *	reldesc		Relation descriptor of the table
  *	key			Partition key information of the table
@@ -45,9 +48,14 @@
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
  *				this table's rowtype (when extracting the partition key of a
  *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *	indexes		Array of partdesc->nparts elements.  For leaf partitions the
+ *				index into the encapsulating PartitionTupleRouting's
+ *				'partitions' array is stored.  When the partition is itself a
+ *				partitioned table then we store the index into the
+ *				encapsulating PartitionTupleRouting's
+ *				'partition_dispatch_info' array.  An index of -1 means we've
+ *				not yet allocated anything in PartitionTupleRouting for the
+ *				partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -58,14 +66,20 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrNumber *tupmap;
-	int		   *indexes;
+	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
 
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecExpandRoutingArrays(PartitionTupleRouting *proute);
+static int ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -92,130 +106,114 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition.  The actual ResultRelInfos are allocated lazily by that
+ * function.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
-
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
 
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
-	}
-
-	i = 0;
-	foreach(cell, leaf_parts)
-	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
-
-		proute->partition_oids[i] = leaf_oid;
-
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
+	 * on demand, only when we actually need to route a tuple to that
+	 * partition.  The reason for this is that a common case is for INSERT to
+	 * insert a single tuple into a partitioned table and this must be fast.
+	 *
+	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
+	 * PartitionDispatch and ResultRelInfo pointers in their respective
+	 * arrays. More space can be allocated later, if required via
+	 * ExecExpandRoutingArrays.
+	 *
+	 * Initially we must only setup 1 PartitionDispatch object; the one for
+	 * the partitioned table that's the target of the command.  If we must
+	 * route a tuple via some sub-partitioned table, then its
+	 * PartitionDispatch is only built the first time it's required.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
 
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
+	proute->partitions = (ResultRelInfo **)
+		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
 
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
+	/* Mark that no items are yet stored in the 'partitions' array. */
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
 
-			update_rri_index++;
-		}
+	/* We only allocate these arrays when we need to store the first map */
+	proute->parent_child_tupconv_maps = NULL;
+	proute->child_parent_tupconv_maps = NULL;
+	proute->partition_tuple_slots = NULL;
 
-		proute->partitions[i] = leaf_part_rri;
-		i++;
-	}
+	/*
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent as NULL as we don't need to care about any parent of the target
+	 * partitioned table.
+	 */
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
 
 	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
+	 * If performing an UPDATE with tuple routing, we can reuse partition
+	 * sub-plan result rels.  We build a hash table to map the OIDs of
+	 * partitions present in mtstate->resultRelInfo to their ResultRelInfos.
+	 * Every time a tuple is routed to a partition that we've yet to set the
+	 * ResultRelInfo for, before we go to the trouble of making one, we check
+	 * for a pre-made one in the hash table.
+	 *
+	 * Also, we'll need a slot that will transiently store the tuple being
+	 * routed using the root parent's rowtype.
 	 */
-	Assert(update_rri_index == num_update_rri);
+	if (node && node->operation == CMD_UPDATE)
+	{
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
+		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
+	}
+	else
+	{
+		proute->subplan_resultrel_hash = NULL;
+		proute->root_tuple_slot = NULL;
+	}
 
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find a leaf partition for the tuple contained in *slot.
+ * If the partition's ResultRelInfo does not yet exist in 'proute' then we set
+ * one up or reuse one from mtstate's resultRelInfo array.
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message, else it returns the index of the leaf partition's
+ * ResultRelInfo in the proute->partitions array.  This index can also be used
+ * to obtain the correct tuple translation map via the
+ * PartitionChildToParentMap and PartitionParentToChildMap macros.
  */
 int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch dispatch;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 	TupleTableSlot *myslot = NULL;
@@ -236,9 +234,10 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	while (true)
 	{
 		AttrNumber *map = dispatch->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = dispatch->reldesc;
+		partdesc = dispatch->partdesc;
 
 		/*
 		 * Convert the tuple to this parent's layout, if different from the
@@ -260,91 +259,251 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (dispatch->partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(dispatch, values, isnull);
-
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
-		{
-			result = -1;
-			break;
-		}
-		else if (dispatch->indexes[cur_index] >= 0)
+		if (partdesc->is_leaf[partidx])
 		{
-			result = dispatch->indexes[cur_index];
-			/* success! */
-			break;
+			int			result = -1;
+
+			/*
+			 * Get this leaf partition's index in the
+			 * PartitionTupleRouting->partitions array.  We need to build a
+			 * new ResultRelInfo.
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(dispatch->indexes[partidx] < proute->num_partitions);
+				result = dispatch->indexes[partidx];
+			}
+			else
+			{
+				/*
+				 * A ResultRelInfo has not been set up for this partition yet,
+				 * so either use one of the sub-plan result rels or create a
+				 * fresh one.
+				 */
+				if (proute->subplan_resultrel_hash)
+				{
+					ResultRelInfo *rri;
+					Oid			partoid = partdesc->oids[partidx];
+
+					rri = hash_search(proute->subplan_resultrel_hash,
+									  &partoid, HASH_FIND, NULL);
+
+					if (rri)
+					{
+						result = proute->num_partitions++;
+						dispatch->indexes[partidx] = result;
+
+
+						/* Allocate more space in the arrays, if required */
+						if (result >= proute->partitions_allocsize)
+							ExecExpandRoutingArrays(proute);
+
+						/* Save here for later use. */
+						proute->partitions[result] = rri;
+					}
+				}
+
+				/* We need to create a new one. */
+				if (result < 0)
+				{
+					MemoryContextSwitchTo(oldcxt);
+					result = ExecInitPartitionInfo(mtstate, resultRelInfo,
+												   proute, estate,
+												   dispatch, partidx);
+					MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+					Assert(result >= 0 && result < proute->num_partitions);
+				}
+			}
+
+			/* Release the tuple in the lowest parent's dedicated slot. */
+			if (slot == myslot)
+				ExecClearTuple(myslot);
+
+			MemoryContextSwitchTo(oldcxt);
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return result;
 		}
 		else
 		{
-			/* move down one level */
-			dispatch = pd[-dispatch->indexes[cur_index]];
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
+
+				/*
+				 * Move down to the next partition level and search again
+				 * until we find a leaf partition that matches this tuple
+				 */
+				dispatch = pd[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subdispatch;
+
+				MemoryContextSwitchTo(oldcxt);
+
+				/*
+				 * Create the new PartitionDispatch.  We pass the current one
+				 * in as the parent PartitionDispatch
+				 */
+				subdispatch = ExecInitPartitionDispatchInfo(proute,
+															partdesc->oids[partidx],
+															dispatch, partidx);
+				MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				Assert(dispatch->indexes[partidx] >= 0 &&
+					   dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = subdispatch;
+			}
 		}
 	}
+}
+
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo *subplan_result_rels;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			nsubplans;
+	int			i;
+
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
 
-	/* Release the tuple in the lowest parent's dedicated slot. */
-	if (slot == myslot)
-		ExecClearTuple(myslot);
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
 
-	/* A partition was not found. */
-	if (result < 0)
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_resultrel_hash = htab;
+
+	/* Hash all subplans by their Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		char	   *val_desc;
-
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
+
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+													&found);
+
+		if (!found)
+			*subplanrri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
 	}
+}
 
-	MemoryContextSwitchTo(oldcxt);
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
+/*
+ * ExecExpandRoutingArrays
+ *		Double the size of the allocated arrays in 'proute'
+ */
+static void
+ExecExpandRoutingArrays(PartitionTupleRouting *proute)
+{
+	int			new_size = proute->partitions_allocsize * 2;
+	int			old_size = proute->partitions_allocsize;
 
-	return result;
+	proute->partitions_allocsize = new_size;
+
+	proute->partitions = (ResultRelInfo **)
+		repalloc(proute->partitions, sizeof(ResultRelInfo *) * new_size);
+
+	if (proute->parent_child_tupconv_maps != NULL)
+	{
+		proute->parent_child_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->parent_child_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->parent_child_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->child_parent_tupconv_maps != NULL)
+	{
+		proute->child_parent_tupconv_maps = (TupleConversionMap **)
+			repalloc(proute->child_parent_tupconv_maps,
+					 sizeof(TupleConversionMap *) * new_size);
+		memset(&proute->child_parent_tupconv_maps[old_size], 0,
+			   sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
+
+	if (proute->partition_tuple_slots != NULL)
+	{
+		proute->partition_tuple_slots = (TupleTableSlot **)
+			repalloc(proute->partition_tuple_slots,
+				sizeof(TupleTableSlot **) * new_size);
+		memset(&proute->partition_tuple_slots[old_size], 0,
+			sizeof(TupleConversionMap *) * (new_size - old_size));
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
- *
- * Returns the ResultRelInfo
+ *		and store it in the next empty slot in proute's partitions array.
+ *		Return the index of the array element.
  */
-ResultRelInfo *
+static int
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx)
 {
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -520,15 +679,25 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 									&mtstate->ps, RelationGetDescr(partrel));
 	}
 
+	part_result_rel_index = proute->num_partitions++;
+	dispatch->indexes[partidx] = part_result_rel_index;
+
+	/* Allocate more space in the arrays, if required */
+	if (part_result_rel_index >= proute->partitions_allocsize)
+		ExecExpandRoutingArrays(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
+
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri,
+						part_result_rel_index);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -541,7 +710,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -554,7 +723,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -568,7 +737,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -578,8 +747,12 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = PartitionParentToChildMap(proute, part_result_rel_index);
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -588,7 +761,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -679,12 +852,9 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
-
 	MemoryContextSwitchTo(oldContext);
 
-	return leaf_part_rri;
+	return part_result_rel_index;
 }
 
 /*
@@ -699,6 +869,7 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 					int partidx)
 {
 	MemoryContext oldContext;
+	TupleConversionMap *map;
 
 	/*
 	 * Switch into per-query memory context.
@@ -709,29 +880,42 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
-		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
-							   RelationGetDescr(partRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	map = convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								 RelationGetDescr(partRelInfo->ri_RelationDesc),
+								 gettext_noop("could not convert row type"));
 
 	/*
-	 * If a partition has a different rowtype than the root parent, initialize
-	 * a slot dedicated to storing this partition's tuples.  The slot is used
-	 * for various operations that are applied to tuples after routing, such
-	 * as checking constraints.
+	 * If a partition has a different rowtype than the root parent, store the
+	 * translation map and initialize a slot dedicated to storing this
+	 * partition's tuples.  The slot is used for various operations that are
+	 * applied to tuples after routing, such as checking constraints.
 	 */
-	if (proute->parent_child_tupconv_maps[partidx] != NULL)
+	if (map)
 	{
 		Relation	partrel = partRelInfo->ri_RelationDesc;
 
-		/*
-		 * Initialize the array in proute where these slots are stored, if not
-		 * already done.
-		 */
+		/* Allocate parent child map array only if we need to store a map */
+		if (proute->parent_child_tupconv_maps == NULL)
+		{
+			int			size = proute->partitions_allocsize;
+
+			proute->parent_child_tupconv_maps = (TupleConversionMap **)
+				palloc0(sizeof(TupleConversionMap *) * size);
+		}
+
+		proute->parent_child_tupconv_maps[partidx] = map;
+
+		 /*
+		  * Initialize the array in proute where these slots are stored, if not
+		  * already done.
+		  */
 		if (proute->partition_tuple_slots == NULL)
+		{
+			int			size = proute->partitions_allocsize;
+
 			proute->partition_tuple_slots = (TupleTableSlot **)
-				palloc0(proute->num_partitions *
-						sizeof(TupleTableSlot *));
+				palloc0(sizeof(TupleTableSlot *) * size);
+		}
 
 		/*
 		 * Initialize the slot itself setting its descriptor to this
@@ -741,6 +925,35 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 		proute->partition_tuple_slots[partidx] =
 			ExecInitExtraTupleSlot(estate,
 								   RelationGetDescr(partrel));
+
+	}
+
+	/*
+	 * Also, if transition capture is required, store a map to convert tuples
+	 * from partition's rowtype to the parent's.
+	 */
+	if (mtstate &&
+		(mtstate->mt_transition_capture || mtstate->mt_oc_transition_capture))
+	{
+		map =
+			convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
+								   RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								   gettext_noop("could not convert row type"));
+
+		/* Allocate child parent map array only if we need to store a map */
+		if (map)
+		{
+			if (proute->child_parent_tupconv_maps == NULL)
+			{
+				int			size;
+
+				size = proute->partitions_allocsize;
+				proute->child_parent_tupconv_maps = (TupleConversionMap **)
+					palloc0(sizeof(TupleConversionMap *) * size);
+			}
+
+			proute->child_parent_tupconv_maps[partidx] = map;
+		}
 	}
 
 	/*
@@ -757,67 +970,88 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 }
 
 /*
- * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
- * child-to-root tuple conversion map array.
- *
- * This map is required for capturing transition tuples when the target table
- * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
- * we need to convert it from the leaf partition to the target table
- * descriptor.
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table and store it in
+ *		the next available slot in the 'proute' partition_dispatch_info[]
+ *		array.  Also, record the index into this array in the 'parent_pd'
+ *		indexes[] array in the partidx element so that we can properly
+ *		retrieve the newly created PartitionDispatch later.
  */
-void
-ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
 {
-	Assert(proute != NULL);
+	Relation	rel;
+	PartitionDesc partdesc;
+	PartitionDispatch pd;
+	int			dispatchidx;
 
-	/*
-	 * These array elements get filled up with maps on an on-demand basis.
-	 * Initially just set all of them to NULL.
-	 */
-	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	partdesc = RelationGetPartitionDesc(rel);
 
-	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
-}
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
+									+ (partdesc->nparts * sizeof(int)));
+	pd->reldesc = rel;
+	pd->key = RelationGetPartitionKey(rel);
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		TupleDesc	tupdesc = RelationGetDescr(rel);
 
-/*
- * TupConvMapForLeaf -- Get the tuple conversion map for a given leaf partition
- * index.
- */
-TupleConversionMap *
-TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index)
-{
-	ResultRelInfo **resultRelInfos = proute->partitions;
-	TupleConversionMap **map;
-	TupleDesc	tupdesc;
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent_pd->reldesc),
+													   tupdesc,
+													   gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
 
-	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
-		return NULL;
+	dispatchidx = proute->num_dispatch++;
 
-	/* If we've already got a map, return it. */
-	map = &proute->child_parent_tupconv_maps[leaf_index];
-	if (*map != NULL)
-		return *map;
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize *= 2;
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
 
-	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
-	*map =
-		convert_tuples_by_name(tupdesc,
-							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
 
-	/* If it turns out no map is needed, remember for next time. */
-	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
+	/*
+	 * Finally, if setting up a PartitionDispatch for a sub-partitioned table,
+	 * install the link to allow us to descend the partition hierarchy for
+	 * future searches
+	 */
+	if (parent_pd)
+		parent_pd->indexes[partidx] = dispatchidx;
 
-	return *map;
+	return pd;
 }
 
 /*
@@ -830,8 +1064,8 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -864,21 +1098,19 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 														   resultRelInfo);
 
 		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
+		if (resultrel_hash)
 		{
-			subplan_index++;
-			continue;
+			Oid			partoid;
+			bool		found;
+
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
+			if (found)
+				continue;
 		}
 
 		ExecCloseIndices(resultRelInfo);
@@ -890,144 +1122,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent),
-													   tupdesc,
-													   gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
-
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
-		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
-
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
-	}
-}
-
 /* ----------------
  *		FormPartitionKeyDatum
  *			Construct values[] and isnull[] arrays for the partition key
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 528f58717e..840b98811f 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1665,7 +1664,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1709,21 +1708,13 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	 * value is to be used as an index into the arrays for the ResultRelInfo
 	 * and TupleConversionMap for the partition.
 	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
+	partidx = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
 	Assert(partidx >= 0 && partidx < proute->num_partitions);
 
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
-	 */
+	Assert(proute->partitions[partidx] != NULL);
+	/* Get the ResultRelInfo corresponding to the selected partition. */
 	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
+	Assert(partrel != NULL);
 
 	/*
 	 * Check whether the partition is routable if we didn't yet
@@ -1768,7 +1759,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			 */
 			mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 			mtstate->mt_transition_capture->tcs_map =
-				TupConvMapForLeaf(proute, targetRelInfo, partidx);
+									PartitionChildToParentMap(proute, partidx);
 		}
 		else
 		{
@@ -1783,13 +1774,13 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	if (mtstate->mt_oc_transition_capture != NULL)
 	{
 		mtstate->mt_oc_transition_capture->tcs_map =
-			TupConvMapForLeaf(proute, targetRelInfo, partidx);
+								PartitionChildToParentMap(proute, partidx);
 	}
 
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	map = proute->parent_child_tupconv_maps[partidx];
+	map = PartitionParentToChildMap(proute, partidx);
 	if (map != NULL)
 	{
 		TupleTableSlot *new_slot;
@@ -1834,17 +1825,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			numResultRelInfos = mtstate->mt_nplans;
 	int			i;
 
-	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
 	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
@@ -1866,79 +1846,18 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	}
 }
 
-/*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
 /*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ourselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index d5720518a8..2a1c1cb2e1 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1657,9 +1657,6 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 /*
  * expand_partitioned_rtentry
  *		Recursively expand an RTE for a partitioned table.
- *
- * Note that RelationGetPartitionDispatchInfo will expand partitions in the
- * same order as this code.
  */
 static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 5757301d05..2afde69134 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -582,6 +582,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -770,7 +771,15 @@ RelationBuildPartitionDesc(Relation rel)
 		 * defined by canonicalized representation of the partition bounds.
 		 */
 		for (i = 0; i < nparts; i++)
-			result->oids[mapping[i]] = oids[i];
+		{
+			int			index = mapping[i];
+
+			result->oids[index] = oids[i];
+			/* Record if the partition is a leaf partition */
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+		}
+
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index a53de2372e..59c7a6ab69 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -25,7 +25,11 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of 'nparts' elements containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * the corresponding 'oids' element belongs to
+								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 3e08104ea4..45d5f6a8d0 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -22,71 +22,119 @@
 typedef struct PartitionDispatchData *PartitionDispatch;
 
 /*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
- *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slots		Array of TupleTableSlot objects; if non-NULL,
- *								contains one entry for every leaf partition,
- *								of which only those of the leaf partitions
- *								whose attribute numbers differ from the root
- *								parent have a non-NULL value.  NULL if all of
- *								the partitions encountered by a given command
- *								happen to have same rowtype as the root parent
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
+ *
+ * partition_root			The partitioned table that's the target of the
+ *							command.
+ *
+ * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
+ *							a pointer to a PartitionDispatch objects for every
+ *							partitioned table touched by tuple routing.  The
+ *							entry for the target partitioned table is *always*
+ *							present in the 0th element of this array.  See
+ *							comment for PartitionDispatchData->indexes for
+ *							details on how this array is indexed.
+ *
+ * num_dispatch				The current number of items stored in the
+ *							'partition_dispatch_info' array.  Also serves as
+ *							the index of the next free array element for new
+ *							PartitionDispatch which need to be stored.
+ *
+ * dispatch_allocsize		The current allocated size of the
+ *							'partition_dispatch_info' array.
+ *
+ * partitions				Array of 'partitions_allocsize' elements
+ *							containing pointers to a ResultRelInfos of all
+ *							leaf partitions touched by tuple routing.  Some of
+ *							these are pointers to ResultRelInfos which are
+ *							borrowed out of 'subplan_resultrel_hash'.  The
+ *							remainder have been built especially for tuple
+ *							routing.  See comment for
+ *							PartitionDispatchData->indexes for details on how
+ *							this array is indexed.
+ *
+ * num_partitions			The current number of items stored in the
+ *							'partitions' array.  Also serves as the index of
+ *							the next free array element for new ResultRelInfos
+ *							which need to be stored.
+ *
+ * partitions_allocsize		The current allocated size of the 'partitions'
+ *							array.  Also, if they're non-NULL, marks the size
+ *							of the 'parent_child_tupconv_maps',
+ *							'child_parent_tupconv_maps',
+ *							'child_parent_map_not_required' and
+ *							'partition_tuple_slots' arrays.
+ *
+ * parent_child_tupconv_maps	Array of partitions_allocsize elements
+ *							containing information on how to convert tuples of
+ *							partition_root's rowtype to the rowtype of the
+ *							corresponding partition as stored in 'partitions',
+ *							or NULL if no conversion is required.  The entire
+ *							array is only allocated when the first conversion
+ *							map needs to stored.  When not allocated it's set
+ *							to NULL.
+ *
+ * partition_tuple_slots	Array of TupleTableSlot objects; if non-NULL,
+ *							contains one entry for every leaf partition,
+ *							of which only those of the leaf partitions
+ *							whose attribute numbers differ from the root
+ *							parent have a non-NULL value.  NULL if all of
+ *							the partitions encountered by a given command
+ *							happen to have same rowtype as the root parent
+ *
+ * child_parent_tupconv_maps	As 'parent_child_tupconv_maps' but stores
+ *							conversion maps to translate partition tuples into
+ *							partition_root's rowtype, needed if transition
+ *							capture is active
+ *
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
+ *
+ * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
+ *							This is used to cache ResultRelInfos from subplans
+ *							of a ModifyTable node.  Some of these may be
+ *							useful for tuple routing to save having to build
+ *							duplicates.
+ *
+ * root_tuple_slot			During UPDATE tuple routing, this tuple slot is
+ *							used to transiently store a tuple using the root
+ *							table's rowtype after converting it from the
+ *							tuple's source leaf partition's rowtype.  That is,
+ *							if leaf partition's rowtype is different.
  *-----------------------
  */
 typedef struct PartitionTupleRouting
 {
+	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
-	Oid		   *partition_oids;
+	int			dispatch_allocsize;
 	ResultRelInfo **partitions;
 	int			num_partitions;
+	int			partitions_allocsize;
 	TupleConversionMap **parent_child_tupconv_maps;
 	TupleConversionMap **child_parent_tupconv_maps;
-	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
 	TupleTableSlot **partition_tuple_slots;
 	TupleTableSlot *root_tuple_slot;
+	HTAB	   *subplan_resultrel_hash;
 } PartitionTupleRouting;
 
+/*
+ * Accessor macros for tuple conversion maps contained in
+ * PartitionTupleRouting.  Beware of multiple evaluations of p!
+ */
+#define PartitionChildToParentMap(p, i) \
+			((p)->child_parent_tupconv_maps != NULL ? \
+				(p)->child_parent_tupconv_maps[(i)] : \
+							NULL)
+
+#define PartitionParentToChildMap(p, i) \
+			((p)->parent_child_tupconv_maps != NULL ? \
+				(p)->parent_child_tupconv_maps[(i)] : \
+							NULL)
+
 /*
  * PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
  * of partitions.  For a multilevel partitioned table, we have one of these
@@ -175,22 +223,20 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern int ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *resultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
+extern ResultRelInfo *ExecGetPartitionInfo(ModifyTableState *mtstate,
+					 ResultRelInfo *resultRelInfo,
+					 PartitionTupleRouting *proute,
+					 EState *estate, int partidx);
 extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
 					PartitionTupleRouting *proute,
 					ResultRelInfo *partRelInfo,
 					int partidx);
-extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
-extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index);
 extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute);
 extern PartitionPruneState *ExecCreatePartitionPruneState(PlanState *planstate,
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index 27cf5a01b3..6b841c7850 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -904,4 +904,26 @@ select * from parted_conflict order by a;
  50 | cincuenta | 2
 (1 row)
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+truncate parted_conflict;
+insert into parted_conflict values (0, 'cero', 1);
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+NOTICE:  a = 0, b = cero, c = 2
 drop table parted_conflict;
+drop function parted_conflict_update_func();
diff --git a/src/test/regress/sql/insert_conflict.sql b/src/test/regress/sql/insert_conflict.sql
index c677d70fb7..fe6dcfaa06 100644
--- a/src/test/regress/sql/insert_conflict.sql
+++ b/src/test/regress/sql/insert_conflict.sql
@@ -576,4 +576,30 @@ insert into parted_conflict values (50, 'cincuenta', 2)
 -- should see (50, 'cincuenta', 2)
 select * from parted_conflict order by a;
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+
+truncate parted_conflict;
+
+insert into parted_conflict values (0, 'cero', 1);
+
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+
 drop table parted_conflict;
+drop function parted_conflict_update_func();
-- 
2.16.2.windows.1

#48Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#47)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018/11/04 19:07, David Rowley wrote:

On 1 November 2018 at 22:39, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
I've attached v13 which hopefully addresses these.

Thank you for updating the patch.

The macro naming discussion got me thinking today about the macro itself.
It encapsulates access to the various PartitionTupleRouting arrays
containing the maps, but maybe we've got the interface of tuple routing a
bit (maybe a lot given this thread!) wrong to begin with. Instead of
ExecFindPartition returning indexes into various PartitionTupleRouting
arrays and its callers then using those indexes to fetch various objects
from those arrays, why doesn't it return those objects itself? Although
we made sure that the callers don't need to worry about the meaning of
these indexes changing with this patch, it still seems a bit odd for them
to have to go back to those arrays to get various objects.

How about we change ExecFindPartition's interface so that it returns the
ResultRelInfo, the two maps, and the partition slot? So, the arrays
simply become a cache for ExecFindPartition et al and are no longer
accessed outside execPartition.c. Although that makes the interface of
ExecFindPartition longer, I think it reduces overall complexity.

I don't really think stuffing values into a bunch of output parameters
is much of an improvement. I'd rather allow callers to fetch what they
need using the index we return. Most callers don't need to know about
the child to parent maps, so it seems nicer for those places not to
have to invent a dummy variable to pass along to ExecFindPartition()
so it can needlessly populate it for them.

Well, if a caller finds a partition using ExecFindPartition, it's going to
need to fetch those other objects anyway. Both of its callers that exist
today, CopyFrom and ExecPrepareTupleRouting, fetch both maps and the slot
in the same code block as ExecFindPartition.

Perhaps a better design would be to instead of having random special
partitioned-table-only fields in ResultRelInfo, just have an extra
struct there that contains the extra partition information which would
include the translation maps and then just return the ResultRelInfo
and allow callers to lookup any extra details they require.

IIUC, you're saying that we could introduce a new struct that contains
auxiliary information needed by partition pruning (maps, slot, etc. for
tuple conversion) and add a new ResultRelInfo member of that struct type.
That way, there is no need to return them separately or return an index to
access them from their arrays. I guess we won't even need the arrays we
have now. I think that might be a good idea and simplifies things
significantly.

It reminds me of how ResultRelInfo grew a ri_onConflict member of type
OnConflictSetState [1]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=555ee77a9. We decided to go that way, as opposed to the
earlier approach of having arrays of num_partitions length in
ModifyTableState or PartitionTupleRouting that contained ON CONFLICT
related objects for individual partitions.

I've not
looked into this in detail, but I think the committer work that's
required for the patch as it is today is already quite significant.
I'm not keen on warding any willing one off by making the commit job
any harder. I agree that it would be good to stabilise the API for
all this partitioning code sometime, but I don't believe it needs to
be done all in one commit. My intent here is to improve performance or
INSERT and UPDATE on partitioned tables. Perhaps we can shape some API
redesign later in the release cycle. What do you think?

I do suspect that simplifying ExecFindPartition's interface as part of
patch will make a committer's life easier, as the resulting interface is
simpler, especially if we revise it like you suggest above.

Thanks,
Amit

[1]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=555ee77a9

#49Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: David Rowley (#47)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 11/4/18 5:07 AM, David Rowley wrote:

I've attached v13 which hopefully addresses these.

I ran a test against the INSERT case using a 64 hash partition.

Non-partitioned table: ~73k TPS
Master: ~25k TPS
0001: ~26k TPS
0001 + 0002: ~68k TPS

The profile of 0001 shows that almost all of
ExecSetupPartitionTupleRouting() is find_all_inheritors(), hence the
last test with a rebase of the original v1-0002 patch.

Best regards,
Jesper

#50David Rowley
david.rowley@2ndquadrant.com
In reply to: Amit Langote (#48)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 5 November 2018 at 20:17, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2018/11/04 19:07, David Rowley wrote:

Perhaps a better design would be to instead of having random special
partitioned-table-only fields in ResultRelInfo, just have an extra
struct there that contains the extra partition information which would
include the translation maps and then just return the ResultRelInfo
and allow callers to lookup any extra details they require.

IIUC, you're saying that we could introduce a new struct that contains
auxiliary information needed by partition pruning (maps, slot, etc. for
tuple conversion) and add a new ResultRelInfo member of that struct type.
That way, there is no need to return them separately or return an index to
access them from their arrays. I guess we won't even need the arrays we
have now. I think that might be a good idea and simplifies things
significantly.

I've attached a patch which does this. It adds a new struct named
PartitionRoutingInfo into ResultRelInfo and pulls 3 of the 4 arrays
out of PartitionTupleRouting. There are fields for each of what these
arrays used to store inside the PartitionRoutingInfo struct.

While doing this I kept glancing back over at ModifyTableState and at
the mt_per_subplan_tupconv_maps array. I wondered if it would be
better to make the PartitionRoutingInfo a bit more generic, perhaps
call it TupleConversionInfo and have fields like ti_ToGeneric and
ti_FromGeneric, with the idea that "generic" would be the root
partition or the first subplan for inheritance updates. This would
allow us to get rid of a good chunk of code inside nodeModifyTable.c.
However, I ended up not doing this and left PartitionRoutingInfo to be
specifically for Root to Partition conversion.

Also, on the topic about what to name the conversion maps from a few
days ago; After looking at this a bit more I decided that having them
named ParentToChild and ChildToParent is misleading. If the child is
the child of some sub-partitioned table then the parent that the map
is talking about is not the partition's parent, but the root
partitioned table. So really RootToPartition and PartitionToRoot seem
like much more accurate names for the maps.

I made a few other changes along the way; I thought that
ExecFindPartition() would be a good candidate to take on the
responsibility of validating the partition is valid for INSERTs when
it uses a partition out of the subplan_resultrel_hash. I thought it
was better to check this once when we're in the code path of grabbing
the ResultRelInfo out that hash table rather than in a code path that
must check if it's been done already each time we route a tuple into a
partition. It also allowed me to get rid of
ri_PartitionReadyForRouting. I also moved the call to
ExecInitRoutingInfo() into ExecFindPartition() which allowed me to
make that function static, which could result in the generation of
slightly more optimal compiled code.

Please find attached the v14 patch.

Rather nicely git --stat reports a net negative additional code (with
the 48 lines of new tests included)

11 files changed, 632 insertions(+), 660 deletions(-)

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v14-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchapplication/octet-stream; name=v14-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload
From fb335b9be5637993517678032b93f5efa332966f Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 26 Jul 2018 19:54:55 +1200
Subject: [PATCH v14] Speed up INSERT and UPDATE on partitioned tables

This is more or less a complete redesign of PartitionTupleRouting.  This
changes the setup that it does far less work during the initial setup and
pushes more work out to when partitions receive tuples.
PartitionDispatchData structs for sub-partitioned tables are only created
when a tuple gets routed through it. The possibly large arrays in the
PartitionTupleRouting struct have largely been removed.  The partitions[]
array remains but now never contains any NULL gaps.  Previously the NULLs
had to be skipped during ExecCleanupTupleRouting(), which could add a
large overhead to the cleanup when the number of partitions was large.
The partitions[] array is allocated small to start with and only enlarged
when we route tuples to enough partitions that it runs out of space. This
allows us to keep simple single-row partition INSERTs running quickly.

The arrays in PartitionTupleRouting which stored the tuple translation
maps have now been removed.  These have been moved out into a
PartitionRoutingInfo struct which is an additional field in ResultRelInfo.

The find_all_inheritors() call still remains by far the slowest part of
ExecSetupPartitionTupleRouting(). This commit just removes the other slow
parts.

In passing also rename the tuple translation maps from being ParentToChild
and ChildToParent to being RootToPartition and PartitionToRoot. The old
names mislead you into thinking that a partition of some sub-partitioned
table would translate to the rowtype of the sub-partitioned table rather
than the root partitioned table.

David Rowley and Amit Langote
---
 src/backend/commands/copy.c                   |  86 +--
 src/backend/executor/execMain.c               |   2 +-
 src/backend/executor/execPartition.c          | 830 ++++++++++++++------------
 src/backend/executor/nodeModifyTable.c        | 145 +----
 src/backend/optimizer/prep/prepunion.c        |   3 -
 src/backend/utils/cache/partcache.c           |  11 +-
 src/include/catalog/partition.h               |   6 +-
 src/include/executor/execPartition.h          | 156 ++---
 src/include/nodes/execnodes.h                 |   5 +-
 src/test/regress/expected/insert_conflict.out |  22 +
 src/test/regress/sql/insert_conflict.sql      |  26 +
 11 files changed, 632 insertions(+), 660 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b58a74f4e3..523eb2f995 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2316,6 +2316,7 @@ CopyFrom(CopyState cstate)
 	bool	   *nulls;
 	ResultRelInfo *resultRelInfo;
 	ResultRelInfo *target_resultRelInfo;
+	ResultRelInfo *prevResultRelInfo = NULL;
 	EState	   *estate = CreateExecutorState(); /* for ExecConstraints() */
 	ModifyTableState *mtstate;
 	ExprContext *econtext;
@@ -2331,7 +2332,6 @@ CopyFrom(CopyState cstate)
 	CopyInsertMethod insertMethod;
 	uint64		processed = 0;
 	int			nBufferedTuples = 0;
-	int			prev_leaf_part_index = -1;
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
 	bool		leafpart_use_multi_insert = false;
@@ -2513,8 +2513,12 @@ CopyFrom(CopyState cstate)
 	/*
 	 * If there are any triggers with transition tables on the named relation,
 	 * we need to be prepared to capture transition tuples.
+	 *
+	 * Because partition tuple routing would like to know about whether
+	 * transition capture is active, we also set it in mtstate, which is
+	 * passed to ExecFindPartition() below.
 	 */
-	cstate->transition_capture =
+	cstate->transition_capture = mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(cstate->rel->trigdesc,
 								   RelationGetRelid(cstate->rel),
 								   CMD_INSERT);
@@ -2524,19 +2528,8 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
 		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
-		/*
-		 * If we are capturing transition tuples, they may need to be
-		 * converted from partition format back to partitioned table format
-		 * (this is only ever necessary if a BEFORE trigger modifies the
-		 * tuple).
-		 */
-		if (cstate->transition_capture != NULL)
-			ExecSetupChildParentMapForLeaf(proute);
-	}
-
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2692,25 +2685,17 @@ CopyFrom(CopyState cstate)
 		/* Determine the partition to heap_insert the tuple into */
 		if (proute)
 		{
-			int			leaf_part_index;
 			TupleConversionMap *map;
 
 			/*
-			 * Away we go ... If we end up not finding a partition after all,
-			 * ExecFindPartition() does not return and errors out instead.
-			 * Otherwise, the returned value is to be used as an index into
-			 * arrays mt_partitions[] and mt_partition_tupconv_maps[] that
-			 * will get us the ResultRelInfo and TupleConversionMap for the
-			 * partition, respectively.
+			 * Attempt to find a partition suitable for this tuple.
+			 * ExecFindPartition() will raise an error if none can be found or
+			 * if the found partition is not suitable for INSERTs.
 			 */
-			leaf_part_index = ExecFindPartition(target_resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
-			Assert(leaf_part_index >= 0 &&
-				   leaf_part_index < proute->num_partitions);
-
-			if (prev_leaf_part_index != leaf_part_index)
+			resultRelInfo = ExecFindPartition(mtstate, target_resultRelInfo,
+											  proute, slot, estate);
+
+			if (prevResultRelInfo != resultRelInfo)
 			{
 				/* Check if we can multi-insert into this partition */
 				if (insertMethod == CIM_MULTI_CONDITIONAL)
@@ -2723,12 +2708,9 @@ CopyFrom(CopyState cstate)
 					if (nBufferedTuples > 0)
 					{
 						ExprContext *swapcontext;
-						ResultRelInfo *presultRelInfo;
-
-						presultRelInfo = proute->partitions[prev_leaf_part_index];
 
 						CopyFromInsertBatch(cstate, estate, mycid, hi_options,
-											presultRelInfo, myslot, bistate,
+											prevResultRelInfo, myslot, bistate,
 											nBufferedTuples, bufferedTuples,
 											firstBufferedLineNo);
 						nBufferedTuples = 0;
@@ -2785,21 +2767,6 @@ CopyFrom(CopyState cstate)
 					}
 				}
 
-				/*
-				 * Overwrite resultRelInfo with the corresponding partition's
-				 * one.
-				 */
-				resultRelInfo = proute->partitions[leaf_part_index];
-				if (unlikely(resultRelInfo == NULL))
-				{
-					resultRelInfo = ExecInitPartitionInfo(mtstate,
-														  target_resultRelInfo,
-														  proute, estate,
-														  leaf_part_index);
-					proute->partitions[leaf_part_index] = resultRelInfo;
-					Assert(resultRelInfo != NULL);
-				}
-
 				/* Determine which triggers exist on this partition */
 				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
 											  resultRelInfo->ri_TrigDesc->trig_insert_before_row);
@@ -2825,7 +2792,7 @@ CopyFrom(CopyState cstate)
 				 * buffer when the partition being inserted into changes.
 				 */
 				ReleaseBulkInsertStatePin(bistate);
-				prev_leaf_part_index = leaf_part_index;
+				prevResultRelInfo = resultRelInfo;
 			}
 
 			/*
@@ -2835,7 +2802,7 @@ CopyFrom(CopyState cstate)
 
 			/*
 			 * If we're capturing transition tuples, we might need to convert
-			 * from the partition rowtype to parent rowtype.
+			 * from the partition rowtype to root rowtype.
 			 */
 			if (cstate->transition_capture != NULL)
 			{
@@ -2848,8 +2815,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, target_resultRelInfo,
-										  leaf_part_index);
+						resultRelInfo->ri_PartitionInfo->pi_PartitionToRootMap;
 				}
 				else
 				{
@@ -2863,18 +2829,18 @@ CopyFrom(CopyState cstate)
 			}
 
 			/*
-			 * We might need to convert from the parent rowtype to the
-			 * partition rowtype.
+			 * We might need to convert from the root rowtype to the partition
+			 * rowtype.
 			 */
-			map = proute->parent_child_tupconv_maps[leaf_part_index];
+			map = resultRelInfo->ri_PartitionInfo->pi_RootToPartitionMap;
 			if (map != NULL)
 			{
 				TupleTableSlot *new_slot;
 				MemoryContext oldcontext;
 
-				Assert(proute->partition_tuple_slots != NULL &&
-					   proute->partition_tuple_slots[leaf_part_index] != NULL);
-				new_slot = proute->partition_tuple_slots[leaf_part_index];
+				new_slot = resultRelInfo->ri_PartitionInfo->pi_PartitionTupleSlot;
+				Assert(new_slot != NULL);
+
 				slot = execute_attr_map_slot(map->attrMap, slot, new_slot);
 
 				/*
@@ -3019,12 +2985,8 @@ CopyFrom(CopyState cstate)
 	{
 		if (insertMethod == CIM_MULTI_CONDITIONAL)
 		{
-			ResultRelInfo *presultRelInfo;
-
-			presultRelInfo = proute->partitions[prev_leaf_part_index];
-
 			CopyFromInsertBatch(cstate, estate, mycid, hi_options,
-								presultRelInfo, myslot, bistate,
+								prevResultRelInfo, myslot, bistate,
 								nBufferedTuples, bufferedTuples,
 								firstBufferedLineNo);
 		}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index ba156f8c5f..6385ebdfd0 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1343,7 +1343,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
 
 	resultRelInfo->ri_PartitionCheck = partition_check;
 	resultRelInfo->ri_PartitionRoot = partition_root;
-	resultRelInfo->ri_PartitionReadyForRouting = false;
+	resultRelInfo->ri_PartitionInfo = NULL;		/* May be set later */
 }
 
 /*
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 1e72e9fb3f..930349ac47 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,10 +31,13 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+#define PARTITION_ROUTING_INITSIZE	8
 
 /*-----------------------
  * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
+ * hierarchy required to route a tuple to any of its partitions.  A
+ * PartitionDispatch is always encapsulated inside a PartitionTupleRouting
+ * struct and stored inside its 'partition_dispatch_info' array.
  *
  *	reldesc		Relation descriptor of the table
  *	key			Partition key information of the table
@@ -45,9 +48,14 @@
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
  *				this table's rowtype (when extracting the partition key of a
  *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *	indexes		Array of partdesc->nparts elements.  For leaf partitions the
+ *				index into the encapsulating PartitionTupleRouting's
+ *				'partitions' array is stored.  When the partition is itself a
+ *				partitioned table then we store the index into the
+ *				encapsulating PartitionTupleRouting's
+ *				'partition_dispatch_info' array.  An index of -1 means we've
+ *				not yet allocated anything in PartitionTupleRouting for the
+ *				partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -58,14 +66,23 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrNumber *tupmap;
-	int		   *indexes;
+	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
 
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecCheckPartitionArraySpace(PartitionTupleRouting *proute);
+static ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx);
+static void ExecInitRoutingInfo(ModifyTableState *mtstate,
+					EState *estate,
+					ResultRelInfo *partRelInfo);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -92,130 +109,111 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition().  The actual ResultRelInfo for a partition is only
+ * allocated when the first tuple is routed there.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
-
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
-
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
-	}
-
-	i = 0;
-	foreach(cell, leaf_parts)
-	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
-
-		proute->partition_oids[i] = leaf_oid;
 
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
-
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built
+	 * on demand, only when we actually need to route a tuple to that
+	 * partition.  The reason for this is that a common case is for INSERT to
+	 * insert a single tuple into a partitioned table and this must be fast.
+	 *
+	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
+	 * PartitionDispatch and ResultRelInfo pointers the 'partitions' array.
+	 * More space can be allocated later if we end up routing tuples to more
+	 * than that many partitions.
+	 *
+	 * Initially we must only setup 1 PartitionDispatch object; the one for
+	 * the partitioned table that's the target of the command.  If we must
+	 * route a tuple via some sub-partitioned table, then its
+	 * PartitionDispatch is only built the first time it's required.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
 
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
+	proute->partitions = (ResultRelInfo **)
+		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
 
-			update_rri_index++;
-		}
+	/* Mark that no items are yet stored in the 'partitions' array. */
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
 
-		proute->partitions[i] = leaf_part_rri;
-		i++;
-	}
+	/*
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent as NULL as we don't need to care about any parent of the target
+	 * partitioned table.
+	 */
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
 
 	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
+	 * If performing an UPDATE with tuple routing, we can reuse partition
+	 * sub-plan result rels.  We build a hash table to map the OIDs of
+	 * partitions present in mtstate->resultRelInfo to their ResultRelInfos.
+	 * Every time a tuple is routed to a partition that we've yet to set the
+	 * ResultRelInfo for, before we go to the trouble of making one, we check
+	 * for a pre-made one in the hash table.
+	 *
+	 * Also, we'll need a slot that will transiently store the tuple being
+	 * routed using the root parent's rowtype.
 	 */
-	Assert(update_rri_index == num_update_rri);
+	if (node && node->operation == CMD_UPDATE)
+	{
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
+		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
+	}
+	else
+	{
+		proute->subplan_resultrel_hash = NULL;
+		proute->root_tuple_slot = NULL;
+	}
 
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find and return the ResultRelInfo for the leaf
+ * partition for the tuple contained in *slot.
+ *
+ * If the partition's ResultRelInfo does not yet exist in 'proute' then we set
+ * one up or reuse one from mtstate's resultRelInfo array.  When reusing a
+ * ResultRelInfo from the mtstate we verify that the relation is a valid
+ * target for INSERTs and then set up a PartitionRoutingInfo for it.
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message.  An error may also raised if the found target partition is
+ * not a valid target for an INSERT.
  */
-int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ResultRelInfo *
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *rootResultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch dispatch;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 	TupleTableSlot *myslot = NULL;
@@ -228,17 +226,18 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate, true);
+	if (rootResultRelInfo->ri_PartitionCheck)
+		ExecPartitionCheck(rootResultRelInfo, slot, estate, true);
 
 	/* start with the root partitioned table */
 	dispatch = pd[0];
 	while (true)
 	{
 		AttrNumber *map = dispatch->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = dispatch->reldesc;
+		partdesc = dispatch->partdesc;
 
 		/*
 		 * Convert the tuple to this parent's layout, if different from the
@@ -260,91 +259,235 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (dispatch->partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(dispatch, values, isnull);
-
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
-		{
-			result = -1;
-			break;
-		}
-		else if (dispatch->indexes[cur_index] >= 0)
+		if (partdesc->is_leaf[partidx])
 		{
-			result = dispatch->indexes[cur_index];
-			/* success! */
-			break;
+			ResultRelInfo   *rri;
+
+			/*
+			 * Look to see if we've already got a ResultRelInfo for this
+			 * partition.
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(dispatch->indexes[partidx] < proute->num_partitions);
+				rri = proute->partitions[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				int			rri_index = -1;
+
+				/*
+				 * A ResultRelInfo has not been set up for this partition yet,
+				 * so either use one of the sub-plan result rels or build a
+				 * new one.
+				 */
+				if (proute->subplan_resultrel_hash)
+				{
+					Oid			partoid = partdesc->oids[partidx];
+
+					rri = hash_search(proute->subplan_resultrel_hash,
+									  &partoid, HASH_FIND, NULL);
+
+					if (rri)
+					{
+						/* Found one! */
+
+						/* Verify this ResultRelInfo allows INSERTs */
+						CheckValidResultRel(rri, CMD_INSERT);
+
+						/* This shouldn't have be set up yet */
+						Assert(rri->ri_PartitionInfo == NULL);
+
+						/* Setup the PartitionRoutingInfo for it */
+						ExecInitRoutingInfo(mtstate, estate, rri);
+
+						rri_index = proute->num_partitions++;
+						dispatch->indexes[partidx] = rri_index;
+
+						ExecCheckPartitionArraySpace(proute);
+
+						/*
+						 * Store it in the partitions array so we don't have
+						 * to look it up again.
+						 */
+						proute->partitions[rri_index] = rri;
+					}
+				}
+
+				/* We need to create a new one. */
+				if (rri_index < 0)
+				{
+					MemoryContextSwitchTo(oldcxt);
+					rri = ExecInitPartitionInfo(mtstate, rootResultRelInfo,
+												proute, estate,
+												dispatch, partidx);
+					MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				}
+			}
+
+			/* Release the tuple in the lowest parent's dedicated slot. */
+			if (slot == myslot)
+				ExecClearTuple(myslot);
+
+			MemoryContextSwitchTo(oldcxt);
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return rri;
 		}
 		else
 		{
-			/* move down one level */
-			dispatch = pd[-dispatch->indexes[cur_index]];
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
+
+				/*
+				 * Move down to the next partition level and search again
+				 * until we find a leaf partition that matches this tuple
+				 */
+				dispatch = pd[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subdispatch;
+
+				MemoryContextSwitchTo(oldcxt);
+
+				/*
+				 * Create the new PartitionDispatch.  We pass the current one
+				 * in as the parent PartitionDispatch
+				 */
+				subdispatch = ExecInitPartitionDispatchInfo(proute,
+															partdesc->oids[partidx],
+															dispatch, partidx);
+				MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				Assert(dispatch->indexes[partidx] >= 0 &&
+					   dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = subdispatch;
+			}
 		}
 	}
+}
 
-	/* Release the tuple in the lowest parent's dedicated slot. */
-	if (slot == myslot)
-		ExecClearTuple(myslot);
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo *subplan_result_rels;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			nsubplans;
+	int			i;
+
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
 
-	/* A partition was not found. */
-	if (result < 0)
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
+
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_resultrel_hash = htab;
+
+	/* Hash all subplans by their Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		char	   *val_desc;
-
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
-	}
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
 
-	MemoryContextSwitchTo(oldcxt);
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+													&found);
 
-	return result;
+		if (!found)
+			*subplanrri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
+	}
+}
+
+/*
+ * ExecCheckPartitionArraySpace
+ *		Ensure there's enough space in the 'partitions' array of 'proute'
+ */
+static void
+ExecCheckPartitionArraySpace(PartitionTupleRouting *proute)
+{
+	if (proute->num_partitions >= proute->partitions_allocsize)
+	{
+		proute->partitions_allocsize *= 2;
+		proute->partitions = (ResultRelInfo **)
+			repalloc(proute->partitions, sizeof(ResultRelInfo *) *
+					 proute->partitions_allocsize);
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
+ *		and store it in the next empty slot in proute's partitions array.
  *
  * Returns the ResultRelInfo
  */
-ResultRelInfo *
+static ResultRelInfo *
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx)
 {
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -520,15 +663,22 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 									&mtstate->ps, RelationGetDescr(partrel));
 	}
 
+	part_result_rel_index = proute->num_partitions++;
+	dispatch->indexes[partidx] = part_result_rel_index;
+
+	ExecCheckPartitionArraySpace(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
+
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, leaf_part_rri);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -541,7 +691,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -554,7 +704,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -568,7 +718,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -578,8 +728,12 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = leaf_part_rri->ri_PartitionInfo->pi_RootToPartitionMap;
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -588,7 +742,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -679,9 +833,6 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
-
 	MemoryContextSwitchTo(oldContext);
 
 	return leaf_part_rri;
@@ -689,30 +840,32 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 
 /*
  * ExecInitRoutingInfo
- *		Set up information needed for routing tuples to a leaf partition
+ *		Set up information needed for translating tuples between root
+ *		partitioned table format and partition format.
  */
-void
+static void
 ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
-					PartitionTupleRouting *proute,
-					ResultRelInfo *partRelInfo,
-					int partidx)
+					ResultRelInfo *partRelInfo)
 {
 	MemoryContext oldContext;
+	PartitionRoutingInfo *partrouteinfo;
 
 	/*
 	 * Switch into per-query memory context.
 	 */
 	oldContext = MemoryContextSwitchTo(estate->es_query_cxt);
 
+	partrouteinfo = palloc(sizeof(PartitionRoutingInfo));
+
 	/*
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
-		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
-							   RelationGetDescr(partRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	partrouteinfo->pi_RootToPartitionMap =
+			convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								   RelationGetDescr(partRelInfo->ri_RelationDesc),
+								   gettext_noop("could not convert row type"));
 
 	/*
 	 * If a partition has a different rowtype than the root parent, initialize
@@ -720,28 +873,36 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * for various operations that are applied to tuples after routing, such
 	 * as checking constraints.
 	 */
-	if (proute->parent_child_tupconv_maps[partidx] != NULL)
+	if (partrouteinfo->pi_RootToPartitionMap != NULL)
 	{
 		Relation	partrel = partRelInfo->ri_RelationDesc;
 
-		/*
-		 * Initialize the array in proute where these slots are stored, if not
-		 * already done.
-		 */
-		if (proute->partition_tuple_slots == NULL)
-			proute->partition_tuple_slots = (TupleTableSlot **)
-				palloc0(proute->num_partitions *
-						sizeof(TupleTableSlot *));
-
 		/*
 		 * Initialize the slot itself setting its descriptor to this
 		 * partition's TupleDesc; TupleDesc reference will be released at the
 		 * end of the command.
 		 */
-		proute->partition_tuple_slots[partidx] =
-			ExecInitExtraTupleSlot(estate,
-								   RelationGetDescr(partrel));
+		partrouteinfo->pi_PartitionTupleSlot =
+							ExecInitExtraTupleSlot(estate,
+												   RelationGetDescr(partrel));
 	}
+	else
+		partrouteinfo->pi_PartitionTupleSlot = NULL;
+
+	/*
+	 * Also, if transition capture is required, store a map to convert tuples
+	 * from partition's rowtype to the root partition table's.
+	 */
+	if (mtstate &&
+		(mtstate->mt_transition_capture || mtstate->mt_oc_transition_capture))
+	{
+		partrouteinfo->pi_PartitionToRootMap =
+			convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
+								   RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								   gettext_noop("could not convert row type"));
+	}
+	else
+		partrouteinfo->pi_PartitionToRootMap = NULL;
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -753,71 +914,92 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 
 	MemoryContextSwitchTo(oldContext);
 
-	partRelInfo->ri_PartitionReadyForRouting = true;
+	partRelInfo->ri_PartitionInfo = partrouteinfo;
 }
 
 /*
- * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
- * child-to-root tuple conversion map array.
- *
- * This map is required for capturing transition tuples when the target table
- * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
- * we need to convert it from the leaf partition to the target table
- * descriptor.
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table and store it in
+ *		the next available slot in the 'proute' partition_dispatch_info[]
+ *		array.  Also, record the index into this array in the 'parent_pd'
+ *		indexes[] array in the partidx element so that we can properly
+ *		retrieve the newly created PartitionDispatch later.
  */
-void
-ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
 {
-	Assert(proute != NULL);
+	Relation	rel;
+	PartitionDesc partdesc;
+	PartitionDispatch pd;
+	int			dispatchidx;
 
-	/*
-	 * These array elements get filled up with maps on an on-demand basis.
-	 * Initially just set all of them to NULL.
-	 */
-	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	partdesc = RelationGetPartitionDesc(rel);
 
-	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
-}
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
+									+ (partdesc->nparts * sizeof(int)));
+	pd->reldesc = rel;
+	pd->key = RelationGetPartitionKey(rel);
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		TupleDesc	tupdesc = RelationGetDescr(rel);
 
-/*
- * TupConvMapForLeaf -- Get the tuple conversion map for a given leaf partition
- * index.
- */
-TupleConversionMap *
-TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index)
-{
-	ResultRelInfo **resultRelInfos = proute->partitions;
-	TupleConversionMap **map;
-	TupleDesc	tupdesc;
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent_pd->reldesc),
+													   tupdesc,
+													   gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
 
-	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
-		return NULL;
+	dispatchidx = proute->num_dispatch++;
 
-	/* If we've already got a map, return it. */
-	map = &proute->child_parent_tupconv_maps[leaf_index];
-	if (*map != NULL)
-		return *map;
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize *= 2;
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
 
-	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
-	*map =
-		convert_tuples_by_name(tupdesc,
-							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
 
-	/* If it turns out no map is needed, remember for next time. */
-	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
+	/*
+	 * Finally, if setting up a PartitionDispatch for a sub-partitioned table,
+	 * install the link to allow us to descend the partition hierarchy for
+	 * future searches
+	 */
+	if (parent_pd)
+		parent_pd->indexes[partidx] = dispatchidx;
 
-	return *map;
+	return pd;
 }
 
 /*
@@ -830,8 +1012,8 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -852,35 +1034,29 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-		/* skip further processing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
-
-		/* Allow any FDWs to shut down if they've been exercised */
-		if (resultRelInfo->ri_PartitionReadyForRouting &&
-			resultRelInfo->ri_FdwRoutine != NULL &&
-			resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
-			resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
-														   resultRelInfo);
 
 		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
+		if (resultrel_hash)
 		{
-			subplan_index++;
-			continue;
+			Oid			partoid;
+			bool		found;
+
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
+			if (found)
+				continue;
 		}
 
+		/* Allow any FDWs to shut down if they've been exercised */
+		if (resultRelInfo->ri_FdwRoutine != NULL &&
+			resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
+			resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
+														   resultRelInfo);
+
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
@@ -890,144 +1066,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
 }
 
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent),
-													   tupdesc,
-													   gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
-
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
-		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
-
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
-	}
-}
-
 /* ----------------
  *		FormPartitionKeyDatum
  *			Construct values[] and isnull[] arrays for the partition key
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 528f58717e..aafeea3a8c 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1665,7 +1664,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1698,52 +1697,21 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						TupleTableSlot *slot)
 {
 	ModifyTable *node;
-	int			partidx;
 	ResultRelInfo *partrel;
+	PartitionRoutingInfo *partrouteinfo;
 	HeapTuple	tuple;
 	TupleConversionMap *map;
 
 	/*
-	 * Determine the target partition.  If ExecFindPartition does not find a
-	 * partition after all, it doesn't return here; otherwise, the returned
-	 * value is to be used as an index into the arrays for the ResultRelInfo
-	 * and TupleConversionMap for the partition.
-	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
-	Assert(partidx >= 0 && partidx < proute->num_partitions);
-
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
+	 * Lookup the target partition's ResultRelInfo.  If ExecFindPartition does
+	 * not find a valid partition for the tuple in 'slot' then an error is
+	 * raised.  An error may also be raised if the found partition is not a
+	 * valid target for INSERTs.  This is required since a partitioned table
+	 * UPDATE to another partition becomes a DELETE+INSERT.
 	 */
-	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
-
-	/*
-	 * Check whether the partition is routable if we didn't yet
-	 *
-	 * Note: an UPDATE of a partition key invokes an INSERT that moves the
-	 * tuple to a new partition.  This check would be applied to a subplan
-	 * partition of such an UPDATE that is chosen as the partition to route
-	 * the tuple to.  The reason we do this check here rather than in
-	 * ExecSetupPartitionTupleRouting is to avoid aborting such an UPDATE
-	 * unnecessarily due to non-routable subplan partitions that may not be
-	 * chosen for update tuple movement after all.
-	 */
-	if (!partrel->ri_PartitionReadyForRouting)
-	{
-		/* Verify the partition is a valid target for INSERT. */
-		CheckValidResultRel(partrel, CMD_INSERT);
-
-		/* Set up information needed for routing tuples to the partition. */
-		ExecInitRoutingInfo(mtstate, estate, proute, partrel, partidx);
-	}
+	partrel = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
+	partrouteinfo = partrel->ri_PartitionInfo;
+	Assert(partrouteinfo != NULL);
 
 	/*
 	 * Make it look like we are inserting into the partition.
@@ -1755,7 +1723,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 
 	/*
 	 * If we're capturing transition tuples, we might need to convert from the
-	 * partition rowtype to parent rowtype.
+	 * partition rowtype to root partitioned table's rowtype.
 	 */
 	if (mtstate->mt_transition_capture != NULL)
 	{
@@ -1768,7 +1736,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			 */
 			mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 			mtstate->mt_transition_capture->tcs_map =
-				TupConvMapForLeaf(proute, targetRelInfo, partidx);
+									partrouteinfo->pi_PartitionToRootMap;
 		}
 		else
 		{
@@ -1783,20 +1751,17 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	if (mtstate->mt_oc_transition_capture != NULL)
 	{
 		mtstate->mt_oc_transition_capture->tcs_map =
-			TupConvMapForLeaf(proute, targetRelInfo, partidx);
+								partrouteinfo->pi_PartitionToRootMap;
 	}
 
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	map = proute->parent_child_tupconv_maps[partidx];
+	map = partrouteinfo->pi_RootToPartitionMap;
 	if (map != NULL)
 	{
-		TupleTableSlot *new_slot;
+		TupleTableSlot *new_slot = partrouteinfo->pi_PartitionTupleSlot;
 
-		Assert(proute->partition_tuple_slots != NULL &&
-			   proute->partition_tuple_slots[partidx] != NULL);
-		new_slot = proute->partition_tuple_slots[partidx];
 		slot = execute_attr_map_slot(map->attrMap, slot, new_slot);
 	}
 
@@ -1834,17 +1799,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			numResultRelInfos = mtstate->mt_nplans;
 	int			i;
 
-	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
 	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
@@ -1866,79 +1820,18 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	}
 }
 
-/*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
 /*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ourselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index d5720518a8..2a1c1cb2e1 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1657,9 +1657,6 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 /*
  * expand_partitioned_rtentry
  *		Recursively expand an RTE for a partitioned table.
- *
- * Note that RelationGetPartitionDispatchInfo will expand partitions in the
- * same order as this code.
  */
 static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 5757301d05..2afde69134 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -582,6 +582,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -770,7 +771,15 @@ RelationBuildPartitionDesc(Relation rel)
 		 * defined by canonicalized representation of the partition bounds.
 		 */
 		for (i = 0; i < nparts; i++)
-			result->oids[mapping[i]] = oids[i];
+		{
+			int			index = mapping[i];
+
+			result->oids[index] = oids[i];
+			/* Record if the partition is a leaf partition */
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+		}
+
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index a53de2372e..59c7a6ab69 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -25,7 +25,11 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of 'nparts' elements containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * the corresponding 'oids' element belongs to
+								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 3e08104ea4..78b9ac85c2 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -21,70 +21,101 @@
 /* See execPartition.c for the definition. */
 typedef struct PartitionDispatchData *PartitionDispatch;
 
+/*
+ * PartitionRoutingInfo
+ *
+ * Additional result relation information specific to routing tuples to a
+ * table partition.
+ */
+typedef struct PartitionRoutingInfo
+{
+	/*
+	 * Map for converting tuples in root partitioned table format into
+	 * partition format, or NULL if no conversion is required.
+	 */
+	TupleConversionMap	   *pi_RootToPartitionMap;
+
+	/*
+	 * Map for converting tuples in partition format into the root partitioned
+	 * table format, or NULL if no conversion is required.
+	 */
+	TupleConversionMap	   *pi_PartitionToRootMap;
+
+	/*
+	 * Slot to store tuples in partition format, or NULL when no translation
+	 * is required between root and partition.
+	 */
+	TupleTableSlot		   *pi_PartitionTupleSlot;
+} PartitionRoutingInfo;
+
 /*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
- *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slots		Array of TupleTableSlot objects; if non-NULL,
- *								contains one entry for every leaf partition,
- *								of which only those of the leaf partitions
- *								whose attribute numbers differ from the root
- *								parent have a non-NULL value.  NULL if all of
- *								the partitions encountered by a given command
- *								happen to have same rowtype as the root parent
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions
+ *
+ * partition_root			The partitioned table that's the target of the
+ *							command.
+ *
+ * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
+ *							a pointer to a PartitionDispatch objects for every
+ *							partitioned table touched by tuple routing.  The
+ *							entry for the target partitioned table is *always*
+ *							present in the 0th element of this array.  See
+ *							comment for PartitionDispatchData->indexes for
+ *							details on how this array is indexed.
+ *
+ * num_dispatch				The current number of items stored in the
+ *							'partition_dispatch_info' array.  Also serves as
+ *							the index of the next free array element for new
+ *							PartitionDispatch which need to be stored.
+ *
+ * dispatch_allocsize		The current allocated size of the
+ *							'partition_dispatch_info' array.
+ *
+ * partitions				Array of 'partitions_allocsize' elements
+ *							containing pointers to a ResultRelInfos of all
+ *							leaf partitions touched by tuple routing.  Some of
+ *							these are pointers to ResultRelInfos which are
+ *							borrowed out of 'subplan_resultrel_hash'.  The
+ *							remainder have been built especially for tuple
+ *							routing.  See comment for
+ *							PartitionDispatchData->indexes for details on how
+ *							this array is indexed.
+ *
+ * num_partitions			The current number of items stored in the
+ *							'partitions' array.  Also serves as the index of
+ *							the next free array element for new ResultRelInfos
+ *							which need to be stored.
+ *
+ * partitions_allocsize		The current allocated size of the 'partitions'
+ *							array.
+ * Note: The following fields are used only when UPDATE ends up needing to
+ * do tuple routing.
+ *
+ * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
+ *							This is used to cache ResultRelInfos from subplans
+ *							of a ModifyTable node.  Some of these may be
+ *							useful for tuple routing to save having to build
+ *							duplicates.
+ *
+ * root_tuple_slot			During UPDATE tuple routing, this tuple slot is
+ *							used to transiently store a tuple using the root
+ *							table's rowtype after converting it from the
+ *							tuple's source leaf partition's rowtype.  That is,
+ *							if leaf partition's rowtype is different.
  *-----------------------
  */
 typedef struct PartitionTupleRouting
 {
+	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
-	Oid		   *partition_oids;
+	int			dispatch_allocsize;
 	ResultRelInfo **partitions;
 	int			num_partitions;
-	TupleConversionMap **parent_child_tupconv_maps;
-	TupleConversionMap **child_parent_tupconv_maps;
-	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
-	TupleTableSlot **partition_tuple_slots;
+	int			partitions_allocsize;
 	TupleTableSlot *root_tuple_slot;
+	HTAB	   *subplan_resultrel_hash;
 } PartitionTupleRouting;
 
 /*
@@ -175,22 +206,11 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern ResultRelInfo *ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *rootResultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
-extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
-					EState *estate,
-					PartitionTupleRouting *proute,
-					ResultRelInfo *partRelInfo,
-					int partidx);
-extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
-extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index);
 extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute);
 extern PartitionPruneState *ExecCreatePartitionPruneState(PlanState *planstate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 880a03e4e4..8efc80f710 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -33,6 +33,7 @@
 
 
 struct PlanState;				/* forward references in this file */
+struct PartitionRoutingInfo;
 struct ParallelHashJoinState;
 struct ExecRowMark;
 struct ExprState;
@@ -469,8 +470,8 @@ typedef struct ResultRelInfo
 	/* relation descriptor for root partitioned table */
 	Relation	ri_PartitionRoot;
 
-	/* true if ready for tuple routing */
-	bool		ri_PartitionReadyForRouting;
+	/* Additional information that's specific to partition tuple routing */
+	struct PartitionRoutingInfo *ri_PartitionInfo;
 } ResultRelInfo;
 
 /* ----------------
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index 27cf5a01b3..6b841c7850 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -904,4 +904,26 @@ select * from parted_conflict order by a;
  50 | cincuenta | 2
 (1 row)
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+truncate parted_conflict;
+insert into parted_conflict values (0, 'cero', 1);
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+NOTICE:  a = 0, b = cero, c = 2
 drop table parted_conflict;
+drop function parted_conflict_update_func();
diff --git a/src/test/regress/sql/insert_conflict.sql b/src/test/regress/sql/insert_conflict.sql
index c677d70fb7..fe6dcfaa06 100644
--- a/src/test/regress/sql/insert_conflict.sql
+++ b/src/test/regress/sql/insert_conflict.sql
@@ -576,4 +576,30 @@ insert into parted_conflict values (50, 'cincuenta', 2)
 -- should see (50, 'cincuenta', 2)
 select * from parted_conflict order by a;
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+
+truncate parted_conflict;
+
+insert into parted_conflict values (0, 'cero', 1);
+
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+
 drop table parted_conflict;
+drop function parted_conflict_update_func();
-- 
2.16.2.windows.1

#51Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: David Rowley (#50)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

Hi David,

On 11/7/18 6:46 AM, David Rowley wrote:

I've attached a patch which does this. It adds a new struct named
PartitionRoutingInfo into ResultRelInfo and pulls 3 of the 4 arrays
out of PartitionTupleRouting. There are fields for each of what these
arrays used to store inside the PartitionRoutingInfo struct.

While doing this I kept glancing back over at ModifyTableState and at
the mt_per_subplan_tupconv_maps array. I wondered if it would be
better to make the PartitionRoutingInfo a bit more generic, perhaps
call it TupleConversionInfo and have fields like ti_ToGeneric and
ti_FromGeneric, with the idea that "generic" would be the root
partition or the first subplan for inheritance updates. This would
allow us to get rid of a good chunk of code inside nodeModifyTable.c.
However, I ended up not doing this and left PartitionRoutingInfo to be
specifically for Root to Partition conversion.

Yeah, it doesn't necessarily have to be part of this patch.

Also, on the topic about what to name the conversion maps from a few
days ago; After looking at this a bit more I decided that having them
named ParentToChild and ChildToParent is misleading. If the child is
the child of some sub-partitioned table then the parent that the map
is talking about is not the partition's parent, but the root
partitioned table. So really RootToPartition and PartitionToRoot seem
like much more accurate names for the maps.

Agreed.

I made a few other changes along the way; I thought that
ExecFindPartition() would be a good candidate to take on the
responsibility of validating the partition is valid for INSERTs when
it uses a partition out of the subplan_resultrel_hash. I thought it
was better to check this once when we're in the code path of grabbing
the ResultRelInfo out that hash table rather than in a code path that
must check if it's been done already each time we route a tuple into a
partition. It also allowed me to get rid of
ri_PartitionReadyForRouting. I also moved the call to
ExecInitRoutingInfo() into ExecFindPartition() which allowed me to
make that function static, which could result in the generation of
slightly more optimal compiled code.

Please find attached the v14 patch.

Passes check-world, and has detailed documentation about the changes :)

Best regards,
Jesper

#52Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#50)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018/11/07 20:46, David Rowley wrote:

On 5 November 2018 at 20:17, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

On 2018/11/04 19:07, David Rowley wrote:

Perhaps a better design would be to instead of having random special
partitioned-table-only fields in ResultRelInfo, just have an extra
struct there that contains the extra partition information which would
include the translation maps and then just return the ResultRelInfo
and allow callers to lookup any extra details they require.

IIUC, you're saying that we could introduce a new struct that contains
auxiliary information needed by partition pruning (maps, slot, etc. for
tuple conversion) and add a new ResultRelInfo member of that struct type.
That way, there is no need to return them separately or return an index to
access them from their arrays. I guess we won't even need the arrays we
have now. I think that might be a good idea and simplifies things
significantly.

I've attached a patch which does this.

Thank you for updating the patch this way.

It adds a new struct named
PartitionRoutingInfo into ResultRelInfo and pulls 3 of the 4 arrays
out of PartitionTupleRouting. There are fields for each of what these
arrays used to store inside the PartitionRoutingInfo struct.

While doing this I kept glancing back over at ModifyTableState and at
the mt_per_subplan_tupconv_maps array. I wondered if it would be
better to make the PartitionRoutingInfo a bit more generic, perhaps
call it TupleConversionInfo and have fields like ti_ToGeneric and
ti_FromGeneric, with the idea that "generic" would be the root
partition or the first subplan for inheritance updates. This would
allow us to get rid of a good chunk of code inside nodeModifyTable.c.
However, I ended up not doing this and left PartitionRoutingInfo to be
specifically for Root to Partition conversion.

I think it's good that you left mt_per_subplan_tupconv_maps out of this.
UPDATE tuple routing can be said to have two steps: first step, a tiny
one, converts the tuple that needs to be routed from the source
partition's rowtype to the root's rowtype so that tuple routing proper can
begin, and second is the actual tuple routing carried out using
PartitionTupleRouting. The first step is handled by nodeModifyTable.c and
so any data structures related to it should be in ModifyTableState.

Actually, as I also proposed upthread, we should move root_tuple_slot from
PartitionTupleRouting to ModifyTableState as mt_root_tuple_slot, because
it's part of the first step described above that has nothing to do with
partition tuple routing proper. We can make PartitionTupleRouting private
to execPartition.c if we do that.

Also, on the topic about what to name the conversion maps from a few
days ago; After looking at this a bit more I decided that having them
named ParentToChild and ChildToParent is misleading. If the child is
the child of some sub-partitioned table then the parent that the map
is talking about is not the partition's parent, but the root
partitioned table. So really RootToPartition and PartitionToRoot seem
like much more accurate names for the maps.

Makes sense. :)

I made a few other changes along the way; I thought that
ExecFindPartition() would be a good candidate to take on the
responsibility of validating the partition is valid for INSERTs when
it uses a partition out of the subplan_resultrel_hash. I thought it
was better to check this once when we're in the code path of grabbing
the ResultRelInfo out that hash table rather than in a code path that
must check if it's been done already each time we route a tuple into a
partition. It also allowed me to get rid of
ri_PartitionReadyForRouting.

Ah, I too had tried to remove ri_PartitionReadyForRouting, but had to give
up on that idea because I didn't think of moving steps that are needed to
be performed before setting it to true to that block of code in
ExecFindPartition.

I also moved the call to
ExecInitRoutingInfo() into ExecFindPartition() which allowed me to
make that function static, which could result in the generation of
slightly more optimal compiled code.

Please find attached the v14 patch.

Rather nicely git --stat reports a net negative additional code (with
the 48 lines of new tests included)

11 files changed, 632 insertions(+), 660 deletions(-)

That's nice!

I didn't find anything quite significant to complain about, except just
one line:

+ * Initially we must only setup 1 PartitionDispatch object; the one for

setup -> set up

Thanks,
Amit

#53David Rowley
david.rowley@2ndquadrant.com
In reply to: Amit Langote (#52)
2 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 8 November 2018 at 20:15, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Actually, as I also proposed upthread, we should move root_tuple_slot from
PartitionTupleRouting to ModifyTableState as mt_root_tuple_slot, because
it's part of the first step described above that has nothing to do with
partition tuple routing proper. We can make PartitionTupleRouting private
to execPartition.c if we do that.

okay. Makes sense. I've changed things around to PartitionTupleRouting
is now private to execPartition.c

I didn't find anything quite significant to complain about, except just
one line:

+ * Initially we must only setup 1 PartitionDispatch object; the one for

setup -> set up

Changed too.

I've attached v15 and a delta from v14 to ease re-review.

I also ran pgindent on this version. That's not part of the delta but
is in the main patch.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v14_v15.diffapplication/octet-stream; name=v14_v15.diffDownload
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 523eb2f995..dee32e827e 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2323,7 +2323,7 @@ CopyFrom(CopyState cstate)
 	TupleTableSlot *myslot;
 	MemoryContext oldcontext = CurrentMemoryContext;
 
-	PartitionTupleRouting *proute = NULL;
+	struct PartitionTupleRouting *proute = NULL;
 	ExprContext *secondaryExprContext = NULL;
 	ErrorContextCallback errcallback;
 	CommandId	mycid = GetCurrentCommandId(true);
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 930349ac47..0e0f4e8294 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -33,6 +33,67 @@
 
 #define PARTITION_ROUTING_INITSIZE	8
 
+ /*-----------------------
+  * PartitionTupleRouting - Encapsulates all information required to
+  * route a tuple inserted into a partitioned table to one of its leaf
+  * partitions
+  *
+  * partition_root			The partitioned table that's the target of the
+  *							command.
+  *
+  * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
+  *							a pointer to a PartitionDispatch objects for every
+  *							partitioned table touched by tuple routing.  The
+  *							entry for the target partitioned table is *always*
+  *							present in the 0th element of this array.  See
+  *							comment for PartitionDispatchData->indexes for
+  *							details on how this array is indexed.
+  *
+  * num_dispatch				The current number of items stored in the
+  *							'partition_dispatch_info' array.  Also serves as
+  *							the index of the next free array element for new
+  *							PartitionDispatch which need to be stored.
+  *
+  * dispatch_allocsize		The current allocated size of the
+  *							'partition_dispatch_info' array.
+  *
+  * partitions				Array of 'partitions_allocsize' elements
+  *							containing pointers to a ResultRelInfos of all
+  *							leaf partitions touched by tuple routing.  Some of
+  *							these are pointers to ResultRelInfos which are
+  *							borrowed out of 'subplan_resultrel_hash'.  The
+  *							remainder have been built especially for tuple
+  *							routing.  See comment for
+  *							PartitionDispatchData->indexes for details on how
+  *							this array is indexed.
+  *
+  * num_partitions			The current number of items stored in the
+  *							'partitions' array.  Also serves as the index of
+  *							the next free array element for new ResultRelInfos
+  *							which need to be stored.
+  *
+  * partitions_allocsize		The current allocated size of the 'partitions'
+  *							array.
+  *
+  * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
+  *							This is used to cache ResultRelInfos from subplans
+  *							of an UPDATE ModifyTable node.  Some of these may
+  *							be useful for tuple routing to save having to build
+  *							duplicates.
+  *-----------------------
+  */
+typedef struct PartitionTupleRouting
+{
+	Relation	partition_root;
+	PartitionDispatch *partition_dispatch_info;
+	int			num_dispatch;
+	int			dispatch_allocsize;
+	ResultRelInfo **partitions;
+	int			num_partitions;
+	int			partitions_allocsize;
+	HTAB	   *subplan_resultrel_hash;
+} PartitionTupleRouting;
+
 /*-----------------------
  * PartitionDispatch - information about one partitioned table in a partition
  * hierarchy required to route a tuple to any of its partitions.  A
@@ -54,8 +115,8 @@
  *				partitioned table then we store the index into the
  *				encapsulating PartitionTupleRouting's
  *				'partition_dispatch_info' array.  An index of -1 means we've
- *				not yet allocated anything in PartitionTupleRouting for the
- *				partition.
+ *				not yet allocated anything in PartitionTupleRouting for
+ *				the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -134,7 +195,7 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	 * More space can be allocated later if we end up routing tuples to more
 	 * than that many partitions.
 	 *
-	 * Initially we must only setup 1 PartitionDispatch object; the one for
+	 * Initially we must only set up 1 PartitionDispatch object; the one for
 	 * the partitioned table that's the target of the command.  If we must
 	 * route a tuple via some sub-partitioned table, then its
 	 * PartitionDispatch is only built the first time it's required.
@@ -168,20 +229,11 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	 * Every time a tuple is routed to a partition that we've yet to set the
 	 * ResultRelInfo for, before we go to the trouble of making one, we check
 	 * for a pre-made one in the hash table.
-	 *
-	 * Also, we'll need a slot that will transiently store the tuple being
-	 * routed using the root parent's rowtype.
 	 */
 	if (node && node->operation == CMD_UPDATE)
-	{
 		ExecHashSubPlanResultRelsByOid(mtstate, proute);
-		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
-	}
 	else
-	{
 		proute->subplan_resultrel_hash = NULL;
-		proute->root_tuple_slot = NULL;
-	}
 
 	return proute;
 }
@@ -1060,10 +1112,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
-
-	/* Release the standalone partition tuple descriptors, if any */
-	if (proute->root_tuple_slot)
-		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
 }
 
 /* ----------------
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index aafeea3a8c..0f704308be 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -64,7 +64,7 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 TupleTableSlot **returning);
 static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						EState *estate,
-						PartitionTupleRouting *proute,
+						struct PartitionTupleRouting *proute,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
@@ -1068,7 +1068,7 @@ lreplace:;
 			bool		tuple_deleted;
 			TupleTableSlot *ret_slot;
 			TupleTableSlot *epqslot = NULL;
-			PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+			struct PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
 			int			map_index;
 			TupleConversionMap *tupconv_map;
 
@@ -1162,7 +1162,8 @@ lreplace:;
 			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
 			if (tupconv_map != NULL)
 				slot = execute_attr_map_slot(tupconv_map->attrMap,
-											 slot, proute->root_tuple_slot);
+											 slot,
+											 mtstate->mt_root_tuple_slot);
 
 			/*
 			 * Prepare for tuple routing, making it look like we're inserting
@@ -1692,7 +1693,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 static TupleTableSlot *
 ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						EState *estate,
-						PartitionTupleRouting *proute,
+						struct PartitionTupleRouting *proute,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot)
 {
@@ -1845,7 +1846,7 @@ static TupleTableSlot *
 ExecModifyTable(PlanState *pstate)
 {
 	ModifyTableState *node = castNode(ModifyTableState, pstate);
-	PartitionTupleRouting *proute = node->mt_partition_tuple_routing;
+	struct PartitionTupleRouting *proute = node->mt_partition_tuple_routing;
 	EState	   *estate = node->ps.state;
 	CmdType		operation = node->operation;
 	ResultRelInfo *saved_resultRelInfo;
@@ -2254,10 +2255,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * descriptor of a source partition does not match the root partitioned
 	 * table descriptor.  In such a case we need to convert tuples to the root
 	 * tuple descriptor, because the search for destination partition starts
-	 * from the root.  Skip this setup if it's not a partition key update.
+	 * from the root.  We'll also need a slot to store these converted tuples.
+	 * We can skip this setup if it's not a partition key update.
 	 */
 	if (update_tuple_routing_needed)
+	{
 		ExecSetupChildParentMapForSubplan(mtstate);
+		mtstate->mt_root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
+	}
 
 	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
@@ -2597,9 +2602,16 @@ ExecEndModifyTable(ModifyTableState *node)
 														   resultRelInfo);
 	}
 
-	/* Close all the partitioned tables, leaf partitions, and their indices */
+	/*
+	 * Close all the partitioned tables, leaf partitions, and their indices
+	 * and release the slot used for tuple routing, if set.
+	 */
 	if (node->mt_partition_tuple_routing)
+	{
 		ExecCleanupTupleRouting(node, node->mt_partition_tuple_routing);
+		if (node->mt_root_tuple_slot)
+			ExecDropSingleTupleTableSlot(node->mt_root_tuple_slot);
+	}
 
 	/*
 	 * Free the exprcontext
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 78b9ac85c2..0123a38b59 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -48,76 +48,6 @@ typedef struct PartitionRoutingInfo
 	TupleTableSlot		   *pi_PartitionTupleSlot;
 } PartitionRoutingInfo;
 
-/*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to
- * route a tuple inserted into a partitioned table to one of its leaf
- * partitions
- *
- * partition_root			The partitioned table that's the target of the
- *							command.
- *
- * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
- *							a pointer to a PartitionDispatch objects for every
- *							partitioned table touched by tuple routing.  The
- *							entry for the target partitioned table is *always*
- *							present in the 0th element of this array.  See
- *							comment for PartitionDispatchData->indexes for
- *							details on how this array is indexed.
- *
- * num_dispatch				The current number of items stored in the
- *							'partition_dispatch_info' array.  Also serves as
- *							the index of the next free array element for new
- *							PartitionDispatch which need to be stored.
- *
- * dispatch_allocsize		The current allocated size of the
- *							'partition_dispatch_info' array.
- *
- * partitions				Array of 'partitions_allocsize' elements
- *							containing pointers to a ResultRelInfos of all
- *							leaf partitions touched by tuple routing.  Some of
- *							these are pointers to ResultRelInfos which are
- *							borrowed out of 'subplan_resultrel_hash'.  The
- *							remainder have been built especially for tuple
- *							routing.  See comment for
- *							PartitionDispatchData->indexes for details on how
- *							this array is indexed.
- *
- * num_partitions			The current number of items stored in the
- *							'partitions' array.  Also serves as the index of
- *							the next free array element for new ResultRelInfos
- *							which need to be stored.
- *
- * partitions_allocsize		The current allocated size of the 'partitions'
- *							array.
- * Note: The following fields are used only when UPDATE ends up needing to
- * do tuple routing.
- *
- * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
- *							This is used to cache ResultRelInfos from subplans
- *							of a ModifyTable node.  Some of these may be
- *							useful for tuple routing to save having to build
- *							duplicates.
- *
- * root_tuple_slot			During UPDATE tuple routing, this tuple slot is
- *							used to transiently store a tuple using the root
- *							table's rowtype after converting it from the
- *							tuple's source leaf partition's rowtype.  That is,
- *							if leaf partition's rowtype is different.
- *-----------------------
- */
-typedef struct PartitionTupleRouting
-{
-	Relation	partition_root;
-	PartitionDispatch *partition_dispatch_info;
-	int			num_dispatch;
-	int			dispatch_allocsize;
-	ResultRelInfo **partitions;
-	int			num_partitions;
-	int			partitions_allocsize;
-	TupleTableSlot *root_tuple_slot;
-	HTAB	   *subplan_resultrel_hash;
-} PartitionTupleRouting;
-
 /*
  * PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
  * of partitions.  For a multilevel partitioned table, we have one of these
@@ -204,15 +134,15 @@ typedef struct PartitionPruneState
 	PartitionPruningData *partprunedata[FLEXIBLE_ARRAY_MEMBER];
 } PartitionPruneState;
 
-extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
+extern struct PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
 extern ResultRelInfo *ExecFindPartition(ModifyTableState *mtstate,
 				  ResultRelInfo *rootResultRelInfo,
-				  PartitionTupleRouting *proute,
+				  struct PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
 extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
-						PartitionTupleRouting *proute);
+						struct PartitionTupleRouting *proute);
 extern PartitionPruneState *ExecCreatePartitionPruneState(PlanState *planstate,
 							  PartitionPruneInfo *partitionpruneinfo);
 extern Bitmapset *ExecFindMatchingSubPlans(PartitionPruneState *prunestate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 8efc80f710..55c5e700b5 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -34,6 +34,7 @@
 
 struct PlanState;				/* forward references in this file */
 struct PartitionRoutingInfo;
+struct PartitionTupleRouting;
 struct ParallelHashJoinState;
 struct ExecRowMark;
 struct ExprState;
@@ -1073,6 +1074,12 @@ typedef struct ModifyTableState
 	TupleTableSlot *mt_existing;	/* slot to store existing target tuple in */
 	List	   *mt_excludedtlist;	/* the excluded pseudo relation's tlist  */
 	TupleTableSlot *mt_conflproj;	/* CONFLICT ... SET ... projection target */
+	
+	/*
+	 * Slot for storing tuples in the root partitioned table's rowtype during
+	 * an UPDATE of a partitioned table.
+	 */
+	TupleTableSlot *mt_root_tuple_slot;
 
 	/* Tuple-routing support info */
 	struct PartitionTupleRouting *mt_partition_tuple_routing;
v15-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchapplication/octet-stream; name=v15-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload
From 2431c0eaf6493f76225c4add01d8f5ba19ef2a1e Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 26 Jul 2018 19:54:55 +1200
Subject: [PATCH v15] Speed up INSERT and UPDATE on partitioned tables

This is more or less a complete redesign of PartitionTupleRouting.  This
changes the setup that it does far less work during the initial setup and
pushes more work out to when partitions receive tuples.
PartitionDispatchData structs for sub-partitioned tables are only created
when a tuple gets routed through it. The possibly large arrays in the
PartitionTupleRouting struct have largely been removed.  The partitions[]
array remains but now never contains any NULL gaps.  Previously the NULLs
had to be skipped during ExecCleanupTupleRouting(), which could add a
large overhead to the cleanup when the number of partitions was large.
The partitions[] array is allocated small to start with and only enlarged
when we route tuples to enough partitions that it runs out of space. This
allows us to keep simple single-row partition INSERTs running quickly.

The arrays in PartitionTupleRouting which stored the tuple translation
maps have now been removed.  These have been moved out into a
PartitionRoutingInfo struct which is an additional field in ResultRelInfo.

The find_all_inheritors() call still remains by far the slowest part of
ExecSetupPartitionTupleRouting(). This commit just removes the other slow
parts.

In passing also rename the tuple translation maps from being ParentToChild
and ChildToParent to being RootToPartition and PartitionToRoot. The old
names mislead you into thinking that a partition of some sub-partitioned
table would translate to the rowtype of the sub-partitioned table rather
than the root partitioned table.

David Rowley and Amit Langote
---
 src/backend/commands/copy.c                   |  88 +--
 src/backend/executor/execMain.c               |   2 +-
 src/backend/executor/execPartition.c          | 876 ++++++++++++++------------
 src/backend/executor/nodeModifyTable.c        | 171 ++---
 src/backend/optimizer/prep/prepunion.c        |   3 -
 src/backend/utils/cache/partcache.c           |  11 +-
 src/include/catalog/partition.h               |   6 +-
 src/include/executor/execPartition.h          | 106 +---
 src/include/nodes/execnodes.h                 |  12 +-
 src/test/regress/expected/insert_conflict.out |  22 +
 src/test/regress/sql/insert_conflict.sql      |  26 +
 11 files changed, 646 insertions(+), 677 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b58a74f4e3..dee32e827e 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2316,13 +2316,14 @@ CopyFrom(CopyState cstate)
 	bool	   *nulls;
 	ResultRelInfo *resultRelInfo;
 	ResultRelInfo *target_resultRelInfo;
+	ResultRelInfo *prevResultRelInfo = NULL;
 	EState	   *estate = CreateExecutorState(); /* for ExecConstraints() */
 	ModifyTableState *mtstate;
 	ExprContext *econtext;
 	TupleTableSlot *myslot;
 	MemoryContext oldcontext = CurrentMemoryContext;
 
-	PartitionTupleRouting *proute = NULL;
+	struct PartitionTupleRouting *proute = NULL;
 	ExprContext *secondaryExprContext = NULL;
 	ErrorContextCallback errcallback;
 	CommandId	mycid = GetCurrentCommandId(true);
@@ -2331,7 +2332,6 @@ CopyFrom(CopyState cstate)
 	CopyInsertMethod insertMethod;
 	uint64		processed = 0;
 	int			nBufferedTuples = 0;
-	int			prev_leaf_part_index = -1;
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
 	bool		leafpart_use_multi_insert = false;
@@ -2513,8 +2513,12 @@ CopyFrom(CopyState cstate)
 	/*
 	 * If there are any triggers with transition tables on the named relation,
 	 * we need to be prepared to capture transition tuples.
+	 *
+	 * Because partition tuple routing would like to know about whether
+	 * transition capture is active, we also set it in mtstate, which is
+	 * passed to ExecFindPartition() below.
 	 */
-	cstate->transition_capture =
+	cstate->transition_capture = mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(cstate->rel->trigdesc,
 								   RelationGetRelid(cstate->rel),
 								   CMD_INSERT);
@@ -2524,19 +2528,8 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
 		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
-		/*
-		 * If we are capturing transition tuples, they may need to be
-		 * converted from partition format back to partitioned table format
-		 * (this is only ever necessary if a BEFORE trigger modifies the
-		 * tuple).
-		 */
-		if (cstate->transition_capture != NULL)
-			ExecSetupChildParentMapForLeaf(proute);
-	}
-
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2692,25 +2685,17 @@ CopyFrom(CopyState cstate)
 		/* Determine the partition to heap_insert the tuple into */
 		if (proute)
 		{
-			int			leaf_part_index;
 			TupleConversionMap *map;
 
 			/*
-			 * Away we go ... If we end up not finding a partition after all,
-			 * ExecFindPartition() does not return and errors out instead.
-			 * Otherwise, the returned value is to be used as an index into
-			 * arrays mt_partitions[] and mt_partition_tupconv_maps[] that
-			 * will get us the ResultRelInfo and TupleConversionMap for the
-			 * partition, respectively.
+			 * Attempt to find a partition suitable for this tuple.
+			 * ExecFindPartition() will raise an error if none can be found or
+			 * if the found partition is not suitable for INSERTs.
 			 */
-			leaf_part_index = ExecFindPartition(target_resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
-			Assert(leaf_part_index >= 0 &&
-				   leaf_part_index < proute->num_partitions);
-
-			if (prev_leaf_part_index != leaf_part_index)
+			resultRelInfo = ExecFindPartition(mtstate, target_resultRelInfo,
+											  proute, slot, estate);
+
+			if (prevResultRelInfo != resultRelInfo)
 			{
 				/* Check if we can multi-insert into this partition */
 				if (insertMethod == CIM_MULTI_CONDITIONAL)
@@ -2723,12 +2708,9 @@ CopyFrom(CopyState cstate)
 					if (nBufferedTuples > 0)
 					{
 						ExprContext *swapcontext;
-						ResultRelInfo *presultRelInfo;
-
-						presultRelInfo = proute->partitions[prev_leaf_part_index];
 
 						CopyFromInsertBatch(cstate, estate, mycid, hi_options,
-											presultRelInfo, myslot, bistate,
+											prevResultRelInfo, myslot, bistate,
 											nBufferedTuples, bufferedTuples,
 											firstBufferedLineNo);
 						nBufferedTuples = 0;
@@ -2785,21 +2767,6 @@ CopyFrom(CopyState cstate)
 					}
 				}
 
-				/*
-				 * Overwrite resultRelInfo with the corresponding partition's
-				 * one.
-				 */
-				resultRelInfo = proute->partitions[leaf_part_index];
-				if (unlikely(resultRelInfo == NULL))
-				{
-					resultRelInfo = ExecInitPartitionInfo(mtstate,
-														  target_resultRelInfo,
-														  proute, estate,
-														  leaf_part_index);
-					proute->partitions[leaf_part_index] = resultRelInfo;
-					Assert(resultRelInfo != NULL);
-				}
-
 				/* Determine which triggers exist on this partition */
 				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
 											  resultRelInfo->ri_TrigDesc->trig_insert_before_row);
@@ -2825,7 +2792,7 @@ CopyFrom(CopyState cstate)
 				 * buffer when the partition being inserted into changes.
 				 */
 				ReleaseBulkInsertStatePin(bistate);
-				prev_leaf_part_index = leaf_part_index;
+				prevResultRelInfo = resultRelInfo;
 			}
 
 			/*
@@ -2835,7 +2802,7 @@ CopyFrom(CopyState cstate)
 
 			/*
 			 * If we're capturing transition tuples, we might need to convert
-			 * from the partition rowtype to parent rowtype.
+			 * from the partition rowtype to root rowtype.
 			 */
 			if (cstate->transition_capture != NULL)
 			{
@@ -2848,8 +2815,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, target_resultRelInfo,
-										  leaf_part_index);
+						resultRelInfo->ri_PartitionInfo->pi_PartitionToRootMap;
 				}
 				else
 				{
@@ -2863,18 +2829,18 @@ CopyFrom(CopyState cstate)
 			}
 
 			/*
-			 * We might need to convert from the parent rowtype to the
-			 * partition rowtype.
+			 * We might need to convert from the root rowtype to the partition
+			 * rowtype.
 			 */
-			map = proute->parent_child_tupconv_maps[leaf_part_index];
+			map = resultRelInfo->ri_PartitionInfo->pi_RootToPartitionMap;
 			if (map != NULL)
 			{
 				TupleTableSlot *new_slot;
 				MemoryContext oldcontext;
 
-				Assert(proute->partition_tuple_slots != NULL &&
-					   proute->partition_tuple_slots[leaf_part_index] != NULL);
-				new_slot = proute->partition_tuple_slots[leaf_part_index];
+				new_slot = resultRelInfo->ri_PartitionInfo->pi_PartitionTupleSlot;
+				Assert(new_slot != NULL);
+
 				slot = execute_attr_map_slot(map->attrMap, slot, new_slot);
 
 				/*
@@ -3019,12 +2985,8 @@ CopyFrom(CopyState cstate)
 	{
 		if (insertMethod == CIM_MULTI_CONDITIONAL)
 		{
-			ResultRelInfo *presultRelInfo;
-
-			presultRelInfo = proute->partitions[prev_leaf_part_index];
-
 			CopyFromInsertBatch(cstate, estate, mycid, hi_options,
-								presultRelInfo, myslot, bistate,
+								prevResultRelInfo, myslot, bistate,
 								nBufferedTuples, bufferedTuples,
 								firstBufferedLineNo);
 		}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index ba156f8c5f..32d2461528 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1343,7 +1343,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
 
 	resultRelInfo->ri_PartitionCheck = partition_check;
 	resultRelInfo->ri_PartitionRoot = partition_root;
-	resultRelInfo->ri_PartitionReadyForRouting = false;
+	resultRelInfo->ri_PartitionInfo = NULL; /* May be set later */
 }
 
 /*
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 1e72e9fb3f..962db6d7f0 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,10 +31,74 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+#define PARTITION_ROUTING_INITSIZE	8
+
+ /*-----------------------
+  * PartitionTupleRouting - Encapsulates all information required to
+  * route a tuple inserted into a partitioned table to one of its leaf
+  * partitions
+  *
+  * partition_root			The partitioned table that's the target of the
+  *							command.
+  *
+  * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
+  *							a pointer to a PartitionDispatch objects for every
+  *							partitioned table touched by tuple routing.  The
+  *							entry for the target partitioned table is *always*
+  *							present in the 0th element of this array.  See
+  *							comment for PartitionDispatchData->indexes for
+  *							details on how this array is indexed.
+  *
+  * num_dispatch				The current number of items stored in the
+  *							'partition_dispatch_info' array.  Also serves as
+  *							the index of the next free array element for new
+  *							PartitionDispatch which need to be stored.
+  *
+  * dispatch_allocsize		The current allocated size of the
+  *							'partition_dispatch_info' array.
+  *
+  * partitions				Array of 'partitions_allocsize' elements
+  *							containing pointers to a ResultRelInfos of all
+  *							leaf partitions touched by tuple routing.  Some of
+  *							these are pointers to ResultRelInfos which are
+  *							borrowed out of 'subplan_resultrel_hash'.  The
+  *							remainder have been built especially for tuple
+  *							routing.  See comment for
+  *							PartitionDispatchData->indexes for details on how
+  *							this array is indexed.
+  *
+  * num_partitions			The current number of items stored in the
+  *							'partitions' array.  Also serves as the index of
+  *							the next free array element for new ResultRelInfos
+  *							which need to be stored.
+  *
+  * partitions_allocsize		The current allocated size of the 'partitions'
+  *							array.
+  *
+  * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
+  *							This is used to cache ResultRelInfos from subplans
+  *							of an UPDATE ModifyTable node.  Some of these may
+  *							be useful for tuple routing to save having to build
+  *							duplicates.
+  *-----------------------
+  */
+typedef struct PartitionTupleRouting
+{
+	Relation	partition_root;
+	PartitionDispatch *partition_dispatch_info;
+	int			num_dispatch;
+	int			dispatch_allocsize;
+	ResultRelInfo **partitions;
+	int			num_partitions;
+	int			partitions_allocsize;
+	HTAB	   *subplan_resultrel_hash;
+} PartitionTupleRouting;
 
 /*-----------------------
  * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
+ * hierarchy required to route a tuple to any of its partitions.  A
+ * PartitionDispatch is always encapsulated inside a PartitionTupleRouting
+ * struct and stored inside its 'partition_dispatch_info' array.
  *
  *	reldesc		Relation descriptor of the table
  *	key			Partition key information of the table
@@ -45,9 +109,14 @@
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
  *				this table's rowtype (when extracting the partition key of a
  *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *	indexes		Array of partdesc->nparts elements.  For leaf partitions the
+ *				index into the encapsulating PartitionTupleRouting's
+ *				'partitions' array is stored.  When the partition is itself a
+ *				partitioned table then we store the index into the
+ *				encapsulating PartitionTupleRouting's
+ *				'partition_dispatch_info' array.  An index of -1 means we've
+ *				not yet allocated anything in PartitionTupleRouting for
+ *				the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -58,14 +127,23 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrNumber *tupmap;
-	int		   *indexes;
+	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
 
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecCheckPartitionArraySpace(PartitionTupleRouting *proute);
+static ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx);
+static void ExecInitRoutingInfo(ModifyTableState *mtstate,
+					EState *estate,
+					ResultRelInfo *partRelInfo);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -92,130 +170,102 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition().  The actual ResultRelInfo for a partition is only
+ * allocated when the first tuple is routed there.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
-
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
-
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
-	}
-
-	i = 0;
-	foreach(cell, leaf_parts)
-	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
 
-		proute->partition_oids[i] = leaf_oid;
-
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
-
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built on
+	 * demand, only when we actually need to route a tuple to that partition.
+	 * The reason for this is that a common case is for INSERT to insert a
+	 * single tuple into a partitioned table and this must be fast.
+	 *
+	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
+	 * PartitionDispatch and ResultRelInfo pointers the 'partitions' array.
+	 * More space can be allocated later if we end up routing tuples to more
+	 * than that many partitions.
+	 *
+	 * Initially we must only set up 1 PartitionDispatch object; the one for
+	 * the partitioned table that's the target of the command.  If we must
+	 * route a tuple via some sub-partitioned table, then its
+	 * PartitionDispatch is only built the first time it's required.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
 
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
+	proute->partitions = (ResultRelInfo **)
+		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
 
-			update_rri_index++;
-		}
+	/* Mark that no items are yet stored in the 'partitions' array. */
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
 
-		proute->partitions[i] = leaf_part_rri;
-		i++;
-	}
+	/*
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent as NULL as we don't need to care about any parent of the target
+	 * partitioned table.
+	 */
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
 
 	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
+	 * If performing an UPDATE with tuple routing, we can reuse partition
+	 * sub-plan result rels.  We build a hash table to map the OIDs of
+	 * partitions present in mtstate->resultRelInfo to their ResultRelInfos.
+	 * Every time a tuple is routed to a partition that we've yet to set the
+	 * ResultRelInfo for, before we go to the trouble of making one, we check
+	 * for a pre-made one in the hash table.
 	 */
-	Assert(update_rri_index == num_update_rri);
+	if (node && node->operation == CMD_UPDATE)
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
+	else
+		proute->subplan_resultrel_hash = NULL;
 
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find and return the ResultRelInfo for the leaf
+ * partition for the tuple contained in *slot.
+ *
+ * If the partition's ResultRelInfo does not yet exist in 'proute' then we set
+ * one up or reuse one from mtstate's resultRelInfo array.  When reusing a
+ * ResultRelInfo from the mtstate we verify that the relation is a valid
+ * target for INSERTs and then set up a PartitionRoutingInfo for it.
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message.  An error may also raised if the found target partition is
+ * not a valid target for an INSERT.
  */
-int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ResultRelInfo *
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *rootResultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch dispatch;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 	TupleTableSlot *myslot = NULL;
@@ -228,17 +278,18 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate, true);
+	if (rootResultRelInfo->ri_PartitionCheck)
+		ExecPartitionCheck(rootResultRelInfo, slot, estate, true);
 
 	/* start with the root partitioned table */
 	dispatch = pd[0];
 	while (true)
 	{
 		AttrNumber *map = dispatch->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = dispatch->reldesc;
+		partdesc = dispatch->partdesc;
 
 		/*
 		 * Convert the tuple to this parent's layout, if different from the
@@ -260,91 +311,235 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (dispatch->partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(dispatch, values, isnull);
-
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
-		{
-			result = -1;
-			break;
-		}
-		else if (dispatch->indexes[cur_index] >= 0)
+		if (partdesc->is_leaf[partidx])
 		{
-			result = dispatch->indexes[cur_index];
-			/* success! */
-			break;
+			ResultRelInfo *rri;
+
+			/*
+			 * Look to see if we've already got a ResultRelInfo for this
+			 * partition.
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(dispatch->indexes[partidx] < proute->num_partitions);
+				rri = proute->partitions[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				int			rri_index = -1;
+
+				/*
+				 * A ResultRelInfo has not been set up for this partition yet,
+				 * so either use one of the sub-plan result rels or build a
+				 * new one.
+				 */
+				if (proute->subplan_resultrel_hash)
+				{
+					Oid			partoid = partdesc->oids[partidx];
+
+					rri = hash_search(proute->subplan_resultrel_hash,
+									  &partoid, HASH_FIND, NULL);
+
+					if (rri)
+					{
+						/* Found one! */
+
+						/* Verify this ResultRelInfo allows INSERTs */
+						CheckValidResultRel(rri, CMD_INSERT);
+
+						/* This shouldn't have be set up yet */
+						Assert(rri->ri_PartitionInfo == NULL);
+
+						/* Setup the PartitionRoutingInfo for it */
+						ExecInitRoutingInfo(mtstate, estate, rri);
+
+						rri_index = proute->num_partitions++;
+						dispatch->indexes[partidx] = rri_index;
+
+						ExecCheckPartitionArraySpace(proute);
+
+						/*
+						 * Store it in the partitions array so we don't have
+						 * to look it up again.
+						 */
+						proute->partitions[rri_index] = rri;
+					}
+				}
+
+				/* We need to create a new one. */
+				if (rri_index < 0)
+				{
+					MemoryContextSwitchTo(oldcxt);
+					rri = ExecInitPartitionInfo(mtstate, rootResultRelInfo,
+												proute, estate,
+												dispatch, partidx);
+					MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				}
+			}
+
+			/* Release the tuple in the lowest parent's dedicated slot. */
+			if (slot == myslot)
+				ExecClearTuple(myslot);
+
+			MemoryContextSwitchTo(oldcxt);
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return rri;
 		}
 		else
 		{
-			/* move down one level */
-			dispatch = pd[-dispatch->indexes[cur_index]];
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
+
+				/*
+				 * Move down to the next partition level and search again
+				 * until we find a leaf partition that matches this tuple
+				 */
+				dispatch = pd[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subdispatch;
+
+				MemoryContextSwitchTo(oldcxt);
+
+				/*
+				 * Create the new PartitionDispatch.  We pass the current one
+				 * in as the parent PartitionDispatch
+				 */
+				subdispatch = ExecInitPartitionDispatchInfo(proute,
+															partdesc->oids[partidx],
+															dispatch, partidx);
+				MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				Assert(dispatch->indexes[partidx] >= 0 &&
+					   dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = subdispatch;
+			}
 		}
 	}
+}
+
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo *subplan_result_rels;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			nsubplans;
+	int			i;
 
-	/* Release the tuple in the lowest parent's dedicated slot. */
-	if (slot == myslot)
-		ExecClearTuple(myslot);
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
 
-	/* A partition was not found. */
-	if (result < 0)
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
+
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_resultrel_hash = htab;
+
+	/* Hash all subplans by their Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		char	   *val_desc;
-
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
-	}
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
 
-	MemoryContextSwitchTo(oldcxt);
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+													&found);
 
-	return result;
+		if (!found)
+			*subplanrri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
+	}
+}
+
+/*
+ * ExecCheckPartitionArraySpace
+ *		Ensure there's enough space in the 'partitions' array of 'proute'
+ */
+static void
+ExecCheckPartitionArraySpace(PartitionTupleRouting *proute)
+{
+	if (proute->num_partitions >= proute->partitions_allocsize)
+	{
+		proute->partitions_allocsize *= 2;
+		proute->partitions = (ResultRelInfo **)
+			repalloc(proute->partitions, sizeof(ResultRelInfo *) *
+					 proute->partitions_allocsize);
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
+ *		and store it in the next empty slot in proute's partitions array.
  *
  * Returns the ResultRelInfo
  */
-ResultRelInfo *
+static ResultRelInfo *
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx)
 {
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -520,15 +715,22 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 									&mtstate->ps, RelationGetDescr(partrel));
 	}
 
+	part_result_rel_index = proute->num_partitions++;
+	dispatch->indexes[partidx] = part_result_rel_index;
+
+	ExecCheckPartitionArraySpace(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
+
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, leaf_part_rri);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -541,7 +743,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -554,7 +756,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -568,7 +770,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -578,8 +780,12 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = leaf_part_rri->ri_PartitionInfo->pi_RootToPartitionMap;
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -588,7 +794,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -679,9 +885,6 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
-
 	MemoryContextSwitchTo(oldContext);
 
 	return leaf_part_rri;
@@ -689,27 +892,29 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 
 /*
  * ExecInitRoutingInfo
- *		Set up information needed for routing tuples to a leaf partition
+ *		Set up information needed for translating tuples between root
+ *		partitioned table format and partition format.
  */
-void
+static void
 ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
-					PartitionTupleRouting *proute,
-					ResultRelInfo *partRelInfo,
-					int partidx)
+					ResultRelInfo *partRelInfo)
 {
 	MemoryContext oldContext;
+	PartitionRoutingInfo *partrouteinfo;
 
 	/*
 	 * Switch into per-query memory context.
 	 */
 	oldContext = MemoryContextSwitchTo(estate->es_query_cxt);
 
+	partrouteinfo = palloc(sizeof(PartitionRoutingInfo));
+
 	/*
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
+	partrouteinfo->pi_RootToPartitionMap =
 		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
 							   RelationGetDescr(partRelInfo->ri_RelationDesc),
 							   gettext_noop("could not convert row type"));
@@ -720,28 +925,36 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * for various operations that are applied to tuples after routing, such
 	 * as checking constraints.
 	 */
-	if (proute->parent_child_tupconv_maps[partidx] != NULL)
+	if (partrouteinfo->pi_RootToPartitionMap != NULL)
 	{
 		Relation	partrel = partRelInfo->ri_RelationDesc;
 
-		/*
-		 * Initialize the array in proute where these slots are stored, if not
-		 * already done.
-		 */
-		if (proute->partition_tuple_slots == NULL)
-			proute->partition_tuple_slots = (TupleTableSlot **)
-				palloc0(proute->num_partitions *
-						sizeof(TupleTableSlot *));
-
 		/*
 		 * Initialize the slot itself setting its descriptor to this
 		 * partition's TupleDesc; TupleDesc reference will be released at the
 		 * end of the command.
 		 */
-		proute->partition_tuple_slots[partidx] =
+		partrouteinfo->pi_PartitionTupleSlot =
 			ExecInitExtraTupleSlot(estate,
 								   RelationGetDescr(partrel));
 	}
+	else
+		partrouteinfo->pi_PartitionTupleSlot = NULL;
+
+	/*
+	 * Also, if transition capture is required, store a map to convert tuples
+	 * from partition's rowtype to the root partition table's.
+	 */
+	if (mtstate &&
+		(mtstate->mt_transition_capture || mtstate->mt_oc_transition_capture))
+	{
+		partrouteinfo->pi_PartitionToRootMap =
+			convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
+								   RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								   gettext_noop("could not convert row type"));
+	}
+	else
+		partrouteinfo->pi_PartitionToRootMap = NULL;
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -753,71 +966,92 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 
 	MemoryContextSwitchTo(oldContext);
 
-	partRelInfo->ri_PartitionReadyForRouting = true;
+	partRelInfo->ri_PartitionInfo = partrouteinfo;
 }
 
 /*
- * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
- * child-to-root tuple conversion map array.
- *
- * This map is required for capturing transition tuples when the target table
- * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
- * we need to convert it from the leaf partition to the target table
- * descriptor.
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table and store it in
+ *		the next available slot in the 'proute' partition_dispatch_info[]
+ *		array.  Also, record the index into this array in the 'parent_pd'
+ *		indexes[] array in the partidx element so that we can properly
+ *		retrieve the newly created PartitionDispatch later.
  */
-void
-ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
 {
-	Assert(proute != NULL);
+	Relation	rel;
+	PartitionDesc partdesc;
+	PartitionDispatch pd;
+	int			dispatchidx;
 
-	/*
-	 * These array elements get filled up with maps on an on-demand basis.
-	 * Initially just set all of them to NULL.
-	 */
-	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	partdesc = RelationGetPartitionDesc(rel);
 
-	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
-}
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
+									+ (partdesc->nparts * sizeof(int)));
+	pd->reldesc = rel;
+	pd->key = RelationGetPartitionKey(rel);
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		TupleDesc	tupdesc = RelationGetDescr(rel);
 
-/*
- * TupConvMapForLeaf -- Get the tuple conversion map for a given leaf partition
- * index.
- */
-TupleConversionMap *
-TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index)
-{
-	ResultRelInfo **resultRelInfos = proute->partitions;
-	TupleConversionMap **map;
-	TupleDesc	tupdesc;
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent_pd->reldesc),
+													   tupdesc,
+													   gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
 
-	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
-		return NULL;
+	dispatchidx = proute->num_dispatch++;
 
-	/* If we've already got a map, return it. */
-	map = &proute->child_parent_tupconv_maps[leaf_index];
-	if (*map != NULL)
-		return *map;
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize *= 2;
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
 
-	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
-	*map =
-		convert_tuples_by_name(tupdesc,
-							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
 
-	/* If it turns out no map is needed, remember for next time. */
-	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
+	/*
+	 * Finally, if setting up a PartitionDispatch for a sub-partitioned table,
+	 * install the link to allow us to descend the partition hierarchy for
+	 * future searches
+	 */
+	if (parent_pd)
+		parent_pd->indexes[partidx] = dispatchidx;
 
-	return *map;
+	return pd;
 }
 
 /*
@@ -830,8 +1064,8 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -852,179 +1086,31 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-		/* skip further processing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
-
-		/* Allow any FDWs to shut down if they've been exercised */
-		if (resultRelInfo->ri_PartitionReadyForRouting &&
-			resultRelInfo->ri_FdwRoutine != NULL &&
-			resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
-			resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
-														   resultRelInfo);
 
 		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
+		if (resultrel_hash)
 		{
-			subplan_index++;
-			continue;
-		}
-
-		ExecCloseIndices(resultRelInfo);
-		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
-	}
-
-	/* Release the standalone partition tuple descriptors, if any */
-	if (proute->root_tuple_slot)
-		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
-}
-
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent),
-													   tupdesc,
-													   gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
+			Oid			partoid;
+			bool		found;
 
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
+			if (found)
+				continue;
 		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
 
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
+		/* Allow any FDWs to shut down if they've been exercised */
+		if (resultRelInfo->ri_FdwRoutine != NULL &&
+			resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
+			resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
+														   resultRelInfo);
+
+		ExecCloseIndices(resultRelInfo);
+		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 }
 
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 528f58717e..3d023b458f 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -64,11 +64,10 @@ static bool ExecOnConflictUpdate(ModifyTableState *mtstate,
 					 TupleTableSlot **returning);
 static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						EState *estate,
-						PartitionTupleRouting *proute,
+						struct PartitionTupleRouting *proute,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1069,7 +1068,7 @@ lreplace:;
 			bool		tuple_deleted;
 			TupleTableSlot *ret_slot;
 			TupleTableSlot *epqslot = NULL;
-			PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+			struct PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
 			int			map_index;
 			TupleConversionMap *tupconv_map;
 
@@ -1163,7 +1162,8 @@ lreplace:;
 			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
 			if (tupconv_map != NULL)
 				slot = execute_attr_map_slot(tupconv_map->attrMap,
-											 slot, proute->root_tuple_slot);
+											 slot,
+											 mtstate->mt_root_tuple_slot);
 
 			/*
 			 * Prepare for tuple routing, making it look like we're inserting
@@ -1665,7 +1665,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1693,57 +1693,26 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 static TupleTableSlot *
 ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						EState *estate,
-						PartitionTupleRouting *proute,
+						struct PartitionTupleRouting *proute,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot)
 {
 	ModifyTable *node;
-	int			partidx;
 	ResultRelInfo *partrel;
+	PartitionRoutingInfo *partrouteinfo;
 	HeapTuple	tuple;
 	TupleConversionMap *map;
 
 	/*
-	 * Determine the target partition.  If ExecFindPartition does not find a
-	 * partition after all, it doesn't return here; otherwise, the returned
-	 * value is to be used as an index into the arrays for the ResultRelInfo
-	 * and TupleConversionMap for the partition.
-	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
-	Assert(partidx >= 0 && partidx < proute->num_partitions);
-
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
+	 * Lookup the target partition's ResultRelInfo.  If ExecFindPartition does
+	 * not find a valid partition for the tuple in 'slot' then an error is
+	 * raised.  An error may also be raised if the found partition is not a
+	 * valid target for INSERTs.  This is required since a partitioned table
+	 * UPDATE to another partition becomes a DELETE+INSERT.
 	 */
-	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
-
-	/*
-	 * Check whether the partition is routable if we didn't yet
-	 *
-	 * Note: an UPDATE of a partition key invokes an INSERT that moves the
-	 * tuple to a new partition.  This check would be applied to a subplan
-	 * partition of such an UPDATE that is chosen as the partition to route
-	 * the tuple to.  The reason we do this check here rather than in
-	 * ExecSetupPartitionTupleRouting is to avoid aborting such an UPDATE
-	 * unnecessarily due to non-routable subplan partitions that may not be
-	 * chosen for update tuple movement after all.
-	 */
-	if (!partrel->ri_PartitionReadyForRouting)
-	{
-		/* Verify the partition is a valid target for INSERT. */
-		CheckValidResultRel(partrel, CMD_INSERT);
-
-		/* Set up information needed for routing tuples to the partition. */
-		ExecInitRoutingInfo(mtstate, estate, proute, partrel, partidx);
-	}
+	partrel = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
+	partrouteinfo = partrel->ri_PartitionInfo;
+	Assert(partrouteinfo != NULL);
 
 	/*
 	 * Make it look like we are inserting into the partition.
@@ -1755,7 +1724,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 
 	/*
 	 * If we're capturing transition tuples, we might need to convert from the
-	 * partition rowtype to parent rowtype.
+	 * partition rowtype to root partitioned table's rowtype.
 	 */
 	if (mtstate->mt_transition_capture != NULL)
 	{
@@ -1768,7 +1737,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			 */
 			mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 			mtstate->mt_transition_capture->tcs_map =
-				TupConvMapForLeaf(proute, targetRelInfo, partidx);
+				partrouteinfo->pi_PartitionToRootMap;
 		}
 		else
 		{
@@ -1783,20 +1752,17 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	if (mtstate->mt_oc_transition_capture != NULL)
 	{
 		mtstate->mt_oc_transition_capture->tcs_map =
-			TupConvMapForLeaf(proute, targetRelInfo, partidx);
+			partrouteinfo->pi_PartitionToRootMap;
 	}
 
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	map = proute->parent_child_tupconv_maps[partidx];
+	map = partrouteinfo->pi_RootToPartitionMap;
 	if (map != NULL)
 	{
-		TupleTableSlot *new_slot;
+		TupleTableSlot *new_slot = partrouteinfo->pi_PartitionTupleSlot;
 
-		Assert(proute->partition_tuple_slots != NULL &&
-			   proute->partition_tuple_slots[partidx] != NULL);
-		new_slot = proute->partition_tuple_slots[partidx];
 		slot = execute_attr_map_slot(map->attrMap, slot, new_slot);
 	}
 
@@ -1834,17 +1800,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			numResultRelInfos = mtstate->mt_nplans;
 	int			i;
 
-	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
 	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
@@ -1866,79 +1821,18 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	}
 }
 
-/*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
 /*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ourselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
@@ -1952,7 +1846,7 @@ static TupleTableSlot *
 ExecModifyTable(PlanState *pstate)
 {
 	ModifyTableState *node = castNode(ModifyTableState, pstate);
-	PartitionTupleRouting *proute = node->mt_partition_tuple_routing;
+	struct PartitionTupleRouting *proute = node->mt_partition_tuple_routing;
 	EState	   *estate = node->ps.state;
 	CmdType		operation = node->operation;
 	ResultRelInfo *saved_resultRelInfo;
@@ -2361,10 +2255,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * descriptor of a source partition does not match the root partitioned
 	 * table descriptor.  In such a case we need to convert tuples to the root
 	 * tuple descriptor, because the search for destination partition starts
-	 * from the root.  Skip this setup if it's not a partition key update.
+	 * from the root.  We'll also need a slot to store these converted tuples.
+	 * We can skip this setup if it's not a partition key update.
 	 */
 	if (update_tuple_routing_needed)
+	{
 		ExecSetupChildParentMapForSubplan(mtstate);
+		mtstate->mt_root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
+	}
 
 	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
@@ -2704,9 +2602,16 @@ ExecEndModifyTable(ModifyTableState *node)
 														   resultRelInfo);
 	}
 
-	/* Close all the partitioned tables, leaf partitions, and their indices */
+	/*
+	 * Close all the partitioned tables, leaf partitions, and their indices
+	 * and release the slot used for tuple routing, if set.
+	 */
 	if (node->mt_partition_tuple_routing)
+	{
 		ExecCleanupTupleRouting(node, node->mt_partition_tuple_routing);
+		if (node->mt_root_tuple_slot)
+			ExecDropSingleTupleTableSlot(node->mt_root_tuple_slot);
+	}
 
 	/*
 	 * Free the exprcontext
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index d5720518a8..2a1c1cb2e1 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1657,9 +1657,6 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 /*
  * expand_partitioned_rtentry
  *		Recursively expand an RTE for a partitioned table.
- *
- * Note that RelationGetPartitionDispatchInfo will expand partitions in the
- * same order as this code.
  */
 static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 5757301d05..2afde69134 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -582,6 +582,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -770,7 +771,15 @@ RelationBuildPartitionDesc(Relation rel)
 		 * defined by canonicalized representation of the partition bounds.
 		 */
 		for (i = 0; i < nparts; i++)
-			result->oids[mapping[i]] = oids[i];
+		{
+			int			index = mapping[i];
+
+			result->oids[index] = oids[i];
+			/* Record if the partition is a leaf partition */
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+		}
+
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index a53de2372e..59c7a6ab69 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -25,7 +25,11 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of 'nparts' elements containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * the corresponding 'oids' element belongs to
+								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 3e08104ea4..6a3c04b1f4 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -21,71 +21,32 @@
 /* See execPartition.c for the definition. */
 typedef struct PartitionDispatchData *PartitionDispatch;
 
-/*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
+/*
+ * PartitionRoutingInfo
  *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slots		Array of TupleTableSlot objects; if non-NULL,
- *								contains one entry for every leaf partition,
- *								of which only those of the leaf partitions
- *								whose attribute numbers differ from the root
- *								parent have a non-NULL value.  NULL if all of
- *								the partitions encountered by a given command
- *								happen to have same rowtype as the root parent
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
- *-----------------------
+ * Additional result relation information specific to routing tuples to a
+ * table partition.
  */
-typedef struct PartitionTupleRouting
+typedef struct PartitionRoutingInfo
 {
-	PartitionDispatch *partition_dispatch_info;
-	int			num_dispatch;
-	Oid		   *partition_oids;
-	ResultRelInfo **partitions;
-	int			num_partitions;
-	TupleConversionMap **parent_child_tupconv_maps;
-	TupleConversionMap **child_parent_tupconv_maps;
-	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
-	TupleTableSlot **partition_tuple_slots;
-	TupleTableSlot *root_tuple_slot;
-} PartitionTupleRouting;
+	/*
+	 * Map for converting tuples in root partitioned table format into
+	 * partition format, or NULL if no conversion is required.
+	 */
+	TupleConversionMap *pi_RootToPartitionMap;
+
+	/*
+	 * Map for converting tuples in partition format into the root partitioned
+	 * table format, or NULL if no conversion is required.
+	 */
+	TupleConversionMap *pi_PartitionToRootMap;
+
+	/*
+	 * Slot to store tuples in partition format, or NULL when no translation
+	 * is required between root and partition.
+	 */
+	TupleTableSlot *pi_PartitionTupleSlot;
+} PartitionRoutingInfo;
 
 /*
  * PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
@@ -173,26 +134,15 @@ typedef struct PartitionPruneState
 	PartitionPruningData *partprunedata[FLEXIBLE_ARRAY_MEMBER];
 } PartitionPruneState;
 
-extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
+extern struct PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern ResultRelInfo *ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *rootResultRelInfo,
+				  struct PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
-extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
-					EState *estate,
-					PartitionTupleRouting *proute,
-					ResultRelInfo *partRelInfo,
-					int partidx);
-extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
-extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index);
 extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
-						PartitionTupleRouting *proute);
+						struct PartitionTupleRouting *proute);
 extern PartitionPruneState *ExecCreatePartitionPruneState(PlanState *planstate,
 							  PartitionPruneInfo *partitionpruneinfo);
 extern Bitmapset *ExecFindMatchingSubPlans(PartitionPruneState *prunestate);
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 880a03e4e4..a6b0bf52c7 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -33,6 +33,8 @@
 
 
 struct PlanState;				/* forward references in this file */
+struct PartitionRoutingInfo;
+struct PartitionTupleRouting;
 struct ParallelHashJoinState;
 struct ExecRowMark;
 struct ExprState;
@@ -469,8 +471,8 @@ typedef struct ResultRelInfo
 	/* relation descriptor for root partitioned table */
 	Relation	ri_PartitionRoot;
 
-	/* true if ready for tuple routing */
-	bool		ri_PartitionReadyForRouting;
+	/* Additional information that's specific to partition tuple routing */
+	struct PartitionRoutingInfo *ri_PartitionInfo;
 } ResultRelInfo;
 
 /* ----------------
@@ -1073,6 +1075,12 @@ typedef struct ModifyTableState
 	List	   *mt_excludedtlist;	/* the excluded pseudo relation's tlist  */
 	TupleTableSlot *mt_conflproj;	/* CONFLICT ... SET ... projection target */
 
+	/*
+	 * Slot for storing tuples in the root partitioned table's rowtype during
+	 * an UPDATE of a partitioned table.
+	 */
+	TupleTableSlot *mt_root_tuple_slot;
+
 	/* Tuple-routing support info */
 	struct PartitionTupleRouting *mt_partition_tuple_routing;
 
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index 27cf5a01b3..6b841c7850 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -904,4 +904,26 @@ select * from parted_conflict order by a;
  50 | cincuenta | 2
 (1 row)
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+truncate parted_conflict;
+insert into parted_conflict values (0, 'cero', 1);
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+NOTICE:  a = 0, b = cero, c = 2
 drop table parted_conflict;
+drop function parted_conflict_update_func();
diff --git a/src/test/regress/sql/insert_conflict.sql b/src/test/regress/sql/insert_conflict.sql
index c677d70fb7..fe6dcfaa06 100644
--- a/src/test/regress/sql/insert_conflict.sql
+++ b/src/test/regress/sql/insert_conflict.sql
@@ -576,4 +576,30 @@ insert into parted_conflict values (50, 'cincuenta', 2)
 -- should see (50, 'cincuenta', 2)
 select * from parted_conflict order by a;
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+
+truncate parted_conflict;
+
+insert into parted_conflict values (0, 'cero', 1);
+
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+
 drop table parted_conflict;
+drop function parted_conflict_update_func();
-- 
2.16.2.windows.1

#54Robert Haas
robertmhaas@gmail.com
In reply to: David Rowley (#53)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On Thu, Nov 8, 2018 at 6:28 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

I've attached v15 and a delta from v14 to ease re-review.

I also ran pgindent on this version. That's not part of the delta but
is in the main patch.

Did you notice /messages/by-id/25C1C6B2E7BE044889E4FE8643A58BA963B5796B@G01JPEXMBKW03
?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#55David Rowley
david.rowley@2ndquadrant.com
In reply to: David Rowley (#53)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 9 November 2018 at 00:28, David Rowley <david.rowley@2ndquadrant.com> wrote:

I've attached v15 and a delta from v14 to ease re-review.

I just revived the 0002 patch, which is still just for testing at this
stage. There was mention over on [1]/messages/by-id/CA+TgmoZGJsy-nRFnzurhZQJtHdDh5fzJKvbvhS0byN6_46pB9Q@mail.gmail.com about removing the
find_all_inheritors() call.

Also some benchmarks from v14 with 0001+0002.

Setup:

DROP TABLE hashp;
CREATE TABLE hashp (a INT) PARTITION BY HASH (a);
SELECT 'CREATE TABLE hashp'||x::Text || ' PARTITION OF hashp FOR
VALUES WITH (modulus 10000, remainder ' || x::text || ');' from
generate_Series(0,9999) x;
\gexec

(0 partitions is a non-partitioned table)

fsync=off

Partitions Patched Unpatched
0 23672 23785
10 22794 18385
100 22392 8541
1000 22419 1159
10000 22195 101

[1]: /messages/by-id/CA+TgmoZGJsy-nRFnzurhZQJtHdDh5fzJKvbvhS0byN6_46pB9Q@mail.gmail.com

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v14-0002-Unsafe-locking-reduction-for-partitioned-INSERT-.patchapplication/octet-stream; name=v14-0002-Unsafe-locking-reduction-for-partitioned-INSERT-.patchDownload
From f649fc914ea0e2bc15e2f1387b4c56df9e27bec6 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 9 Nov 2018 10:20:14 +1300
Subject: [PATCH v14 2/2] Unsafe locking reduction for partitioned
 INSERT/UPDATEs

For performance demonstration purposes only.
---
 src/backend/executor/execPartition.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 962db6d7f0..f37371f561 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -167,9 +167,6 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * tuple routing for partitioned tables, encapsulates it in
  * PartitionTupleRouting, and returns it.
  *
- * Note that all the relations in the partition tree are locked using the
- * RowExclusiveLock mode upon return from this function.
- *
  * Callers must use the returned PartitionTupleRouting during calls to
  * ExecFindPartition().  The actual ResultRelInfo for a partition is only
  * allocated when the first tuple is routed there.
@@ -180,9 +177,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	PartitionTupleRouting *proute;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/* Lock all the partitions. */
-	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-
 	/*
 	 * Here we attempt to expend as little effort as possible in setting up
 	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built on
@@ -535,11 +529,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 	bool		found_whole_row;
 	int			part_result_rel_index;
 
-	/*
-	 * We locked all the partitions in ExecSetupPartitionTupleRouting
-	 * including the leaf partitions.
-	 */
-	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], RowExclusiveLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -987,7 +977,7 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 	int			dispatchidx;
 
 	if (partoid != RelationGetRelid(proute->partition_root))
-		rel = heap_open(partoid, NoLock);
+		rel = heap_open(partoid, RowExclusiveLock);
 	else
 		rel = proute->partition_root;
 	partdesc = RelationGetPartitionDesc(rel);
-- 
2.16.2.windows.1

#56Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#53)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018/11/08 20:28, David Rowley wrote:

On 8 November 2018 at 20:15, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Actually, as I also proposed upthread, we should move root_tuple_slot from
PartitionTupleRouting to ModifyTableState as mt_root_tuple_slot, because
it's part of the first step described above that has nothing to do with
partition tuple routing proper. We can make PartitionTupleRouting private
to execPartition.c if we do that.

okay. Makes sense. I've changed things around to PartitionTupleRouting
is now private to execPartition.c

Thank you. I have a comment regarding how you chose to make
PartitionTupleRouting private.

Using the v14_to_v15 diff, I could quickly see that there are many diffs
changing PartitionTupleRouting to struct PartitionTupleRouting, but they
would be unnecessary if you had added the following in execPartition.h, as
my upthread had done.

-/* See execPartition.c for the definition. */
+/* See execPartition.c for the definitions. */
 typedef struct PartitionDispatchData *PartitionDispatch;
+typedef struct PartitionTupleRouting PartitionTupleRouting;

Thanks,
Amit

#57David Rowley
david.rowley@2ndquadrant.com
In reply to: Amit Langote (#56)
2 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 9 November 2018 at 19:18, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

I have a comment regarding how you chose to make
PartitionTupleRouting private.

Using the v14_to_v15 diff, I could quickly see that there are many diffs
changing PartitionTupleRouting to struct PartitionTupleRouting, but they
would be unnecessary if you had added the following in execPartition.h, as
my upthread had done.

-/* See execPartition.c for the definition. */
+/* See execPartition.c for the definitions. */
typedef struct PartitionDispatchData *PartitionDispatch;
+typedef struct PartitionTupleRouting PartitionTupleRouting;

Okay, done that way. v16 attached.

The 0002 patch is included again, this time with a new proposed commit
message. There was some discussion over on [1]/messages/by-id/25C1C6B2E7BE044889E4FE8643A58BA963B5796B@G01JPEXMBKW03 where nobody seemed to
have any concerns about delaying the locking until we route the first
tuple to the partition.

[1]: /messages/by-id/25C1C6B2E7BE044889E4FE8643A58BA963B5796B@G01JPEXMBKW03

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v16-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchapplication/octet-stream; name=v16-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload
From 71d8be052db94800e9dbbfabbbd4679e9e98a162 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 26 Jul 2018 19:54:55 +1200
Subject: [PATCH v16 1/2] Speed up INSERT and UPDATE on partitioned tables

This is more or less a complete redesign of PartitionTupleRouting.  This
changes the setup that it does far less work during the initial setup and
pushes more work out to when partitions receive tuples.
PartitionDispatchData structs for sub-partitioned tables are only created
when a tuple gets routed through it. The possibly large arrays in the
PartitionTupleRouting struct have largely been removed.  The partitions[]
array remains but now never contains any NULL gaps.  Previously the NULLs
had to be skipped during ExecCleanupTupleRouting(), which could add a
large overhead to the cleanup when the number of partitions was large.
The partitions[] array is allocated small to start with and only enlarged
when we route tuples to enough partitions that it runs out of space. This
allows us to keep simple single-row partition INSERTs running quickly.

The arrays in PartitionTupleRouting which stored the tuple translation
maps have now been removed.  These have been moved out into a
PartitionRoutingInfo struct which is an additional field in ResultRelInfo.

The find_all_inheritors() call still remains by far the slowest part of
ExecSetupPartitionTupleRouting(). This commit just removes the other slow
parts.

In passing also rename the tuple translation maps from being ParentToChild
and ChildToParent to being RootToPartition and PartitionToRoot. The old
names mislead you into thinking that a partition of some sub-partitioned
table would translate to the rowtype of the sub-partitioned table rather
than the root partitioned table.

David Rowley and Amit Langote
---
 src/backend/commands/copy.c                   |  86 +--
 src/backend/executor/execMain.c               |   2 +-
 src/backend/executor/execPartition.c          | 876 ++++++++++++++------------
 src/backend/executor/nodeModifyTable.c        | 163 +----
 src/backend/optimizer/prep/prepunion.c        |   3 -
 src/backend/utils/cache/partcache.c           |  11 +-
 src/include/catalog/partition.h               |   6 +-
 src/include/executor/execPartition.h          | 105 +--
 src/include/nodes/execnodes.h                 |  12 +-
 src/test/regress/expected/insert_conflict.out |  22 +
 src/test/regress/sql/insert_conflict.sql      |  26 +
 11 files changed, 641 insertions(+), 671 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b58a74f4e3..523eb2f995 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2316,6 +2316,7 @@ CopyFrom(CopyState cstate)
 	bool	   *nulls;
 	ResultRelInfo *resultRelInfo;
 	ResultRelInfo *target_resultRelInfo;
+	ResultRelInfo *prevResultRelInfo = NULL;
 	EState	   *estate = CreateExecutorState(); /* for ExecConstraints() */
 	ModifyTableState *mtstate;
 	ExprContext *econtext;
@@ -2331,7 +2332,6 @@ CopyFrom(CopyState cstate)
 	CopyInsertMethod insertMethod;
 	uint64		processed = 0;
 	int			nBufferedTuples = 0;
-	int			prev_leaf_part_index = -1;
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
 	bool		leafpart_use_multi_insert = false;
@@ -2513,8 +2513,12 @@ CopyFrom(CopyState cstate)
 	/*
 	 * If there are any triggers with transition tables on the named relation,
 	 * we need to be prepared to capture transition tuples.
+	 *
+	 * Because partition tuple routing would like to know about whether
+	 * transition capture is active, we also set it in mtstate, which is
+	 * passed to ExecFindPartition() below.
 	 */
-	cstate->transition_capture =
+	cstate->transition_capture = mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(cstate->rel->trigdesc,
 								   RelationGetRelid(cstate->rel),
 								   CMD_INSERT);
@@ -2524,19 +2528,8 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
 		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
-		/*
-		 * If we are capturing transition tuples, they may need to be
-		 * converted from partition format back to partitioned table format
-		 * (this is only ever necessary if a BEFORE trigger modifies the
-		 * tuple).
-		 */
-		if (cstate->transition_capture != NULL)
-			ExecSetupChildParentMapForLeaf(proute);
-	}
-
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2692,25 +2685,17 @@ CopyFrom(CopyState cstate)
 		/* Determine the partition to heap_insert the tuple into */
 		if (proute)
 		{
-			int			leaf_part_index;
 			TupleConversionMap *map;
 
 			/*
-			 * Away we go ... If we end up not finding a partition after all,
-			 * ExecFindPartition() does not return and errors out instead.
-			 * Otherwise, the returned value is to be used as an index into
-			 * arrays mt_partitions[] and mt_partition_tupconv_maps[] that
-			 * will get us the ResultRelInfo and TupleConversionMap for the
-			 * partition, respectively.
+			 * Attempt to find a partition suitable for this tuple.
+			 * ExecFindPartition() will raise an error if none can be found or
+			 * if the found partition is not suitable for INSERTs.
 			 */
-			leaf_part_index = ExecFindPartition(target_resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
-			Assert(leaf_part_index >= 0 &&
-				   leaf_part_index < proute->num_partitions);
-
-			if (prev_leaf_part_index != leaf_part_index)
+			resultRelInfo = ExecFindPartition(mtstate, target_resultRelInfo,
+											  proute, slot, estate);
+
+			if (prevResultRelInfo != resultRelInfo)
 			{
 				/* Check if we can multi-insert into this partition */
 				if (insertMethod == CIM_MULTI_CONDITIONAL)
@@ -2723,12 +2708,9 @@ CopyFrom(CopyState cstate)
 					if (nBufferedTuples > 0)
 					{
 						ExprContext *swapcontext;
-						ResultRelInfo *presultRelInfo;
-
-						presultRelInfo = proute->partitions[prev_leaf_part_index];
 
 						CopyFromInsertBatch(cstate, estate, mycid, hi_options,
-											presultRelInfo, myslot, bistate,
+											prevResultRelInfo, myslot, bistate,
 											nBufferedTuples, bufferedTuples,
 											firstBufferedLineNo);
 						nBufferedTuples = 0;
@@ -2785,21 +2767,6 @@ CopyFrom(CopyState cstate)
 					}
 				}
 
-				/*
-				 * Overwrite resultRelInfo with the corresponding partition's
-				 * one.
-				 */
-				resultRelInfo = proute->partitions[leaf_part_index];
-				if (unlikely(resultRelInfo == NULL))
-				{
-					resultRelInfo = ExecInitPartitionInfo(mtstate,
-														  target_resultRelInfo,
-														  proute, estate,
-														  leaf_part_index);
-					proute->partitions[leaf_part_index] = resultRelInfo;
-					Assert(resultRelInfo != NULL);
-				}
-
 				/* Determine which triggers exist on this partition */
 				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
 											  resultRelInfo->ri_TrigDesc->trig_insert_before_row);
@@ -2825,7 +2792,7 @@ CopyFrom(CopyState cstate)
 				 * buffer when the partition being inserted into changes.
 				 */
 				ReleaseBulkInsertStatePin(bistate);
-				prev_leaf_part_index = leaf_part_index;
+				prevResultRelInfo = resultRelInfo;
 			}
 
 			/*
@@ -2835,7 +2802,7 @@ CopyFrom(CopyState cstate)
 
 			/*
 			 * If we're capturing transition tuples, we might need to convert
-			 * from the partition rowtype to parent rowtype.
+			 * from the partition rowtype to root rowtype.
 			 */
 			if (cstate->transition_capture != NULL)
 			{
@@ -2848,8 +2815,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, target_resultRelInfo,
-										  leaf_part_index);
+						resultRelInfo->ri_PartitionInfo->pi_PartitionToRootMap;
 				}
 				else
 				{
@@ -2863,18 +2829,18 @@ CopyFrom(CopyState cstate)
 			}
 
 			/*
-			 * We might need to convert from the parent rowtype to the
-			 * partition rowtype.
+			 * We might need to convert from the root rowtype to the partition
+			 * rowtype.
 			 */
-			map = proute->parent_child_tupconv_maps[leaf_part_index];
+			map = resultRelInfo->ri_PartitionInfo->pi_RootToPartitionMap;
 			if (map != NULL)
 			{
 				TupleTableSlot *new_slot;
 				MemoryContext oldcontext;
 
-				Assert(proute->partition_tuple_slots != NULL &&
-					   proute->partition_tuple_slots[leaf_part_index] != NULL);
-				new_slot = proute->partition_tuple_slots[leaf_part_index];
+				new_slot = resultRelInfo->ri_PartitionInfo->pi_PartitionTupleSlot;
+				Assert(new_slot != NULL);
+
 				slot = execute_attr_map_slot(map->attrMap, slot, new_slot);
 
 				/*
@@ -3019,12 +2985,8 @@ CopyFrom(CopyState cstate)
 	{
 		if (insertMethod == CIM_MULTI_CONDITIONAL)
 		{
-			ResultRelInfo *presultRelInfo;
-
-			presultRelInfo = proute->partitions[prev_leaf_part_index];
-
 			CopyFromInsertBatch(cstate, estate, mycid, hi_options,
-								presultRelInfo, myslot, bistate,
+								prevResultRelInfo, myslot, bistate,
 								nBufferedTuples, bufferedTuples,
 								firstBufferedLineNo);
 		}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index ba156f8c5f..32d2461528 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1343,7 +1343,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
 
 	resultRelInfo->ri_PartitionCheck = partition_check;
 	resultRelInfo->ri_PartitionRoot = partition_root;
-	resultRelInfo->ri_PartitionReadyForRouting = false;
+	resultRelInfo->ri_PartitionInfo = NULL; /* May be set later */
 }
 
 /*
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 1e72e9fb3f..962db6d7f0 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,10 +31,74 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+#define PARTITION_ROUTING_INITSIZE	8
+
+ /*-----------------------
+  * PartitionTupleRouting - Encapsulates all information required to
+  * route a tuple inserted into a partitioned table to one of its leaf
+  * partitions
+  *
+  * partition_root			The partitioned table that's the target of the
+  *							command.
+  *
+  * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
+  *							a pointer to a PartitionDispatch objects for every
+  *							partitioned table touched by tuple routing.  The
+  *							entry for the target partitioned table is *always*
+  *							present in the 0th element of this array.  See
+  *							comment for PartitionDispatchData->indexes for
+  *							details on how this array is indexed.
+  *
+  * num_dispatch				The current number of items stored in the
+  *							'partition_dispatch_info' array.  Also serves as
+  *							the index of the next free array element for new
+  *							PartitionDispatch which need to be stored.
+  *
+  * dispatch_allocsize		The current allocated size of the
+  *							'partition_dispatch_info' array.
+  *
+  * partitions				Array of 'partitions_allocsize' elements
+  *							containing pointers to a ResultRelInfos of all
+  *							leaf partitions touched by tuple routing.  Some of
+  *							these are pointers to ResultRelInfos which are
+  *							borrowed out of 'subplan_resultrel_hash'.  The
+  *							remainder have been built especially for tuple
+  *							routing.  See comment for
+  *							PartitionDispatchData->indexes for details on how
+  *							this array is indexed.
+  *
+  * num_partitions			The current number of items stored in the
+  *							'partitions' array.  Also serves as the index of
+  *							the next free array element for new ResultRelInfos
+  *							which need to be stored.
+  *
+  * partitions_allocsize		The current allocated size of the 'partitions'
+  *							array.
+  *
+  * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
+  *							This is used to cache ResultRelInfos from subplans
+  *							of an UPDATE ModifyTable node.  Some of these may
+  *							be useful for tuple routing to save having to build
+  *							duplicates.
+  *-----------------------
+  */
+typedef struct PartitionTupleRouting
+{
+	Relation	partition_root;
+	PartitionDispatch *partition_dispatch_info;
+	int			num_dispatch;
+	int			dispatch_allocsize;
+	ResultRelInfo **partitions;
+	int			num_partitions;
+	int			partitions_allocsize;
+	HTAB	   *subplan_resultrel_hash;
+} PartitionTupleRouting;
 
 /*-----------------------
  * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
+ * hierarchy required to route a tuple to any of its partitions.  A
+ * PartitionDispatch is always encapsulated inside a PartitionTupleRouting
+ * struct and stored inside its 'partition_dispatch_info' array.
  *
  *	reldesc		Relation descriptor of the table
  *	key			Partition key information of the table
@@ -45,9 +109,14 @@
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
  *				this table's rowtype (when extracting the partition key of a
  *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *	indexes		Array of partdesc->nparts elements.  For leaf partitions the
+ *				index into the encapsulating PartitionTupleRouting's
+ *				'partitions' array is stored.  When the partition is itself a
+ *				partitioned table then we store the index into the
+ *				encapsulating PartitionTupleRouting's
+ *				'partition_dispatch_info' array.  An index of -1 means we've
+ *				not yet allocated anything in PartitionTupleRouting for
+ *				the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -58,14 +127,23 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrNumber *tupmap;
-	int		   *indexes;
+	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
 
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecCheckPartitionArraySpace(PartitionTupleRouting *proute);
+static ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx);
+static void ExecInitRoutingInfo(ModifyTableState *mtstate,
+					EState *estate,
+					ResultRelInfo *partRelInfo);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -92,130 +170,102 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition().  The actual ResultRelInfo for a partition is only
+ * allocated when the first tuple is routed there.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
-
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
-
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
-	}
-
-	i = 0;
-	foreach(cell, leaf_parts)
-	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
 
-		proute->partition_oids[i] = leaf_oid;
-
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
-
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built on
+	 * demand, only when we actually need to route a tuple to that partition.
+	 * The reason for this is that a common case is for INSERT to insert a
+	 * single tuple into a partitioned table and this must be fast.
+	 *
+	 * We initially allocate enough memory to hold PARTITION_ROUTING_INITSIZE
+	 * PartitionDispatch and ResultRelInfo pointers the 'partitions' array.
+	 * More space can be allocated later if we end up routing tuples to more
+	 * than that many partitions.
+	 *
+	 * Initially we must only set up 1 PartitionDispatch object; the one for
+	 * the partitioned table that's the target of the command.  If we must
+	 * route a tuple via some sub-partitioned table, then its
+	 * PartitionDispatch is only built the first time it's required.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
 
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
+	proute->partitions = (ResultRelInfo **)
+		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
 
-			update_rri_index++;
-		}
+	/* Mark that no items are yet stored in the 'partitions' array. */
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
 
-		proute->partitions[i] = leaf_part_rri;
-		i++;
-	}
+	/*
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent as NULL as we don't need to care about any parent of the target
+	 * partitioned table.
+	 */
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
 
 	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
+	 * If performing an UPDATE with tuple routing, we can reuse partition
+	 * sub-plan result rels.  We build a hash table to map the OIDs of
+	 * partitions present in mtstate->resultRelInfo to their ResultRelInfos.
+	 * Every time a tuple is routed to a partition that we've yet to set the
+	 * ResultRelInfo for, before we go to the trouble of making one, we check
+	 * for a pre-made one in the hash table.
 	 */
-	Assert(update_rri_index == num_update_rri);
+	if (node && node->operation == CMD_UPDATE)
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
+	else
+		proute->subplan_resultrel_hash = NULL;
 
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find and return the ResultRelInfo for the leaf
+ * partition for the tuple contained in *slot.
+ *
+ * If the partition's ResultRelInfo does not yet exist in 'proute' then we set
+ * one up or reuse one from mtstate's resultRelInfo array.  When reusing a
+ * ResultRelInfo from the mtstate we verify that the relation is a valid
+ * target for INSERTs and then set up a PartitionRoutingInfo for it.
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message.  An error may also raised if the found target partition is
+ * not a valid target for an INSERT.
  */
-int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ResultRelInfo *
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *rootResultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch dispatch;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 	TupleTableSlot *myslot = NULL;
@@ -228,17 +278,18 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate, true);
+	if (rootResultRelInfo->ri_PartitionCheck)
+		ExecPartitionCheck(rootResultRelInfo, slot, estate, true);
 
 	/* start with the root partitioned table */
 	dispatch = pd[0];
 	while (true)
 	{
 		AttrNumber *map = dispatch->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = dispatch->reldesc;
+		partdesc = dispatch->partdesc;
 
 		/*
 		 * Convert the tuple to this parent's layout, if different from the
@@ -260,91 +311,235 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (dispatch->partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(dispatch, values, isnull);
-
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
-		{
-			result = -1;
-			break;
-		}
-		else if (dispatch->indexes[cur_index] >= 0)
+		if (partdesc->is_leaf[partidx])
 		{
-			result = dispatch->indexes[cur_index];
-			/* success! */
-			break;
+			ResultRelInfo *rri;
+
+			/*
+			 * Look to see if we've already got a ResultRelInfo for this
+			 * partition.
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(dispatch->indexes[partidx] < proute->num_partitions);
+				rri = proute->partitions[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				int			rri_index = -1;
+
+				/*
+				 * A ResultRelInfo has not been set up for this partition yet,
+				 * so either use one of the sub-plan result rels or build a
+				 * new one.
+				 */
+				if (proute->subplan_resultrel_hash)
+				{
+					Oid			partoid = partdesc->oids[partidx];
+
+					rri = hash_search(proute->subplan_resultrel_hash,
+									  &partoid, HASH_FIND, NULL);
+
+					if (rri)
+					{
+						/* Found one! */
+
+						/* Verify this ResultRelInfo allows INSERTs */
+						CheckValidResultRel(rri, CMD_INSERT);
+
+						/* This shouldn't have be set up yet */
+						Assert(rri->ri_PartitionInfo == NULL);
+
+						/* Setup the PartitionRoutingInfo for it */
+						ExecInitRoutingInfo(mtstate, estate, rri);
+
+						rri_index = proute->num_partitions++;
+						dispatch->indexes[partidx] = rri_index;
+
+						ExecCheckPartitionArraySpace(proute);
+
+						/*
+						 * Store it in the partitions array so we don't have
+						 * to look it up again.
+						 */
+						proute->partitions[rri_index] = rri;
+					}
+				}
+
+				/* We need to create a new one. */
+				if (rri_index < 0)
+				{
+					MemoryContextSwitchTo(oldcxt);
+					rri = ExecInitPartitionInfo(mtstate, rootResultRelInfo,
+												proute, estate,
+												dispatch, partidx);
+					MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				}
+			}
+
+			/* Release the tuple in the lowest parent's dedicated slot. */
+			if (slot == myslot)
+				ExecClearTuple(myslot);
+
+			MemoryContextSwitchTo(oldcxt);
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return rri;
 		}
 		else
 		{
-			/* move down one level */
-			dispatch = pd[-dispatch->indexes[cur_index]];
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
+
+				/*
+				 * Move down to the next partition level and search again
+				 * until we find a leaf partition that matches this tuple
+				 */
+				dispatch = pd[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subdispatch;
+
+				MemoryContextSwitchTo(oldcxt);
+
+				/*
+				 * Create the new PartitionDispatch.  We pass the current one
+				 * in as the parent PartitionDispatch
+				 */
+				subdispatch = ExecInitPartitionDispatchInfo(proute,
+															partdesc->oids[partidx],
+															dispatch, partidx);
+				MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				Assert(dispatch->indexes[partidx] >= 0 &&
+					   dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = subdispatch;
+			}
 		}
 	}
+}
+
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo *subplan_result_rels;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			nsubplans;
+	int			i;
 
-	/* Release the tuple in the lowest parent's dedicated slot. */
-	if (slot == myslot)
-		ExecClearTuple(myslot);
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
 
-	/* A partition was not found. */
-	if (result < 0)
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
+
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_resultrel_hash = htab;
+
+	/* Hash all subplans by their Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		char	   *val_desc;
-
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
-	}
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
 
-	MemoryContextSwitchTo(oldcxt);
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+													&found);
 
-	return result;
+		if (!found)
+			*subplanrri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
+	}
+}
+
+/*
+ * ExecCheckPartitionArraySpace
+ *		Ensure there's enough space in the 'partitions' array of 'proute'
+ */
+static void
+ExecCheckPartitionArraySpace(PartitionTupleRouting *proute)
+{
+	if (proute->num_partitions >= proute->partitions_allocsize)
+	{
+		proute->partitions_allocsize *= 2;
+		proute->partitions = (ResultRelInfo **)
+			repalloc(proute->partitions, sizeof(ResultRelInfo *) *
+					 proute->partitions_allocsize);
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
+ *		and store it in the next empty slot in proute's partitions array.
  *
  * Returns the ResultRelInfo
  */
-ResultRelInfo *
+static ResultRelInfo *
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx)
 {
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -520,15 +715,22 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 									&mtstate->ps, RelationGetDescr(partrel));
 	}
 
+	part_result_rel_index = proute->num_partitions++;
+	dispatch->indexes[partidx] = part_result_rel_index;
+
+	ExecCheckPartitionArraySpace(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
+
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, leaf_part_rri);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -541,7 +743,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -554,7 +756,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -568,7 +770,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -578,8 +780,12 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = leaf_part_rri->ri_PartitionInfo->pi_RootToPartitionMap;
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -588,7 +794,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -679,9 +885,6 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
-
 	MemoryContextSwitchTo(oldContext);
 
 	return leaf_part_rri;
@@ -689,27 +892,29 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 
 /*
  * ExecInitRoutingInfo
- *		Set up information needed for routing tuples to a leaf partition
+ *		Set up information needed for translating tuples between root
+ *		partitioned table format and partition format.
  */
-void
+static void
 ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
-					PartitionTupleRouting *proute,
-					ResultRelInfo *partRelInfo,
-					int partidx)
+					ResultRelInfo *partRelInfo)
 {
 	MemoryContext oldContext;
+	PartitionRoutingInfo *partrouteinfo;
 
 	/*
 	 * Switch into per-query memory context.
 	 */
 	oldContext = MemoryContextSwitchTo(estate->es_query_cxt);
 
+	partrouteinfo = palloc(sizeof(PartitionRoutingInfo));
+
 	/*
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
+	partrouteinfo->pi_RootToPartitionMap =
 		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
 							   RelationGetDescr(partRelInfo->ri_RelationDesc),
 							   gettext_noop("could not convert row type"));
@@ -720,28 +925,36 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * for various operations that are applied to tuples after routing, such
 	 * as checking constraints.
 	 */
-	if (proute->parent_child_tupconv_maps[partidx] != NULL)
+	if (partrouteinfo->pi_RootToPartitionMap != NULL)
 	{
 		Relation	partrel = partRelInfo->ri_RelationDesc;
 
-		/*
-		 * Initialize the array in proute where these slots are stored, if not
-		 * already done.
-		 */
-		if (proute->partition_tuple_slots == NULL)
-			proute->partition_tuple_slots = (TupleTableSlot **)
-				palloc0(proute->num_partitions *
-						sizeof(TupleTableSlot *));
-
 		/*
 		 * Initialize the slot itself setting its descriptor to this
 		 * partition's TupleDesc; TupleDesc reference will be released at the
 		 * end of the command.
 		 */
-		proute->partition_tuple_slots[partidx] =
+		partrouteinfo->pi_PartitionTupleSlot =
 			ExecInitExtraTupleSlot(estate,
 								   RelationGetDescr(partrel));
 	}
+	else
+		partrouteinfo->pi_PartitionTupleSlot = NULL;
+
+	/*
+	 * Also, if transition capture is required, store a map to convert tuples
+	 * from partition's rowtype to the root partition table's.
+	 */
+	if (mtstate &&
+		(mtstate->mt_transition_capture || mtstate->mt_oc_transition_capture))
+	{
+		partrouteinfo->pi_PartitionToRootMap =
+			convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
+								   RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								   gettext_noop("could not convert row type"));
+	}
+	else
+		partrouteinfo->pi_PartitionToRootMap = NULL;
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -753,71 +966,92 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 
 	MemoryContextSwitchTo(oldContext);
 
-	partRelInfo->ri_PartitionReadyForRouting = true;
+	partRelInfo->ri_PartitionInfo = partrouteinfo;
 }
 
 /*
- * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
- * child-to-root tuple conversion map array.
- *
- * This map is required for capturing transition tuples when the target table
- * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
- * we need to convert it from the leaf partition to the target table
- * descriptor.
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table and store it in
+ *		the next available slot in the 'proute' partition_dispatch_info[]
+ *		array.  Also, record the index into this array in the 'parent_pd'
+ *		indexes[] array in the partidx element so that we can properly
+ *		retrieve the newly created PartitionDispatch later.
  */
-void
-ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
 {
-	Assert(proute != NULL);
+	Relation	rel;
+	PartitionDesc partdesc;
+	PartitionDispatch pd;
+	int			dispatchidx;
 
-	/*
-	 * These array elements get filled up with maps on an on-demand basis.
-	 * Initially just set all of them to NULL.
-	 */
-	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	partdesc = RelationGetPartitionDesc(rel);
 
-	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
-}
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
+									+ (partdesc->nparts * sizeof(int)));
+	pd->reldesc = rel;
+	pd->key = RelationGetPartitionKey(rel);
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		TupleDesc	tupdesc = RelationGetDescr(rel);
 
-/*
- * TupConvMapForLeaf -- Get the tuple conversion map for a given leaf partition
- * index.
- */
-TupleConversionMap *
-TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index)
-{
-	ResultRelInfo **resultRelInfos = proute->partitions;
-	TupleConversionMap **map;
-	TupleDesc	tupdesc;
+		/*
+		 * For every partitioned table other than the root, we must store a
+		 * tuple table slot initialized with its tuple descriptor and a tuple
+		 * conversion map to convert a tuple from its parent's rowtype to its
+		 * own. That is to make sure that we are looking at the correct row
+		 * using the correct tuple descriptor when computing its partition key
+		 * for tuple routing.
+		 */
+		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
+		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent_pd->reldesc),
+													   tupdesc,
+													   gettext_noop("could not convert row type"));
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
 
-	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
-		return NULL;
+	dispatchidx = proute->num_dispatch++;
 
-	/* If we've already got a map, return it. */
-	map = &proute->child_parent_tupconv_maps[leaf_index];
-	if (*map != NULL)
-		return *map;
+	if (dispatchidx >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize *= 2;
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+					 sizeof(PartitionDispatchData *) *
+					 proute->dispatch_allocsize);
+	}
 
-	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
-	*map =
-		convert_tuples_by_name(tupdesc,
-							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
 
-	/* If it turns out no map is needed, remember for next time. */
-	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
+	/*
+	 * Finally, if setting up a PartitionDispatch for a sub-partitioned table,
+	 * install the link to allow us to descend the partition hierarchy for
+	 * future searches
+	 */
+	if (parent_pd)
+		parent_pd->indexes[partidx] = dispatchidx;
 
-	return *map;
+	return pd;
 }
 
 /*
@@ -830,8 +1064,8 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -852,179 +1086,31 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-		/* skip further processing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
-
-		/* Allow any FDWs to shut down if they've been exercised */
-		if (resultRelInfo->ri_PartitionReadyForRouting &&
-			resultRelInfo->ri_FdwRoutine != NULL &&
-			resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
-			resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
-														   resultRelInfo);
 
 		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
+		if (resultrel_hash)
 		{
-			subplan_index++;
-			continue;
-		}
-
-		ExecCloseIndices(resultRelInfo);
-		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
-	}
-
-	/* Release the standalone partition tuple descriptors, if any */
-	if (proute->root_tuple_slot)
-		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
-}
-
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent),
-													   tupdesc,
-													   gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
+			Oid			partoid;
+			bool		found;
 
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
+			if (found)
+				continue;
 		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
 
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
+		/* Allow any FDWs to shut down if they've been exercised */
+		if (resultRelInfo->ri_FdwRoutine != NULL &&
+			resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
+			resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
+														   resultRelInfo);
+
+		ExecCloseIndices(resultRelInfo);
+		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 }
 
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index e2836b75ff..2faaea95ce 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1163,7 +1162,8 @@ lreplace:;
 			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
 			if (tupconv_map != NULL)
 				slot = execute_attr_map_slot(tupconv_map->attrMap,
-											 slot, proute->root_tuple_slot);
+											 slot,
+											 mtstate->mt_root_tuple_slot);
 
 			/*
 			 * Prepare for tuple routing, making it look like we're inserting
@@ -1665,7 +1665,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1698,52 +1698,21 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						TupleTableSlot *slot)
 {
 	ModifyTable *node;
-	int			partidx;
 	ResultRelInfo *partrel;
+	PartitionRoutingInfo *partrouteinfo;
 	HeapTuple	tuple;
 	TupleConversionMap *map;
 
 	/*
-	 * Determine the target partition.  If ExecFindPartition does not find a
-	 * partition after all, it doesn't return here; otherwise, the returned
-	 * value is to be used as an index into the arrays for the ResultRelInfo
-	 * and TupleConversionMap for the partition.
-	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
-	Assert(partidx >= 0 && partidx < proute->num_partitions);
-
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
+	 * Lookup the target partition's ResultRelInfo.  If ExecFindPartition does
+	 * not find a valid partition for the tuple in 'slot' then an error is
+	 * raised.  An error may also be raised if the found partition is not a
+	 * valid target for INSERTs.  This is required since a partitioned table
+	 * UPDATE to another partition becomes a DELETE+INSERT.
 	 */
-	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
-
-	/*
-	 * Check whether the partition is routable if we didn't yet
-	 *
-	 * Note: an UPDATE of a partition key invokes an INSERT that moves the
-	 * tuple to a new partition.  This check would be applied to a subplan
-	 * partition of such an UPDATE that is chosen as the partition to route
-	 * the tuple to.  The reason we do this check here rather than in
-	 * ExecSetupPartitionTupleRouting is to avoid aborting such an UPDATE
-	 * unnecessarily due to non-routable subplan partitions that may not be
-	 * chosen for update tuple movement after all.
-	 */
-	if (!partrel->ri_PartitionReadyForRouting)
-	{
-		/* Verify the partition is a valid target for INSERT. */
-		CheckValidResultRel(partrel, CMD_INSERT);
-
-		/* Set up information needed for routing tuples to the partition. */
-		ExecInitRoutingInfo(mtstate, estate, proute, partrel, partidx);
-	}
+	partrel = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
+	partrouteinfo = partrel->ri_PartitionInfo;
+	Assert(partrouteinfo != NULL);
 
 	/*
 	 * Make it look like we are inserting into the partition.
@@ -1755,7 +1724,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 
 	/*
 	 * If we're capturing transition tuples, we might need to convert from the
-	 * partition rowtype to parent rowtype.
+	 * partition rowtype to root partitioned table's rowtype.
 	 */
 	if (mtstate->mt_transition_capture != NULL)
 	{
@@ -1768,7 +1737,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			 */
 			mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 			mtstate->mt_transition_capture->tcs_map =
-				TupConvMapForLeaf(proute, targetRelInfo, partidx);
+				partrouteinfo->pi_PartitionToRootMap;
 		}
 		else
 		{
@@ -1783,20 +1752,17 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	if (mtstate->mt_oc_transition_capture != NULL)
 	{
 		mtstate->mt_oc_transition_capture->tcs_map =
-			TupConvMapForLeaf(proute, targetRelInfo, partidx);
+			partrouteinfo->pi_PartitionToRootMap;
 	}
 
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	map = proute->parent_child_tupconv_maps[partidx];
+	map = partrouteinfo->pi_RootToPartitionMap;
 	if (map != NULL)
 	{
-		TupleTableSlot *new_slot;
+		TupleTableSlot *new_slot = partrouteinfo->pi_PartitionTupleSlot;
 
-		Assert(proute->partition_tuple_slots != NULL &&
-			   proute->partition_tuple_slots[partidx] != NULL);
-		new_slot = proute->partition_tuple_slots[partidx];
 		slot = execute_attr_map_slot(map->attrMap, slot, new_slot);
 	}
 
@@ -1834,17 +1800,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			numResultRelInfos = mtstate->mt_nplans;
 	int			i;
 
-	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
 	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
@@ -1866,79 +1821,18 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	}
 }
 
-/*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
 /*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ourselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
@@ -2361,10 +2255,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * descriptor of a source partition does not match the root partitioned
 	 * table descriptor.  In such a case we need to convert tuples to the root
 	 * tuple descriptor, because the search for destination partition starts
-	 * from the root.  Skip this setup if it's not a partition key update.
+	 * from the root.  We'll also need a slot to store these converted tuples.
+	 * We can skip this setup if it's not a partition key update.
 	 */
 	if (update_tuple_routing_needed)
+	{
 		ExecSetupChildParentMapForSubplan(mtstate);
+		mtstate->mt_root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
+	}
 
 	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
@@ -2704,9 +2602,16 @@ ExecEndModifyTable(ModifyTableState *node)
 														   resultRelInfo);
 	}
 
-	/* Close all the partitioned tables, leaf partitions, and their indices */
+	/*
+	 * Close all the partitioned tables, leaf partitions, and their indices
+	 * and release the slot used for tuple routing, if set.
+	 */
 	if (node->mt_partition_tuple_routing)
+	{
 		ExecCleanupTupleRouting(node, node->mt_partition_tuple_routing);
+		if (node->mt_root_tuple_slot)
+			ExecDropSingleTupleTableSlot(node->mt_root_tuple_slot);
+	}
 
 	/*
 	 * Free the exprcontext
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index d5720518a8..2a1c1cb2e1 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1657,9 +1657,6 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 /*
  * expand_partitioned_rtentry
  *		Recursively expand an RTE for a partitioned table.
- *
- * Note that RelationGetPartitionDispatchInfo will expand partitions in the
- * same order as this code.
  */
 static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 5757301d05..2afde69134 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -582,6 +582,7 @@ RelationBuildPartitionDesc(Relation rel)
 		int			next_index = 0;
 
 		result->oids = (Oid *) palloc0(nparts * sizeof(Oid));
+		result->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
 		boundinfo = (PartitionBoundInfoData *)
 			palloc0(sizeof(PartitionBoundInfoData));
@@ -770,7 +771,15 @@ RelationBuildPartitionDesc(Relation rel)
 		 * defined by canonicalized representation of the partition bounds.
 		 */
 		for (i = 0; i < nparts; i++)
-			result->oids[mapping[i]] = oids[i];
+		{
+			int			index = mapping[i];
+
+			result->oids[index] = oids[i];
+			/* Record if the partition is a leaf partition */
+			result->is_leaf[index] =
+				(get_rel_relkind(oids[i]) != RELKIND_PARTITIONED_TABLE);
+		}
+
 		pfree(mapping);
 	}
 
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index a53de2372e..59c7a6ab69 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -25,7 +25,11 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of 'nparts' elements containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * the corresponding 'oids' element belongs to
+								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 3e08104ea4..d3cfb55f9f 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -18,74 +18,36 @@
 #include "nodes/plannodes.h"
 #include "partitioning/partprune.h"
 
-/* See execPartition.c for the definition. */
+/* See execPartition.c for the definitions. */
 typedef struct PartitionDispatchData *PartitionDispatch;
+typedef struct PartitionTupleRouting PartitionTupleRouting;
 
-/*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
+/*
+ * PartitionRoutingInfo
  *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slots		Array of TupleTableSlot objects; if non-NULL,
- *								contains one entry for every leaf partition,
- *								of which only those of the leaf partitions
- *								whose attribute numbers differ from the root
- *								parent have a non-NULL value.  NULL if all of
- *								the partitions encountered by a given command
- *								happen to have same rowtype as the root parent
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
- *-----------------------
+ * Additional result relation information specific to routing tuples to a
+ * table partition.
  */
-typedef struct PartitionTupleRouting
+typedef struct PartitionRoutingInfo
 {
-	PartitionDispatch *partition_dispatch_info;
-	int			num_dispatch;
-	Oid		   *partition_oids;
-	ResultRelInfo **partitions;
-	int			num_partitions;
-	TupleConversionMap **parent_child_tupconv_maps;
-	TupleConversionMap **child_parent_tupconv_maps;
-	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
-	TupleTableSlot **partition_tuple_slots;
-	TupleTableSlot *root_tuple_slot;
-} PartitionTupleRouting;
+	/*
+	 * Map for converting tuples in root partitioned table format into
+	 * partition format, or NULL if no conversion is required.
+	 */
+	TupleConversionMap *pi_RootToPartitionMap;
+
+	/*
+	 * Map for converting tuples in partition format into the root partitioned
+	 * table format, or NULL if no conversion is required.
+	 */
+	TupleConversionMap *pi_PartitionToRootMap;
+
+	/*
+	 * Slot to store tuples in partition format, or NULL when no translation
+	 * is required between root and partition.
+	 */
+	TupleTableSlot *pi_PartitionTupleSlot;
+} PartitionRoutingInfo;
 
 /*
  * PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
@@ -175,22 +137,11 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern ResultRelInfo *ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *rootResultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
-extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
-					EState *estate,
-					PartitionTupleRouting *proute,
-					ResultRelInfo *partRelInfo,
-					int partidx);
-extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
-extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index);
 extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute);
 extern PartitionPruneState *ExecCreatePartitionPruneState(PlanState *planstate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 18544566f7..423118cbbc 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -33,6 +33,8 @@
 
 
 struct PlanState;				/* forward references in this file */
+struct PartitionRoutingInfo;
+struct PartitionTupleRouting;
 struct ParallelHashJoinState;
 struct ExecRowMark;
 struct ExprState;
@@ -469,8 +471,8 @@ typedef struct ResultRelInfo
 	/* relation descriptor for root partitioned table */
 	Relation	ri_PartitionRoot;
 
-	/* true if ready for tuple routing */
-	bool		ri_PartitionReadyForRouting;
+	/* Additional information that's specific to partition tuple routing */
+	struct PartitionRoutingInfo *ri_PartitionInfo;
 } ResultRelInfo;
 
 /* ----------------
@@ -1074,6 +1076,12 @@ typedef struct ModifyTableState
 	List	   *mt_excludedtlist;	/* the excluded pseudo relation's tlist  */
 	TupleTableSlot *mt_conflproj;	/* CONFLICT ... SET ... projection target */
 
+	/*
+	 * Slot for storing tuples in the root partitioned table's rowtype during
+	 * an UPDATE of a partitioned table.
+	 */
+	TupleTableSlot *mt_root_tuple_slot;
+
 	/* Tuple-routing support info */
 	struct PartitionTupleRouting *mt_partition_tuple_routing;
 
diff --git a/src/test/regress/expected/insert_conflict.out b/src/test/regress/expected/insert_conflict.out
index 27cf5a01b3..6b841c7850 100644
--- a/src/test/regress/expected/insert_conflict.out
+++ b/src/test/regress/expected/insert_conflict.out
@@ -904,4 +904,26 @@ select * from parted_conflict order by a;
  50 | cincuenta | 2
 (1 row)
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+truncate parted_conflict;
+insert into parted_conflict values (0, 'cero', 1);
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+NOTICE:  a = 0, b = cero, c = 2
 drop table parted_conflict;
+drop function parted_conflict_update_func();
diff --git a/src/test/regress/sql/insert_conflict.sql b/src/test/regress/sql/insert_conflict.sql
index c677d70fb7..fe6dcfaa06 100644
--- a/src/test/regress/sql/insert_conflict.sql
+++ b/src/test/regress/sql/insert_conflict.sql
@@ -576,4 +576,30 @@ insert into parted_conflict values (50, 'cincuenta', 2)
 -- should see (50, 'cincuenta', 2)
 select * from parted_conflict order by a;
 
+-- test with statement level triggers
+create or replace function parted_conflict_update_func() returns trigger as $$
+declare
+    r record;
+begin
+ for r in select * from inserted loop
+	raise notice 'a = %, b = %, c = %', r.a, r.b, r.c;
+ end loop;
+ return new;
+end;
+$$ language plpgsql;
+
+create trigger parted_conflict_update
+    after update on parted_conflict
+    referencing new table as inserted
+    for each statement
+    execute procedure parted_conflict_update_func();
+
+truncate parted_conflict;
+
+insert into parted_conflict values (0, 'cero', 1);
+
+insert into parted_conflict values(0, 'cero', 1)
+  on conflict (a,b) do update set c = parted_conflict.c + 1;
+
 drop table parted_conflict;
+drop function parted_conflict_update_func();
-- 
2.16.2.windows.1

v16-0002-Delay-locking-of-partitions-during-INSERT-and-UP.patchapplication/octet-stream; name=v16-0002-Delay-locking-of-partitions-during-INSERT-and-UP.patchDownload
From e788b293f3770c7d89bc2156658f4bde3aba1303 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 9 Nov 2018 10:20:14 +1300
Subject: [PATCH v16 2/2] Delay locking of partitions during INSERT and UPDATE

During INSERT, even if we were inserting a single row into a partitioned
table, we would obtain a lock on every partition which was a direct or
an indirect partition of the insert target table.  This was done in order
to provide a consistent order to the locking of the partitions, which happens
to be the same order that partitions are locked during planning.  The
problem with locking all these partitions was that if a partitioned table
had many partitions and the INSERT inserted one, or just a few rows, the
overhead of the locking was significantly more than the inserting the actual
rows.

This commit changes the locking to only lock partitions the first time we
route a tuple to them, so if you insert one row, then only 1 leaf
partition will be locked, plus any sub-partitioned tables that we search
through before we find the correct home of the tuple.  This does mean that
the locking order of partitions during INSERT does become less well defined.
Previously operations such as CREATE INDEX and TRUNCATE when performed on
leaf partitions could defend against deadlocking with concurrent INSERT by
performing the operation in table oid order. However, to deadlock, such
DDL would have had to have been performed inside a transaction and not in
table oid order.  With this commit it's now possible to get deadlocks even
if the DDL is performed in table oid order.   If required such
transactions can defend against such deadlocks by performing a LOCK TABLE
on the partitioned table before performing the DDL.

Currently, only INSERTs are affected by this change as UPDATEs to a
partitioned table still obtain locks on all partitions either during
planning or during AcquireExecutorLocks, however, there are upcoming
patches which may change this too.
---
 src/backend/executor/execPartition.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 962db6d7f0..f37371f561 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -167,9 +167,6 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * tuple routing for partitioned tables, encapsulates it in
  * PartitionTupleRouting, and returns it.
  *
- * Note that all the relations in the partition tree are locked using the
- * RowExclusiveLock mode upon return from this function.
- *
  * Callers must use the returned PartitionTupleRouting during calls to
  * ExecFindPartition().  The actual ResultRelInfo for a partition is only
  * allocated when the first tuple is routed there.
@@ -180,9 +177,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	PartitionTupleRouting *proute;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/* Lock all the partitions. */
-	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-
 	/*
 	 * Here we attempt to expend as little effort as possible in setting up
 	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built on
@@ -535,11 +529,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 	bool		found_whole_row;
 	int			part_result_rel_index;
 
-	/*
-	 * We locked all the partitions in ExecSetupPartitionTupleRouting
-	 * including the leaf partitions.
-	 */
-	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], RowExclusiveLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -987,7 +977,7 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 	int			dispatchidx;
 
 	if (partoid != RelationGetRelid(proute->partition_root))
-		rel = heap_open(partoid, NoLock);
+		rel = heap_open(partoid, RowExclusiveLock);
 	else
 		rel = proute->partition_root;
 	partdesc = RelationGetPartitionDesc(rel);
-- 
2.16.2.windows.1

#58Jesper Pedersen
jesper.pedersen@redhat.com
In reply to: David Rowley (#57)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

Hi,

On 11/12/18 6:17 PM, David Rowley wrote:

On 9 November 2018 at 19:18, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

I have a comment regarding how you chose to make
PartitionTupleRouting private.

Using the v14_to_v15 diff, I could quickly see that there are many diffs
changing PartitionTupleRouting to struct PartitionTupleRouting, but they
would be unnecessary if you had added the following in execPartition.h, as
my upthread had done.

-/* See execPartition.c for the definition. */
+/* See execPartition.c for the definitions. */
typedef struct PartitionDispatchData *PartitionDispatch;
+typedef struct PartitionTupleRouting PartitionTupleRouting;

Okay, done that way. v16 attached.

Still passes check-world.

Best regards,
Jesper

#59Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Jesper Pedersen (#58)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018/11/14 0:32, Jesper Pedersen wrote:

Hi,

On 11/12/18 6:17 PM, David Rowley wrote:

On 9 November 2018 at 19:18, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

I have a comment regarding how you chose to make
PartitionTupleRouting private.

Using the v14_to_v15 diff, I could quickly see that there are many diffs
changing PartitionTupleRouting to struct PartitionTupleRouting, but they
would be unnecessary if you had added the following in execPartition.h, as
my upthread had done.

-/* See execPartition.c for the definition. */
+/* See execPartition.c for the definitions. */
  typedef struct PartitionDispatchData *PartitionDispatch;
+typedef struct PartitionTupleRouting PartitionTupleRouting;

Okay, done that way. v16 attached.

Thank you.

Still passes check-world.

I looked at v16 and noticed a few typos:

+  * partition_dispatch_info Array of 'dispatch_allocsize' elements containing
+  *                         a pointer to a PartitionDispatch objects for

a PartitionDispatch objects -> a PartitionDispatch object

+  * partitions              Array of 'partitions_allocsize' elements
+  *                         containing pointers to a ResultRelInfos of all
+  *                         leaf partitions touched by tuple routing.

a ResultRelInfos -> ResultRelInfos

+ * PartitionDispatch and ResultRelInfo pointers the 'partitions' array.

"in" the 'partitions' array.

+ /* Setup the PartitionRoutingInfo for it */

Setup -> Set up

+ * Ensure there's enough space in the 'partitions' array of 'proute'

+ * and store it in the next empty slot in proute's partitions array.

Not a typo, but maybe just write proute->partitions instead of "partitions
array of proute" and "proute's partition array".

+ *      the next available slot in the 'proute' partition_dispatch_info[]
+ *      array.  Also, record the index into this array in the 'parent_pd'

Similarly, here: proute->partition_dipatch_info array

+ *      array.  Also, record the index into this array in the 'parent_pd'
+ *      indexes[] array in the partidx element so that we can properly

Similarly, parent_pd->indexes array

+    if (dispatchidx >= proute->dispatch_allocsize)
+    {
+        /* Expand allocated space. */
+        proute->dispatch_allocsize *= 2;
+        proute->partition_dispatch_info = (PartitionDispatchData **)
+            repalloc(proute->partition_dispatch_info,
+                     sizeof(PartitionDispatchData *) *
+                     proute->dispatch_allocsize);
+    }

Sorry, I forgot to point this out before, but can this code in
ExecInitPartitionDispatchInfo be accommodated in
ExecCheckPartitionArraySpace() for consistency?

Thanks,
Amit

#60David Rowley
david.rowley@2ndquadrant.com
In reply to: Amit Langote (#59)
2 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

Thanks for looking at this again.

On 14 November 2018 at 13:47, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

+    if (dispatchidx >= proute->dispatch_allocsize)
+    {
+        /* Expand allocated space. */
+        proute->dispatch_allocsize *= 2;
+        proute->partition_dispatch_info = (PartitionDispatchData **)
+            repalloc(proute->partition_dispatch_info,
+                     sizeof(PartitionDispatchData *) *
+                     proute->dispatch_allocsize);
+    }

Sorry, I forgot to point this out before, but can this code in
ExecInitPartitionDispatchInfo be accommodated in
ExecCheckPartitionArraySpace() for consistency?

I don't really want to put that code in ExecCheckPartitionArraySpace()
as the way the function is now, it makes quite a lot of sense for the
compiler to inline it. If we add redundant work in there, then it
makes less sense. There's never any need to check both arrays at once
as we're only adding the new item to one array at a time.

Instead, I've written a new function named
ExecCheckDispatchArraySpace() and put the resize code inside that.

I've fixed the typos you mentioned. The only other thing I changed was
to only allocate the PartitionDispatch->tupslot if a conversion is
required. The previous code allocated this regardless if it was going
to be used or not. This saves both the redundant allocation and also
very slightly reduces the cost of the if test in ExecFindPartition().
There's now no need to check if the map ! NULL as if the slot is there
then the map must exist too.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v17-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchapplication/octet-stream; name=v17-0001-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchDownload
From 84e580474ad5b5260aa36a20607ee3e3bd5fdb87 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 26 Jul 2018 19:54:55 +1200
Subject: [PATCH v17 1/2] Speed up INSERT and UPDATE on partitioned tables

This is more or less a complete redesign of PartitionTupleRouting.  This
changes the setup that it does far less work during the initial setup and
pushes more work out to when partitions receive tuples.
PartitionDispatchData structs for sub-partitioned tables are only created
when a tuple gets routed through it. The possibly large arrays in the
PartitionTupleRouting struct have largely been removed.  The partitions[]
array remains but now never contains any NULL gaps.  Previously the NULLs
had to be skipped during ExecCleanupTupleRouting(), which could add a
large overhead to the cleanup when the number of partitions was large.
The partitions[] array is allocated small to start with and only enlarged
when we route tuples to enough partitions that it runs out of space. This
allows us to keep simple single-row partition INSERTs running quickly.

The arrays in PartitionTupleRouting which stored the tuple translation
maps have now been removed.  These have been moved out into a
PartitionRoutingInfo struct which is an additional field in ResultRelInfo.

The find_all_inheritors() call still remains by far the slowest part of
ExecSetupPartitionTupleRouting(). This commit just removes the other slow
parts.

In passing also rename the tuple translation maps from being ParentToChild
and ChildToParent to being RootToPartition and PartitionToRoot. The old
names mislead you into thinking that a partition of some sub-partitioned
table would translate to the rowtype of the sub-partitioned table rather
than the root partitioned table.

David Rowley and Amit Langote
---
 src/backend/commands/copy.c            |  86 +---
 src/backend/executor/execMain.c        |   2 +-
 src/backend/executor/execPartition.c   | 903 ++++++++++++++++++---------------
 src/backend/executor/nodeModifyTable.c | 164 ++----
 src/backend/optimizer/prep/prepunion.c |   3 -
 src/backend/utils/cache/partcache.c    |  16 +-
 src/include/catalog/partition.h        |   6 +-
 src/include/executor/execPartition.h   | 105 +---
 src/include/nodes/execnodes.h          |  12 +-
 9 files changed, 620 insertions(+), 677 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index b58a74f4e3..523eb2f995 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2316,6 +2316,7 @@ CopyFrom(CopyState cstate)
 	bool	   *nulls;
 	ResultRelInfo *resultRelInfo;
 	ResultRelInfo *target_resultRelInfo;
+	ResultRelInfo *prevResultRelInfo = NULL;
 	EState	   *estate = CreateExecutorState(); /* for ExecConstraints() */
 	ModifyTableState *mtstate;
 	ExprContext *econtext;
@@ -2331,7 +2332,6 @@ CopyFrom(CopyState cstate)
 	CopyInsertMethod insertMethod;
 	uint64		processed = 0;
 	int			nBufferedTuples = 0;
-	int			prev_leaf_part_index = -1;
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
 	bool		leafpart_use_multi_insert = false;
@@ -2513,8 +2513,12 @@ CopyFrom(CopyState cstate)
 	/*
 	 * If there are any triggers with transition tables on the named relation,
 	 * we need to be prepared to capture transition tuples.
+	 *
+	 * Because partition tuple routing would like to know about whether
+	 * transition capture is active, we also set it in mtstate, which is
+	 * passed to ExecFindPartition() below.
 	 */
-	cstate->transition_capture =
+	cstate->transition_capture = mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(cstate->rel->trigdesc,
 								   RelationGetRelid(cstate->rel),
 								   CMD_INSERT);
@@ -2524,19 +2528,8 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
 		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
-		/*
-		 * If we are capturing transition tuples, they may need to be
-		 * converted from partition format back to partitioned table format
-		 * (this is only ever necessary if a BEFORE trigger modifies the
-		 * tuple).
-		 */
-		if (cstate->transition_capture != NULL)
-			ExecSetupChildParentMapForLeaf(proute);
-	}
-
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2692,25 +2685,17 @@ CopyFrom(CopyState cstate)
 		/* Determine the partition to heap_insert the tuple into */
 		if (proute)
 		{
-			int			leaf_part_index;
 			TupleConversionMap *map;
 
 			/*
-			 * Away we go ... If we end up not finding a partition after all,
-			 * ExecFindPartition() does not return and errors out instead.
-			 * Otherwise, the returned value is to be used as an index into
-			 * arrays mt_partitions[] and mt_partition_tupconv_maps[] that
-			 * will get us the ResultRelInfo and TupleConversionMap for the
-			 * partition, respectively.
+			 * Attempt to find a partition suitable for this tuple.
+			 * ExecFindPartition() will raise an error if none can be found or
+			 * if the found partition is not suitable for INSERTs.
 			 */
-			leaf_part_index = ExecFindPartition(target_resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
-			Assert(leaf_part_index >= 0 &&
-				   leaf_part_index < proute->num_partitions);
-
-			if (prev_leaf_part_index != leaf_part_index)
+			resultRelInfo = ExecFindPartition(mtstate, target_resultRelInfo,
+											  proute, slot, estate);
+
+			if (prevResultRelInfo != resultRelInfo)
 			{
 				/* Check if we can multi-insert into this partition */
 				if (insertMethod == CIM_MULTI_CONDITIONAL)
@@ -2723,12 +2708,9 @@ CopyFrom(CopyState cstate)
 					if (nBufferedTuples > 0)
 					{
 						ExprContext *swapcontext;
-						ResultRelInfo *presultRelInfo;
-
-						presultRelInfo = proute->partitions[prev_leaf_part_index];
 
 						CopyFromInsertBatch(cstate, estate, mycid, hi_options,
-											presultRelInfo, myslot, bistate,
+											prevResultRelInfo, myslot, bistate,
 											nBufferedTuples, bufferedTuples,
 											firstBufferedLineNo);
 						nBufferedTuples = 0;
@@ -2785,21 +2767,6 @@ CopyFrom(CopyState cstate)
 					}
 				}
 
-				/*
-				 * Overwrite resultRelInfo with the corresponding partition's
-				 * one.
-				 */
-				resultRelInfo = proute->partitions[leaf_part_index];
-				if (unlikely(resultRelInfo == NULL))
-				{
-					resultRelInfo = ExecInitPartitionInfo(mtstate,
-														  target_resultRelInfo,
-														  proute, estate,
-														  leaf_part_index);
-					proute->partitions[leaf_part_index] = resultRelInfo;
-					Assert(resultRelInfo != NULL);
-				}
-
 				/* Determine which triggers exist on this partition */
 				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
 											  resultRelInfo->ri_TrigDesc->trig_insert_before_row);
@@ -2825,7 +2792,7 @@ CopyFrom(CopyState cstate)
 				 * buffer when the partition being inserted into changes.
 				 */
 				ReleaseBulkInsertStatePin(bistate);
-				prev_leaf_part_index = leaf_part_index;
+				prevResultRelInfo = resultRelInfo;
 			}
 
 			/*
@@ -2835,7 +2802,7 @@ CopyFrom(CopyState cstate)
 
 			/*
 			 * If we're capturing transition tuples, we might need to convert
-			 * from the partition rowtype to parent rowtype.
+			 * from the partition rowtype to root rowtype.
 			 */
 			if (cstate->transition_capture != NULL)
 			{
@@ -2848,8 +2815,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, target_resultRelInfo,
-										  leaf_part_index);
+						resultRelInfo->ri_PartitionInfo->pi_PartitionToRootMap;
 				}
 				else
 				{
@@ -2863,18 +2829,18 @@ CopyFrom(CopyState cstate)
 			}
 
 			/*
-			 * We might need to convert from the parent rowtype to the
-			 * partition rowtype.
+			 * We might need to convert from the root rowtype to the partition
+			 * rowtype.
 			 */
-			map = proute->parent_child_tupconv_maps[leaf_part_index];
+			map = resultRelInfo->ri_PartitionInfo->pi_RootToPartitionMap;
 			if (map != NULL)
 			{
 				TupleTableSlot *new_slot;
 				MemoryContext oldcontext;
 
-				Assert(proute->partition_tuple_slots != NULL &&
-					   proute->partition_tuple_slots[leaf_part_index] != NULL);
-				new_slot = proute->partition_tuple_slots[leaf_part_index];
+				new_slot = resultRelInfo->ri_PartitionInfo->pi_PartitionTupleSlot;
+				Assert(new_slot != NULL);
+
 				slot = execute_attr_map_slot(map->attrMap, slot, new_slot);
 
 				/*
@@ -3019,12 +2985,8 @@ CopyFrom(CopyState cstate)
 	{
 		if (insertMethod == CIM_MULTI_CONDITIONAL)
 		{
-			ResultRelInfo *presultRelInfo;
-
-			presultRelInfo = proute->partitions[prev_leaf_part_index];
-
 			CopyFromInsertBatch(cstate, estate, mycid, hi_options,
-								presultRelInfo, myslot, bistate,
+								prevResultRelInfo, myslot, bistate,
 								nBufferedTuples, bufferedTuples,
 								firstBufferedLineNo);
 		}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index ba156f8c5f..32d2461528 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1343,7 +1343,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
 
 	resultRelInfo->ri_PartitionCheck = partition_check;
 	resultRelInfo->ri_PartitionRoot = partition_root;
-	resultRelInfo->ri_PartitionReadyForRouting = false;
+	resultRelInfo->ri_PartitionInfo = NULL; /* May be set later */
 }
 
 /*
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 1e72e9fb3f..9a685051cc 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -31,23 +31,94 @@
 #include "utils/rls.h"
 #include "utils/ruleutils.h"
 
+#define PARTITION_ROUTING_INITSIZE	8
+
+/*-----------------------
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions.
+ *
+ * partition_root			The partitioned table that's the target of the
+ *							command.
+ *
+ * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
+ *							a pointer to a PartitionDispatch object for every
+ *							partitioned table touched by tuple routing.  The
+ *							entry for the target partitioned table is *always*
+ *							present in the 0th element of this array.  See
+ *							comment for PartitionDispatchData->indexes for
+ *							details on how this array is indexed.
+ *
+ * num_dispatch				The current number of items stored in the
+ *							'partition_dispatch_info' array.  Also serves as
+ *							the index of the next free array element for new
+ *							PartitionDispatch objects that need to be stored.
+ *
+ * dispatch_allocsize		The current allocated size of the
+ *							'partition_dispatch_info' array.
+ *
+ * partitions				Array of 'partitions_allocsize' elements
+ *							containing a pointer to a ResultRelInfo for every
+ *							leaf partitions touched by tuple routing.  Some of
+ *							these are pointers to ResultRelInfos which are
+ *							borrowed out of 'subplan_resultrel_hash'.  The
+ *							remainder have been built especially for tuple
+ *							routing.  See comment for
+ *							PartitionDispatchData->indexes for details on how
+ *							this array is indexed.
+ *
+ * num_partitions			The current number of items stored in the
+ *							'partitions' array.  Also serves as the index of
+ *							the next free array element for new ResultRelInfo
+ *							objects that need to be stored.
+ *
+ * partitions_allocsize		The current allocated size of the 'partitions'
+ *							array.
+ *
+ * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
+ *							This is used to cache ResultRelInfos from subplans
+ *							of an UPDATE ModifyTable node.  Some of these may
+ *							be useful for tuple routing to save having to build
+ *							duplicates.
+ *-----------------------
+ */
+typedef struct PartitionTupleRouting
+{
+	Relation	partition_root;
+	PartitionDispatch *partition_dispatch_info;
+	int			num_dispatch;
+	int			dispatch_allocsize;
+	ResultRelInfo **partitions;
+	int			num_partitions;
+	int			partitions_allocsize;
+	HTAB	   *subplan_resultrel_hash;
+} PartitionTupleRouting;
 
 /*-----------------------
  * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
+ * hierarchy required to route a tuple to any of its partitions.  A
+ * PartitionDispatch is always encapsulated inside a PartitionTupleRouting
+ * struct and stored inside its 'partition_dispatch_info' array.
  *
  *	reldesc		Relation descriptor of the table
  *	key			Partition key information of the table
  *	keystate	Execution state required for expressions in the partition key
  *	partdesc	Partition descriptor of the table
  *	tupslot		A standalone TupleTableSlot initialized with this table's tuple
- *				descriptor
+ *				descriptor, or NULL if no tuple conversion between the parent
+ *				is required.
  *	tupmap		TupleConversionMap to convert from the parent's rowtype to
- *				this table's rowtype (when extracting the partition key of a
- *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ *				this table's rowtype  (when extracting the partition key of
+ *				a tuple just before routing it through this table). A NULL
+ *				value is stored if no tuple conversion is required.
+ *	indexes		Array of partdesc->nparts elements.  For leaf partitions the
+ *				index into the encapsulating PartitionTupleRouting's
+ *				'partitions' array is stored.  When the partition is itself a
+ *				partitioned table then we store the index into the
+ *				encapsulating PartitionTupleRouting's
+ *				'partition_dispatch_info' array.  An index of -1 means we've
+ *				not yet allocated anything in PartitionTupleRouting for
+ *				the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -58,14 +129,24 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrNumber *tupmap;
-	int		   *indexes;
+	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
 
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static void ExecCheckPartitionArraySpace(PartitionTupleRouting *proute);
+static void ExecCheckDispatchArraySpace(PartitionTupleRouting *proute);
+static ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx);
+static void ExecInitRoutingInfo(ModifyTableState *mtstate,
+					EState *estate,
+					ResultRelInfo *partRelInfo);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -92,130 +173,103 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition().  The actual ResultRelInfo for a partition is only
+ * allocated when the partition is found for the first time.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
-
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
-
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
-	}
-
-	i = 0;
-	foreach(cell, leaf_parts)
-	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
-
-		proute->partition_oids[i] = leaf_oid;
-
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
 
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
+	/*
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built on
+	 * demand, only when we actually need to route a tuple to that partition.
+	 * The reason for this is that a common case is for INSERT to insert a
+	 * single tuple into a partitioned table and this must be fast.
+	 *
+	 * We initially size the 'partition_dispatch_info' and 'partitions' arrays
+	 * to allow storage of PARTITION_ROUTING_INITSIZE pointers.  If we route
+	 * tuples to more than this many partitions or through more than that many
+	 * sub-partitioned tables then we'll need to increase the size of these
+	 * arrays.
+	 *
+	 * Initially we must only set up 1 PartitionDispatch object; the one for
+	 * the partitioned table that's the target of the command.  If we must
+	 * route a tuple via some sub-partitioned table, then its
+	 * PartitionDispatch is only built the first time it's required.
+	 */
+	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->partition_dispatch_info = (PartitionDispatchData **)
+		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
+	proute->num_dispatch = 0;
+	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
 
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
+	proute->partitions = (ResultRelInfo **)
+		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
 
-			update_rri_index++;
-		}
+	/* Mark that no items are yet stored in the 'partitions' array. */
+	proute->num_partitions = 0;
+	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
 
-		proute->partitions[i] = leaf_part_rri;
-		i++;
-	}
+	/*
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent as NULL as we don't need to care about any parent of the target
+	 * partitioned table.
+	 */
+	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
+										 0);
 
 	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
+	 * If performing an UPDATE with tuple routing, we can reuse partition
+	 * sub-plan result rels.  We build a hash table to map the OIDs of
+	 * partitions present in mtstate->resultRelInfo to their ResultRelInfos.
+	 * Every time a tuple is routed to a partition that we've yet to set the
+	 * ResultRelInfo for, before we go to the trouble of making one, we check
+	 * for a pre-made one in the hash table.
 	 */
-	Assert(update_rri_index == num_update_rri);
+	if (node && node->operation == CMD_UPDATE)
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
+	else
+		proute->subplan_resultrel_hash = NULL;
 
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find and return, or build and return the ResultRelInfo
+ * for the leaf partition that the tuple contained in *slot should belong to.
+ *
+ * If the partition's ResultRelInfo does not yet exist in 'proute' then we set
+ * one up or reuse one from mtstate's resultRelInfo array.  When reusing a
+ * ResultRelInfo from the mtstate we verify that the relation is a valid
+ * target for INSERTs and then set up a PartitionRoutingInfo for it.
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
  * partition key(s)
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message.  An error may also raised if the found target partition is
+ * not a valid target for an INSERT.
  */
-int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ResultRelInfo *
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *rootResultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch dispatch;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 	TupleTableSlot *myslot = NULL;
@@ -228,25 +282,29 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate, true);
+	if (rootResultRelInfo->ri_PartitionCheck)
+		ExecPartitionCheck(rootResultRelInfo, slot, estate, true);
 
 	/* start with the root partitioned table */
 	dispatch = pd[0];
 	while (true)
 	{
 		AttrNumber *map = dispatch->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
 
 		rel = dispatch->reldesc;
+		partdesc = dispatch->partdesc;
 
 		/*
 		 * Convert the tuple to this parent's layout, if different from the
 		 * current relation.
 		 */
 		myslot = dispatch->tupslot;
-		if (myslot != NULL && map != NULL)
+		if (myslot != NULL)
+		{
+			Assert(map != NULL);
 			slot = execute_attr_map_slot(map, slot, myslot);
+		}
 
 		/*
 		 * Extract partition key from tuple. Expression evaluation machinery
@@ -260,91 +318,254 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, then error out.
 		 */
-		if (dispatch->partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(dispatch, values, isnull);
-
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
-		{
-			result = -1;
-			break;
-		}
-		else if (dispatch->indexes[cur_index] >= 0)
+		if (partdesc->is_leaf[partidx])
 		{
-			result = dispatch->indexes[cur_index];
-			/* success! */
-			break;
+			ResultRelInfo *rri;
+
+			/*
+			 * Look to see if we've already got a ResultRelInfo for this
+			 * partition.
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(dispatch->indexes[partidx] < proute->num_partitions);
+				rri = proute->partitions[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				int			rri_index = -1;
+
+				/*
+				 * A ResultRelInfo has not been set up for this partition yet,
+				 * so either use one of the sub-plan result rels or build a
+				 * new one.
+				 */
+				if (proute->subplan_resultrel_hash)
+				{
+					Oid			partoid = partdesc->oids[partidx];
+
+					rri = hash_search(proute->subplan_resultrel_hash,
+									  &partoid, HASH_FIND, NULL);
+
+					if (rri)
+					{
+						/* Found one! */
+
+						/* Verify this ResultRelInfo allows INSERTs */
+						CheckValidResultRel(rri, CMD_INSERT);
+
+						/* This shouldn't have be set up yet */
+						Assert(rri->ri_PartitionInfo == NULL);
+
+						/* Set up the PartitionRoutingInfo for it */
+						ExecInitRoutingInfo(mtstate, estate, rri);
+
+						rri_index = proute->num_partitions++;
+						dispatch->indexes[partidx] = rri_index;
+
+						ExecCheckPartitionArraySpace(proute);
+
+						/*
+						 * Store it in the partitions array so we don't have
+						 * to look it up again.
+						 */
+						proute->partitions[rri_index] = rri;
+					}
+				}
+
+				/* We need to create a new one. */
+				if (rri_index < 0)
+				{
+					MemoryContextSwitchTo(oldcxt);
+					rri = ExecInitPartitionInfo(mtstate, rootResultRelInfo,
+												proute, estate,
+												dispatch, partidx);
+					MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				}
+			}
+
+			/* Release the tuple in the lowest parent's dedicated slot. */
+			if (slot == myslot)
+				ExecClearTuple(myslot);
+
+			MemoryContextSwitchTo(oldcxt);
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return rri;
 		}
 		else
 		{
-			/* move down one level */
-			dispatch = pd[-dispatch->indexes[cur_index]];
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
+
+				/*
+				 * Move down to the next partition level and search again
+				 * until we find a leaf partition that matches this tuple
+				 */
+				dispatch = pd[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subdispatch;
+
+				MemoryContextSwitchTo(oldcxt);
+
+				/*
+				 * Create the new PartitionDispatch.  We pass the current one
+				 * in as the parent PartitionDispatch
+				 */
+				subdispatch = ExecInitPartitionDispatchInfo(proute,
+															partdesc->oids[partidx],
+															dispatch, partidx);
+				MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+				Assert(dispatch->indexes[partidx] >= 0 &&
+					   dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = subdispatch;
+			}
 		}
 	}
+}
+
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
+	ResultRelInfo *subplan_result_rels;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			nsubplans;
+	int			i;
 
-	/* Release the tuple in the lowest parent's dedicated slot. */
-	if (slot == myslot)
-		ExecClearTuple(myslot);
+	subplan_result_rels = mtstate->resultRelInfo;
+	nsubplans = list_length(node->plans);
 
-	/* A partition was not found. */
-	if (result < 0)
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.hcxt = CurrentMemoryContext;
+
+	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
+					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_resultrel_hash = htab;
+
+	/* Hash all subplans by their Oid */
+	for (i = 0; i < nsubplans; i++)
 	{
-		char	   *val_desc;
-
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
+		ResultRelInfo *rri = &subplan_result_rels[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		ResultRelInfo **subplanrri;
+
+		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
+													&found);
+
+		if (!found)
+			*subplanrri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
 	}
+}
 
-	MemoryContextSwitchTo(oldcxt);
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
+/*
+ * ExecCheckPartitionArraySpace
+ *		Ensure there's enough space in the proute->partitions array
+ */
+static void
+ExecCheckPartitionArraySpace(PartitionTupleRouting *proute)
+{
+	if (proute->num_partitions >= proute->partitions_allocsize)
+	{
+		proute->partitions_allocsize *= 2;
+		proute->partitions = (ResultRelInfo **)
+			repalloc(proute->partitions, sizeof(ResultRelInfo *) *
+					 proute->partitions_allocsize);
+	}
+}
 
-	return result;
+/*
+ * ExecCheckDispatchArraySpace
+ *		Ensure there's enough space in the proute->partition_dispatch_info
+ *		array.
+ */
+static void
+ExecCheckDispatchArraySpace(PartitionTupleRouting *proute)
+{
+	if (proute->num_dispatch >= proute->dispatch_allocsize)
+	{
+		/* Expand allocated space. */
+		proute->dispatch_allocsize *= 2;
+		proute->partition_dispatch_info = (PartitionDispatchData **)
+			repalloc(proute->partition_dispatch_info,
+				sizeof(PartitionDispatchData *) *
+				proute->dispatch_allocsize);
+	}
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
+ *		and store it in the next empty slot in the proute->partitions array.
  *
  * Returns the ResultRelInfo
  */
-ResultRelInfo *
+static ResultRelInfo *
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx)
 {
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
+	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -521,14 +742,13 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 	}
 
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(mtstate, estate, leaf_part_rri);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -541,7 +761,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -554,7 +774,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -568,7 +788,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -578,8 +798,12 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = leaf_part_rri->ri_PartitionInfo->pi_RootToPartitionMap;
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -588,7 +812,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -679,8 +903,13 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
+	part_result_rel_index = proute->num_partitions++;
+	dispatch->indexes[partidx] = part_result_rel_index;
+
+	ExecCheckPartitionArraySpace(proute);
+
+	/* Save here for later use. */
+	proute->partitions[part_result_rel_index] = leaf_part_rri;
 
 	MemoryContextSwitchTo(oldContext);
 
@@ -689,27 +918,29 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 
 /*
  * ExecInitRoutingInfo
- *		Set up information needed for routing tuples to a leaf partition
+ *		Set up information needed for translating tuples between root
+ *		partitioned table format and partition format.
  */
-void
+static void
 ExecInitRoutingInfo(ModifyTableState *mtstate,
 					EState *estate,
-					PartitionTupleRouting *proute,
-					ResultRelInfo *partRelInfo,
-					int partidx)
+					ResultRelInfo *partRelInfo)
 {
 	MemoryContext oldContext;
+	PartitionRoutingInfo *partrouteinfo;
 
 	/*
 	 * Switch into per-query memory context.
 	 */
 	oldContext = MemoryContextSwitchTo(estate->es_query_cxt);
 
+	partrouteinfo = palloc(sizeof(PartitionRoutingInfo));
+
 	/*
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
+	partrouteinfo->pi_RootToPartitionMap =
 		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
 							   RelationGetDescr(partRelInfo->ri_RelationDesc),
 							   gettext_noop("could not convert row type"));
@@ -720,28 +951,36 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * for various operations that are applied to tuples after routing, such
 	 * as checking constraints.
 	 */
-	if (proute->parent_child_tupconv_maps[partidx] != NULL)
+	if (partrouteinfo->pi_RootToPartitionMap != NULL)
 	{
 		Relation	partrel = partRelInfo->ri_RelationDesc;
 
-		/*
-		 * Initialize the array in proute where these slots are stored, if not
-		 * already done.
-		 */
-		if (proute->partition_tuple_slots == NULL)
-			proute->partition_tuple_slots = (TupleTableSlot **)
-				palloc0(proute->num_partitions *
-						sizeof(TupleTableSlot *));
-
 		/*
 		 * Initialize the slot itself setting its descriptor to this
 		 * partition's TupleDesc; TupleDesc reference will be released at the
 		 * end of the command.
 		 */
-		proute->partition_tuple_slots[partidx] =
+		partrouteinfo->pi_PartitionTupleSlot =
 			ExecInitExtraTupleSlot(estate,
 								   RelationGetDescr(partrel));
 	}
+	else
+		partrouteinfo->pi_PartitionTupleSlot = NULL;
+
+	/*
+	 * Also, if transition capture is required, store a map to convert tuples
+	 * from partition's rowtype to the root partition table's.
+	 */
+	if (mtstate &&
+		(mtstate->mt_transition_capture || mtstate->mt_oc_transition_capture))
+	{
+		partrouteinfo->pi_PartitionToRootMap =
+			convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
+								   RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								   gettext_noop("could not convert row type"));
+	}
+	else
+		partrouteinfo->pi_PartitionToRootMap = NULL;
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -753,71 +992,85 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 
 	MemoryContextSwitchTo(oldContext);
 
-	partRelInfo->ri_PartitionReadyForRouting = true;
+	partRelInfo->ri_PartitionInfo = partrouteinfo;
 }
 
 /*
- * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
- * child-to-root tuple conversion map array.
- *
- * This map is required for capturing transition tuples when the target table
- * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
- * we need to convert it from the leaf partition to the target table
- * descriptor.
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table and store it in
+ *		the next available slot in the proute->partition_dispatch_info array.
+ *		Also, record the index into this array in the parent_pd->indexes[]
+ *		array in the partidx element so that we can properly retrieve the
+ *		newly created PartitionDispatch later.
  */
-void
-ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
 {
-	Assert(proute != NULL);
+	Relation	rel;
+	PartitionDesc partdesc;
+	PartitionDispatch pd;
+	int			dispatchidx;
 
-	/*
-	 * These array elements get filled up with maps on an on-demand basis.
-	 * Initially just set all of them to NULL.
-	 */
-	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	partdesc = RelationGetPartitionDesc(rel);
 
-	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
-}
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
+									+ (partdesc->nparts * sizeof(int)));
+	pd->reldesc = rel;
+	pd->key = RelationGetPartitionKey(rel);
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		TupleDesc	tupdesc = RelationGetDescr(rel);
 
-/*
- * TupConvMapForLeaf -- Get the tuple conversion map for a given leaf partition
- * index.
- */
-TupleConversionMap *
-TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index)
-{
-	ResultRelInfo **resultRelInfos = proute->partitions;
-	TupleConversionMap **map;
-	TupleDesc	tupdesc;
+		/*
+		 * For sub-partitioned tables where the column order differs from its
+		 * direct parent partitioned table, we must store a tuple table slot
+		 * initialized with its tuple descriptor and a tuple conversion map to
+		 * convert a tuple from its parent's rowtype to its own.  This is to
+		 * make sure that we are looking at the correct row using the correct
+		 * tuple descriptor when computing its partition key for tuple
+		 * routing.
+		 */
+		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent_pd->reldesc),
+													   tupdesc,
+													   gettext_noop("could not convert row type"));
+		pd->tupslot = pd->tupmap ? MakeSingleTupleTableSlot(tupdesc) : NULL;
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupslot = NULL;
+		pd->tupmap = NULL;
+	}
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
 
-	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
-		return NULL;
+	dispatchidx = proute->num_dispatch++;
 
-	/* If we've already got a map, return it. */
-	map = &proute->child_parent_tupconv_maps[leaf_index];
-	if (*map != NULL)
-		return *map;
+	ExecCheckDispatchArraySpace(proute);
 
-	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
-	*map =
-		convert_tuples_by_name(tupdesc,
-							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+	/* Save here for later use. */
+	proute->partition_dispatch_info[dispatchidx] = pd;
 
-	/* If it turns out no map is needed, remember for next time. */
-	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
+	/*
+	 * Finally, if setting up a PartitionDispatch for a sub-partitioned table,
+	 * install the link to allow us to descend the partition hierarchy for
+	 * future searches
+	 */
+	if (parent_pd)
+		parent_pd->indexes[partidx] = dispatchidx;
 
-	return *map;
+	return pd;
 }
 
 /*
@@ -830,8 +1083,8 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -845,186 +1098,40 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		PartitionDispatch pd = proute->partition_dispatch_info[i];
 
 		heap_close(pd->reldesc, NoLock);
-		ExecDropSingleTupleTableSlot(pd->tupslot);
+
+		if (pd->tupslot)
+			ExecDropSingleTupleTableSlot(pd->tupslot);
 	}
 
 	for (i = 0; i < proute->num_partitions; i++)
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-		/* skip further processing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
-
-		/* Allow any FDWs to shut down if they've been exercised */
-		if (resultRelInfo->ri_PartitionReadyForRouting &&
-			resultRelInfo->ri_FdwRoutine != NULL &&
-			resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
-			resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
-														   resultRelInfo);
 
 		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
 		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
+		if (resultrel_hash)
 		{
-			subplan_index++;
-			continue;
-		}
+			Oid			partoid;
+			bool		found;
 
-		ExecCloseIndices(resultRelInfo);
-		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
-	}
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 
-	/* Release the standalone partition tuple descriptors, if any */
-	if (proute->root_tuple_slot)
-		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
-}
-
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc);
-		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent),
-													   tupdesc,
-													   gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
-
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
+			if (found)
+				continue;
 		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
 
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
+		/* Allow any FDWs to shut down if they've been exercised */
+		if (resultRelInfo->ri_FdwRoutine != NULL &&
+			resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
+			resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
+														   resultRelInfo);
+
+		ExecCloseIndices(resultRelInfo);
+		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
 }
 
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index e2836b75ff..9018543d2e 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1163,7 +1162,8 @@ lreplace:;
 			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
 			if (tupconv_map != NULL)
 				slot = execute_attr_map_slot(tupconv_map->attrMap,
-											 slot, proute->root_tuple_slot);
+											 slot,
+											 mtstate->mt_root_tuple_slot);
 
 			/*
 			 * Prepare for tuple routing, making it look like we're inserting
@@ -1665,7 +1665,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1698,52 +1698,21 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						TupleTableSlot *slot)
 {
 	ModifyTable *node;
-	int			partidx;
 	ResultRelInfo *partrel;
+	PartitionRoutingInfo *partrouteinfo;
 	HeapTuple	tuple;
 	TupleConversionMap *map;
 
 	/*
-	 * Determine the target partition.  If ExecFindPartition does not find a
-	 * partition after all, it doesn't return here; otherwise, the returned
-	 * value is to be used as an index into the arrays for the ResultRelInfo
-	 * and TupleConversionMap for the partition.
-	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
-	Assert(partidx >= 0 && partidx < proute->num_partitions);
-
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
+	 * Lookup the target partition's ResultRelInfo.  If ExecFindPartition does
+	 * not find a valid partition for the tuple in 'slot' then an error is
+	 * raised.  An error may also be raised if the found partition is not a
+	 * valid target for INSERTs.  This is required since a partitioned table
+	 * UPDATE to another partition becomes a DELETE+INSERT.
 	 */
-	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
-
-	/*
-	 * Check whether the partition is routable if we didn't yet
-	 *
-	 * Note: an UPDATE of a partition key invokes an INSERT that moves the
-	 * tuple to a new partition.  This check would be applied to a subplan
-	 * partition of such an UPDATE that is chosen as the partition to route
-	 * the tuple to.  The reason we do this check here rather than in
-	 * ExecSetupPartitionTupleRouting is to avoid aborting such an UPDATE
-	 * unnecessarily due to non-routable subplan partitions that may not be
-	 * chosen for update tuple movement after all.
-	 */
-	if (!partrel->ri_PartitionReadyForRouting)
-	{
-		/* Verify the partition is a valid target for INSERT. */
-		CheckValidResultRel(partrel, CMD_INSERT);
-
-		/* Set up information needed for routing tuples to the partition. */
-		ExecInitRoutingInfo(mtstate, estate, proute, partrel, partidx);
-	}
+	partrel = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
+	partrouteinfo = partrel->ri_PartitionInfo;
+	Assert(partrouteinfo != NULL);
 
 	/*
 	 * Make it look like we are inserting into the partition.
@@ -1755,7 +1724,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 
 	/*
 	 * If we're capturing transition tuples, we might need to convert from the
-	 * partition rowtype to parent rowtype.
+	 * partition rowtype to root partitioned table's rowtype.
 	 */
 	if (mtstate->mt_transition_capture != NULL)
 	{
@@ -1768,7 +1737,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			 */
 			mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 			mtstate->mt_transition_capture->tcs_map =
-				TupConvMapForLeaf(proute, targetRelInfo, partidx);
+				partrouteinfo->pi_PartitionToRootMap;
 		}
 		else
 		{
@@ -1783,20 +1752,17 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	if (mtstate->mt_oc_transition_capture != NULL)
 	{
 		mtstate->mt_oc_transition_capture->tcs_map =
-			TupConvMapForLeaf(proute, targetRelInfo, partidx);
+			partrouteinfo->pi_PartitionToRootMap;
 	}
 
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	map = proute->parent_child_tupconv_maps[partidx];
+	map = partrouteinfo->pi_RootToPartitionMap;
 	if (map != NULL)
 	{
-		TupleTableSlot *new_slot;
+		TupleTableSlot *new_slot = partrouteinfo->pi_PartitionTupleSlot;
 
-		Assert(proute->partition_tuple_slots != NULL &&
-			   proute->partition_tuple_slots[partidx] != NULL);
-		new_slot = proute->partition_tuple_slots[partidx];
 		slot = execute_attr_map_slot(map->attrMap, slot, new_slot);
 	}
 
@@ -1834,17 +1800,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			numResultRelInfos = mtstate->mt_nplans;
 	int			i;
 
-	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
 	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
@@ -1866,79 +1821,18 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	}
 }
 
-/*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
 /*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ourselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
@@ -2361,10 +2255,14 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * descriptor of a source partition does not match the root partitioned
 	 * table descriptor.  In such a case we need to convert tuples to the root
 	 * tuple descriptor, because the search for destination partition starts
-	 * from the root.  Skip this setup if it's not a partition key update.
+	 * from the root.  We'll also need a slot to store these converted tuples.
+	 * We can skip this setup if it's not a partition key update.
 	 */
 	if (update_tuple_routing_needed)
+	{
 		ExecSetupChildParentMapForSubplan(mtstate);
+		mtstate->mt_root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
+	}
 
 	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
@@ -2704,10 +2602,18 @@ ExecEndModifyTable(ModifyTableState *node)
 														   resultRelInfo);
 	}
 
-	/* Close all the partitioned tables, leaf partitions, and their indices */
+	/*
+	 * Close all the partitioned tables, leaf partitions, and their indices
+	 * and release the slot used for tuple routing, if set.
+	 */
 	if (node->mt_partition_tuple_routing)
+	{
 		ExecCleanupTupleRouting(node, node->mt_partition_tuple_routing);
 
+		if (node->mt_root_tuple_slot)
+			ExecDropSingleTupleTableSlot(node->mt_root_tuple_slot);
+	}
+
 	/*
 	 * Free the exprcontext
 	 */
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index d5720518a8..2a1c1cb2e1 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1657,9 +1657,6 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 /*
  * expand_partitioned_rtentry
  *		Recursively expand an RTE for a partitioned table.
- *
- * Note that RelationGetPartitionDispatchInfo will expand partitions in the
- * same order as this code.
  */
 static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 07653f312b..7856b47cdd 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -340,15 +340,23 @@ RelationBuildPartitionDesc(Relation rel)
 	oldcxt = MemoryContextSwitchTo(rel->rd_pdcxt);
 	partdesc->boundinfo = partition_bounds_copy(boundinfo, key);
 	partdesc->oids = (Oid *) palloc(partdesc->nparts * sizeof(Oid));
+	partdesc->is_leaf = (bool *) palloc(partdesc->nparts * sizeof(bool));
 
 	/*
 	 * Now assign OIDs from the original array into mapped indexes of the
-	 * result array.  Order of OIDs in the former is defined by the catalog
-	 * scan that retrieved them, whereas that in the latter is defined by
-	 * canonicalized representation of the partition bounds.
+	 * result array.  The order of OIDs in the former is defined by the
+	 * catalog scan that retrieved them, whereas that in the latter is defined
+	 * by canonicalized representation of the partition bounds.
 	 */
 	for (i = 0; i < partdesc->nparts; i++)
-		partdesc->oids[mapping[i]] = oids_orig[i];
+	{
+		int			index = mapping[i];
+
+		partdesc->oids[index] = oids_orig[i];
+		/* Record if the partition is a leaf partition */
+		partdesc->is_leaf[index] =
+				(get_rel_relkind(oids_orig[i]) != RELKIND_PARTITIONED_TABLE);
+	}
 	MemoryContextSwitchTo(oldcxt);
 
 	rel->rd_partdesc = partdesc;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index a53de2372e..59c7a6ab69 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -25,7 +25,11 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of 'nparts' elements containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * the corresponding 'oids' element belongs to
+								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 3e08104ea4..d3cfb55f9f 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -18,74 +18,36 @@
 #include "nodes/plannodes.h"
 #include "partitioning/partprune.h"
 
-/* See execPartition.c for the definition. */
+/* See execPartition.c for the definitions. */
 typedef struct PartitionDispatchData *PartitionDispatch;
+typedef struct PartitionTupleRouting PartitionTupleRouting;
 
-/*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
+/*
+ * PartitionRoutingInfo
  *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slots		Array of TupleTableSlot objects; if non-NULL,
- *								contains one entry for every leaf partition,
- *								of which only those of the leaf partitions
- *								whose attribute numbers differ from the root
- *								parent have a non-NULL value.  NULL if all of
- *								the partitions encountered by a given command
- *								happen to have same rowtype as the root parent
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
- *-----------------------
+ * Additional result relation information specific to routing tuples to a
+ * table partition.
  */
-typedef struct PartitionTupleRouting
+typedef struct PartitionRoutingInfo
 {
-	PartitionDispatch *partition_dispatch_info;
-	int			num_dispatch;
-	Oid		   *partition_oids;
-	ResultRelInfo **partitions;
-	int			num_partitions;
-	TupleConversionMap **parent_child_tupconv_maps;
-	TupleConversionMap **child_parent_tupconv_maps;
-	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
-	TupleTableSlot **partition_tuple_slots;
-	TupleTableSlot *root_tuple_slot;
-} PartitionTupleRouting;
+	/*
+	 * Map for converting tuples in root partitioned table format into
+	 * partition format, or NULL if no conversion is required.
+	 */
+	TupleConversionMap *pi_RootToPartitionMap;
+
+	/*
+	 * Map for converting tuples in partition format into the root partitioned
+	 * table format, or NULL if no conversion is required.
+	 */
+	TupleConversionMap *pi_PartitionToRootMap;
+
+	/*
+	 * Slot to store tuples in partition format, or NULL when no translation
+	 * is required between root and partition.
+	 */
+	TupleTableSlot *pi_PartitionTupleSlot;
+} PartitionRoutingInfo;
 
 /*
  * PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
@@ -175,22 +137,11 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern ResultRelInfo *ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *rootResultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
-extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
-					EState *estate,
-					PartitionTupleRouting *proute,
-					ResultRelInfo *partRelInfo,
-					int partidx);
-extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
-extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index);
 extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute);
 extern PartitionPruneState *ExecCreatePartitionPruneState(PlanState *planstate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 18544566f7..423118cbbc 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -33,6 +33,8 @@
 
 
 struct PlanState;				/* forward references in this file */
+struct PartitionRoutingInfo;
+struct PartitionTupleRouting;
 struct ParallelHashJoinState;
 struct ExecRowMark;
 struct ExprState;
@@ -469,8 +471,8 @@ typedef struct ResultRelInfo
 	/* relation descriptor for root partitioned table */
 	Relation	ri_PartitionRoot;
 
-	/* true if ready for tuple routing */
-	bool		ri_PartitionReadyForRouting;
+	/* Additional information that's specific to partition tuple routing */
+	struct PartitionRoutingInfo *ri_PartitionInfo;
 } ResultRelInfo;
 
 /* ----------------
@@ -1074,6 +1076,12 @@ typedef struct ModifyTableState
 	List	   *mt_excludedtlist;	/* the excluded pseudo relation's tlist  */
 	TupleTableSlot *mt_conflproj;	/* CONFLICT ... SET ... projection target */
 
+	/*
+	 * Slot for storing tuples in the root partitioned table's rowtype during
+	 * an UPDATE of a partitioned table.
+	 */
+	TupleTableSlot *mt_root_tuple_slot;
+
 	/* Tuple-routing support info */
 	struct PartitionTupleRouting *mt_partition_tuple_routing;
 
-- 
2.16.2.windows.1

v17-0002-Delay-locking-of-partitions-during-INSERT-and-UP.patchapplication/octet-stream; name=v17-0002-Delay-locking-of-partitions-during-INSERT-and-UP.patchDownload
From 110cfe16f685a9529fafa8d979a1b81e6eea57e5 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Fri, 9 Nov 2018 10:20:14 +1300
Subject: [PATCH v17 2/2] Delay locking of partitions during INSERT and UPDATE

During INSERT, even if we were inserting a single row into a partitioned
table, we would obtain a lock on every partition which was a direct or
an indirect partition of the insert target table.  This was done in order
to provide a consistent order to the locking of the partitions, which happens
to be the same order that partitions are locked during planning.  The
problem with locking all these partitions was that if a partitioned table
had many partitions and the INSERT inserted one, or just a few rows, the
overhead of the locking was significantly more than the inserting the actual
rows.

This commit changes the locking to only lock partitions the first time we
route a tuple to them, so if you insert one row, then only 1 leaf
partition will be locked, plus any sub-partitioned tables that we search
through before we find the correct home of the tuple.  This does mean that
the locking order of partitions during INSERT does become less well defined.
Previously operations such as CREATE INDEX and TRUNCATE when performed on
leaf partitions could defend against deadlocking with concurrent INSERT by
performing the operation in table oid order. However, to deadlock, such
DDL would have had to have been performed inside a transaction and not in
table oid order.  With this commit it's now possible to get deadlocks even
if the DDL is performed in table oid order.   If required such
transactions can defend against such deadlocks by performing a LOCK TABLE
on the partitioned table before performing the DDL.

Currently, only INSERTs are affected by this change as UPDATEs to a
partitioned table still obtain locks on all partitions either during
planning or during AcquireExecutorLocks, however, there are upcoming
patches which may change this too.
---
 src/backend/executor/execPartition.c | 14 ++------------
 1 file changed, 2 insertions(+), 12 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 9a685051cc..c472b590f9 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -170,9 +170,6 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * tuple routing for partitioned tables, encapsulates it in
  * PartitionTupleRouting, and returns it.
  *
- * Note that all the relations in the partition tree are locked using the
- * RowExclusiveLock mode upon return from this function.
- *
  * Callers must use the returned PartitionTupleRouting during calls to
  * ExecFindPartition().  The actual ResultRelInfo for a partition is only
  * allocated when the partition is found for the first time.
@@ -183,9 +180,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	PartitionTupleRouting *proute;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/* Lock all the partitions. */
-	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-
 	/*
 	 * Here we attempt to expend as little effort as possible in setting up
 	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built on
@@ -561,11 +555,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 	bool		found_whole_row;
 	int			part_result_rel_index;
 
-	/*
-	 * We locked all the partitions in ExecSetupPartitionTupleRouting
-	 * including the leaf partitions.
-	 */
-	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], RowExclusiveLock);
 
 	/*
 	 * Keep ResultRelInfo and other information for this partition in the
@@ -1013,7 +1003,7 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 	int			dispatchidx;
 
 	if (partoid != RelationGetRelid(proute->partition_root))
-		rel = heap_open(partoid, NoLock);
+		rel = heap_open(partoid, RowExclusiveLock);
 	else
 		rel = proute->partition_root;
 	partdesc = RelationGetPartitionDesc(rel);
-- 
2.16.2.windows.1

#61Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: David Rowley (#60)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

Thanks for updating the patch.

On 2018/11/14 13:16, David Rowley wrote:

Thanks for looking at this again.

On 14 November 2018 at 13:47, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:

+    if (dispatchidx >= proute->dispatch_allocsize)
+    {
+        /* Expand allocated space. */
+        proute->dispatch_allocsize *= 2;
+        proute->partition_dispatch_info = (PartitionDispatchData **)
+            repalloc(proute->partition_dispatch_info,
+                     sizeof(PartitionDispatchData *) *
+                     proute->dispatch_allocsize);
+    }

Sorry, I forgot to point this out before, but can this code in
ExecInitPartitionDispatchInfo be accommodated in
ExecCheckPartitionArraySpace() for consistency?

I don't really want to put that code in ExecCheckPartitionArraySpace()
as the way the function is now, it makes quite a lot of sense for the
compiler to inline it. If we add redundant work in there, then it
makes less sense. There's never any need to check both arrays at once
as we're only adding the new item to one array at a time.

Instead, I've written a new function named
ExecCheckDispatchArraySpace() and put the resize code inside that.

Okay, seems fine.

I've fixed the typos you mentioned. The only other thing I changed was
to only allocate the PartitionDispatch->tupslot if a conversion is
required. The previous code allocated this regardless if it was going
to be used or not. This saves both the redundant allocation and also
very slightly reduces the cost of the if test in ExecFindPartition().
There's now no need to check if the map ! NULL as if the slot is there

Also makes sense.

Although it seems that Alvaro has already started at looking at this, I'll
mark the CF entry as Ready for Committer anyway, because I don't have any
more comments. :)

Thanks,
Amit

#62Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: David Rowley (#60)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

What's with this comment?

* Initially we must only set up 1 PartitionDispatch object; the one for
* the partitioned table that's the target of the command. If we must
* route a tuple via some sub-partitioned table, then its
* PartitionDispatch is only built the first time it's required.

You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at
odds with the '1' mentioned in the comment. Which is wrong?

(I have a few edits on the patch, so please don't send a full v18 -- a
delta patch would be welcome, if you have further changes to propose.)

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#63David Rowley
david.rowley@2ndquadrant.com
In reply to: Alvaro Herrera (#62)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

Thanks for picking this up.

On 15 November 2018 at 07:10, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

What's with this comment?

* Initially we must only set up 1 PartitionDispatch object; the one for
* the partitioned table that's the target of the command. If we must
* route a tuple via some sub-partitioned table, then its
* PartitionDispatch is only built the first time it's required.

You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at
odds with the '1' mentioned in the comment. Which is wrong?

I don't think either is wrong, but I guess something must be
misleading, so could perhaps be improved.

We're simply allocating enough space for PARTITION_ROUTING_INITSIZE
but we're only initialising 1 item. That leaves space for
PARTITION_ROUTING_INITSIZE - 1 more items before we'd need to
reallocate the array.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#64Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: David Rowley (#63)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018-Nov-15, David Rowley wrote:

On 15 November 2018 at 07:10, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

What's with this comment?

* Initially we must only set up 1 PartitionDispatch object; the one for
* the partitioned table that's the target of the command. If we must
* route a tuple via some sub-partitioned table, then its
* PartitionDispatch is only built the first time it's required.

You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at
odds with the '1' mentioned in the comment. Which is wrong?

I don't think either is wrong, but I guess something must be
misleading, so could perhaps be improved.

Ah, that makes sense. Yeah, it seems a bit misleading to me. No
worries.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#65Amit Langote
Langote_Amit_f8@lab.ntt.co.jp
In reply to: Alvaro Herrera (#64)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018/11/15 8:58, Alvaro Herrera wrote:

On 2018-Nov-15, David Rowley wrote:

On 15 November 2018 at 07:10, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

What's with this comment?

* Initially we must only set up 1 PartitionDispatch object; the one for
* the partitioned table that's the target of the command. If we must
* route a tuple via some sub-partitioned table, then its
* PartitionDispatch is only built the first time it's required.

You're setting the allocsize to PARTITION_ROUTING_INITSIZE, which is at
odds with the '1' mentioned in the comment. Which is wrong?

I don't think either is wrong, but I guess something must be
misleading, so could perhaps be improved.

Ah, that makes sense. Yeah, it seems a bit misleading to me. No
worries.

Maybe name it PARTITION_INIT_ALLOCSIZE (dropping the ROUTING from it), or
PROUTE_INIT_ALLOCSIZE, to make it clear that it's only allocation size?

Thanks,
Amit

#66Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#65)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018-Nov-15, Amit Langote wrote:

Maybe name it PARTITION_INIT_ALLOCSIZE (dropping the ROUTING from it), or
PROUTE_INIT_ALLOCSIZE, to make it clear that it's only allocation size?

Here's a proposed delta on v17 0001. Most importantly, I noticed that
the hashed subplans stuff didn't actually work, because the hash API was
not being used correctly. So the search in the hash would never return
a hit, and we'd create RRIs for those partitions again. To fix this, I
added a new struct to hold hash entries.

I think this merits that the performance tests be redone. (Unless I
misunderstand, this shouldn't change the performance of INSERT, only
that of UPDATE.)

On the subject of the ArraySpace routines, I decided to drop them and
instead do the re-allocations on the places where they were needed.
In the original code there were two places for the partitions array, but
both did the same thing so it made sense to create a separate routine to
do it instead (ExecRememberPartitionRel), and do the allocation there.
Just out of custom I moved the palloc to appear at the same place as the
repalloc, and after doing so it became obvious that we were
over-allocating memory for the PartitionDispatchData pointer --
allocating the size for the whole struct instead of just the pointer.

(I renamed the "allocsize" struct members to "max", as is customary.)

I added CHECK_FOR_INTERRUPTS to the ExecFindPartition loop. It
shouldn't be useful if the code is correct, but if there are bugs it's
better to be able to interrupt infinite loops :-)

I reindented the comment atop PartitionTupleRouting. The other way was
just too unwieldy.

Let me know what you think. Regression tests still pass for me.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v18-delta.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 32d2461528..22a814bcbe 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1343,7 +1343,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
 
 	resultRelInfo->ri_PartitionCheck = partition_check;
 	resultRelInfo->ri_PartitionRoot = partition_root;
-	resultRelInfo->ri_PartitionInfo = NULL; /* May be set later */
+	resultRelInfo->ri_PartitionInfo = NULL; /* may be set later */
 }
 
 /*
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index b2d394676f..592daab1be 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -38,48 +38,46 @@
  * route a tuple inserted into a partitioned table to one of its leaf
  * partitions.
  *
- * partition_root			The partitioned table that's the target of the
- *							command.
+ * partition_root
+ *		The partitioned table that's the target of the command.
  *
- * partition_dispatch_info	Array of 'dispatch_allocsize' elements containing
- *							a pointer to a PartitionDispatch object for every
- *							partitioned table touched by tuple routing.  The
- *							entry for the target partitioned table is *always*
- *							present in the 0th element of this array.  See
- *							comment for PartitionDispatchData->indexes for
- *							details on how this array is indexed.
+ * partition_dispatch_info
+ *		Array of 'max_dispatch' elements containing a pointer to a
+ *		PartitionDispatch object for every partitioned table touched by tuple
+ *		routing.  The entry for the target partitioned table is *always*
+ *		present in the 0th element of this array.  See comment for
+ *		PartitionDispatchData->indexes for details on how this array is
+ *		indexed.
  *
- * num_dispatch				The current number of items stored in the
- *							'partition_dispatch_info' array.  Also serves as
- *							the index of the next free array element for new
- *							PartitionDispatch objects that need to be stored.
+ * num_dispatch
+ *		The current number of items stored in the 'partition_dispatch_info'
+ *		array.  Also serves as the index of the next free array element for
+ *		new PartitionDispatch objects that need to be stored.
  *
- * dispatch_allocsize		The current allocated size of the
- *							'partition_dispatch_info' array.
+ * max_dispatch
+ *		The current allocated size of the 'partition_dispatch_info' array.
  *
- * partitions				Array of 'partitions_allocsize' elements
- *							containing a pointer to a ResultRelInfo for every
- *							leaf partitions touched by tuple routing.  Some of
- *							these are pointers to ResultRelInfos which are
- *							borrowed out of 'subplan_resultrel_hash'.  The
- *							remainder have been built especially for tuple
- *							routing.  See comment for
- *							PartitionDispatchData->indexes for details on how
- *							this array is indexed.
+ * partitions
+ *		Array of 'max_partitions' elements containing a pointer to a
+ *		ResultRelInfo for every leaf partitions touched by tuple routing.
+ *		Some of these are pointers to ResultRelInfos which are borrowed out of
+ *		'subplan_resultrel_hash'.  The remainder have been built especially
+ *		for tuple routing.  See comment for PartitionDispatchData->indexes for
+ *		details on how this array is indexed.
  *
- * num_partitions			The current number of items stored in the
- *							'partitions' array.  Also serves as the index of
- *							the next free array element for new ResultRelInfo
- *							objects that need to be stored.
+ * num_partitions
+ *		The current number of items stored in the 'partitions' array.  Also
+ *		serves as the index of the next free array element for new
+ *		ResultRelInfo objects that need to be stored.
  *
- * partitions_allocsize		The current allocated size of the 'partitions'
- *							array.
+ * max_partitions
+ *		The current allocated size of the 'partitions' array.
  *
- * subplan_resultrel_hash	Hash table to store subplan ResultRelInfos by Oid.
- *							This is used to cache ResultRelInfos from subplans
- *							of an UPDATE ModifyTable node.  Some of these may
- *							be useful for tuple routing to save having to build
- *							duplicates.
+ * subplan_resultrel_hash
+ *		Hash table to store subplan ResultRelInfos by Oid.  This is used to
+ *		cache ResultRelInfos from subplans of an UPDATE ModifyTable node;
+ *		NULL in other cases.  Some of these may be useful for tuple routing
+ *		to save having to build duplicates.
  *-----------------------
  */
 typedef struct PartitionTupleRouting
@@ -87,10 +85,10 @@ typedef struct PartitionTupleRouting
 	Relation	partition_root;
 	PartitionDispatch *partition_dispatch_info;
 	int			num_dispatch;
-	int			dispatch_allocsize;
+	int			max_dispatch;
 	ResultRelInfo **partitions;
 	int			num_partitions;
-	int			partitions_allocsize;
+	int			max_partitions;
 	HTAB	   *subplan_resultrel_hash;
 } PartitionTupleRouting;
 
@@ -132,11 +130,16 @@ typedef struct PartitionDispatchData
 	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
+/* struct to hold result relations coming from UPDATE subplans */
+typedef struct SubplanResultRelHashElem
+{
+	Oid		relid;		/* hash key -- must be first */
+	ResultRelInfo *rri;
+} SubplanResultRelHashElem;
+
 
 static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
 							   PartitionTupleRouting *proute);
-static void ExecCheckPartitionArraySpace(PartitionTupleRouting *proute);
-static void ExecCheckDispatchArraySpace(PartitionTupleRouting *proute);
 static ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
 					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
@@ -147,6 +150,9 @@ static void ExecInitRoutingInfo(ModifyTableState *mtstate,
 					ResultRelInfo *partRelInfo);
 static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
 							  Oid partoid, PartitionDispatch parent_pd, int partidx);
+static void ExecRememberPartitionRel(EState *estate, PartitionTupleRouting *proute,
+						 int partidx, ResultRelInfo *rri,
+						 PartitionDispatch dispatch);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -192,39 +198,17 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	 * demand, only when we actually need to route a tuple to that partition.
 	 * The reason for this is that a common case is for INSERT to insert a
 	 * single tuple into a partitioned table and this must be fast.
-	 *
-	 * We initially size the 'partition_dispatch_info' and 'partitions' arrays
-	 * to allow storage of PARTITION_ROUTING_INITSIZE pointers.  If we route
-	 * tuples to more than this many partitions or through more than that many
-	 * sub-partitioned tables then we'll need to increase the size of these
-	 * arrays.
-	 *
-	 * Initially we must only set up 1 PartitionDispatch object; the one for
-	 * the partitioned table that's the target of the command.  If we must
-	 * route a tuple via some sub-partitioned table, then its
-	 * PartitionDispatch is only built the first time it's required.
 	 */
-	proute = (PartitionTupleRouting *) palloc(sizeof(PartitionTupleRouting));
+	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
 	proute->partition_root = rel;
-	proute->partition_dispatch_info = (PartitionDispatchData **)
-		palloc(sizeof(PartitionDispatchData) * PARTITION_ROUTING_INITSIZE);
-	proute->num_dispatch = 0;
-	proute->dispatch_allocsize = PARTITION_ROUTING_INITSIZE;
-
-	proute->partitions = (ResultRelInfo **)
-		palloc(sizeof(ResultRelInfo *) * PARTITION_ROUTING_INITSIZE);
-
-	/* Mark that no items are yet stored in the 'partitions' array. */
-	proute->num_partitions = 0;
-	proute->partitions_allocsize = PARTITION_ROUTING_INITSIZE;
+	/* Rest of members initialized by zeroing */
 
 	/*
 	 * Initialize this table's PartitionDispatch object.  Here we pass in the
 	 * parent as NULL as we don't need to care about any parent of the target
 	 * partitioned table.
 	 */
-	(void) ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL,
-										 0);
+	ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL, 0);
 
 	/*
 	 * If performing an UPDATE with tuple routing, we can reuse partition
@@ -236,8 +220,6 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 	 */
 	if (node && node->operation == CMD_UPDATE)
 		ExecHashSubPlanResultRelsByOid(mtstate, proute);
-	else
-		proute->subplan_resultrel_hash = NULL;
 
 	return proute;
 }
@@ -292,6 +274,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 		AttrNumber *map = dispatch->tupmap;
 		int			partidx = -1;
 
+		CHECK_FOR_INTERRUPTS();
+
 		rel = dispatch->reldesc;
 		partdesc = dispatch->partdesc;
 
@@ -319,7 +303,7 @@ ExecFindPartition(ModifyTableState *mtstate,
 
 		/*
 		 * If this partitioned table has no partitions or no partition for
-		 * these values, then error out.
+		 * these values, error out.
 		 */
 		if (partdesc->nparts == 0 ||
 			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
@@ -333,7 +317,9 @@ ExecFindPartition(ModifyTableState *mtstate,
 					(errcode(ERRCODE_CHECK_VIOLATION),
 					 errmsg("no partition of relation \"%s\" found for row",
 							RelationGetRelationName(rel)),
-					 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
+					 val_desc ?
+					 errdetail("Partition key of the failing row contains %s.",
+							   val_desc) : 0));
 		}
 
 		if (partdesc->is_leaf[partidx])
@@ -352,55 +338,39 @@ ExecFindPartition(ModifyTableState *mtstate,
 			}
 			else
 			{
-				int			rri_index = -1;
+				bool		found = false;
 
 				/*
-				 * A ResultRelInfo has not been set up for this partition yet,
-				 * so either use one of the sub-plan result rels or build a
-				 * new one.
+				 * We have not yet set up a ResultRelInfo for this partition,
+				 * but if we have a subplan hash table, we might have one
+				 * there.  If not, we'll have to create one.
 				 */
 				if (proute->subplan_resultrel_hash)
 				{
 					Oid			partoid = partdesc->oids[partidx];
+					SubplanResultRelHashElem   *elem;
 
-					rri = hash_search(proute->subplan_resultrel_hash,
-									  &partoid, HASH_FIND, NULL);
-
-					if (rri)
+					elem = hash_search(proute->subplan_resultrel_hash,
+									   &partoid, HASH_FIND, NULL);
+					if (elem)
 					{
-						/* Found one! */
+						found = true;
+						rri = elem->rri;
 
 						/* Verify this ResultRelInfo allows INSERTs */
 						CheckValidResultRel(rri, CMD_INSERT);
 
-						/* This shouldn't have be set up yet */
-						Assert(rri->ri_PartitionInfo == NULL);
-
 						/* Set up the PartitionRoutingInfo for it */
 						ExecInitRoutingInfo(mtstate, estate, rri);
-
-						rri_index = proute->num_partitions++;
-						dispatch->indexes[partidx] = rri_index;
-
-						ExecCheckPartitionArraySpace(proute);
-
-						/*
-						 * Store it in the partitions array so we don't have
-						 * to look it up again.
-						 */
-						proute->partitions[rri_index] = rri;
+						ExecRememberPartitionRel(estate, proute, partidx, rri, dispatch);
 					}
 				}
 
 				/* We need to create a new one. */
-				if (rri_index < 0)
-				{
-					MemoryContextSwitchTo(oldcxt);
+				if (!found)
 					rri = ExecInitPartitionInfo(mtstate, rootResultRelInfo,
 												proute, estate,
 												dispatch, partidx);
-					MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
-				}
 			}
 
 			/* Release the tuple in the lowest parent's dedicated slot. */
@@ -460,38 +430,31 @@ static void
 ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
 							   PartitionTupleRouting *proute)
 {
-	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	ResultRelInfo *subplan_result_rels;
 	HASHCTL		ctl;
 	HTAB	   *htab;
-	int			nsubplans;
 	int			i;
 
-	subplan_result_rels = mtstate->resultRelInfo;
-	nsubplans = list_length(node->plans);
-
 	memset(&ctl, 0, sizeof(ctl));
 	ctl.keysize = sizeof(Oid);
-	ctl.entrysize = sizeof(ResultRelInfo **);
+	ctl.entrysize = sizeof(SubplanResultRelHashElem);
 	ctl.hcxt = CurrentMemoryContext;
 
-	htab = hash_create("PartitionTupleRouting table", nsubplans, &ctl,
-					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	htab = hash_create("PartitionTupleRouting table", mtstate->mt_nplans,
+					   &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
 	proute->subplan_resultrel_hash = htab;
 
 	/* Hash all subplans by their Oid */
-	for (i = 0; i < nsubplans; i++)
+	for (i = 0; i < mtstate->mt_nplans; i++)
 	{
-		ResultRelInfo *rri = &subplan_result_rels[i];
+		ResultRelInfo *rri = &mtstate->resultRelInfo[i];
 		bool		found;
 		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
-		ResultRelInfo **subplanrri;
+		SubplanResultRelHashElem   *elem;
 
-		subplanrri = (ResultRelInfo **) hash_search(htab, &partoid, HASH_ENTER,
-													&found);
-
-		if (!found)
-			*subplanrri = rri;
+		elem = (SubplanResultRelHashElem *)
+			hash_search(htab, &partoid, HASH_ENTER, &found);
+		Assert(!found);
+		elem->rri = rri;
 
 		/*
 		 * This is required in order to convert the partition's tuple to be
@@ -503,41 +466,6 @@ ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
 }
 
 /*
- * ExecCheckPartitionArraySpace
- *		Ensure there's enough space in the proute->partitions array
- */
-static void
-ExecCheckPartitionArraySpace(PartitionTupleRouting *proute)
-{
-	if (proute->num_partitions >= proute->partitions_allocsize)
-	{
-		proute->partitions_allocsize *= 2;
-		proute->partitions = (ResultRelInfo **)
-			repalloc(proute->partitions, sizeof(ResultRelInfo *) *
-					 proute->partitions_allocsize);
-	}
-}
-
-/*
- * ExecCheckDispatchArraySpace
- *		Ensure there's enough space in the proute->partition_dispatch_info
- *		array.
- */
-static void
-ExecCheckDispatchArraySpace(PartitionTupleRouting *proute)
-{
-	if (proute->num_dispatch >= proute->dispatch_allocsize)
-	{
-		/* Expand allocated space. */
-		proute->dispatch_allocsize *= 2;
-		proute->partition_dispatch_info = (PartitionDispatchData **)
-			repalloc(proute->partition_dispatch_info,
-				sizeof(PartitionDispatchData *) *
-				proute->dispatch_allocsize);
-	}
-}
-
-/*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
  *		and store it in the next empty slot in the proute->partitions array.
@@ -559,7 +487,6 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 	MemoryContext oldContext;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
-	int			part_result_rel_index;
 
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
@@ -903,13 +830,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	part_result_rel_index = proute->num_partitions++;
-	dispatch->indexes[partidx] = part_result_rel_index;
-
-	ExecCheckPartitionArraySpace(proute);
-
-	/* Save here for later use. */
-	proute->partitions[part_result_rel_index] = leaf_part_rri;
+	ExecRememberPartitionRel(estate, proute, partidx, leaf_part_rri, dispatch);
 
 	MemoryContextSwitchTo(oldContext);
 
@@ -1018,8 +939,8 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 		rel = proute->partition_root;
 	partdesc = RelationGetPartitionDesc(rel);
 
-	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes)
-									+ (partdesc->nparts * sizeof(int)));
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes) +
+									partdesc->nparts * sizeof(int));
 	pd->reldesc = rel;
 	pd->key = RelationGetPartitionKey(rel);
 	pd->keystate = NIL;
@@ -1045,8 +966,8 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 	else
 	{
 		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
 		pd->tupmap = NULL;
+		pd->tupslot = NULL;
 	}
 
 	/*
@@ -1055,25 +976,79 @@ ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
 	 */
 	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
 
+	/* Track in PartitionTupleRouting for later use */
 	dispatchidx = proute->num_dispatch++;
 
-	ExecCheckDispatchArraySpace(proute);
-
-	/* Save here for later use. */
+	/* Allocate or enlarge the array, as needed */
+	if (proute->num_dispatch >= proute->max_dispatch)
+	{
+		if (proute->max_dispatch == 0)
+		{
+			proute->max_dispatch = PARTITION_ROUTING_INITSIZE;
+			proute->partition_dispatch_info = (PartitionDispatch *)
+				palloc(sizeof(PartitionDispatch) * proute->max_dispatch);
+		}
+		else
+		{
+			proute->max_dispatch *= 2;
+			proute->partition_dispatch_info = (PartitionDispatch *)
+				repalloc(proute->partition_dispatch_info,
+						 sizeof(PartitionDispatch) * proute->max_dispatch);
+		}
+	}
 	proute->partition_dispatch_info[dispatchidx] = pd;
 
 	/*
 	 * Finally, if setting up a PartitionDispatch for a sub-partitioned table,
-	 * install the link to allow us to descend the partition hierarchy for
-	 * future searches
+	 * install a downlink in the parent to allow quick descent.
 	 */
 	if (parent_pd)
+	{
+		Assert(parent_pd->indexes[partidx] == -1);
 		parent_pd->indexes[partidx] = dispatchidx;
+	}
 
 	return pd;
 }
 
 /*
+ * Store the given ResultRelInfo as corresponding to partition partidx in
+ * proute, tracking which array item was used in dispatch->indexes.
+ */
+static void
+ExecRememberPartitionRel(EState *estate, PartitionTupleRouting *proute, int partidx,
+						 ResultRelInfo *rri, PartitionDispatch dispatch)
+{
+	int		rri_index;
+
+	Assert(dispatch->indexes[partidx] == -1);
+
+	rri_index = proute->num_partitions++;
+
+	/* Allocate or enlarge the array, as needed */
+	if (proute->num_partitions >= proute->max_partitions)
+	{
+		if (proute->max_partitions == 0)
+		{
+			proute->max_partitions = PARTITION_ROUTING_INITSIZE;
+			proute->partitions = (ResultRelInfo **)
+				MemoryContextAlloc(estate->es_query_cxt,
+								   sizeof(ResultRelInfo *) * proute->max_partitions);
+		}
+		else
+		{
+			proute->max_partitions *= 2;
+			proute->partitions = (ResultRelInfo **)
+				repalloc(proute->partitions, sizeof(ResultRelInfo *) *
+						 proute->max_partitions);
+		}
+	}
+
+	proute->partitions[rri_index] = rri;
+	dispatch->indexes[partidx] = rri_index;
+}
+
+/*
  * ExecCleanupTupleRouting -- Clean up objects allocated for partition tuple
  * routing.
  *
@@ -1107,7 +1082,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-
 		/*
 		 * Check if this result rel is one belonging to the node's subplans,
 		 * if so, let ExecEndPlan() clean it up.
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 423118cbbc..de27d88e63 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -34,7 +34,6 @@
 
 struct PlanState;				/* forward references in this file */
 struct PartitionRoutingInfo;
-struct PartitionTupleRouting;
 struct ParallelHashJoinState;
 struct ExecRowMark;
 struct ExprState;
@@ -471,7 +470,7 @@ typedef struct ResultRelInfo
 	/* relation descriptor for root partitioned table */
 	Relation	ri_PartitionRoot;
 
-	/* Additional information that's specific to partition tuple routing */
+	/* Additional information specific to partition tuple routing */
 	struct PartitionRoutingInfo *ri_PartitionInfo;
 } ResultRelInfo;
 
#67Amit Langote
amitlangote09@gmail.com
In reply to: Alvaro Herrera (#66)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On Fri, Nov 16, 2018 at 11:40 AM Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

On 2018-Nov-15, Amit Langote wrote:

Maybe name it PARTITION_INIT_ALLOCSIZE (dropping the ROUTING from it), or
PROUTE_INIT_ALLOCSIZE, to make it clear that it's only allocation size?

Here's a proposed delta on v17 0001. Most importantly, I noticed that
the hashed subplans stuff didn't actually work, because the hash API was
not being used correctly. So the search in the hash would never return
a hit, and we'd create RRIs for those partitions again. To fix this, I
added a new struct to hold hash entries.

I'm a bit surprised that you found that the hash table didn't work,
because I remember having checked by attaching gdb that it works when
I was hacking on my own delta patch, but I may have been looking at
too many things.

I think this merits that the performance tests be redone. (Unless I
misunderstand, this shouldn't change the performance of INSERT, only
that of UPDATE.)

Actually, I don't remember seeing performance tests done with UPDATEs
on this thread.

Since we don't needlessly scan *all* subplan result rels anymore,
maybe this removes a good deal of overhead for UPDATEs that update
partition key.

On the subject of the ArraySpace routines, I decided to drop them and
instead do the re-allocations on the places where they were needed.
In the original code there were two places for the partitions array, but
both did the same thing so it made sense to create a separate routine to
do it instead (ExecRememberPartitionRel), and do the allocation there.
Just out of custom I moved the palloc to appear at the same place as the
repalloc, and after doing so it became obvious that we were
over-allocating memory for the PartitionDispatchData pointer --
allocating the size for the whole struct instead of just the pointer.

(I renamed the "allocsize" struct members to "max", as is customary.)

These changes look good to me.

I added CHECK_FOR_INTERRUPTS to the ExecFindPartition loop. It
shouldn't be useful if the code is correct, but if there are bugs it's
better to be able to interrupt infinite loops :-)

Good measure. :)

I reindented the comment atop PartitionTupleRouting. The other way was
just too unwieldy.

Let me know what you think. Regression tests still pass for me.

Overall, it looks good to me.

Thanks,
Amit

#68Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Amit Langote (#67)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

One thing I don't quite like is the inconsistency in handling memory
context switches in the various function allocating stuff. It seems
rather haphazard. I'd rather have a memcxt member in
PartitionTupleRouting, which is set when the struct is created, and then
have all the other functions allocating stuff use that one.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#69Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#68)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018-Nov-16, Alvaro Herrera wrote:

One thing I don't quite like is the inconsistency in handling memory
context switches in the various function allocating stuff. It seems
rather haphazard. I'd rather have a memcxt member in
PartitionTupleRouting, which is set when the struct is created, and then
have all the other functions allocating stuff use that one.

So while researching this I finally realized that there was a "lexical
disconnect" between setting a ResultRelInfo's ri_PartitionInfo
struct/pointer and adding it to the PartitionTupleRoute arrays.
However, if you think about it, these things are one and the same, so we
don't need to do them separately; just merge the new function I wrote
into the existing ExecInitRoutingInfo(). Patch attached.

(This version also rebases across Andres' recent conflicting
TupleTableSlot changes.)

I'll now see about the commit message and push shortly.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachments:

v19-Speed-up-INSERT-and-UPDATE-on-partitioned-tables.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index e62e3d8fba..6588ebd6dc 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2316,6 +2316,7 @@ CopyFrom(CopyState cstate)
 	bool	   *nulls;
 	ResultRelInfo *resultRelInfo;
 	ResultRelInfo *target_resultRelInfo;
+	ResultRelInfo *prevResultRelInfo = NULL;
 	EState	   *estate = CreateExecutorState(); /* for ExecConstraints() */
 	ModifyTableState *mtstate;
 	ExprContext *econtext;
@@ -2331,7 +2332,6 @@ CopyFrom(CopyState cstate)
 	CopyInsertMethod insertMethod;
 	uint64		processed = 0;
 	int			nBufferedTuples = 0;
-	int			prev_leaf_part_index = -1;
 	bool		has_before_insert_row_trig;
 	bool		has_instead_insert_row_trig;
 	bool		leafpart_use_multi_insert = false;
@@ -2515,8 +2515,12 @@ CopyFrom(CopyState cstate)
 	/*
 	 * If there are any triggers with transition tables on the named relation,
 	 * we need to be prepared to capture transition tuples.
+	 *
+	 * Because partition tuple routing would like to know about whether
+	 * transition capture is active, we also set it in mtstate, which is
+	 * passed to ExecFindPartition() below.
 	 */
-	cstate->transition_capture =
+	cstate->transition_capture = mtstate->mt_transition_capture =
 		MakeTransitionCaptureState(cstate->rel->trigdesc,
 								   RelationGetRelid(cstate->rel),
 								   CMD_INSERT);
@@ -2526,19 +2530,8 @@ CopyFrom(CopyState cstate)
 	 * CopyFrom tuple routing.
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
-	{
 		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
-		/*
-		 * If we are capturing transition tuples, they may need to be
-		 * converted from partition format back to partitioned table format
-		 * (this is only ever necessary if a BEFORE trigger modifies the
-		 * tuple).
-		 */
-		if (cstate->transition_capture != NULL)
-			ExecSetupChildParentMapForLeaf(proute);
-	}
-
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
@@ -2694,25 +2687,17 @@ CopyFrom(CopyState cstate)
 		/* Determine the partition to heap_insert the tuple into */
 		if (proute)
 		{
-			int			leaf_part_index;
 			TupleConversionMap *map;
 
 			/*
-			 * Away we go ... If we end up not finding a partition after all,
-			 * ExecFindPartition() does not return and errors out instead.
-			 * Otherwise, the returned value is to be used as an index into
-			 * arrays mt_partitions[] and mt_partition_tupconv_maps[] that
-			 * will get us the ResultRelInfo and TupleConversionMap for the
-			 * partition, respectively.
+			 * Attempt to find a partition suitable for this tuple.
+			 * ExecFindPartition() will raise an error if none can be found or
+			 * if the found partition is not suitable for INSERTs.
 			 */
-			leaf_part_index = ExecFindPartition(target_resultRelInfo,
-												proute->partition_dispatch_info,
-												slot,
-												estate);
-			Assert(leaf_part_index >= 0 &&
-				   leaf_part_index < proute->num_partitions);
+			resultRelInfo = ExecFindPartition(mtstate, target_resultRelInfo,
+											  proute, slot, estate);
 
-			if (prev_leaf_part_index != leaf_part_index)
+			if (prevResultRelInfo != resultRelInfo)
 			{
 				/* Check if we can multi-insert into this partition */
 				if (insertMethod == CIM_MULTI_CONDITIONAL)
@@ -2725,12 +2710,9 @@ CopyFrom(CopyState cstate)
 					if (nBufferedTuples > 0)
 					{
 						ExprContext *swapcontext;
-						ResultRelInfo *presultRelInfo;
-
-						presultRelInfo = proute->partitions[prev_leaf_part_index];
 
 						CopyFromInsertBatch(cstate, estate, mycid, hi_options,
-											presultRelInfo, myslot, bistate,
+											prevResultRelInfo, myslot, bistate,
 											nBufferedTuples, bufferedTuples,
 											firstBufferedLineNo);
 						nBufferedTuples = 0;
@@ -2787,21 +2769,6 @@ CopyFrom(CopyState cstate)
 					}
 				}
 
-				/*
-				 * Overwrite resultRelInfo with the corresponding partition's
-				 * one.
-				 */
-				resultRelInfo = proute->partitions[leaf_part_index];
-				if (unlikely(resultRelInfo == NULL))
-				{
-					resultRelInfo = ExecInitPartitionInfo(mtstate,
-														  target_resultRelInfo,
-														  proute, estate,
-														  leaf_part_index);
-					proute->partitions[leaf_part_index] = resultRelInfo;
-					Assert(resultRelInfo != NULL);
-				}
-
 				/* Determine which triggers exist on this partition */
 				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
 											  resultRelInfo->ri_TrigDesc->trig_insert_before_row);
@@ -2827,7 +2794,7 @@ CopyFrom(CopyState cstate)
 				 * buffer when the partition being inserted into changes.
 				 */
 				ReleaseBulkInsertStatePin(bistate);
-				prev_leaf_part_index = leaf_part_index;
+				prevResultRelInfo = resultRelInfo;
 			}
 
 			/*
@@ -2837,7 +2804,7 @@ CopyFrom(CopyState cstate)
 
 			/*
 			 * If we're capturing transition tuples, we might need to convert
-			 * from the partition rowtype to parent rowtype.
+			 * from the partition rowtype to root rowtype.
 			 */
 			if (cstate->transition_capture != NULL)
 			{
@@ -2850,8 +2817,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, target_resultRelInfo,
-										  leaf_part_index);
+						resultRelInfo->ri_PartitionInfo->pi_PartitionToRootMap;
 				}
 				else
 				{
@@ -2865,18 +2831,18 @@ CopyFrom(CopyState cstate)
 			}
 
 			/*
-			 * We might need to convert from the parent rowtype to the
-			 * partition rowtype.
+			 * We might need to convert from the root rowtype to the partition
+			 * rowtype.
 			 */
-			map = proute->parent_child_tupconv_maps[leaf_part_index];
+			map = resultRelInfo->ri_PartitionInfo->pi_RootToPartitionMap;
 			if (map != NULL)
 			{
 				TupleTableSlot *new_slot;
 				MemoryContext oldcontext;
 
-				Assert(proute->partition_tuple_slots != NULL &&
-					   proute->partition_tuple_slots[leaf_part_index] != NULL);
-				new_slot = proute->partition_tuple_slots[leaf_part_index];
+				new_slot = resultRelInfo->ri_PartitionInfo->pi_PartitionTupleSlot;
+				Assert(new_slot != NULL);
+
 				slot = execute_attr_map_slot(map->attrMap, slot, new_slot);
 
 				/*
@@ -3021,12 +2987,8 @@ CopyFrom(CopyState cstate)
 	{
 		if (insertMethod == CIM_MULTI_CONDITIONAL)
 		{
-			ResultRelInfo *presultRelInfo;
-
-			presultRelInfo = proute->partitions[prev_leaf_part_index];
-
 			CopyFromInsertBatch(cstate, estate, mycid, hi_options,
-								presultRelInfo, myslot, bistate,
+								prevResultRelInfo, myslot, bistate,
 								nBufferedTuples, bufferedTuples,
 								firstBufferedLineNo);
 		}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 74398eb464..757df0705d 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1345,7 +1345,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
 
 	resultRelInfo->ri_PartitionCheck = partition_check;
 	resultRelInfo->ri_PartitionRoot = partition_root;
-	resultRelInfo->ri_PartitionReadyForRouting = false;
+	resultRelInfo->ri_PartitionInfo = NULL; /* may be set later */
 }
 
 /*
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index e11fe68712..5216d0f93b 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -33,21 +33,98 @@
 
 
 /*-----------------------
- * PartitionDispatch - information about one partitioned table in a partition
- * hierarchy required to route a tuple to one of its partitions
+ * PartitionTupleRouting - Encapsulates all information required to
+ * route a tuple inserted into a partitioned table to one of its leaf
+ * partitions.
  *
- *	reldesc		Relation descriptor of the table
- *	key			Partition key information of the table
- *	keystate	Execution state required for expressions in the partition key
- *	partdesc	Partition descriptor of the table
- *	tupslot		A standalone TupleTableSlot initialized with this table's tuple
- *				descriptor
- *	tupmap		TupleConversionMap to convert from the parent's rowtype to
- *				this table's rowtype (when extracting the partition key of a
- *				tuple just before routing it through this table)
- *	indexes		Array with partdesc->nparts members (for details on what
- *				individual members represent, see how they are set in
- *				get_partition_dispatch_recurse())
+ * partition_root
+ *		The partitioned table that's the target of the command.
+ *
+ * partition_dispatch_info
+ *		Array of 'max_dispatch' elements containing a pointer to a
+ *		PartitionDispatch object for every partitioned table touched by tuple
+ *		routing.  The entry for the target partitioned table is *always*
+ *		present in the 0th element of this array.  See comment for
+ *		PartitionDispatchData->indexes for details on how this array is
+ *		indexed.
+ *
+ * num_dispatch
+ *		The current number of items stored in the 'partition_dispatch_info'
+ *		array.  Also serves as the index of the next free array element for
+ *		new PartitionDispatch objects that need to be stored.
+ *
+ * max_dispatch
+ *		The current allocated size of the 'partition_dispatch_info' array.
+ *
+ * partitions
+ *		Array of 'max_partitions' elements containing a pointer to a
+ *		ResultRelInfo for every leaf partitions touched by tuple routing.
+ *		Some of these are pointers to ResultRelInfos which are borrowed out of
+ *		'subplan_resultrel_hash'.  The remainder have been built especially
+ *		for tuple routing.  See comment for PartitionDispatchData->indexes for
+ *		details on how this array is indexed.
+ *
+ * num_partitions
+ *		The current number of items stored in the 'partitions' array.  Also
+ *		serves as the index of the next free array element for new
+ *		ResultRelInfo objects that need to be stored.
+ *
+ * max_partitions
+ *		The current allocated size of the 'partitions' array.
+ *
+ * subplan_resultrel_hash
+ *		Hash table to store subplan ResultRelInfos by Oid.  This is used to
+ *		cache ResultRelInfos from subplans of an UPDATE ModifyTable node;
+ *		NULL in other cases.  Some of these may be useful for tuple routing
+ *		to save having to build duplicates.
+ *
+ * memcxt
+ *		Memory context used to allocate subsidiary structs.
+ *-----------------------
+ */
+typedef struct PartitionTupleRouting
+{
+	Relation	partition_root;
+	PartitionDispatch *partition_dispatch_info;
+	int			num_dispatch;
+	int			max_dispatch;
+	ResultRelInfo **partitions;
+	int			num_partitions;
+	int			max_partitions;
+	HTAB	   *subplan_resultrel_hash;
+	MemoryContext memcxt;
+} PartitionTupleRouting;
+
+/*-----------------------
+ * PartitionDispatch - information about one partitioned table in a partition
+ * hierarchy required to route a tuple to any of its partitions.  A
+ * PartitionDispatch is always encapsulated inside a PartitionTupleRouting
+ * struct and stored inside its 'partition_dispatch_info' array.
+ *
+ * reldesc
+ *		Relation descriptor of the table
+ * key
+ *		Partition key information of the table
+ * keystate
+ *		Execution state required for expressions in the partition key
+ * partdesc
+ *		Partition descriptor of the table
+ * tupslot
+ *		A standalone TupleTableSlot initialized with this table's tuple
+ *		descriptor, or NULL if no tuple conversion between the parent is
+ *		required.
+ * tupmap
+ *		TupleConversionMap to convert from the parent's rowtype to this table's
+ *		rowtype  (when extracting the partition key of a tuple just before
+ *		routing it through this table). A NULL value is stored if no tuple
+ *		conversion is required.
+ * indexes
+ *		Array of partdesc->nparts elements.  For leaf partitions the index
+ *		corresponds to the partition's ResultRelInfo in the encapsulating
+ *		PartitionTupleRouting's partitions array.  For partitioned partitions,
+ *		the index corresponds to the PartitionDispatch for it in its
+ *		partition_dispatch_info array.  -1 indicates we've not yet allocated
+ *		anything in PartitionTupleRouting for the partition.
  *-----------------------
  */
 typedef struct PartitionDispatchData
@@ -58,14 +135,32 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrNumber *tupmap;
-	int		   *indexes;
+	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 } PartitionDispatchData;
 
+/* struct to hold result relations coming from UPDATE subplans */
+typedef struct SubplanResultRelHashElem
+{
+	Oid		relid;		/* hash key -- must be first */
+	ResultRelInfo *rri;
+} SubplanResultRelHashElem;
 
-static PartitionDispatch *RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids);
-static void get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids);
+
+static void ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute);
+static ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
+					  ResultRelInfo *rootResultRelInfo,
+					  PartitionTupleRouting *proute,
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx);
+static void ExecInitRoutingInfo(PartitionTupleRouting *proute,
+					PartitionDispatch dispatch,
+					ModifyTableState *mtstate,
+					EState *estate,
+					ResultRelInfo *partRelInfo,
+					int partidx);
+static PartitionDispatch ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute,
+							  Oid partoid, PartitionDispatch parent_pd, int partidx);
 static void FormPartitionKeyDatum(PartitionDispatch pd,
 					  TupleTableSlot *slot,
 					  EState *estate,
@@ -92,131 +187,84 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
  * Note that all the relations in the partition tree are locked using the
  * RowExclusiveLock mode upon return from this function.
  *
- * While we allocate the arrays of pointers of ResultRelInfo and
- * TupleConversionMap for all partitions here, actual objects themselves are
- * lazily allocated for a given partition if a tuple is actually routed to it;
- * see ExecInitPartitionInfo.  However, if the function is invoked for update
- * tuple routing, caller would already have initialized ResultRelInfo's for
- * some of the partitions, which are reused and assigned to their respective
- * slot in the aforementioned array.  For such partitions, we delay setting
- * up objects such as TupleConversionMap until those are actually chosen as
- * the partitions to route tuples to.  See ExecPrepareTupleRouting.
+ * Callers must use the returned PartitionTupleRouting during calls to
+ * ExecFindPartition().  The actual ResultRelInfo for a partition is only
+ * allocated when the partition is found for the first time.
+ *
+ * The current memory context is used to allocate this struct and all
+ * subsidiary structs that will be allocated from it later on.  Typically
+ * it should be estate->es_query_cxt.
  */
 PartitionTupleRouting *
 ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 {
-	List	   *leaf_parts;
-	ListCell   *cell;
-	int			i;
-	ResultRelInfo *update_rri = NULL;
-	int			num_update_rri = 0,
-				update_rri_index = 0;
 	PartitionTupleRouting *proute;
-	int			nparts;
 	ModifyTable *node = mtstate ? (ModifyTable *) mtstate->ps.plan : NULL;
 
-	/*
-	 * Get the information about the partition tree after locking all the
-	 * partitions.
-	 */
+	/* Lock all the partitions. */
 	(void) find_all_inheritors(RelationGetRelid(rel), RowExclusiveLock, NULL);
-	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
-	proute->partition_dispatch_info =
-		RelationGetPartitionDispatchInfo(rel, &proute->num_dispatch,
-										 &leaf_parts);
-	proute->num_partitions = nparts = list_length(leaf_parts);
-	proute->partitions =
-		(ResultRelInfo **) palloc(nparts * sizeof(ResultRelInfo *));
-	proute->parent_child_tupconv_maps =
-		(TupleConversionMap **) palloc0(nparts * sizeof(TupleConversionMap *));
-	proute->partition_oids = (Oid *) palloc(nparts * sizeof(Oid));
-
-	/* Set up details specific to the type of tuple routing we are doing. */
-	if (node && node->operation == CMD_UPDATE)
-	{
-		update_rri = mtstate->resultRelInfo;
-		num_update_rri = list_length(node->plans);
-		proute->subplan_partition_offsets =
-			palloc(num_update_rri * sizeof(int));
-		proute->num_subplan_partition_offsets = num_update_rri;
-
-		/*
-		 * We need an additional tuple slot for storing transient tuples that
-		 * are converted to the root table descriptor.
-		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel),
-													 &TTSOpsHeapTuple);
-	}
-
-	i = 0;
-	foreach(cell, leaf_parts)
-	{
-		ResultRelInfo *leaf_part_rri = NULL;
-		Oid			leaf_oid = lfirst_oid(cell);
-
-		proute->partition_oids[i] = leaf_oid;
-
-		/*
-		 * If the leaf partition is already present in the per-subplan result
-		 * rels, we re-use that rather than initialize a new result rel. The
-		 * per-subplan resultrels and the resultrels of the leaf partitions
-		 * are both in the same canonical order. So while going through the
-		 * leaf partition oids, we need to keep track of the next per-subplan
-		 * result rel to be looked for in the leaf partition resultrels.
-		 */
-		if (update_rri_index < num_update_rri &&
-			RelationGetRelid(update_rri[update_rri_index].ri_RelationDesc) == leaf_oid)
-		{
-			leaf_part_rri = &update_rri[update_rri_index];
-
-			/*
-			 * This is required in order to convert the partition's tuple to
-			 * be compatible with the root partitioned table's tuple
-			 * descriptor.  When generating the per-subplan result rels, this
-			 * was not set.
-			 */
-			leaf_part_rri->ri_PartitionRoot = rel;
-
-			/* Remember the subplan offset for this ResultRelInfo */
-			proute->subplan_partition_offsets[update_rri_index] = i;
-
-			update_rri_index++;
-		}
-
-		proute->partitions[i] = leaf_part_rri;
-		i++;
-	}
 
 	/*
-	 * For UPDATE, we should have found all the per-subplan resultrels in the
-	 * leaf partitions.  (If this is an INSERT, both values will be zero.)
+	 * Here we attempt to expend as little effort as possible in setting up
+	 * the PartitionTupleRouting.  Each partition's ResultRelInfo is built on
+	 * demand, only when we actually need to route a tuple to that partition.
+	 * The reason for this is that a common case is for INSERT to insert a
+	 * single tuple into a partitioned table and this must be fast.
 	 */
-	Assert(update_rri_index == num_update_rri);
+	proute = (PartitionTupleRouting *) palloc0(sizeof(PartitionTupleRouting));
+	proute->partition_root = rel;
+	proute->memcxt = CurrentMemoryContext;
+	/* Rest of members initialized by zeroing */
+
+	/*
+	 * Initialize this table's PartitionDispatch object.  Here we pass in the
+	 * parent as NULL as we don't need to care about any parent of the target
+	 * partitioned table.
+	 */
+	ExecInitPartitionDispatchInfo(proute, RelationGetRelid(rel), NULL, 0);
+
+	/*
+	 * If performing an UPDATE with tuple routing, we can reuse partition
+	 * sub-plan result rels.  We build a hash table to map the OIDs of
+	 * partitions present in mtstate->resultRelInfo to their ResultRelInfos.
+	 * Every time a tuple is routed to a partition that we've yet to set the
+	 * ResultRelInfo for, before we go to the trouble of making one, we check
+	 * for a pre-made one in the hash table.
+	 */
+	if (node && node->operation == CMD_UPDATE)
+		ExecHashSubPlanResultRelsByOid(mtstate, proute);
 
 	return proute;
 }
 
 /*
- * ExecFindPartition -- Find a leaf partition in the partition tree rooted
- * at parent, for the heap tuple contained in *slot
+ * ExecFindPartition -- Find and return, or build and return the ResultRelInfo
+ * for the leaf partition that the tuple contained in *slot should belong to.
+ *
+ * If the partition's ResultRelInfo does not yet exist in 'proute' then we set
+ * one up or reuse one from mtstate's resultRelInfo array.  When reusing a
+ * ResultRelInfo from the mtstate we verify that the relation is a valid
+ * target for INSERTs and then set up a PartitionRoutingInfo for it.
  *
  * estate must be non-NULL; we'll need it to compute any expressions in the
- * partition key(s)
+ * partition keys.  Also, its per-tuple context is used.
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message, else it returns the leaf partition sequence number
- * as an index into the array of (ResultRelInfos of) all leaf partitions in
- * the partition tree.
+ * error message.  An error may also raised if the found target partition is
+ * not a valid target for an INSERT.
  */
-int
-ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
+ResultRelInfo *
+ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *rootResultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot, EState *estate)
 {
-	int			result;
+	PartitionDispatch *pd = proute->partition_dispatch_info;
 	Datum		values[PARTITION_MAX_KEYS];
 	bool		isnull[PARTITION_MAX_KEYS];
 	Relation	rel;
 	PartitionDispatch dispatch;
+	PartitionDesc partdesc;
 	ExprContext *ecxt = GetPerTupleExprContext(estate);
 	TupleTableSlot *ecxt_scantuple_old = ecxt->ecxt_scantuple;
 	TupleTableSlot *myslot = NULL;
@@ -229,25 +277,31 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 	 * First check the root table's partition constraint, if any.  No point in
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
-	if (resultRelInfo->ri_PartitionCheck)
-		ExecPartitionCheck(resultRelInfo, slot, estate, true);
+	if (rootResultRelInfo->ri_PartitionCheck)
+		ExecPartitionCheck(rootResultRelInfo, slot, estate, true);
 
 	/* start with the root partitioned table */
 	dispatch = pd[0];
 	while (true)
 	{
 		AttrNumber *map = dispatch->tupmap;
-		int			cur_index = -1;
+		int			partidx = -1;
+
+		CHECK_FOR_INTERRUPTS();
 
 		rel = dispatch->reldesc;
+		partdesc = dispatch->partdesc;
 
 		/*
 		 * Convert the tuple to this parent's layout, if different from the
 		 * current relation.
 		 */
 		myslot = dispatch->tupslot;
-		if (myslot != NULL && map != NULL)
+		if (myslot != NULL)
+		{
+			Assert(map != NULL);
 			slot = execute_attr_map_slot(map, slot, myslot);
+		}
 
 		/*
 		 * Extract partition key from tuple. Expression evaluation machinery
@@ -261,97 +315,196 @@ ExecFindPartition(ResultRelInfo *resultRelInfo, PartitionDispatch *pd,
 		FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
 
 		/*
-		 * Nothing for get_partition_for_tuple() to do if there are no
-		 * partitions to begin with.
+		 * If this partitioned table has no partitions or no partition for
+		 * these values, error out.
 		 */
-		if (dispatch->partdesc->nparts == 0)
+		if (partdesc->nparts == 0 ||
+			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
 		{
-			result = -1;
-			break;
+			char	   *val_desc;
+
+			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+															values, isnull, 64);
+			Assert(OidIsValid(RelationGetRelid(rel)));
+			ereport(ERROR,
+					(errcode(ERRCODE_CHECK_VIOLATION),
+					 errmsg("no partition of relation \"%s\" found for row",
+							RelationGetRelationName(rel)),
+					 val_desc ?
+					 errdetail("Partition key of the failing row contains %s.",
+							   val_desc) : 0));
 		}
 
-		cur_index = get_partition_for_tuple(dispatch, values, isnull);
+		if (partdesc->is_leaf[partidx])
+		{
+			ResultRelInfo *rri;
 
-		/*
-		 * cur_index < 0 means we failed to find a partition of this parent.
-		 * cur_index >= 0 means we either found the leaf partition, or the
-		 * next parent to find a partition of.
-		 */
-		if (cur_index < 0)
-		{
-			result = -1;
-			break;
-		}
-		else if (dispatch->indexes[cur_index] >= 0)
-		{
-			result = dispatch->indexes[cur_index];
-			/* success! */
-			break;
+			/*
+			 * Look to see if we've already got a ResultRelInfo for this
+			 * partition.
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* ResultRelInfo already built */
+				Assert(dispatch->indexes[partidx] < proute->num_partitions);
+				rri = proute->partitions[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				bool		found = false;
+
+				/*
+				 * We have not yet set up a ResultRelInfo for this partition,
+				 * but if we have a subplan hash table, we might have one
+				 * there.  If not, we'll have to create one.
+				 */
+				if (proute->subplan_resultrel_hash)
+				{
+					Oid			partoid = partdesc->oids[partidx];
+					SubplanResultRelHashElem   *elem;
+
+					elem = hash_search(proute->subplan_resultrel_hash,
+									   &partoid, HASH_FIND, NULL);
+					if (elem)
+					{
+						found = true;
+						rri = elem->rri;
+
+						/* Verify this ResultRelInfo allows INSERTs */
+						CheckValidResultRel(rri, CMD_INSERT);
+
+						/* Set up the PartitionRoutingInfo for it */
+						ExecInitRoutingInfo(proute, dispatch, mtstate, estate,
+											rri, partidx);
+					}
+				}
+
+				/* We need to create a new one. */
+				if (!found)
+					rri = ExecInitPartitionInfo(mtstate, rootResultRelInfo,
+												proute, estate,
+												dispatch, partidx);
+			}
+
+			/* Release the tuple in the lowest parent's dedicated slot. */
+			if (slot == myslot)
+				ExecClearTuple(myslot);
+
+			MemoryContextSwitchTo(oldcxt);
+			ecxt->ecxt_scantuple = ecxt_scantuple_old;
+			return rri;
 		}
 		else
 		{
-			/* move down one level */
-			dispatch = pd[-dispatch->indexes[cur_index]];
+			/*
+			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 */
+			if (likely(dispatch->indexes[partidx] >= 0))
+			{
+				/* Already built. */
+				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
+
+				/*
+				 * Move down to the next partition level and search again
+				 * until we find a leaf partition that matches this tuple
+				 */
+				dispatch = pd[dispatch->indexes[partidx]];
+			}
+			else
+			{
+				/* Not yet built. Do that now. */
+				PartitionDispatch subdispatch;
+
+				/*
+				 * Create the new PartitionDispatch.  We pass the current one
+				 * in as the parent PartitionDispatch
+				 */
+				subdispatch = ExecInitPartitionDispatchInfo(proute,
+															partdesc->oids[partidx],
+															dispatch, partidx);
+				Assert(dispatch->indexes[partidx] >= 0 &&
+					   dispatch->indexes[partidx] < proute->num_dispatch);
+				dispatch = subdispatch;
+			}
 		}
 	}
+}
 
-	/* Release the tuple in the lowest parent's dedicated slot. */
-	if (slot == myslot)
-		ExecClearTuple(myslot);
+/*
+ * ExecHashSubPlanResultRelsByOid
+ *		Build a hash table to allow fast lookups of subplan ResultRelInfos by
+ *		partition Oid.  We also populate the subplan ResultRelInfo with an
+ *		ri_PartitionRoot.
+ */
+static void
+ExecHashSubPlanResultRelsByOid(ModifyTableState *mtstate,
+							   PartitionTupleRouting *proute)
+{
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	int			i;
 
-	/* A partition was not found. */
-	if (result < 0)
+	memset(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(SubplanResultRelHashElem);
+	ctl.hcxt = CurrentMemoryContext;
+
+	htab = hash_create("PartitionTupleRouting table", mtstate->mt_nplans,
+					   &ctl, HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	proute->subplan_resultrel_hash = htab;
+
+	/* Hash all subplans by their Oid */
+	for (i = 0; i < mtstate->mt_nplans; i++)
 	{
-		char	   *val_desc;
+		ResultRelInfo *rri = &mtstate->resultRelInfo[i];
+		bool		found;
+		Oid			partoid = RelationGetRelid(rri->ri_RelationDesc);
+		SubplanResultRelHashElem   *elem;
 
-		val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-														values, isnull, 64);
-		Assert(OidIsValid(RelationGetRelid(rel)));
-		ereport(ERROR,
-				(errcode(ERRCODE_CHECK_VIOLATION),
-				 errmsg("no partition of relation \"%s\" found for row",
-						RelationGetRelationName(rel)),
-				 val_desc ? errdetail("Partition key of the failing row contains %s.", val_desc) : 0));
+		elem = (SubplanResultRelHashElem *)
+			hash_search(htab, &partoid, HASH_ENTER, &found);
+		Assert(!found);
+		elem->rri = rri;
+
+		/*
+		 * This is required in order to convert the partition's tuple to be
+		 * compatible with the root partitioned table's tuple descriptor. When
+		 * generating the per-subplan result rels, this was not set.
+		 */
+		rri->ri_PartitionRoot = proute->partition_root;
 	}
-
-	MemoryContextSwitchTo(oldcxt);
-	ecxt->ecxt_scantuple = ecxt_scantuple_old;
-
-	return result;
 }
 
 /*
  * ExecInitPartitionInfo
  *		Initialize ResultRelInfo and other information for a partition
+ *		and store it in the next empty slot in the proute->partitions array.
  *
  * Returns the ResultRelInfo
  */
-ResultRelInfo *
+static ResultRelInfo *
 ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
+					  ResultRelInfo *rootResultRelInfo,
 					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx)
+					  EState *estate,
+					  PartitionDispatch dispatch, int partidx)
 {
 	ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
-	Relation	rootrel = resultRelInfo->ri_RelationDesc,
+	Relation	rootrel = rootResultRelInfo->ri_RelationDesc,
 				partrel;
 	Relation	firstResultRel = mtstate->resultRelInfo[0].ri_RelationDesc;
 	ResultRelInfo *leaf_part_rri;
-	MemoryContext oldContext;
+	MemoryContext oldcxt;
 	AttrNumber *part_attnos = NULL;
 	bool		found_whole_row;
 
+	oldcxt = MemoryContextSwitchTo(proute->memcxt);
+
 	/*
 	 * We locked all the partitions in ExecSetupPartitionTupleRouting
 	 * including the leaf partitions.
 	 */
-	partrel = heap_open(proute->partition_oids[partidx], NoLock);
-
-	/*
-	 * Keep ResultRelInfo and other information for this partition in the
-	 * per-query memory context so they'll survive throughout the query.
-	 */
-	oldContext = MemoryContextSwitchTo(estate->es_query_cxt);
+	partrel = heap_open(dispatch->partdesc->oids[partidx], NoLock);
 
 	leaf_part_rri = makeNode(ResultRelInfo);
 	InitResultRelInfo(leaf_part_rri,
@@ -368,18 +521,6 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 	CheckValidResultRel(leaf_part_rri, CMD_INSERT);
 
 	/*
-	 * Since we've just initialized this ResultRelInfo, it's not in any list
-	 * attached to the estate as yet.  Add it, so that it can be found later.
-	 *
-	 * Note that the entries in this list appear in no predetermined order,
-	 * because partition result rels are initialized as and when they're
-	 * needed.
-	 */
-	estate->es_tuple_routing_result_relations =
-		lappend(estate->es_tuple_routing_result_relations,
-				leaf_part_rri);
-
-	/*
 	 * Open partition indices.  The user may have asked to check for conflicts
 	 * within this leaf partition and do "nothing" instead of throwing an
 	 * error.  Be prepared in that case by initializing the index information
@@ -522,14 +663,14 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 	}
 
 	/* Set up information needed for routing tuples to the partition. */
-	ExecInitRoutingInfo(mtstate, estate, proute, leaf_part_rri, partidx);
+	ExecInitRoutingInfo(proute, dispatch, mtstate, estate,
+						leaf_part_rri, partidx);
 
 	/*
 	 * If there is an ON CONFLICT clause, initialize state for it.
 	 */
 	if (node && node->onConflictAction != ONCONFLICT_NONE)
 	{
-		TupleConversionMap *map = proute->parent_child_tupconv_maps[partidx];
 		int			firstVarno = mtstate->resultRelInfo[0].ri_RangeTableIndex;
 		TupleDesc	partrelDesc = RelationGetDescr(partrel);
 		ExprContext *econtext = mtstate->ps.ps_ExprContext;
@@ -542,7 +683,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * list and searching for ancestry relationships to each index in the
 		 * ancestor table.
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) > 0)
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) > 0)
 		{
 			List	   *childIdxs;
 
@@ -555,7 +696,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 				ListCell   *lc2;
 
 				ancestors = get_partition_ancestors(childIdx);
-				foreach(lc2, resultRelInfo->ri_onConflictArbiterIndexes)
+				foreach(lc2, rootResultRelInfo->ri_onConflictArbiterIndexes)
 				{
 					if (list_member_oid(ancestors, lfirst_oid(lc2)))
 						arbiterIndexes = lappend_oid(arbiterIndexes, childIdx);
@@ -569,7 +710,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 * (This shouldn't happen, since arbiter index selection should not
 		 * pick up an invalid index.)
 		 */
-		if (list_length(resultRelInfo->ri_onConflictArbiterIndexes) !=
+		if (list_length(rootResultRelInfo->ri_onConflictArbiterIndexes) !=
 			list_length(arbiterIndexes))
 			elog(ERROR, "invalid arbiter index list");
 		leaf_part_rri->ri_onConflictArbiterIndexes = arbiterIndexes;
@@ -579,8 +720,12 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		 */
 		if (node->onConflictAction == ONCONFLICT_UPDATE)
 		{
+			TupleConversionMap *map;
+
+			map = leaf_part_rri->ri_PartitionInfo->pi_RootToPartitionMap;
+
 			Assert(node->onConflictSet != NIL);
-			Assert(resultRelInfo->ri_onConflict != NULL);
+			Assert(rootResultRelInfo->ri_onConflict != NULL);
 
 			/*
 			 * If the partition's tuple descriptor matches exactly the root
@@ -589,7 +734,7 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 			 * need to create state specific to this partition.
 			 */
 			if (map == NULL)
-				leaf_part_rri->ri_onConflict = resultRelInfo->ri_onConflict;
+				leaf_part_rri->ri_onConflict = rootResultRelInfo->ri_onConflict;
 			else
 			{
 				List	   *onconflset;
@@ -680,37 +825,51 @@ ExecInitPartitionInfo(ModifyTableState *mtstate,
 		}
 	}
 
-	Assert(proute->partitions[partidx] == NULL);
-	proute->partitions[partidx] = leaf_part_rri;
+	/*
+	 * Since we've just initialized this ResultRelInfo, it's not in any list
+	 * attached to the estate as yet.  Add it, so that it can be found later.
+	 *
+	 * Note that the entries in this list appear in no predetermined order,
+	 * because partition result rels are initialized as and when they're
+	 * needed.
+	 */
+	MemoryContextSwitchTo(estate->es_query_cxt);
+	estate->es_tuple_routing_result_relations =
+		lappend(estate->es_tuple_routing_result_relations,
+				leaf_part_rri);
 
-	MemoryContextSwitchTo(oldContext);
+	MemoryContextSwitchTo(oldcxt);
 
 	return leaf_part_rri;
 }
 
 /*
  * ExecInitRoutingInfo
- *		Set up information needed for routing tuples to a leaf partition
+ *		Set up information needed for translating tuples between root
+ *		partitioned table format and partition format, and keep track of it
+ *		in PartitionTupleRouting.
  */
-void
-ExecInitRoutingInfo(ModifyTableState *mtstate,
+static void
+ExecInitRoutingInfo(PartitionTupleRouting *proute,
+					PartitionDispatch dispatch,
+					ModifyTableState *mtstate,
 					EState *estate,
-					PartitionTupleRouting *proute,
 					ResultRelInfo *partRelInfo,
 					int partidx)
 {
-	MemoryContext oldContext;
+	MemoryContext oldcxt;
+	PartitionRoutingInfo *partrouteinfo;
+	int		rri_index;
 
-	/*
-	 * Switch into per-query memory context.
-	 */
-	oldContext = MemoryContextSwitchTo(estate->es_query_cxt);
+	oldcxt = MemoryContextSwitchTo(proute->memcxt);
+
+	partrouteinfo = palloc(sizeof(PartitionRoutingInfo));
 
 	/*
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
 	 */
-	proute->parent_child_tupconv_maps[partidx] =
+	partrouteinfo->pi_RootToPartitionMap =
 		convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_PartitionRoot),
 							   RelationGetDescr(partRelInfo->ri_RelationDesc),
 							   gettext_noop("could not convert row type"));
@@ -721,29 +880,36 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	 * for various operations that are applied to tuples after routing, such
 	 * as checking constraints.
 	 */
-	if (proute->parent_child_tupconv_maps[partidx] != NULL)
+	if (partrouteinfo->pi_RootToPartitionMap != NULL)
 	{
 		Relation	partrel = partRelInfo->ri_RelationDesc;
 
 		/*
-		 * Initialize the array in proute where these slots are stored, if not
-		 * already done.
-		 */
-		if (proute->partition_tuple_slots == NULL)
-			proute->partition_tuple_slots = (TupleTableSlot **)
-				palloc0(proute->num_partitions *
-						sizeof(TupleTableSlot *));
-
-		/*
 		 * Initialize the slot itself setting its descriptor to this
 		 * partition's TupleDesc; TupleDesc reference will be released at the
 		 * end of the command.
 		 */
-		proute->partition_tuple_slots[partidx] =
-			ExecInitExtraTupleSlot(estate,
-								   RelationGetDescr(partrel),
+		partrouteinfo->pi_PartitionTupleSlot =
+			ExecInitExtraTupleSlot(estate, RelationGetDescr(partrel),
 								   &TTSOpsHeapTuple);
 	}
+	else
+		partrouteinfo->pi_PartitionTupleSlot = NULL;
+
+	/*
+	 * Also, if transition capture is required, store a map to convert tuples
+	 * from partition's rowtype to the root partition table's.
+	 */
+	if (mtstate &&
+		(mtstate->mt_transition_capture || mtstate->mt_oc_transition_capture))
+	{
+		partrouteinfo->pi_PartitionToRootMap =
+			convert_tuples_by_name(RelationGetDescr(partRelInfo->ri_RelationDesc),
+								   RelationGetDescr(partRelInfo->ri_PartitionRoot),
+								   gettext_noop("could not convert row type"));
+	}
+	else
+		partrouteinfo->pi_PartitionToRootMap = NULL;
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -753,73 +919,138 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 		partRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL)
 		partRelInfo->ri_FdwRoutine->BeginForeignInsert(mtstate, partRelInfo);
 
-	MemoryContextSwitchTo(oldContext);
-
-	partRelInfo->ri_PartitionReadyForRouting = true;
-}
-
-/*
- * ExecSetupChildParentMapForLeaf -- Initialize the per-leaf-partition
- * child-to-root tuple conversion map array.
- *
- * This map is required for capturing transition tuples when the target table
- * is a partitioned table. For a tuple that is routed by an INSERT or UPDATE,
- * we need to convert it from the leaf partition to the target table
- * descriptor.
- */
-void
-ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute)
-{
-	Assert(proute != NULL);
+	partRelInfo->ri_PartitionInfo = partrouteinfo;
 
 	/*
-	 * These array elements get filled up with maps on an on-demand basis.
-	 * Initially just set all of them to NULL.
+	 * Keep track of it in the PartitionTupleRouting->partitions array.
 	 */
-	proute->child_parent_tupconv_maps =
-		(TupleConversionMap **) palloc0(sizeof(TupleConversionMap *) *
-										proute->num_partitions);
+	Assert(dispatch->indexes[partidx] == -1);
 
-	/* Same is the case for this array. All the values are set to false */
-	proute->child_parent_map_not_required =
-		(bool *) palloc0(sizeof(bool) * proute->num_partitions);
+	rri_index = proute->num_partitions++;
+
+	/* Allocate or enlarge the array, as needed */
+	if (proute->num_partitions >= proute->max_partitions)
+	{
+		if (proute->max_partitions == 0)
+		{
+			proute->max_partitions = 8;
+			proute->partitions = (ResultRelInfo **)
+				palloc(sizeof(ResultRelInfo *) * proute->max_partitions);
+		}
+		else
+		{
+			proute->max_partitions *= 2;
+			proute->partitions = (ResultRelInfo **)
+				repalloc(proute->partitions, sizeof(ResultRelInfo *) *
+						 proute->max_partitions);
+		}
+	}
+
+	proute->partitions[rri_index] = partRelInfo;
+	dispatch->indexes[partidx] = rri_index;
+
+	MemoryContextSwitchTo(oldcxt);
 }
 
 /*
- * TupConvMapForLeaf -- Get the tuple conversion map for a given leaf partition
- * index.
+ * ExecInitPartitionDispatchInfo
+ *		Initialize PartitionDispatch for a partitioned table and store it in
+ *		the next available slot in the proute->partition_dispatch_info array.
+ *		Also, record the index into this array in the parent_pd->indexes[]
+ *		array in the partidx element so that we can properly retrieve the
+ *		newly created PartitionDispatch later.
  */
-TupleConversionMap *
-TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index)
+static PartitionDispatch
+ExecInitPartitionDispatchInfo(PartitionTupleRouting *proute, Oid partoid,
+							  PartitionDispatch parent_pd, int partidx)
 {
-	ResultRelInfo **resultRelInfos = proute->partitions;
-	TupleConversionMap **map;
-	TupleDesc	tupdesc;
+	Relation	rel;
+	PartitionDesc partdesc;
+	PartitionDispatch pd;
+	int			dispatchidx;
+	MemoryContext oldcxt;
 
-	/* Don't call this if we're not supposed to be using this type of map. */
-	Assert(proute->child_parent_tupconv_maps != NULL);
+	oldcxt = MemoryContextSwitchTo(proute->memcxt);
 
-	/* If it's already known that we don't need a map, return NULL. */
-	if (proute->child_parent_map_not_required[leaf_index])
-		return NULL;
+	if (partoid != RelationGetRelid(proute->partition_root))
+		rel = heap_open(partoid, NoLock);
+	else
+		rel = proute->partition_root;
+	partdesc = RelationGetPartitionDesc(rel);
 
-	/* If we've already got a map, return it. */
-	map = &proute->child_parent_tupconv_maps[leaf_index];
-	if (*map != NULL)
-		return *map;
+	pd = (PartitionDispatch) palloc(offsetof(PartitionDispatchData, indexes) +
+									partdesc->nparts * sizeof(int));
+	pd->reldesc = rel;
+	pd->key = RelationGetPartitionKey(rel);
+	pd->keystate = NIL;
+	pd->partdesc = partdesc;
+	if (parent_pd != NULL)
+	{
+		TupleDesc	tupdesc = RelationGetDescr(rel);
 
-	/* No map yet; try to create one. */
-	tupdesc = RelationGetDescr(resultRelInfos[leaf_index]->ri_RelationDesc);
-	*map =
-		convert_tuples_by_name(tupdesc,
-							   RelationGetDescr(rootRelInfo->ri_RelationDesc),
-							   gettext_noop("could not convert row type"));
+		/*
+		 * For sub-partitioned tables where the column order differs from its
+		 * direct parent partitioned table, we must store a tuple table slot
+		 * initialized with its tuple descriptor and a tuple conversion map to
+		 * convert a tuple from its parent's rowtype to its own.  This is to
+		 * make sure that we are looking at the correct row using the correct
+		 * tuple descriptor when computing its partition key for tuple
+		 * routing.
+		 */
+		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent_pd->reldesc),
+													   tupdesc,
+													   gettext_noop("could not convert row type"));
+		pd->tupslot = pd->tupmap ?
+			MakeSingleTupleTableSlot(tupdesc, &TTSOpsHeapTuple) : NULL;
+	}
+	else
+	{
+		/* Not required for the root partitioned table */
+		pd->tupmap = NULL;
+		pd->tupslot = NULL;
+	}
 
-	/* If it turns out no map is needed, remember for next time. */
-	proute->child_parent_map_not_required[leaf_index] = (*map == NULL);
+	/*
+	 * Initialize with -1 to signify that the corresponding partition's
+	 * ResultRelInfo or PartitionDispatch has not been created yet.
+	 */
+	memset(pd->indexes, -1, sizeof(int) * partdesc->nparts);
 
-	return *map;
+	/* Track in PartitionTupleRouting for later use */
+	dispatchidx = proute->num_dispatch++;
+
+	/* Allocate or enlarge the array, as needed */
+	if (proute->num_dispatch >= proute->max_dispatch)
+	{
+		if (proute->max_dispatch == 0)
+		{
+			proute->max_dispatch = 4;
+			proute->partition_dispatch_info = (PartitionDispatch *)
+				palloc(sizeof(PartitionDispatch) * proute->max_dispatch);
+		}
+		else
+		{
+			proute->max_dispatch *= 2;
+			proute->partition_dispatch_info = (PartitionDispatch *)
+				repalloc(proute->partition_dispatch_info,
+						 sizeof(PartitionDispatch) * proute->max_dispatch);
+		}
+	}
+	proute->partition_dispatch_info[dispatchidx] = pd;
+
+	/*
+	 * Finally, if setting up a PartitionDispatch for a sub-partitioned table,
+	 * install a downlink in the parent to allow quick descent.
+	 */
+	if (parent_pd)
+	{
+		Assert(parent_pd->indexes[partidx] == -1);
+		parent_pd->indexes[partidx] = dispatchidx;
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+
+	return pd;
 }
 
 /*
@@ -832,8 +1063,8 @@ void
 ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute)
 {
+	HTAB	   *resultrel_hash = proute->subplan_resultrel_hash;
 	int			i;
-	int			subplan_index = 0;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -847,187 +1078,40 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 		PartitionDispatch pd = proute->partition_dispatch_info[i];
 
 		heap_close(pd->reldesc, NoLock);
-		ExecDropSingleTupleTableSlot(pd->tupslot);
+
+		if (pd->tupslot)
+			ExecDropSingleTupleTableSlot(pd->tupslot);
 	}
 
 	for (i = 0; i < proute->num_partitions; i++)
 	{
 		ResultRelInfo *resultRelInfo = proute->partitions[i];
 
-		/* skip further processing for uninitialized partitions */
-		if (resultRelInfo == NULL)
-			continue;
+		/*
+		 * Check if this result rel is one belonging to the node's subplans,
+		 * if so, let ExecEndPlan() clean it up.
+		 */
+		if (resultrel_hash)
+		{
+			Oid			partoid;
+			bool		found;
+
+			partoid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+
+			(void) hash_search(resultrel_hash, &partoid, HASH_FIND, &found);
+			if (found)
+				continue;
+		}
 
 		/* Allow any FDWs to shut down if they've been exercised */
-		if (resultRelInfo->ri_PartitionReadyForRouting &&
-			resultRelInfo->ri_FdwRoutine != NULL &&
+		if (resultRelInfo->ri_FdwRoutine != NULL &&
 			resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
 			resultRelInfo->ri_FdwRoutine->EndForeignInsert(mtstate->ps.state,
 														   resultRelInfo);
 
-		/*
-		 * If this result rel is one of the UPDATE subplan result rels, let
-		 * ExecEndPlan() close it. For INSERT or COPY,
-		 * proute->subplan_partition_offsets will always be NULL. Note that
-		 * the subplan_partition_offsets array and the partitions array have
-		 * the partitions in the same order. So, while we iterate over
-		 * partitions array, we also iterate over the
-		 * subplan_partition_offsets array in order to figure out which of the
-		 * result rels are present in the UPDATE subplans.
-		 */
-		if (proute->subplan_partition_offsets &&
-			subplan_index < proute->num_subplan_partition_offsets &&
-			proute->subplan_partition_offsets[subplan_index] == i)
-		{
-			subplan_index++;
-			continue;
-		}
-
 		ExecCloseIndices(resultRelInfo);
 		heap_close(resultRelInfo->ri_RelationDesc, NoLock);
 	}
-
-	/* Release the standalone partition tuple descriptors, if any */
-	if (proute->root_tuple_slot)
-		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
-}
-
-/*
- * RelationGetPartitionDispatchInfo
- *		Returns information necessary to route tuples down a partition tree
- *
- * The number of elements in the returned array (that is, the number of
- * PartitionDispatch objects for the partitioned tables in the partition tree)
- * is returned in *num_parted and a list of the OIDs of all the leaf
- * partitions of rel is returned in *leaf_part_oids.
- *
- * All the relations in the partition tree (including 'rel') must have been
- * locked (using at least the AccessShareLock) by the caller.
- */
-static PartitionDispatch *
-RelationGetPartitionDispatchInfo(Relation rel,
-								 int *num_parted, List **leaf_part_oids)
-{
-	List	   *pdlist = NIL;
-	PartitionDispatchData **pd;
-	ListCell   *lc;
-	int			i;
-
-	Assert(rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE);
-
-	*num_parted = 0;
-	*leaf_part_oids = NIL;
-
-	get_partition_dispatch_recurse(rel, NULL, &pdlist, leaf_part_oids);
-	*num_parted = list_length(pdlist);
-	pd = (PartitionDispatchData **) palloc(*num_parted *
-										   sizeof(PartitionDispatchData *));
-	i = 0;
-	foreach(lc, pdlist)
-	{
-		pd[i++] = lfirst(lc);
-	}
-
-	return pd;
-}
-
-/*
- * get_partition_dispatch_recurse
- *		Recursively expand partition tree rooted at rel
- *
- * As the partition tree is expanded in a depth-first manner, we maintain two
- * global lists: of PartitionDispatch objects corresponding to partitioned
- * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
- *
- * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
- * the order in which the planner's expand_partitioned_rtentry() processes
- * them.  It's not necessarily the case that the offsets match up exactly,
- * because constraint exclusion might prune away some partitions on the
- * planner side, whereas we'll always have the complete list; but unpruned
- * partitions will appear in the same order in the plan as they are returned
- * here.
- */
-static void
-get_partition_dispatch_recurse(Relation rel, Relation parent,
-							   List **pds, List **leaf_part_oids)
-{
-	TupleDesc	tupdesc = RelationGetDescr(rel);
-	PartitionDesc partdesc = RelationGetPartitionDesc(rel);
-	PartitionKey partkey = RelationGetPartitionKey(rel);
-	PartitionDispatch pd;
-	int			i;
-
-	check_stack_depth();
-
-	/* Build a PartitionDispatch for this table and add it to *pds. */
-	pd = (PartitionDispatch) palloc(sizeof(PartitionDispatchData));
-	*pds = lappend(*pds, pd);
-	pd->reldesc = rel;
-	pd->key = partkey;
-	pd->keystate = NIL;
-	pd->partdesc = partdesc;
-	if (parent != NULL)
-	{
-		/*
-		 * For every partitioned table other than the root, we must store a
-		 * tuple table slot initialized with its tuple descriptor and a tuple
-		 * conversion map to convert a tuple from its parent's rowtype to its
-		 * own. That is to make sure that we are looking at the correct row
-		 * using the correct tuple descriptor when computing its partition key
-		 * for tuple routing.
-		 */
-		pd->tupslot = MakeSingleTupleTableSlot(tupdesc, &TTSOpsHeapTuple);
-		pd->tupmap = convert_tuples_by_name_map_if_req(RelationGetDescr(parent),
-													   tupdesc,
-													   gettext_noop("could not convert row type"));
-	}
-	else
-	{
-		/* Not required for the root partitioned table */
-		pd->tupslot = NULL;
-		pd->tupmap = NULL;
-	}
-
-	/*
-	 * Go look at each partition of this table.  If it's a leaf partition,
-	 * simply add its OID to *leaf_part_oids.  If it's a partitioned table,
-	 * recursively call get_partition_dispatch_recurse(), so that its
-	 * partitions are processed as well and a corresponding PartitionDispatch
-	 * object gets added to *pds.
-	 *
-	 * The 'indexes' array is used when searching for a partition matching a
-	 * given tuple.  The actual value we store here depends on whether the
-	 * array element belongs to a leaf partition or a subpartitioned table.
-	 * For leaf partitions we store the index into *leaf_part_oids, and for
-	 * sub-partitioned tables we store a negative version of the index into
-	 * the *pds list.  Both indexes are 0-based, but the first element of the
-	 * *pds list is the root partition, so 0 always means the first leaf. When
-	 * searching, if we see a negative value, the search must continue in the
-	 * corresponding sub-partition; otherwise, we've identified the correct
-	 * partition.
-	 */
-	pd->indexes = (int *) palloc(partdesc->nparts * sizeof(int));
-	for (i = 0; i < partdesc->nparts; i++)
-	{
-		Oid			partrelid = partdesc->oids[i];
-
-		if (get_rel_relkind(partrelid) != RELKIND_PARTITIONED_TABLE)
-		{
-			*leaf_part_oids = lappend_oid(*leaf_part_oids, partrelid);
-			pd->indexes[i] = list_length(*leaf_part_oids) - 1;
-		}
-		else
-		{
-			/*
-			 * We assume all tables in the partition tree were already locked
-			 * by the caller.
-			 */
-			Relation	partrel = heap_open(partrelid, NoLock);
-
-			pd->indexes[i] = -list_length(*pds);
-			get_partition_dispatch_recurse(partrel, rel, pds, leaf_part_oids);
-		}
-	}
 }
 
 /* ----------------
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index bb344a7070..65d46c8ea8 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -68,7 +68,6 @@ static TupleTableSlot *ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						ResultRelInfo *targetRelInfo,
 						TupleTableSlot *slot);
 static ResultRelInfo *getTargetResultRelInfo(ModifyTableState *node);
-static void ExecSetupChildParentMapForTcs(ModifyTableState *mtstate);
 static void ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate);
 static TupleConversionMap *tupconv_map_for_subplan(ModifyTableState *node,
 						int whichplan);
@@ -1157,7 +1156,8 @@ lreplace:;
 			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
 			if (tupconv_map != NULL)
 				slot = execute_attr_map_slot(tupconv_map->attrMap,
-											 slot, proute->root_tuple_slot);
+											 slot,
+											 mtstate->mt_root_tuple_slot);
 
 			/*
 			 * Prepare for tuple routing, making it look like we're inserting
@@ -1653,7 +1653,7 @@ ExecSetupTransitionCaptureState(ModifyTableState *mtstate, EState *estate)
 	if (mtstate->mt_transition_capture != NULL ||
 		mtstate->mt_oc_transition_capture != NULL)
 	{
-		ExecSetupChildParentMapForTcs(mtstate);
+		ExecSetupChildParentMapForSubplan(mtstate);
 
 		/*
 		 * Install the conversion map for the first plan for UPDATE and DELETE
@@ -1686,52 +1686,21 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 						TupleTableSlot *slot)
 {
 	ModifyTable *node;
-	int			partidx;
 	ResultRelInfo *partrel;
+	PartitionRoutingInfo *partrouteinfo;
 	HeapTuple	tuple;
 	TupleConversionMap *map;
 
 	/*
-	 * Determine the target partition.  If ExecFindPartition does not find a
-	 * partition after all, it doesn't return here; otherwise, the returned
-	 * value is to be used as an index into the arrays for the ResultRelInfo
-	 * and TupleConversionMap for the partition.
+	 * Lookup the target partition's ResultRelInfo.  If ExecFindPartition does
+	 * not find a valid partition for the tuple in 'slot' then an error is
+	 * raised.  An error may also be raised if the found partition is not a
+	 * valid target for INSERTs.  This is required since a partitioned table
+	 * UPDATE to another partition becomes a DELETE+INSERT.
 	 */
-	partidx = ExecFindPartition(targetRelInfo,
-								proute->partition_dispatch_info,
-								slot,
-								estate);
-	Assert(partidx >= 0 && partidx < proute->num_partitions);
-
-	/*
-	 * Get the ResultRelInfo corresponding to the selected partition; if not
-	 * yet there, initialize it.
-	 */
-	partrel = proute->partitions[partidx];
-	if (partrel == NULL)
-		partrel = ExecInitPartitionInfo(mtstate, targetRelInfo,
-										proute, estate,
-										partidx);
-
-	/*
-	 * Check whether the partition is routable if we didn't yet
-	 *
-	 * Note: an UPDATE of a partition key invokes an INSERT that moves the
-	 * tuple to a new partition.  This check would be applied to a subplan
-	 * partition of such an UPDATE that is chosen as the partition to route
-	 * the tuple to.  The reason we do this check here rather than in
-	 * ExecSetupPartitionTupleRouting is to avoid aborting such an UPDATE
-	 * unnecessarily due to non-routable subplan partitions that may not be
-	 * chosen for update tuple movement after all.
-	 */
-	if (!partrel->ri_PartitionReadyForRouting)
-	{
-		/* Verify the partition is a valid target for INSERT. */
-		CheckValidResultRel(partrel, CMD_INSERT);
-
-		/* Set up information needed for routing tuples to the partition. */
-		ExecInitRoutingInfo(mtstate, estate, proute, partrel, partidx);
-	}
+	partrel = ExecFindPartition(mtstate, targetRelInfo, proute, slot, estate);
+	partrouteinfo = partrel->ri_PartitionInfo;
+	Assert(partrouteinfo != NULL);
 
 	/*
 	 * Make it look like we are inserting into the partition.
@@ -1743,7 +1712,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 
 	/*
 	 * If we're capturing transition tuples, we might need to convert from the
-	 * partition rowtype to parent rowtype.
+	 * partition rowtype to root partitioned table's rowtype.
 	 */
 	if (mtstate->mt_transition_capture != NULL)
 	{
@@ -1756,7 +1725,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 			 */
 			mtstate->mt_transition_capture->tcs_original_insert_tuple = NULL;
 			mtstate->mt_transition_capture->tcs_map =
-				TupConvMapForLeaf(proute, targetRelInfo, partidx);
+				partrouteinfo->pi_PartitionToRootMap;
 		}
 		else
 		{
@@ -1771,20 +1740,17 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	if (mtstate->mt_oc_transition_capture != NULL)
 	{
 		mtstate->mt_oc_transition_capture->tcs_map =
-			TupConvMapForLeaf(proute, targetRelInfo, partidx);
+			partrouteinfo->pi_PartitionToRootMap;
 	}
 
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	map = proute->parent_child_tupconv_maps[partidx];
+	map = partrouteinfo->pi_RootToPartitionMap;
 	if (map != NULL)
 	{
-		TupleTableSlot *new_slot;
+		TupleTableSlot *new_slot = partrouteinfo->pi_PartitionTupleSlot;
 
-		Assert(proute->partition_tuple_slots != NULL &&
-			   proute->partition_tuple_slots[partidx] != NULL);
-		new_slot = proute->partition_tuple_slots[partidx];
 		slot = execute_attr_map_slot(map->attrMap, slot, new_slot);
 	}
 
@@ -1823,17 +1789,6 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 	int			i;
 
 	/*
-	 * First check if there is already a per-subplan array allocated. Even if
-	 * there is already a per-leaf map array, we won't require a per-subplan
-	 * one, since we will use the subplan offset array to convert the subplan
-	 * index to per-leaf index.
-	 */
-	if (mtstate->mt_per_subplan_tupconv_maps ||
-		(mtstate->mt_partition_tuple_routing &&
-		 mtstate->mt_partition_tuple_routing->child_parent_tupconv_maps))
-		return;
-
-	/*
 	 * Build array of conversion maps from each child's TupleDesc to the one
 	 * used in the target relation.  The map pointers may be NULL when no
 	 * conversion is necessary, which is hopefully a common case.
@@ -1855,78 +1810,17 @@ ExecSetupChildParentMapForSubplan(ModifyTableState *mtstate)
 }
 
 /*
- * Initialize the child-to-root tuple conversion map array required for
- * capturing transition tuples.
- *
- * The map array can be indexed either by subplan index or by leaf-partition
- * index.  For transition tables, we need a subplan-indexed access to the map,
- * and where tuple-routing is present, we also require a leaf-indexed access.
- */
-static void
-ExecSetupChildParentMapForTcs(ModifyTableState *mtstate)
-{
-	PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
-
-	/*
-	 * If partition tuple routing is set up, we will require partition-indexed
-	 * access. In that case, create the map array indexed by partition; we
-	 * will still be able to access the maps using a subplan index by
-	 * converting the subplan index to a partition index using
-	 * subplan_partition_offsets. If tuple routing is not set up, it means we
-	 * don't require partition-indexed access. In that case, create just a
-	 * subplan-indexed map.
-	 */
-	if (proute)
-	{
-		/*
-		 * If a partition-indexed map array is to be created, the subplan map
-		 * array has to be NULL.  If the subplan map array is already created,
-		 * we won't be able to access the map using a partition index.
-		 */
-		Assert(mtstate->mt_per_subplan_tupconv_maps == NULL);
-
-		ExecSetupChildParentMapForLeaf(proute);
-	}
-	else
-		ExecSetupChildParentMapForSubplan(mtstate);
-}
-
-/*
  * For a given subplan index, get the tuple conversion map.
  */
 static TupleConversionMap *
 tupconv_map_for_subplan(ModifyTableState *mtstate, int whichplan)
 {
-	/*
-	 * If a partition-index tuple conversion map array is allocated, we need
-	 * to first get the index into the partition array. Exactly *one* of the
-	 * two arrays is allocated. This is because if there is a partition array
-	 * required, we don't require subplan-indexed array since we can translate
-	 * subplan index into partition index. And, we create a subplan-indexed
-	 * array *only* if partition-indexed array is not required.
-	 */
+	/* If nobody else set the per-subplan array of maps, do so ourselves. */
 	if (mtstate->mt_per_subplan_tupconv_maps == NULL)
-	{
-		int			leaf_index;
-		PartitionTupleRouting *proute = mtstate->mt_partition_tuple_routing;
+		ExecSetupChildParentMapForSubplan(mtstate);
 
-		/*
-		 * If subplan-indexed array is NULL, things should have been arranged
-		 * to convert the subplan index to partition index.
-		 */
-		Assert(proute && proute->subplan_partition_offsets != NULL &&
-			   whichplan < proute->num_subplan_partition_offsets);
-
-		leaf_index = proute->subplan_partition_offsets[whichplan];
-
-		return TupConvMapForLeaf(proute, getTargetResultRelInfo(mtstate),
-								 leaf_index);
-	}
-	else
-	{
-		Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
-		return mtstate->mt_per_subplan_tupconv_maps[whichplan];
-	}
+	Assert(whichplan >= 0 && whichplan < mtstate->mt_nplans);
+	return mtstate->mt_per_subplan_tupconv_maps[whichplan];
 }
 
 /* ----------------------------------------------------------------
@@ -2370,10 +2264,15 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 * descriptor of a source partition does not match the root partitioned
 	 * table descriptor.  In such a case we need to convert tuples to the root
 	 * tuple descriptor, because the search for destination partition starts
-	 * from the root.  Skip this setup if it's not a partition key update.
+	 * from the root.  We'll also need a slot to store these converted tuples.
+	 * We can skip this setup if it's not a partition key update.
 	 */
 	if (update_tuple_routing_needed)
+	{
 		ExecSetupChildParentMapForSubplan(mtstate);
+		mtstate->mt_root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel),
+														 &TTSOpsHeapTuple);
+	}
 
 	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
@@ -2716,10 +2615,18 @@ ExecEndModifyTable(ModifyTableState *node)
 														   resultRelInfo);
 	}
 
-	/* Close all the partitioned tables, leaf partitions, and their indices */
+	/*
+	 * Close all the partitioned tables, leaf partitions, and their indices
+	 * and release the slot used for tuple routing, if set.
+	 */
 	if (node->mt_partition_tuple_routing)
+	{
 		ExecCleanupTupleRouting(node, node->mt_partition_tuple_routing);
 
+		if (node->mt_root_tuple_slot)
+			ExecDropSingleTupleTableSlot(node->mt_root_tuple_slot);
+	}
+
 	/*
 	 * Free the exprcontext
 	 */
diff --git a/src/backend/optimizer/prep/prepunion.c b/src/backend/optimizer/prep/prepunion.c
index d5720518a8..2a1c1cb2e1 100644
--- a/src/backend/optimizer/prep/prepunion.c
+++ b/src/backend/optimizer/prep/prepunion.c
@@ -1657,9 +1657,6 @@ expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte, Index rti)
 /*
  * expand_partitioned_rtentry
  *		Recursively expand an RTE for a partitioned table.
- *
- * Note that RelationGetPartitionDispatchInfo will expand partitions in the
- * same order as this code.
  */
 static void
 expand_partitioned_rtentry(PlannerInfo *root, RangeTblEntry *parentrte,
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 07653f312b..7856b47cdd 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -340,15 +340,23 @@ RelationBuildPartitionDesc(Relation rel)
 	oldcxt = MemoryContextSwitchTo(rel->rd_pdcxt);
 	partdesc->boundinfo = partition_bounds_copy(boundinfo, key);
 	partdesc->oids = (Oid *) palloc(partdesc->nparts * sizeof(Oid));
+	partdesc->is_leaf = (bool *) palloc(partdesc->nparts * sizeof(bool));
 
 	/*
 	 * Now assign OIDs from the original array into mapped indexes of the
-	 * result array.  Order of OIDs in the former is defined by the catalog
-	 * scan that retrieved them, whereas that in the latter is defined by
-	 * canonicalized representation of the partition bounds.
+	 * result array.  The order of OIDs in the former is defined by the
+	 * catalog scan that retrieved them, whereas that in the latter is defined
+	 * by canonicalized representation of the partition bounds.
 	 */
 	for (i = 0; i < partdesc->nparts; i++)
-		partdesc->oids[mapping[i]] = oids_orig[i];
+	{
+		int			index = mapping[i];
+
+		partdesc->oids[index] = oids_orig[i];
+		/* Record if the partition is a leaf partition */
+		partdesc->is_leaf[index] =
+				(get_rel_relkind(oids_orig[i]) != RELKIND_PARTITIONED_TABLE);
+	}
 	MemoryContextSwitchTo(oldcxt);
 
 	rel->rd_partdesc = partdesc;
diff --git a/src/include/catalog/partition.h b/src/include/catalog/partition.h
index a53de2372e..59c7a6ab69 100644
--- a/src/include/catalog/partition.h
+++ b/src/include/catalog/partition.h
@@ -25,7 +25,11 @@
 typedef struct PartitionDescData
 {
 	int			nparts;			/* Number of partitions */
-	Oid		   *oids;			/* OIDs of partitions */
+	Oid		   *oids;			/* Array of 'nparts' elements containing
+								 * partition OIDs in order of the their bounds */
+	bool	   *is_leaf;		/* Array of 'nparts' elements storing whether
+								 * the corresponding 'oids' element belongs to
+								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
 } PartitionDescData;
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 3e08104ea4..d3cfb55f9f 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -18,74 +18,36 @@
 #include "nodes/plannodes.h"
 #include "partitioning/partprune.h"
 
-/* See execPartition.c for the definition. */
+/* See execPartition.c for the definitions. */
 typedef struct PartitionDispatchData *PartitionDispatch;
+typedef struct PartitionTupleRouting PartitionTupleRouting;
 
-/*-----------------------
- * PartitionTupleRouting - Encapsulates all information required to execute
- * tuple-routing between partitions.
+/*
+ * PartitionRoutingInfo
  *
- * partition_dispatch_info		Array of PartitionDispatch objects with one
- *								entry for every partitioned table in the
- *								partition tree.
- * num_dispatch					number of partitioned tables in the partition
- *								tree (= length of partition_dispatch_info[])
- * partition_oids				Array of leaf partitions OIDs with one entry
- *								for every leaf partition in the partition tree,
- *								initialized in full by
- *								ExecSetupPartitionTupleRouting.
- * partitions					Array of ResultRelInfo* objects with one entry
- *								for every leaf partition in the partition tree,
- *								initialized lazily by ExecInitPartitionInfo.
- * num_partitions				Number of leaf partitions in the partition tree
- *								(= 'partitions_oid'/'partitions' array length)
- * parent_child_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert tuple from the root table's rowtype to
- *								a leaf partition's rowtype after tuple routing
- *								is done)
- * child_parent_tupconv_maps	Array of TupleConversionMap objects with one
- *								entry for every leaf partition (required to
- *								convert an updated tuple from the leaf
- *								partition's rowtype to the root table's rowtype
- *								so that tuple routing can be done)
- * child_parent_map_not_required  Array of bool. True value means that a map is
- *								determined to be not required for the given
- *								partition. False means either we haven't yet
- *								checked if a map is required, or it was
- *								determined to be required.
- * subplan_partition_offsets	Integer array ordered by UPDATE subplans. Each
- *								element of this array has the index into the
- *								corresponding partition in partitions array.
- * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slots		Array of TupleTableSlot objects; if non-NULL,
- *								contains one entry for every leaf partition,
- *								of which only those of the leaf partitions
- *								whose attribute numbers differ from the root
- *								parent have a non-NULL value.  NULL if all of
- *								the partitions encountered by a given command
- *								happen to have same rowtype as the root parent
- * root_tuple_slot				TupleTableSlot to be used to transiently hold
- *								copy of a tuple that's being moved across
- *								partitions in the root partitioned table's
- *								rowtype
- *-----------------------
+ * Additional result relation information specific to routing tuples to a
+ * table partition.
  */
-typedef struct PartitionTupleRouting
+typedef struct PartitionRoutingInfo
 {
-	PartitionDispatch *partition_dispatch_info;
-	int			num_dispatch;
-	Oid		   *partition_oids;
-	ResultRelInfo **partitions;
-	int			num_partitions;
-	TupleConversionMap **parent_child_tupconv_maps;
-	TupleConversionMap **child_parent_tupconv_maps;
-	bool	   *child_parent_map_not_required;
-	int		   *subplan_partition_offsets;
-	int			num_subplan_partition_offsets;
-	TupleTableSlot **partition_tuple_slots;
-	TupleTableSlot *root_tuple_slot;
-} PartitionTupleRouting;
+	/*
+	 * Map for converting tuples in root partitioned table format into
+	 * partition format, or NULL if no conversion is required.
+	 */
+	TupleConversionMap *pi_RootToPartitionMap;
+
+	/*
+	 * Map for converting tuples in partition format into the root partitioned
+	 * table format, or NULL if no conversion is required.
+	 */
+	TupleConversionMap *pi_PartitionToRootMap;
+
+	/*
+	 * Slot to store tuples in partition format, or NULL when no translation
+	 * is required between root and partition.
+	 */
+	TupleTableSlot *pi_PartitionTupleSlot;
+} PartitionRoutingInfo;
 
 /*
  * PartitionedRelPruningData - Per-partitioned-table data for run-time pruning
@@ -175,22 +137,11 @@ typedef struct PartitionPruneState
 
 extern PartitionTupleRouting *ExecSetupPartitionTupleRouting(ModifyTableState *mtstate,
 							   Relation rel);
-extern int ExecFindPartition(ResultRelInfo *resultRelInfo,
-				  PartitionDispatch *pd,
+extern ResultRelInfo *ExecFindPartition(ModifyTableState *mtstate,
+				  ResultRelInfo *rootResultRelInfo,
+				  PartitionTupleRouting *proute,
 				  TupleTableSlot *slot,
 				  EState *estate);
-extern ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
-					  ResultRelInfo *resultRelInfo,
-					  PartitionTupleRouting *proute,
-					  EState *estate, int partidx);
-extern void ExecInitRoutingInfo(ModifyTableState *mtstate,
-					EState *estate,
-					PartitionTupleRouting *proute,
-					ResultRelInfo *partRelInfo,
-					int partidx);
-extern void ExecSetupChildParentMapForLeaf(PartitionTupleRouting *proute);
-extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
-				  ResultRelInfo *rootRelInfo, int leaf_index);
 extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
 						PartitionTupleRouting *proute);
 extern PartitionPruneState *ExecCreatePartitionPruneState(PlanState *planstate,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 63c871e6d0..569cc7c476 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -33,6 +33,7 @@
 
 
 struct PlanState;				/* forward references in this file */
+struct PartitionRoutingInfo;
 struct ParallelHashJoinState;
 struct ExecRowMark;
 struct ExprState;
@@ -469,8 +470,8 @@ typedef struct ResultRelInfo
 	/* relation descriptor for root partitioned table */
 	Relation	ri_PartitionRoot;
 
-	/* true if ready for tuple routing */
-	bool		ri_PartitionReadyForRouting;
+	/* Additional information specific to partition tuple routing */
+	struct PartitionRoutingInfo *ri_PartitionInfo;
 } ResultRelInfo;
 
 /* ----------------
@@ -1112,6 +1113,12 @@ typedef struct ModifyTableState
 	List	   *mt_excludedtlist;	/* the excluded pseudo relation's tlist  */
 	TupleTableSlot *mt_conflproj;	/* CONFLICT ... SET ... projection target */
 
+	/*
+	 * Slot for storing tuples in the root partitioned table's rowtype during
+	 * an UPDATE of a partitioned table.
+	 */
+	TupleTableSlot *mt_root_tuple_slot;
+
 	/* Tuple-routing support info */
 	struct PartitionTupleRouting *mt_partition_tuple_routing;
 
#70Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#69)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

I repeated David's original tests not terribly rigorously[*] and got
these numbers:

* Unpatched: 72.396196
* 0001 : 77.279404
* 0001+0002: 20409.415094
* 0002: 816.606613
* control : 22969.140596 (insertion into unpartitioned table)

So while this patch by itself gives a pretty lame increase in tps, it
removes bottlenecks that will appear once we change the locking scheme.

[*] On my laptop, running each test only once for 60s.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#71Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: David Rowley (#57)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018-Nov-13, David Rowley wrote:

The 0002 patch is included again, this time with a new proposed commit
message. There was some discussion over on [1] where nobody seemed to
have any concerns about delaying the locking until we route the first
tuple to the partition.

Please get a new commitfest entry for this patch.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#72David Rowley
david.rowley@2ndquadrant.com
In reply to: Alvaro Herrera (#69)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On Sat, 17 Nov 2018 at 04:14, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I'll now see about the commit message and push shortly.

Many thanks for making the required adjustments and pushing this.

If I wasn't on leave late last week and early this week then the only
thing I'd have mentioned was the lack of empty comment line in the
header comment for PartitionDispatchData. It looks a bit messy
without.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#73Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: David Rowley (#72)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On 2018-Nov-21, David Rowley wrote:

If I wasn't on leave late last week and early this week then the only
thing I'd have mentioned was the lack of empty comment line in the
header comment for PartitionDispatchData. It looks a bit messy
without.

Absolutely. Pushed a few newlines -- I hope I understood you correctly.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#74David Rowley
david.rowley@2ndquadrant.com
In reply to: Alvaro Herrera (#73)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On Thu, 22 Nov 2018 at 07:06, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2018-Nov-21, David Rowley wrote:

If I wasn't on leave late last week and early this week then the only
thing I'd have mentioned was the lack of empty comment line in the
header comment for PartitionDispatchData. It looks a bit messy
without.

Absolutely. Pushed a few newlines -- I hope I understood you correctly.

Thanks, you did. That looks better now.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#75Amit Langote
amitlangote09@gmail.com
In reply to: David Rowley (#74)
1 attachment(s)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

Hi,

On Thu, Nov 22, 2018 at 7:25 AM David Rowley
<david.rowley@2ndquadrant.com> wrote:

On Thu, 22 Nov 2018 at 07:06, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2018-Nov-21, David Rowley wrote:

If I wasn't on leave late last week and early this week then the only
thing I'd have mentioned was the lack of empty comment line in the
header comment for PartitionDispatchData. It looks a bit messy
without.

Absolutely. Pushed a few newlines -- I hope I understood you correctly.

Thanks, you did. That looks better now.

I noticed that there's a "be" missing in the comment above
ExecFindPartition. Fixed in the attached.

Thanks,
Amit

Attachments:

ExecFindPartition-typo.patchapplication/octet-stream; name=ExecFindPartition-typo.patchDownload
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 24de567a92..179a501f30 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -259,8 +259,8 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
  * scratch space.
  *
  * If no leaf partition is found, this routine errors out with the appropriate
- * error message.  An error may also raised if the found target partition is
- * not a valid target for an INSERT.
+ * error message.  An error may also be raised if the found target partition
+ * is not a valid target for an INSERT.
  */
 ResultRelInfo *
 ExecFindPartition(ModifyTableState *mtstate,
#76Michael Paquier
michael@paquier.xyz
In reply to: Amit Langote (#75)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On Thu, Nov 22, 2018 at 11:32:04AM +0900, Amit Langote wrote:

I noticed that there's a "be" missing in the comment above
ExecFindPartition. Fixed in the attached.

Thanks Amit, I have committed this one.
--
Michael

#77David Rowley
david.rowley@2ndquadrant.com
In reply to: Alvaro Herrera (#71)
Re: Speeding up INSERTs and UPDATEs to partitioned tables

On Sat, 17 Nov 2018 at 07:28, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

The 0002 patch is included again, this time with a new proposed commit
message. There was some discussion over on [1] where nobody seemed to
have any concerns about delaying the locking until we route the first
tuple to the partition.

Please get a new commitfest entry for this patch.

Added to Jan-fest in: https://commitfest.postgresql.org/21/1887/

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services