Skip partition tuple routing with constant partition key

Started by houzj.fnst@fujitsu.comover 4 years ago68 messages
#1houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
1 attachment(s)

Hi,

When loading some data into a partitioned table for testing purpose,

I found even if I specified constant value for the partition key[1]--------------------, it still do

the tuple routing for each row.

[1]: --------------------

UPDATE partitioned set part_key = 2 , …

INSERT into partitioned(part_key, ...) select 1, …

---------------------

I saw such SQLs automatically generated by some programs,

So , personally, It’d be better to skip the tuple routing for this case.

IMO, we can use the following steps to skip the tuple routing:

1) collect the column that has constant value in the targetList.

2) compare the constant column with the columns used in partition key.

3) if all the columns used in key are constant then we cache the routed partition

 and do not do the tuple routing again.

In this approach, I did some simple and basic performance tests:

----For plain single column partition key.(partition by range(col)/list(a)...)

When loading 100000000 rows into the table, I can see about 5-7% performance gain

for both cross-partition UPDATE and INSERT if specified constant for the partition key.

----For more complicated expression partition key(partition by range(UDF_func(col)+x)…)

When loading 100000000 rows into the table, it will bring more performance gain.

About > 20% performance gain

Besides, I did not see noticeable performance degradation for other cases(small data set).

Attaching a POC patch about this improvement.

Thoughts ?

Best regards,

houzj

Attachments:

0001-skip-tuple-routing-for-constant-partition-key.patchapplication/octet-stream; name=0001-skip-tuple-routing-for-constant-partition-key.patchDownload
From 6972ae859d2543225baf669363bfe6f8e70c0add Mon Sep 17 00:00:00 2001
From: houzj <houzj.fnst@cn.fujitsu.com>
Date: Mon, 17 May 2021 17:43:57 +0800
Subject: [PATCH] skip-tuple-routing-for-constant-partition-key

---
 src/backend/access/common/tupconvert.c | 37 +++++++++++++++
 src/backend/commands/copyfrom.c        |  3 ++
 src/backend/executor/execPartition.c   | 84 ++++++++++++++++++++++++++++++++--
 src/backend/executor/nodeModifyTable.c | 51 ++++++++++++++++++++-
 src/include/access/tupconvert.h        |  1 +
 src/include/nodes/execnodes.h          |  9 ++++
 6 files changed, 179 insertions(+), 6 deletions(-)

diff --git a/src/backend/access/common/tupconvert.c b/src/backend/access/common/tupconvert.c
index 64f5439..f22414f 100644
--- a/src/backend/access/common/tupconvert.c
+++ b/src/backend/access/common/tupconvert.c
@@ -278,6 +278,43 @@ execute_attr_map_cols(AttrMap *attrMap, Bitmapset *in_cols)
 }
 
 /*
+ * Perform conversion of bitmap of columns according to the map.
+ *
+ * Only convert normal user column.
+ *
+ * output column that does not correspond to any input column will
+ * still be set in bitmap.
+ */
+Bitmapset *
+execute_attr_map_cols_with_null(AttrMap *attrMap, Bitmapset *in_cols)
+{
+	Bitmapset  *out_cols;
+	int			out_attnum;
+
+	/* fast path for the common trivial case */
+	if (in_cols == NULL)
+		return NULL;
+
+	out_cols = NULL;
+
+	for (out_attnum = 1;
+		 out_attnum <= attrMap->maplen;
+		 out_attnum++)
+	{
+		int			in_attnum;
+
+		in_attnum = attrMap->attnums[out_attnum - 1];
+
+		if (in_attnum == 0 ||
+			bms_is_member(in_attnum, in_cols))
+			out_cols = bms_add_member(out_cols, out_attnum);
+	}
+
+	return out_cols;
+}
+
+
+/*
  * Free a TupleConversionMap structure.
  */
 void
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 40a54ad..ea62cfa 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -670,6 +670,9 @@ CopyFrom(CopyFromState cstate)
 	mtstate->mt_nrels = 1;
 	mtstate->resultRelInfo = resultRelInfo;
 	mtstate->rootResultRelInfo = resultRelInfo;
+	mtstate->const_bms = NULL;
+	mtstate->cache_routed_rel = false;
+	mtstate->targetpartrel = NULL;
 
 	if (resultRelInfo->ri_FdwRoutine != NULL &&
 		resultRelInfo->ri_FdwRoutine->BeginForeignInsert != NULL)
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 606c920..dca6799 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -24,6 +24,7 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "nodes/makefuncs.h"
+#include "optimizer/optimizer.h"
 #include "partitioning/partbounds.h"
 #include "partitioning/partdesc.h"
 #include "partitioning/partprune.h"
@@ -191,7 +192,7 @@ static void find_matching_subplans_recurse(PartitionPruningData *prunedata,
 										   PartitionedRelPruningData *pprune,
 										   bool initial_prune,
 										   Bitmapset **validsubplans);
-
+static bool ExecSimplifyTupleRouting(PartitionDispatch pd, Bitmapset *constcols);
 
 /*
  * ExecSetupPartitionTupleRouting - sets up information needed during
@@ -271,6 +272,15 @@ ExecFindPartition(ModifyTableState *mtstate,
 	TupleTableSlot *myslot = NULL;
 	MemoryContext oldcxt;
 	ResultRelInfo *rri = NULL;
+	bool		need_tuple_routing;
+	Bitmapset  *const_bms = mtstate->const_bms;
+	Bitmapset  *root_const_bms = mtstate->const_bms;
+
+	need_tuple_routing = !(RelationGetRelid(rootResultRelInfo->ri_RelationDesc) ==
+			   RelationGetRelid(mtstate->rootResultRelInfo->ri_RelationDesc));
+
+	if (mtstate->targetpartrel != NULL && !need_tuple_routing)
+		return mtstate->targetpartrel;
 
 	/* use per-tuple context here to avoid leaking memory */
 	oldcxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
@@ -327,6 +337,11 @@ ExecFindPartition(ModifyTableState *mtstate,
 					 errtable(rel)));
 		}
 
+		/* Check if we can skip tuple routing next time */
+		if (mtstate->cache_routed_rel)
+			mtstate->cache_routed_rel = ExecSimplifyTupleRouting(dispatch,
+																 const_bms);
+
 		is_leaf = partdesc->is_leaf[partidx];
 		if (is_leaf)
 		{
@@ -415,8 +430,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 			}
 
 			/*
-			 * Convert the tuple to the new parent's layout, if different from
-			 * the previous parent.
+			 * Convert the tuple and constant value bitmap to the new parent's
+			 * layout, if different from the previous parent.
 			 */
 			if (dispatch->tupslot)
 			{
@@ -424,6 +439,10 @@ ExecFindPartition(ModifyTableState *mtstate,
 				TupleTableSlot *tempslot = myslot;
 
 				myslot = dispatch->tupslot;
+
+				if (mtstate->cache_routed_rel)
+					const_bms = execute_attr_map_cols_with_null(map, const_bms);
+
 				slot = execute_attr_map_slot(map, slot, myslot);
 
 				if (tempslot != NULL)
@@ -450,24 +469,43 @@ ExecFindPartition(ModifyTableState *mtstate,
 			 *
 			 * Note that we have a map to convert from root to current
 			 * partition, but not from immediate parent to current partition.
-			 * So if we have to convert, do it from the root slot; if not, use
-			 * the root slot as-is.
+			 * So if we have to convert, do it from the root slot and constant
+			 * value bitmap; if not, use the root slot as-is.
 			 */
 			if (is_leaf)
 			{
 				TupleConversionMap *map = rri->ri_RootToPartitionMap;
 
 				if (map)
+				{
+					if (mtstate->cache_routed_rel)
+						const_bms = execute_attr_map_cols_with_null(map->attrMap,
+																	const_bms);
+
 					slot = execute_attr_map_slot(map->attrMap, rootslot,
 												 rri->ri_PartitionTupleSlot);
+				}
 				else
+				{
+					const_bms = root_const_bms;
 					slot = rootslot;
+				}
 			}
 
 			ExecPartitionCheck(rri, slot, estate, true);
 		}
 	}
 
+	/*
+	 * If all of the columns used in partition key are constant, cache the
+	 * target partition if first time reach here.
+	 */
+	if (mtstate->cache_routed_rel)
+	{
+		mtstate->cache_routed_rel = false;
+		mtstate->targetpartrel = rri;
+	}
+
 	/* Release the tuple in the lowest parent's dedicated slot. */
 	if (myslot != NULL)
 		ExecClearTuple(myslot);
@@ -1165,6 +1203,42 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	}
 }
 
+/*
+ * Check whether all of the columns used in partition key are
+ * const value
+ */
+static bool
+ExecSimplifyTupleRouting(PartitionDispatch pd,
+						 Bitmapset *constcols)
+{
+	int	i;
+	List	   *expr_vars;
+	ListCell   *lc;
+
+	if (constcols == NULL)
+		return false;
+
+	/* Check plain columns */
+	for (i = 0; i < pd->key->partnatts; i++)
+	{
+		AttrNumber	keycol = pd->key->partattrs[i];
+		if (keycol != 0 && !bms_is_member(keycol, constcols))
+			return false;
+	}
+
+	/* Check columns in partition expression */
+	expr_vars = pull_var_clause((Node *) pd->key->partexprs, 0);
+	foreach(lc, expr_vars)
+	{
+		Var *var = lfirst(lc);
+
+		if (var->varno == 1 && !bms_is_member(var->varattno, constcols))
+			return false;
+	}
+
+	return true;
+}
+
 /* ----------------
  *		FormPartitionKeyDatum
  *			Construct values[] and isnull[] arrays for the partition key
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 0816027..5d7dcb2 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1445,7 +1445,11 @@ ExecCrossPartitionUpdate(ModifyTableState *mtstate,
 	/* Initialize tuple routing info if not already done. */
 	if (mtstate->mt_partition_tuple_routing == NULL)
 	{
+		ListCell   *lc, *lc2;
+		List	   *updateColnos;
+		List	   *targetlist;
 		Relation	rootRel = mtstate->rootResultRelInfo->ri_RelationDesc;
+		ModifyTable *node = (ModifyTable *) mtstate->ps.plan;
 		MemoryContext oldcxt;
 
 		/* Things built here have to last for the query duration. */
@@ -1454,6 +1458,22 @@ ExecCrossPartitionUpdate(ModifyTableState *mtstate,
 		mtstate->mt_partition_tuple_routing =
 			ExecSetupPartitionTupleRouting(estate, rootRel);
 
+		/* Initialize constant value bitmapset */
+		Assert(mtstate->const_bms == NULL);
+		mtstate->cache_routed_rel = true;
+		updateColnos = (List *) list_nth(node->updateColnosLists,
+			resultRelInfo - mtstate->resultRelInfo);
+
+		targetlist = outerPlan(node)->targetlist;
+		forboth(lc, targetlist, lc2, updateColnos)
+		{
+			TargetEntry *tle = lfirst_node(TargetEntry, lc);
+			AttrNumber	targetattnum = lfirst_int(lc2);
+			if (!tle->resjunk && IsA(tle->expr, Const))
+				mtstate->const_bms = bms_add_member(mtstate->const_bms,
+													targetattnum);
+		}
+
 		/*
 		 * Before a partition's tuple can be re-routed, it must first be
 		 * converted to the root's format, so we'll need a slot for storing
@@ -1532,10 +1552,17 @@ ExecCrossPartitionUpdate(ModifyTableState *mtstate,
 	 */
 	tupconv_map = ExecGetChildToRootMap(resultRelInfo);
 	if (tupconv_map != NULL)
-		slot = execute_attr_map_slot(tupconv_map->attrMap,
+	{
+		AttrMap *attrMap = tupconv_map->attrMap;
+		slot = execute_attr_map_slot(attrMap,
 									 slot,
 									 mtstate->mt_root_tuple_slot);
 
+		if (mtstate->cache_routed_rel)
+			mtstate->const_bms = execute_attr_map_cols_with_null(attrMap,
+														mtstate->const_bms);
+	}
+
 	/* Tuple routing starts from the root table. */
 	*inserted_tuple = ExecInsert(mtstate, mtstate->rootResultRelInfo, slot,
 								 planSlot, estate, canSetTag);
@@ -2874,6 +2901,10 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	/* Get the root target relation */
 	rel = mtstate->rootResultRelInfo->ri_RelationDesc;
 
+	mtstate->cache_routed_rel = false;
+	mtstate->const_bms = NULL;
+	mtstate->targetpartrel = NULL;
+
 	/*
 	 * Build state for tuple routing if it's a partitioned INSERT.  An UPDATE
 	 * might need this too, but only if it actually moves tuples between
@@ -2881,9 +2912,27 @@ ExecInitModifyTable(ModifyTable *node, EState *estate, int eflags)
 	 */
 	if (rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE &&
 		operation == CMD_INSERT)
+	{
 		mtstate->mt_partition_tuple_routing =
 			ExecSetupPartitionTupleRouting(estate, rel);
 
+		mtstate->cache_routed_rel = true;
+
+		/*
+		 * Build the constant bitmap for tuple routing simplification if
+		 * it's a partitioned INSERT. Cross partition UPDATE will build it
+		 * in ExecCrossPartitionUpdate.
+		 */
+		i = 1;
+		foreach(l, subplan->targetlist)
+		{
+			TargetEntry *tc = (TargetEntry *) lfirst(l);
+			if (!tc->resjunk && IsA(tc->expr, Const))
+				mtstate->const_bms = bms_add_member(mtstate->const_bms, i);
+			i++;
+		}
+	}
+
 	/*
 	 * Initialize any WITH CHECK OPTION constraints if needed.
 	 */
diff --git a/src/include/access/tupconvert.h b/src/include/access/tupconvert.h
index a2cc4b3..ea8eeac 100644
--- a/src/include/access/tupconvert.h
+++ b/src/include/access/tupconvert.h
@@ -45,6 +45,7 @@ extern TupleTableSlot *execute_attr_map_slot(AttrMap *attrMap,
 											 TupleTableSlot *in_slot,
 											 TupleTableSlot *out_slot);
 extern Bitmapset *execute_attr_map_cols(AttrMap *attrMap, Bitmapset *inbitmap);
+extern Bitmapset *execute_attr_map_cols_with_null(AttrMap *attrMap, Bitmapset *inbitmap);
 
 extern void free_conversion_map(TupleConversionMap *map);
 
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7795a69..5ba392b 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -1213,6 +1213,15 @@ typedef struct ModifyTableState
 	HTAB	   *mt_resultOidHash;	/* optional hash table to speed lookups */
 
 	/*
+	 * Cached target partition if all the columns in partition key are
+	 * constant, otherwise NULL
+	 */
+	ResultRelInfo *targetpartrel;
+	Bitmapset  *const_bms;			/* column numbers of constant value */
+	bool		cache_routed_rel;	/* do we need to cache the target
+									 * partition after tuple routing */
+
+	/*
 	 * Slot for storing tuples in the root partitioned table's rowtype during
 	 * an UPDATE of a partitioned table.
 	 */
-- 
2.7.2.windows.1

#2Amit Langote
amitlangote09@gmail.com
In reply to: houzj.fnst@fujitsu.com (#1)
Re: Skip partition tuple routing with constant partition key

On Mon, May 17, 2021 at 8:37 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

When loading some data into a partitioned table for testing purpose,

I found even if I specified constant value for the partition key[1], it still do

the tuple routing for each row.

[1]---------------------

UPDATE partitioned set part_key = 2 , …

INSERT into partitioned(part_key, ...) select 1, …

---------------------

I saw such SQLs automatically generated by some programs,

Hmm, does this seem common enough for the added complexity to be worthwhile?

For an example of what's previously been considered worthwhile for a
project like this, see what 0d5f05cde0 did. The cases it addressed
are common enough -- a file being loaded into a (time range-)
partitioned table using COPY FROM tends to have lines belonging to the
same partition consecutively placed.

--
Amit Langote
EDB: http://www.enterprisedb.com

#3David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#2)
Re: Skip partition tuple routing with constant partition key

On Tue, 18 May 2021 at 01:31, Amit Langote <amitlangote09@gmail.com> wrote:

Hmm, does this seem common enough for the added complexity to be worthwhile?

I'd also like to know if there's some genuine use case for this. For
testing purposes does not seem to be quite a good enough reason.

A slightly different optimization that I have considered and even
written patches before was to have ExecFindPartition() cache the last
routed to partition and have it check if the new row can go into that
one on the next call. I imagined there might be a use case for
speeding that up for RANGE partitioned tables since it seems fairly
likely that most use cases, at least for time series ranges will
always hit the same partition most of the time. Since RANGE requires
a binary search there might be some savings there. I imagine that
optimisation would never be useful for HASH partitioning since it
seems most likely that we'll be routing to a different partition each
time and wouldn't save much since routing to hash partitions are
cheaper than other types. LIST partitioning I'm not so sure about. It
seems much less likely than RANGE to hit the same partition twice in a
row.

IIRC, the patch did something like call ExecPartitionCheck() on the
new tuple with the previously routed to ResultRelInfo. I think the
last used partition was cached somewhere like relcache (which seems a
bit questionable). Likely this would speed up the example case here
a bit. Not as much as the proposed patch, but it would likely apply in
many more cases.

I don't think I ever posted the patch to the list, and if so I no
longer have access to it, so it would need to be done again.

David

#4houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: David Rowley (#3)
RE: Skip partition tuple routing with constant partition key

Hmm, does this seem common enough for the added complexity to be

worthwhile?

I'd also like to know if there's some genuine use case for this. For testing
purposes does not seem to be quite a good enough reason.

Thanks for the response.

For some big data scenario, we sometimes transfer data from one table(only store not expired data)
to another table(historical data) for future analysis.
In this case, we import data into historical table regularly(could be one day or half a day),
And the data is likely to be imported with date label specified, then all of the data to be
imported this time belong to the same partition which partition by time range.

So, personally, It will be nice if postgres can skip tuple routing for each row in this scenario.

A slightly different optimization that I have considered and even written
patches before was to have ExecFindPartition() cache the last routed to
partition and have it check if the new row can go into that one on the next call.
I imagined there might be a use case for speeding that up for RANGE
partitioned tables since it seems fairly likely that most use cases, at least for
time series ranges will
always hit the same partition most of the time. Since RANGE requires
a binary search there might be some savings there. I imagine that
optimisation would never be useful for HASH partitioning since it seems most
likely that we'll be routing to a different partition each time and wouldn't save
much since routing to hash partitions are cheaper than other types. LIST
partitioning I'm not so sure about. It seems much less likely than RANGE to hit
the same partition twice in a row.

I think your approach looks good too,
and it seems does not conflict with the approach proposed here.

Best regards,
houzj

#5Michael Paquier
michael@paquier.xyz
In reply to: David Rowley (#3)
Re: Skip partition tuple routing with constant partition key

On Tue, May 18, 2021 at 01:27:48PM +1200, David Rowley wrote:

A slightly different optimization that I have considered and even
written patches before was to have ExecFindPartition() cache the last
routed to partition and have it check if the new row can go into that
one on the next call. I imagined there might be a use case for
speeding that up for RANGE partitioned tables since it seems fairly
likely that most use cases, at least for time series ranges will
always hit the same partition most of the time. Since RANGE requires
a binary search there might be some savings there. I imagine that
optimisation would never be useful for HASH partitioning since it
seems most likely that we'll be routing to a different partition each
time and wouldn't save much since routing to hash partitions are
cheaper than other types. LIST partitioning I'm not so sure about. It
seems much less likely than RANGE to hit the same partition twice in a
row.

It depends a lot on the schema used and the load pattern, but I'd like
to think that a similar argument can be made in favor of LIST
partitioning here.
--
Michael

#6Amit Langote
amitlangote09@gmail.com
In reply to: David Rowley (#3)
1 attachment(s)
Re: Skip partition tuple routing with constant partition key

On Tue, May 18, 2021 at 10:28 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Tue, 18 May 2021 at 01:31, Amit Langote <amitlangote09@gmail.com> wrote:

Hmm, does this seem common enough for the added complexity to be worthwhile?

I'd also like to know if there's some genuine use case for this. For
testing purposes does not seem to be quite a good enough reason.

A slightly different optimization that I have considered and even
written patches before was to have ExecFindPartition() cache the last
routed to partition and have it check if the new row can go into that
one on the next call. I imagined there might be a use case for
speeding that up for RANGE partitioned tables since it seems fairly
likely that most use cases, at least for time series ranges will
always hit the same partition most of the time. Since RANGE requires
a binary search there might be some savings there. I imagine that
optimisation would never be useful for HASH partitioning since it
seems most likely that we'll be routing to a different partition each
time and wouldn't save much since routing to hash partitions are
cheaper than other types. LIST partitioning I'm not so sure about. It
seems much less likely than RANGE to hit the same partition twice in a
row.

IIRC, the patch did something like call ExecPartitionCheck() on the
new tuple with the previously routed to ResultRelInfo. I think the
last used partition was cached somewhere like relcache (which seems a
bit questionable). Likely this would speed up the example case here
a bit. Not as much as the proposed patch, but it would likely apply in
many more cases.

I don't think I ever posted the patch to the list, and if so I no
longer have access to it, so it would need to be done again.

I gave a shot to implementing your idea and ended up with the attached
PoC patch, which does pass make check-world.

I do see some speedup:

-- creates a range-partitioned table with 1000 partitions
create unlogged table foo (a int) partition by range (a);
select 'create unlogged table foo_' || i || ' partition of foo for
values from (' || (i-1)*100000+1 || ') to (' || i*100000+1 || ');'
from generate_series(1, 1000) i;
\gexec

-- generates a 100 million record file
copy (select generate_series(1, 100000000)) to '/tmp/100m.csv' csv;

Times for loading that file compare as follows:

HEAD:

postgres=# copy foo from '/tmp/100m.csv' csv;
COPY 100000000
Time: 31813.964 ms (00:31.814)
postgres=# copy foo from '/tmp/100m.csv' csv;
COPY 100000000
Time: 31972.942 ms (00:31.973)
postgres=# copy foo from '/tmp/100m.csv' csv;
COPY 100000000
Time: 32049.046 ms (00:32.049)

Patched:

postgres=# copy foo from '/tmp/100m.csv' csv;
COPY 100000000
Time: 26151.158 ms (00:26.151)
postgres=# copy foo from '/tmp/100m.csv' csv;
COPY 100000000
Time: 28161.082 ms (00:28.161)
postgres=# copy foo from '/tmp/100m.csv' csv;
COPY 100000000
Time: 26700.908 ms (00:26.701)

I guess it would be nice if we could fit in a solution for the use
case that houjz mentioned as a special case. BTW, houjz, could you
please check if a patch like this one helps the case you mentioned?

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachments:

ExecFindPartition-cache-partition-PoC.patchapplication/octet-stream; name=ExecFindPartition-cache-partition-PoC.patchDownload
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 606c920b06..2735945c6c 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -150,6 +150,7 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrMap    *tupmap;
+	ResultRelInfo *lastPartInfo;
 	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 }			PartitionDispatchData;
 
@@ -291,6 +292,35 @@ ExecFindPartition(ModifyTableState *mtstate,
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Check if the saved partition accepts this tuple by evaluating its
+		 * partition constraint against the tuple.  If it does, we save a trip
+		 * to get_partition_for_tuple(), which can be a slightly more expensive
+		 * way to get the same partition, especially if there are many
+		 * partitions to search through.
+		 */
+		if (dispatch->lastPartInfo)
+		{
+			TupleTableSlot *tmpslot;
+			TupleConversionMap *map;
+
+			rri = dispatch->lastPartInfo;
+			map = rri->ri_RootToPartitionMap;
+			if (map)
+				tmpslot = execute_attr_map_slot(map->attrMap, rootslot,
+												rri->ri_PartitionTupleSlot);
+			else
+				tmpslot = rootslot;
+			if (ExecPartitionCheck(rri, tmpslot, estate, false))
+			{
+				/* and restore ecxt's scantuple */
+				ecxt->ecxt_scantuple = ecxt_scantuple_saved;
+				MemoryContextSwitchTo(oldcxt);
+				return rri;
+			}
+			dispatch->lastPartInfo = rri = NULL;
+		}
+
 		rel = dispatch->reldesc;
 		partdesc = dispatch->partdesc;
 
@@ -372,6 +402,19 @@ ExecFindPartition(ModifyTableState *mtstate,
 			}
 			Assert(rri != NULL);
 
+			/*
+			 * Remember this partition for the next tuple inserted into this
+			 * parent; see at the top of this loop how it's decided whether
+			 * the next tuple can indeed reuse this partition.
+			 *
+			 * Do this only if we have range/list partitions, because only
+			 * in that case it's conceivable that consecutively inserted rows
+			 * tend to go into the same partition.
+			 */
+			if ((dispatch->key->strategy == PARTITION_STRATEGY_RANGE ||
+				 dispatch->key->strategy == PARTITION_STRATEGY_RANGE))
+				dispatch->lastPartInfo = rri;
+
 			/* Signal to terminate the loop */
 			dispatch = NULL;
 		}
@@ -1051,6 +1094,8 @@ ExecInitPartitionDispatchInfo(EState *estate,
 		pd->tupslot = NULL;
 	}
 
+	pd->lastPartInfo = NULL;
+
 	/*
 	 * Initialize with -1 to signify that the corresponding partition's
 	 * ResultRelInfo or PartitionDispatch has not been created yet.
#7Amit Langote
amitlangote09@gmail.com
In reply to: houzj.fnst@fujitsu.com (#4)
Re: Skip partition tuple routing with constant partition key

On Tue, May 18, 2021 at 11:11 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

Hmm, does this seem common enough for the added complexity to be

worthwhile?

I'd also like to know if there's some genuine use case for this. For testing
purposes does not seem to be quite a good enough reason.

Thanks for the response.

For some big data scenario, we sometimes transfer data from one table(only store not expired data)
to another table(historical data) for future analysis.
In this case, we import data into historical table regularly(could be one day or half a day),
And the data is likely to be imported with date label specified, then all of the data to be
imported this time belong to the same partition which partition by time range.

Is directing that data directly into the appropriate partition not an
acceptable solution to address this particular use case? Yeah, I know
we should avoid encouraging users to perform DML directly on
partitions, but...

--
Amit Langote
EDB: http://www.enterprisedb.com

#8tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Amit Langote (#7)
RE: Skip partition tuple routing with constant partition key

From: Amit Langote <amitlangote09@gmail.com>

On Tue, May 18, 2021 at 11:11 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

For some big data scenario, we sometimes transfer data from one table(only

store not expired data)

to another table(historical data) for future analysis.
In this case, we import data into historical table regularly(could be one day or

half a day),

And the data is likely to be imported with date label specified, then all of the

data to be

imported this time belong to the same partition which partition by time range.

Is directing that data directly into the appropriate partition not an
acceptable solution to address this particular use case? Yeah, I know
we should avoid encouraging users to perform DML directly on
partitions, but...

Yes, I want to make/keep it possible that application developers can be unaware of partitions. I believe that's why David-san, Alvaro-san, and you have made great efforts to improve partitioning performance. So, I'm +1 for what Hou-san is trying to achieve.

Is there something you're concerned about? The amount and/or complexity of added code?

Regards
Takayuki Tsunakawa

#9David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#6)
Re: Skip partition tuple routing with constant partition key

On Thu, 20 May 2021 at 01:17, Amit Langote <amitlangote09@gmail.com> wrote:

I gave a shot to implementing your idea and ended up with the attached
PoC patch, which does pass make check-world.

I only had a quick look at this.

+ if ((dispatch->key->strategy == PARTITION_STRATEGY_RANGE ||
+ dispatch->key->strategy == PARTITION_STRATEGY_RANGE))
+ dispatch->lastPartInfo = rri;

I think you must have meant to have one of these as PARTITION_STRATEGY_LIST?

Wondering what your thoughts are on, instead of caching the last used
ResultRelInfo from the last call to ExecFindPartition(), to instead
cached the last looked up partition index in PartitionDescData? That
way we could cache lookups between statements. Right now your caching
is not going to help for single-row INSERTs, for example.

For multi-level partition hierarchies that would still require looping
and checking the cached value at each level.

I've not studied the code that builds and rebuilds PartitionDescData,
so there may be some reason that we shouldn't do that. I know that's
changed a bit recently with DETACH CONCURRENTLY. However, providing
the cached index is not outside the bounds of the oids array, it
shouldn't really matter if the cached value happens to end up pointing
to some other partition. If that happens, we'll just fail the
ExecPartitionCheck() and have to look for the correct partition.

David

#10David Rowley
dgrowleyml@gmail.com
In reply to: tsunakawa.takay@fujitsu.com (#8)
Re: Skip partition tuple routing with constant partition key

On Thu, 20 May 2021 at 12:20, tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:

Yes, I want to make/keep it possible that application developers can be unaware of partitions. I believe that's why David-san, Alvaro-san, and you have made great efforts to improve partitioning performance. So, I'm +1 for what Hou-san is trying to achieve.

Is there something you're concerned about? The amount and/or complexity of added code?

It would be good to see how close Amit's patch gets to the performance
of the original patch on this thread. As far as I can see, the
difference is, aside from the setup code to determine if the partition
is constant, that Amit's patch just requires an additional
ExecPartitionCheck() call per row. That should be pretty cheap when
compared to the binary search to find the partition for a RANGE or
LIST partitioned table.

Houzj didn't mention how the table in the test was partitioned, so
it's hard to speculate how many comparisons would be done during a
binary search. Or maybe it was HASH partitioned and there was no
binary search.

David

#11Amit Langote
amitlangote09@gmail.com
In reply to: David Rowley (#9)
1 attachment(s)
Re: Skip partition tuple routing with constant partition key

On Thu, May 20, 2021 at 9:31 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 20 May 2021 at 01:17, Amit Langote <amitlangote09@gmail.com> wrote:

I gave a shot to implementing your idea and ended up with the attached
PoC patch, which does pass make check-world.

I only had a quick look at this.

+ if ((dispatch->key->strategy == PARTITION_STRATEGY_RANGE ||
+ dispatch->key->strategy == PARTITION_STRATEGY_RANGE))
+ dispatch->lastPartInfo = rri;

I think you must have meant to have one of these as PARTITION_STRATEGY_LIST?

Oops, of course. Fixed in the attached.

Wondering what your thoughts are on, instead of caching the last used
ResultRelInfo from the last call to ExecFindPartition(), to instead
cached the last looked up partition index in PartitionDescData? That
way we could cache lookups between statements. Right now your caching
is not going to help for single-row INSERTs, for example.

Hmm, addressing single-row INSERTs with something like you suggest
might help time-range partitioning setups, because each of those
INSERTs are likely to be targeting the same partition most of the
time. Is that case what you had in mind? Although, in the cases
where that doesn't help, we'd end up making a ResultRelInfo for the
cached partition to check the partition constraint, only then to be
thrown away because the new row belongs to a different partition.
That overhead would not be free for sure.

For multi-level partition hierarchies that would still require looping
and checking the cached value at each level.

Yeah, there's no getting around that, though maybe that's not a big problem.

I've not studied the code that builds and rebuilds PartitionDescData,
so there may be some reason that we shouldn't do that. I know that's
changed a bit recently with DETACH CONCURRENTLY. However, providing
the cached index is not outside the bounds of the oids array, it
shouldn't really matter if the cached value happens to end up pointing
to some other partition. If that happens, we'll just fail the
ExecPartitionCheck() and have to look for the correct partition.

Yeah, as long as ExecFindPartition performs ExecPartitionCheck() on
before returning a given cached partition, there's no need to worry
about the cached index getting stale for whatever reason.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachments:

ExecFindPartition-cache-partition-PoC_v2.patchapplication/octet-stream; name=ExecFindPartition-cache-partition-PoC_v2.patchDownload
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 606c920b06..2348eb3154 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -133,6 +133,12 @@ struct PartitionTupleRouting
  *		routing it through this table). A NULL value is stored if no tuple
  *		conversion is required.
  *
+ * lastPartInfo
+ * 		If non-NULL, ResultRelInfo for the partition that was most recently
+ * 		chosen as the routing target; ExecFindPartition() checks if the
+ * 		same one can be used for the current row before applying the tuple-
+ * 		routing algorithm to it.
+ *
  * indexes
  *		Array of partdesc->nparts elements.  For leaf partitions the index
  *		corresponds to the partition's ResultRelInfo in the encapsulating
@@ -150,6 +156,7 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrMap    *tupmap;
+	ResultRelInfo *lastPartInfo;
 	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 }			PartitionDispatchData;
 
@@ -291,6 +298,35 @@ ExecFindPartition(ModifyTableState *mtstate,
 
 		CHECK_FOR_INTERRUPTS();
 
+		/*
+		 * Check if the saved partition accepts this tuple by evaluating its
+		 * partition constraint against the tuple.  If it does, we save a trip
+		 * to get_partition_for_tuple(), which can be a slightly more expensive
+		 * way to get the same partition, especially if there are many
+		 * partitions to search through.
+		 */
+		if (dispatch->lastPartInfo)
+		{
+			TupleTableSlot *tmpslot;
+			TupleConversionMap *map;
+
+			rri = dispatch->lastPartInfo;
+			map = rri->ri_RootToPartitionMap;
+			if (map)
+				tmpslot = execute_attr_map_slot(map->attrMap, rootslot,
+												rri->ri_PartitionTupleSlot);
+			else
+				tmpslot = rootslot;
+			if (ExecPartitionCheck(rri, tmpslot, estate, false))
+			{
+				/* and restore ecxt's scantuple */
+				ecxt->ecxt_scantuple = ecxt_scantuple_saved;
+				MemoryContextSwitchTo(oldcxt);
+				return rri;
+			}
+			dispatch->lastPartInfo = rri = NULL;
+		}
+
 		rel = dispatch->reldesc;
 		partdesc = dispatch->partdesc;
 
@@ -372,6 +408,19 @@ ExecFindPartition(ModifyTableState *mtstate,
 			}
 			Assert(rri != NULL);
 
+			/*
+			 * Remember this partition for the next tuple inserted into this
+			 * parent; see at the top of this loop how it's decided whether
+			 * the next tuple can indeed reuse this partition.
+			 *
+			 * Do this only if we have range/list partitions, because only
+			 * in that case it's conceivable that consecutively inserted rows
+			 * tend to go into the same partition.
+			 */
+			if ((dispatch->key->strategy == PARTITION_STRATEGY_RANGE ||
+				 dispatch->key->strategy == PARTITION_STRATEGY_LIST))
+				dispatch->lastPartInfo = rri;
+
 			/* Signal to terminate the loop */
 			dispatch = NULL;
 		}
@@ -1051,6 +1100,8 @@ ExecInitPartitionDispatchInfo(EState *estate,
 		pd->tupslot = NULL;
 	}
 
+	pd->lastPartInfo = NULL;
+
 	/*
 	 * Initialize with -1 to signify that the corresponding partition's
 	 * ResultRelInfo or PartitionDispatch has not been created yet.
#12Amit Langote
amitlangote09@gmail.com
In reply to: tsunakawa.takay@fujitsu.com (#8)
Re: Skip partition tuple routing with constant partition key

On Thu, May 20, 2021 at 9:20 AM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:

From: Amit Langote <amitlangote09@gmail.com>

On Tue, May 18, 2021 at 11:11 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

For some big data scenario, we sometimes transfer data from one table(only

store not expired data)

to another table(historical data) for future analysis.
In this case, we import data into historical table regularly(could be one day or

half a day),

And the data is likely to be imported with date label specified, then all of the

data to be

imported this time belong to the same partition which partition by time range.

Is directing that data directly into the appropriate partition not an
acceptable solution to address this particular use case? Yeah, I know
we should avoid encouraging users to perform DML directly on
partitions, but...

Yes, I want to make/keep it possible that application developers can be unaware of partitions. I believe that's why David-san, Alvaro-san, and you have made great efforts to improve partitioning performance. So, I'm +1 for what Hou-san is trying to achieve.

I'm very glad to see such discussions on the list, because it means
the partitioning feature is being stretched to cover wider set of use
cases.

Is there something you're concerned about? The amount and/or complexity of added code?

IMHO, a patch that implements caching more generally would be better
even if it adds some complexity. Hou-san's patch seemed centered
around the use case where all rows being loaded in a given command
route to the same partition, a very specialized case I'd say.

Maybe we can extract the logic in Hou-san's patch to check the
constant-ness of the targetlist producing the rows to insert and find
a way to add it to the patch I posted such that the generality of the
latter's implementation is not lost.

--
Amit Langote
EDB: http://www.enterprisedb.com

#13David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#11)
Re: Skip partition tuple routing with constant partition key

On Thu, 20 May 2021 at 20:49, Amit Langote <amitlangote09@gmail.com> wrote:

On Thu, May 20, 2021 at 9:31 AM David Rowley <dgrowleyml@gmail.com> wrote:

Wondering what your thoughts are on, instead of caching the last used
ResultRelInfo from the last call to ExecFindPartition(), to instead
cached the last looked up partition index in PartitionDescData? That
way we could cache lookups between statements. Right now your caching
is not going to help for single-row INSERTs, for example.

Hmm, addressing single-row INSERTs with something like you suggest
might help time-range partitioning setups, because each of those
INSERTs are likely to be targeting the same partition most of the
time. Is that case what you had in mind?

Yeah, I thought it would possibly be useful for RANGE partitioning. I
was a bit undecided with LIST. There seemed to be bigger risk there
that the usage pattern would route to a different partition each time.
In my imagination, RANGE partitioning seems more likely to see
subsequent tuples heading to the same partition as the last tuple.

Although, in the cases
where that doesn't help, we'd end up making a ResultRelInfo for the
cached partition to check the partition constraint, only then to be
thrown away because the new row belongs to a different partition.
That overhead would not be free for sure.

Yeah, there's certainly above zero overhead to getting it wrong. It
would be good to see benchmarks to find out what that overhead is.

David

#14houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: Amit Langote (#6)
RE: Skip partition tuple routing with constant partition key

From: Amit Langote <amitlangote09@gmail.com>
Sent: Wednesday, May 19, 2021 9:17 PM

I gave a shot to implementing your idea and ended up with the attached PoC
patch, which does pass make check-world.

I do see some speedup:

-- creates a range-partitioned table with 1000 partitions create unlogged table
foo (a int) partition by range (a); select 'create unlogged table foo_' || i || '
partition of foo for values from (' || (i-1)*100000+1 || ') to (' || i*100000+1 || ');'
from generate_series(1, 1000) i;
\gexec

-- generates a 100 million record file
copy (select generate_series(1, 100000000)) to '/tmp/100m.csv' csv;

Times for loading that file compare as follows:

HEAD:

postgres=# copy foo from '/tmp/100m.csv' csv; COPY 100000000
Time: 31813.964 ms (00:31.814)
postgres=# copy foo from '/tmp/100m.csv' csv; COPY 100000000
Time: 31972.942 ms (00:31.973)
postgres=# copy foo from '/tmp/100m.csv' csv; COPY 100000000
Time: 32049.046 ms (00:32.049)

Patched:

postgres=# copy foo from '/tmp/100m.csv' csv; COPY 100000000
Time: 26151.158 ms (00:26.151)
postgres=# copy foo from '/tmp/100m.csv' csv; COPY 100000000
Time: 28161.082 ms (00:28.161)
postgres=# copy foo from '/tmp/100m.csv' csv; COPY 100000000
Time: 26700.908 ms (00:26.701)

I guess it would be nice if we could fit in a solution for the use case that houjz
mentioned as a special case. BTW, houjz, could you please check if a patch like
this one helps the case you mentioned?

Thanks for the patch!
I did some test on it(using the table you provided above):

1): Test plain column in partition key.
SQL: insert into foo select 1 from generate_series(1, 10000000);

HEAD:
Time: 5493.392 ms (00:05.493)

AFTER PATCH(skip constant partition key)
Time: 4198.421 ms (00:04.198)

AFTER PATCH(cache the last partition)
Time: 4484.492 ms (00:04.484)

The test results of your patch in this case looks good.
It can fit many more cases and the performance gain is nice.

-----------
2) Test expression in partition key

create or replace function partition_func(i int) returns int as $$
begin
return i;
end;
$$ language plpgsql immutable parallel restricted;
create unlogged table foo (a int) partition by range (partition_func(a));

SQL: insert into foo select 1 from generate_series(1, 10000000);

HEAD
Time: 8595.120 ms (00:08.595)

AFTER PATCH(skip constant partition key)
Time: 4198.421 ms (00:04.198)

AFTER PATCH(cache the last partition)
Time: 12829.800 ms (00:12.830)

If add a user defined function in the partition key, it seems have
performance degradation after the patch.

I did some analysis on it, for the above testcase , ExecPartitionCheck
executed three expression 1) key is null 2) key > low 3) key < top
In this case, the "key" contains a funcexpr and the funcexpr will be executed
three times for each row, so, it bring extra overhead which cause the performance degradation.

IMO, improving the ExecPartitionCheck seems a better solution to it, we can
Calculate the key value in advance and use the value to do the bound check.
Thoughts ?

------------

Besides, are we going to add a reloption or guc to control this cache behaviour if we more forward with this approach ?
Because, If most of the rows to be inserted are routing to a different partition each time, then I think the extra ExecPartitionCheck
will become the overhead. Maybe it's better to apply both two approaches(cache the last partition and skip constant partition key)
which can achieve the best performance results.

Best regards,
houzj

#15Amit Langote
amitlangote09@gmail.com
In reply to: houzj.fnst@fujitsu.com (#14)
Re: Skip partition tuple routing with constant partition key

Hou-san,

On Thu, May 20, 2021 at 7:35 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

From: Amit Langote <amitlangote09@gmail.com>
Sent: Wednesday, May 19, 2021 9:17 PM

I guess it would be nice if we could fit in a solution for the use case that houjz
mentioned as a special case. BTW, houjz, could you please check if a patch like
this one helps the case you mentioned?

Thanks for the patch!
I did some test on it(using the table you provided above):

Thanks a lot for doing that.

1): Test plain column in partition key.
SQL: insert into foo select 1 from generate_series(1, 10000000);

HEAD:
Time: 5493.392 ms (00:05.493)

AFTER PATCH(skip constant partition key)
Time: 4198.421 ms (00:04.198)

AFTER PATCH(cache the last partition)
Time: 4484.492 ms (00:04.484)

The test results of your patch in this case looks good.
It can fit many more cases and the performance gain is nice.

Hmm yeah, not too bad.

2) Test expression in partition key

create or replace function partition_func(i int) returns int as $$
begin
return i;
end;
$$ language plpgsql immutable parallel restricted;
create unlogged table foo (a int) partition by range (partition_func(a));

SQL: insert into foo select 1 from generate_series(1, 10000000);

HEAD
Time: 8595.120 ms (00:08.595)

AFTER PATCH(skip constant partition key)
Time: 4198.421 ms (00:04.198)

AFTER PATCH(cache the last partition)
Time: 12829.800 ms (00:12.830)

If add a user defined function in the partition key, it seems have
performance degradation after the patch.

Oops.

I did some analysis on it, for the above testcase , ExecPartitionCheck
executed three expression 1) key is null 2) key > low 3) key < top
In this case, the "key" contains a funcexpr and the funcexpr will be executed
three times for each row, so, it bring extra overhead which cause the performance degradation.

IMO, improving the ExecPartitionCheck seems a better solution to it, we can
Calculate the key value in advance and use the value to do the bound check.
Thoughts ?

This one seems bit tough. ExecPartitionCheck() uses the generic
expression evaluation machinery like a black box, which means
execPartition.c can't really tweal/control the time spent evaluating
partition constraints. Given that, we may have to disable the caching
when key->partexprs != NIL, unless we can reasonably do what you are
suggesting.

Besides, are we going to add a reloption or guc to control this cache behaviour if we more forward with this approach ?
Because, If most of the rows to be inserted are routing to a different partition each time, then I think the extra ExecPartitionCheck
will become the overhead. Maybe it's better to apply both two approaches(cache the last partition and skip constant partition key)
which can achieve the best performance results.

A reloption will have to be a last resort is what I can say about this
at the moment.

--
Amit Langote
EDB: http://www.enterprisedb.com

#16houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: Amit Langote (#15)
1 attachment(s)
RE: Skip partition tuple routing with constant partition key

From: Amit Langote <amitlangote09@gmail.com>
Sent: Thursday, May 20, 2021 8:23 PM

Hou-san,

On Thu, May 20, 2021 at 7:35 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

2) Test expression in partition key

create or replace function partition_func(i int) returns int as $$
begin
return i;
end;
$$ language plpgsql immutable parallel restricted; create unlogged
table foo (a int) partition by range (partition_func(a));

SQL: insert into foo select 1 from generate_series(1, 10000000);

HEAD
Time: 8595.120 ms (00:08.595)

AFTER PATCH(skip constant partition key)
Time: 4198.421 ms (00:04.198)

AFTER PATCH(cache the last partition)
Time: 12829.800 ms (00:12.830)

If add a user defined function in the partition key, it seems have
performance degradation after the patch.

Oops.

I did some analysis on it, for the above testcase , ExecPartitionCheck
executed three expression 1) key is null 2) key > low 3) key < top In
this case, the "key" contains a funcexpr and the funcexpr will be
executed three times for each row, so, it bring extra overhead which cause

the performance degradation.

IMO, improving the ExecPartitionCheck seems a better solution to it,
we can Calculate the key value in advance and use the value to do the bound

check.

Thoughts ?

This one seems bit tough. ExecPartitionCheck() uses the generic expression
evaluation machinery like a black box, which means execPartition.c can't really
tweal/control the time spent evaluating partition constraints. Given that, we
may have to disable the caching when key->partexprs != NIL, unless we can
reasonably do what you are suggesting.[]

I did some research on the CHECK expression that ExecPartitionCheck() execute.
Currently for a normal RANGE partition key it will first generate a CHECK expression
like : [Keyexpression IS NOT NULL AND Keyexpression > lowboud AND Keyexpression < lowboud].
In this case, Keyexpression will be re-executed which will bring some overhead.

Instead, I think we can try to do the following step:
1)extract the Keyexpression from the CHECK expression
2)evaluate the key expression in advance
3)pass the result of key expression to do the partition CHECK.
In this way ,we only execute the key expression once which looks more efficient.

Attaching a POC patch about this approach.
I did some performance test with my laptop for this patch:

------------------------------------test cheap partition key expression

create unlogged table test_partitioned_inner (a int) partition by range ((abs(a) + a/50));
create unlogged table test_partitioned_inner_1 partition of test_partitioned_inner for values from (1) to (50);
create unlogged table test_partitioned_inner_2 partition of test_partitioned_inner for values from ( 50 ) to (100);
insert into test_partitioned_inner_1 select (i%48)+1 from generate_series(1,10000000,1) t(i);

BEFORE patch:
Execution Time: 6120.706 ms

AFTER patch:
Execution Time: 5705.967 ms

------------------------------------test expensive partition key expression
create or replace function partfunc(i int) returns int as
$$
begin
return i;
end;
$$ language plpgsql IMMUTABLE;

create unlogged table test_partitioned_inner (a int) partition by range (partfunc (a));
create unlogged table test_partitioned_inner_1 partition of test_partitioned_inner for values from (1) to (50);
create unlogged table test_partitioned_inner_2 partition of test_partitioned_inner for values from ( 50 ) to (100);

I think this can be a independent improvement for partitioncheck.

before patch:
Execution Time: 14048.551 ms

after patch:
Execution Time: 8810.518 ms

I think this patch can solve the performance degradation of key expression
after applying the [Save the last partition] patch.
Besides, this could be a separate patch which can improve some more cases.
Thoughts ?

Best regards,
houzj

Attachments:

0001-improving-ExecPartitionCheck.patchapplication/octet-stream; name=0001-improving-ExecPartitionCheck.patchDownload
From 1c2fe94ba8d103d76e41dca2dd0a4198e1e3170f Mon Sep 17 00:00:00 2001
From: houzj <houzj.fnst@cn.fujitsu.com>
Date: Mon, 24 May 2021 08:54:15 +0800
Subject: [PATCH] improving-ExecPartitionCheck

Currently for a normal partition key it will first generate a CHECK expression
Like : [Keyexpression IS NOT NULL AND Keyexpression > lowboud AND Keyexpression < lowboud].
In this case, Keyexpression will be re-executed which will bring some overhead.

Instead, I think we can try to do the following step:
1)extract the Keyexpression from the CHECK expression
2)evaluate the key expression in advance
3)pass the result of key expression to do the partition CHECK.
In this way ,we only execute the key expression once which looks more efficient.

---
 src/backend/commands/tablecmds.c      |   6 +-
 src/backend/executor/execMain.c       | 114 +++++++++++++++++++++-
 src/backend/optimizer/util/plancat.c  |   2 +-
 src/backend/partitioning/partbounds.c | 176 +++++++++++++++++++++++++++++++---
 src/backend/utils/cache/partcache.c   |  48 ++++++++--
 src/backend/utils/cache/relcache.c    |   2 +
 src/include/nodes/execnodes.h         |   9 ++
 src/include/partitioning/partbounds.h |   2 +-
 src/include/partitioning/partdefs.h   |   2 +
 src/include/utils/partcache.h         |   8 +-
 src/include/utils/rel.h               |   1 +
 11 files changed, 344 insertions(+), 26 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ebc6203..d0dc337 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -17279,9 +17279,9 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd,
 	 * If the parent itself is a partition, make sure to include its
 	 * constraint as well.
 	 */
-	partBoundConstraint = get_qual_from_partbound(attachrel, rel, cmd->bound);
+	partBoundConstraint = get_qual_from_partbound(attachrel, rel, cmd->bound, NULL);
 	partConstraint = list_concat(partBoundConstraint,
-								 RelationGetPartitionQual(rel));
+								 RelationGetPartitionQual(rel, NULL));
 
 	/* Skip validation if there are no constraints to validate. */
 	if (partConstraint)
@@ -18083,7 +18083,7 @@ DetachAddConstraintIfNeeded(List **wqueue, Relation partRel)
 {
 	List	   *constraintExpr;
 
-	constraintExpr = RelationGetPartitionQual(partRel);
+	constraintExpr = RelationGetPartitionQual(partRel, NULL);
 	constraintExpr = (List *) eval_const_expressions(NULL, (Node *) constraintExpr);
 
 	/*
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index b3ce4ba..f2da243 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -53,6 +53,7 @@
 #include "jit/jit.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -1699,6 +1700,7 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 {
 	ExprContext *econtext;
 	bool		success;
+	ListCell   *lc;
 
 	/*
 	 * If first time through, build expression state tree for the partition
@@ -1709,12 +1711,80 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 */
 	if (resultRelInfo->ri_PartitionCheckExpr == NULL)
 	{
+		int				i;
+		PartKeyContext	partkeycontext;
+		TupleDesc		tupdesc,
+						coltupdesc;
+		List		   *keyexpr_list;
+
 		/*
 		 * Ensure that the qual tree and prepared expression are in the
 		 * query-lifespan context.
 		 */
 		MemoryContext oldcxt = MemoryContextSwitchTo(estate->es_query_cxt);
-		List	   *qual = RelationGetPartitionQual(resultRelInfo->ri_RelationDesc);
+		List	   *qual;
+
+		/*
+		 * Extract the key expressions from the partition check expression to
+		 * avoid re-execution.
+		 */
+
+		/* The attno for key expr starts after the plain column */
+		partkeycontext.keycol_no = slot->tts_tupleDescriptor->natts + 1;
+		partkeycontext.keyexpr_list = NIL;
+		partkeycontext.keyexpr_varattno = NULL;
+
+		qual = RelationGetPartitionQual(resultRelInfo->ri_RelationDesc,
+										&partkeycontext);
+
+		keyexpr_list = partkeycontext.keyexpr_list;
+		resultRelInfo->ri_PartitionKeyExpr = NIL;
+		resultRelInfo->ri_PartitionKeySlot = NULL;
+
+
+		if (keyexpr_list != NIL)
+		{
+			/*
+			 * Build a slot which contains both the partition key and plain
+			 * column
+			 */
+			coltupdesc = slot->tts_tupleDescriptor;
+
+			tupdesc = CreateTemplateTupleDesc(partkeycontext.keycol_no - 1);
+
+			/* Copy the plain column */
+			TupleDescCopy(tupdesc, coltupdesc);
+
+			/* XXX adjust the natts */
+			tupdesc->natts = partkeycontext.keycol_no - 1;
+
+			/* Save the partition key list */
+			i = coltupdesc->natts + 1;
+			foreach(lc, keyexpr_list)
+			{
+				Node *e = lfirst(lc);
+
+				TupleDescInitEntry(tupdesc, i,
+								   NULL,
+								   exprType(e),
+								   exprTypmod(e),
+								   0);
+
+				TupleDescInitEntryCollation(tupdesc,
+											i,
+											exprCollation(e));
+
+				/* initialize each key expression for execution */
+				resultRelInfo->ri_PartitionKeyExpr =
+					lappend(resultRelInfo->ri_PartitionKeyExpr,
+							ExecPrepareExpr((Expr *) e, estate));
+
+				i++;
+			}
+
+			resultRelInfo->ri_PartitionKeySlot =
+				MakeTupleTableSlot(tupdesc, slot->tts_ops);
+		}
 
 		resultRelInfo->ri_PartitionCheckExpr = ExecPrepareCheck(qual, estate);
 		MemoryContextSwitchTo(oldcxt);
@@ -1726,6 +1796,48 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 */
 	econtext = GetPerTupleExprContext(estate);
 
+	if (resultRelInfo->ri_PartitionKeySlot != NULL)
+		ExecClearTuple(resultRelInfo->ri_PartitionKeySlot);
+
+	/*
+	 * Evaluate the partition expression in advance to avoid re-execution,
+	 * and add the result to the slot to do the partition check.
+	 */
+	foreach(lc, resultRelInfo->ri_PartitionKeyExpr)
+	{
+		Datum		datum;
+		bool		isNull;
+		ExprState  *keystate = lfirst_node(ExprState, lc);
+		int			i = foreach_current_index(lc) +
+						slot->tts_tupleDescriptor->natts;
+
+		econtext->ecxt_scantuple = slot;
+		datum = ExecEvalExprSwitchContext(keystate, econtext, &isNull);
+
+		resultRelInfo->ri_PartitionKeySlot->tts_values[i] = datum;
+		resultRelInfo->ri_PartitionKeySlot->tts_isnull[i] = isNull;
+	}
+
+	/*
+	 * Move the values from original slot to the new slot, then the new data
+	 * is like :
+	 * [col1 , ... colN , keyexpr1's result , ... keyexprN's result]
+	 */
+	if (resultRelInfo->ri_PartitionKeyExpr != NIL)
+	{
+		slot_getallattrs(slot);
+		memcpy(resultRelInfo->ri_PartitionKeySlot->tts_values,
+			   slot->tts_values,
+			   slot->tts_tupleDescriptor->natts * sizeof(Datum));
+
+		memcpy(resultRelInfo->ri_PartitionKeySlot->tts_isnull,
+			   slot->tts_isnull,
+			   slot->tts_tupleDescriptor->natts * sizeof(Datum));
+
+		slot = resultRelInfo->ri_PartitionKeySlot;
+		ExecStoreVirtualTuple(slot);
+	}
+
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index c5194fd..f274538 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -2399,7 +2399,7 @@ set_baserel_partition_constraint(Relation relation, RelOptInfo *rel)
 	 * implicit-AND format, we'd have to explicitly convert it to explicit-AND
 	 * format and back again.
 	 */
-	partconstr = RelationGetPartitionQual(relation);
+	partconstr = RelationGetPartitionQual(relation, NULL);
 	if (partconstr)
 	{
 		partconstr = (List *) expression_planner((Expr *) partconstr);
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 7925fcc..9c551bd 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -231,14 +231,14 @@ static Oid	get_partition_operator(PartitionKey key, int col,
 static List *get_qual_for_hash(Relation parent, PartitionBoundSpec *spec);
 static List *get_qual_for_list(Relation parent, PartitionBoundSpec *spec);
 static List *get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
-								bool for_default);
+								bool for_default, PartKeyContext *context);
 static void get_range_key_properties(PartitionKey key, int keynum,
 									 PartitionRangeDatum *ldatum,
 									 PartitionRangeDatum *udatum,
 									 ListCell **partexprs_item,
 									 Expr **keyCol,
 									 Const **lower_val, Const **upper_val);
-static List *get_range_nulltest(PartitionKey key);
+static List *get_range_nulltest(PartitionKey key, PartKeyContext *context);
 
 /*
  * get_qual_from_partbound
@@ -247,7 +247,7 @@ static List *get_range_nulltest(PartitionKey key);
  */
 List *
 get_qual_from_partbound(Relation rel, Relation parent,
-						PartitionBoundSpec *spec)
+						PartitionBoundSpec *spec, PartKeyContext *context)
 {
 	PartitionKey key = RelationGetPartitionKey(parent);
 	List	   *my_qual = NIL;
@@ -268,7 +268,7 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 		case PARTITION_STRATEGY_RANGE:
 			Assert(spec->strategy == PARTITION_STRATEGY_RANGE);
-			my_qual = get_qual_for_range(parent, spec, false);
+			my_qual = get_qual_for_range(parent, spec, false, context);
 			break;
 
 		default:
@@ -3153,7 +3153,7 @@ check_default_partition_contents(Relation parent, Relation default_rel,
 
 	new_part_constraints = (new_spec->strategy == PARTITION_STRATEGY_LIST)
 		? get_qual_for_list(parent, new_spec)
-		: get_qual_for_range(parent, new_spec, false);
+		: get_qual_for_range(parent, new_spec, false, NULL);
 	def_part_constraints =
 		get_proposed_default_constraint(new_part_constraints);
 
@@ -4167,7 +4167,7 @@ get_qual_for_list(Relation parent, PartitionBoundSpec *spec)
  */
 static List *
 get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
-				   bool for_default)
+				   bool for_default, PartKeyContext *context)
 {
 	List	   *result = NIL;
 	ListCell   *cell1,
@@ -4190,6 +4190,13 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 			   *upper_or_start_datum;
 	bool		need_next_lower_arm,
 				need_next_upper_arm;
+	AttrNumber *keyexpr_varattno = NULL;
+	int			cur_keyexpr_no,
+				old_keyexpr_no;
+
+	if (context != NULL && context->keyexpr_varattno == NULL)
+		context->keyexpr_varattno =
+			palloc0(sizeof(AttrNumber) * list_length(key->partexprs));
 
 	if (spec->is_default)
 	{
@@ -4226,7 +4233,7 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 			{
 				List	   *part_qual;
 
-				part_qual = get_qual_for_range(parent, bspec, true);
+				part_qual = get_qual_for_range(parent, bspec, true, context);
 
 				/*
 				 * AND the constraints of the partition and add to
@@ -4251,7 +4258,7 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 			 */
 			other_parts_constr =
 				makeBoolExpr(AND_EXPR,
-							 lappend(get_range_nulltest(key),
+							 lappend(get_range_nulltest(key, context),
 									 list_length(or_expr_args) > 1
 									 ? makeBoolExpr(OR_EXPR, or_expr_args,
 													-1)
@@ -4274,7 +4281,13 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 	 * to avoid accumulating the NullTest on the same keys for each partition.
 	 */
 	if (!for_default)
-		result = get_range_nulltest(key);
+		result = get_range_nulltest(key, context);
+
+	if (context != NULL)
+		keyexpr_varattno = context->keyexpr_varattno;
+
+	cur_keyexpr_no = 0;
+	old_keyexpr_no = 0;
 
 	/*
 	 * Iterate over the key columns and check if the corresponding lower and
@@ -4295,6 +4308,8 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 		ExprState  *test_exprstate;
 		Datum		test_result;
 		bool		isNull;
+		int			key_attno = 0;
+		bool		varattno_saved = false;
 
 		ldatum = castNode(PartitionRangeDatum, lfirst(cell1));
 		udatum = castNode(PartitionRangeDatum, lfirst(cell2));
@@ -4306,12 +4321,33 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 		 */
 		partexprs_item_saved = partexprs_item;
 
+		old_keyexpr_no = cur_keyexpr_no;
+
 		get_range_key_properties(key, i, ldatum, udatum,
 								 &partexprs_item,
 								 &keyCol,
 								 &lower_val, &upper_val);
 
 		/*
+		 * Check if we have saved the same key expression, if so , just get
+		 * the attno from keyexpr_varattno
+		 */
+		if (context != NULL && !IsA(keyCol, Var))
+		{
+			if (keyexpr_varattno[cur_keyexpr_no] != 0)
+			{
+				varattno_saved = true;
+				key_attno = keyexpr_varattno[cur_keyexpr_no];
+			}
+			else
+			{
+				varattno_saved = false;
+				key_attno = context->keycol_no;
+			}
+			cur_keyexpr_no++;
+		}
+
+		/*
 		 * If either value is NULL, the corresponding partition bound is
 		 * either MINVALUE or MAXVALUE, and we treat them as unequal, because
 		 * even if they're the same, there is no common value to equate the
@@ -4346,6 +4382,25 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 		if (i == key->partnatts - 1)
 			elog(ERROR, "invalid range bound specification");
 
+		/* If key is not a plain column */
+		if (context != NULL && !IsA(keyCol, Var))
+		{
+			/* Save the keyexpr to keyexpr_list if first time meet */
+			if (!varattno_saved)
+			{
+				context->keyexpr_list = lappend(context->keyexpr_list, keyCol);
+				keyexpr_varattno[old_keyexpr_no] = key_attno;
+				context->keycol_no++;
+			}
+
+			keyCol = (Expr *) makeVar(2,
+									key_attno,
+									key->parttypid[i],
+									key->parttypmod[i],
+									key->parttypcoll[i],
+									0);
+		}
+
 		/* Equal, so generate keyCol = lower_val expression */
 		result = lappend(result,
 						 make_partition_op_expr(key, i, BTEqualStrategyNumber,
@@ -4372,11 +4427,16 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 		j = i;
 		partexprs_item = partexprs_item_saved;
 
+		cur_keyexpr_no = old_keyexpr_no;
+
 		for_both_cell(cell1, spec->lowerdatums, lower_or_start_datum,
 					  cell2, spec->upperdatums, upper_or_start_datum)
 		{
 			PartitionRangeDatum *ldatum_next = NULL,
 					   *udatum_next = NULL;
+			int			key_attno = 0;
+			bool		varattno_saved = false;
+			Expr		   *temp_keyCol;
 
 			ldatum = castNode(PartitionRangeDatum, lfirst(cell1));
 			if (lnext(spec->lowerdatums, cell1))
@@ -4391,6 +4451,26 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 									 &keyCol,
 									 &lower_val, &upper_val);
 
+			/*
+			 * Check if we have saved the same key expression, if so , just get
+			 * the attno from keyexpr_varattno
+			 */
+			temp_keyCol = keyCol;
+			if (context != NULL && !IsA(keyCol, Var))
+			{
+				if (keyexpr_varattno[cur_keyexpr_no] != 0)
+				{
+					varattno_saved = true;
+					key_attno = keyexpr_varattno[cur_keyexpr_no];
+				}
+				else
+				{
+					varattno_saved = false;
+					key_attno = context->keycol_no;
+				}
+				cur_keyexpr_no++;
+			}
+
 			if (need_next_lower_arm && lower_val)
 			{
 				uint16		strategy;
@@ -4410,10 +4490,30 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 				else
 					strategy = BTGreaterStrategyNumber;
 
+				/* If key is not a plain column */
+				if (context != NULL && !IsA(keyCol, Var))
+				{
+					/* Save the keyexpr to keyexpr_list if first time meet */
+					if (!varattno_saved)
+					{
+						context->keyexpr_list = lappend(context->keyexpr_list,
+														keyCol);
+						keyexpr_varattno[cur_keyexpr_no - 1] = key_attno;
+						context->keycol_no++;
+					}
+
+					temp_keyCol = (Expr *) makeVar(2,
+											key_attno,
+											key->parttypid[j],
+											key->parttypmod[j],
+											key->parttypcoll[j],
+											0);
+				}
+
 				lower_or_arm_args = lappend(lower_or_arm_args,
 											make_partition_op_expr(key, j,
 																   strategy,
-																   keyCol,
+																   temp_keyCol,
 																   (Expr *) lower_val));
 			}
 
@@ -4434,6 +4534,26 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 				else
 					strategy = BTLessStrategyNumber;
 
+				/* If key is not a plain column */
+				if (context != NULL && !IsA(keyCol, Var))
+				{
+					/* Save the keyexpr to keyexpr_list if first time meet */
+					if (keyexpr_varattno[cur_keyexpr_no - 1] == 0)
+					{
+						context->keyexpr_list = lappend(context->keyexpr_list,
+														keyCol);
+						keyexpr_varattno[cur_keyexpr_no - 1] = key_attno;
+						context->keycol_no++;
+					}
+
+					keyCol = (Expr *) makeVar(2,
+											key_attno,
+											key->parttypid[j],
+											key->parttypmod[j],
+											key->parttypcoll[j],
+											0);
+				}
+
 				upper_or_arm_args = lappend(upper_or_arm_args,
 											make_partition_op_expr(key, j,
 																   strategy,
@@ -4506,7 +4626,7 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 	 */
 	if (result == NIL)
 		result = for_default
-			? get_range_nulltest(key)
+			? get_range_nulltest(key, context)
 			: list_make1(makeBoolConst(true, false));
 
 	return result;
@@ -4572,13 +4692,15 @@ get_range_key_properties(PartitionKey key, int keynum,
  * keys to be null, so emit an IS NOT NULL expression for each key column.
  */
 static List *
-get_range_nulltest(PartitionKey key)
+get_range_nulltest(PartitionKey key, PartKeyContext *context)
 {
 	List	   *result = NIL;
 	NullTest   *nulltest;
 	ListCell   *partexprs_item;
 	int			i;
+	int			cur_keyexpr_no;
 
+	cur_keyexpr_no = 0;
 	partexprs_item = list_head(key->partexprs);
 	for (i = 0; i < key->partnatts; i++)
 	{
@@ -4598,6 +4720,36 @@ get_range_nulltest(PartitionKey key)
 			if (partexprs_item == NULL)
 				elog(ERROR, "wrong number of partition key expressions");
 			keyCol = copyObject(lfirst(partexprs_item));
+
+			if (context != NULL)
+			{
+				int key_attno;
+
+				if (context->keyexpr_varattno[cur_keyexpr_no] != 0)
+				{
+					key_attno = context->keyexpr_varattno[cur_keyexpr_no];
+				}
+				else
+				{
+					key_attno = context->keycol_no;
+
+					context->keyexpr_list = lappend(context->keyexpr_list,
+													keyCol);
+					context->keyexpr_varattno[cur_keyexpr_no] = key_attno;
+
+					context->keycol_no++;
+				}
+
+				cur_keyexpr_no++;
+
+				keyCol = (Expr *) makeVar(2,
+										key_attno,
+										key->parttypid[i],
+										key->parttypmod[i],
+										key->parttypcoll[i],
+										0);
+			}
+
 			partexprs_item = lnext(key->partexprs, partexprs_item);
 		}
 
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 21e60f0..117717b 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -38,7 +38,7 @@
 
 
 static void RelationBuildPartitionKey(Relation relation);
-static List *generate_partition_qual(Relation rel);
+static List *generate_partition_qual(Relation rel, PartKeyContext *context);
 
 /*
  * RelationGetPartitionKey -- get partition key, if relation is partitioned
@@ -273,13 +273,13 @@ RelationBuildPartitionKey(Relation relation)
  * Returns a list of partition quals
  */
 List *
-RelationGetPartitionQual(Relation rel)
+RelationGetPartitionQual(Relation rel, PartKeyContext *context)
 {
 	/* Quick exit */
 	if (!rel->rd_rel->relispartition)
 		return NIL;
 
-	return generate_partition_qual(rel);
+	return generate_partition_qual(rel, context);
 }
 
 /*
@@ -305,7 +305,7 @@ get_partition_qual_relid(Oid relid)
 		Relation	rel = relation_open(relid, AccessShareLock);
 		List	   *and_args;
 
-		and_args = generate_partition_qual(rel);
+		and_args = generate_partition_qual(rel, NULL);
 
 		/* Convert implicit-AND list format to boolean expression */
 		if (and_args == NIL)
@@ -333,7 +333,7 @@ get_partition_qual_relid(Oid relid)
  * into long-lived cache contexts, especially if we fail partway through.
  */
 static List *
-generate_partition_qual(Relation rel)
+generate_partition_qual(Relation rel, PartKeyContext *context)
 {
 	HeapTuple	tuple;
 	MemoryContext oldcxt;
@@ -349,7 +349,18 @@ generate_partition_qual(Relation rel)
 
 	/* If we already cached the result, just return a copy */
 	if (rel->rd_partcheckvalid)
+	{
+		if (context != NULL)
+		{
+			context->keyexpr_list = rel->rd_keyexpr_list;
+			context->keycol_no += list_length(context->keyexpr_list);
+		}
+
 		return copyObject(rel->rd_partcheck);
+	}
+
+	if (context != NULL)
+		context->keyexpr_varattno = NULL;
 
 	/*
 	 * Grab at least an AccessShareLock on the parent table.  Must do this
@@ -376,14 +387,27 @@ generate_partition_qual(Relation rel)
 		bound = castNode(PartitionBoundSpec,
 						 stringToNode(TextDatumGetCString(boundDatum)));
 
-		my_qual = get_qual_from_partbound(rel, parent, bound);
+		my_qual = get_qual_from_partbound(rel, parent, bound, context);
 	}
 
 	ReleaseSysCache(tuple);
 
 	/* Add the parent's quals to the list (if any) */
 	if (parent->rd_rel->relispartition)
-		result = list_concat(generate_partition_qual(parent), my_qual);
+	{
+		List *cur_keyexpr_list;
+		if (context != NULL)
+		{
+			cur_keyexpr_list = context->keyexpr_list;
+			context->keyexpr_list = NIL;
+		}
+
+		result = list_concat(generate_partition_qual(parent, context), my_qual);
+
+		if (context != NULL)
+			context->keyexpr_list = list_concat(context->keyexpr_list,
+					cur_keyexpr_list);
+	}
 	else
 		result = my_qual;
 
@@ -394,10 +418,14 @@ generate_partition_qual(Relation rel)
 	 * here.
 	 */
 	result = map_partition_varattnos(result, 1, rel, parent);
+	if (context != NULL)
+		context->keyexpr_list = map_partition_varattnos(context->keyexpr_list,
+														1, rel, parent);
 
 	/* Assert that we're not leaking any old data during assignments below */
 	Assert(rel->rd_partcheckcxt == NULL);
 	Assert(rel->rd_partcheck == NIL);
+	Assert(rel->rd_keyexpr_list == NIL);
 
 	/*
 	 * Save a copy in the relcache.  The order of these operations is fairly
@@ -416,10 +444,16 @@ generate_partition_qual(Relation rel)
 										  RelationGetRelationName(rel));
 		oldcxt = MemoryContextSwitchTo(rel->rd_partcheckcxt);
 		rel->rd_partcheck = copyObject(result);
+		if (context != NULL)
+			rel->rd_keyexpr_list = copyObject(context->keyexpr_list);
 		MemoryContextSwitchTo(oldcxt);
 	}
 	else
+	{
 		rel->rd_partcheck = NIL;
+		rel->rd_keyexpr_list = NIL;
+	}
+
 	rel->rd_partcheckvalid = true;
 
 	/* Keep the parent locked until commit */
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index fd05615..f7527da 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1161,6 +1161,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
 	relation->rd_pdcxt = NULL;
 	relation->rd_pddcxt = NULL;
 	relation->rd_partcheck = NIL;
+	relation->rd_keyexpr_list = NIL;
 	relation->rd_partcheckvalid = false;
 	relation->rd_partcheckcxt = NULL;
 
@@ -6041,6 +6042,7 @@ load_relcache_init_file(bool shared)
 		rel->rd_pdcxt = NULL;
 		rel->rd_pddcxt = NULL;
 		rel->rd_partcheck = NIL;
+		rel->rd_keyexpr_list = NIL;
 		rel->rd_partcheckvalid = false;
 		rel->rd_partcheckcxt = NULL;
 		rel->rd_indexprs = NIL;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7795a69..8dd457c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -497,6 +497,15 @@ typedef struct ResultRelInfo
 	ExprState  *ri_PartitionCheckExpr;
 
 	/*
+	 * Partition Key expressions that used in PartitionCheckExpr
+	 * (NULL if not set up yet)
+	 */
+	List  *ri_PartitionKeyExpr;
+
+	/* Used to evaluate the PartitionCheckExpr (NULL if not set up yet) */
+	TupleTableSlot *ri_PartitionKeySlot;
+
+	/*
 	 * Information needed by tuple routing target relations
 	 *
 	 * RootResultRelInfo gives the target relation mentioned in the query, if
diff --git a/src/include/partitioning/partbounds.h b/src/include/partitioning/partbounds.h
index ebf3ff1..800d4dc 100644
--- a/src/include/partitioning/partbounds.h
+++ b/src/include/partitioning/partbounds.h
@@ -86,7 +86,7 @@ extern uint64 compute_partition_hash_value(int partnatts, FmgrInfo *partsupfunc,
 										   Oid *partcollation,
 										   Datum *values, bool *isnull);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
-									 PartitionBoundSpec *spec);
+									 PartitionBoundSpec *spec, PartKeyContext *context);
 extern PartitionBoundInfo partition_bounds_create(PartitionBoundSpec **boundspecs,
 												  int nparts, PartitionKey key, int **mapping);
 extern bool partition_bounds_equal(int partnatts, int16 *parttyplen,
diff --git a/src/include/partitioning/partdefs.h b/src/include/partitioning/partdefs.h
index d742b96..be5591c 100644
--- a/src/include/partitioning/partdefs.h
+++ b/src/include/partitioning/partdefs.h
@@ -23,4 +23,6 @@ typedef struct PartitionDescData *PartitionDesc;
 
 typedef struct PartitionDirectoryData *PartitionDirectory;
 
+typedef struct PartKeyContext PartKeyContext;
+
 #endif							/* PARTDEFS_H */
diff --git a/src/include/utils/partcache.h b/src/include/utils/partcache.h
index a451bfb..6f1bcbb 100644
--- a/src/include/utils/partcache.h
+++ b/src/include/utils/partcache.h
@@ -46,9 +46,15 @@ typedef struct PartitionKeyData
 	Oid		   *parttypcoll;
 }			PartitionKeyData;
 
+typedef struct PartKeyContext
+{
+	int keycol_no;
+	List *keyexpr_list;
+	AttrNumber *keyexpr_varattno;
+} PartKeyContext;
 
 extern PartitionKey RelationGetPartitionKey(Relation rel);
-extern List *RelationGetPartitionQual(Relation rel);
+extern List *RelationGetPartitionQual(Relation rel, PartKeyContext *context);
 extern Expr *get_partition_qual_relid(Oid relid);
 
 /*
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 774ac5b..111287b 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -143,6 +143,7 @@ typedef struct RelationData
 
 	/* data managed by RelationGetPartitionQual: */
 	List	   *rd_partcheck;	/* partition CHECK quals */
+	List	   *rd_keyexpr_list;	/* partition key exprs used in CHECK quals */
 	bool		rd_partcheckvalid;	/* true if list has been computed */
 	MemoryContext rd_partcheckcxt;	/* private cxt for rd_partcheck, if any */
 
-- 
2.7.2.windows.1

#17tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: houzj.fnst@fujitsu.com (#16)
RE: Skip partition tuple routing with constant partition key

From: houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com>

I think this patch can solve the performance degradation of key expression after
applying the [Save the last partition] patch.
Besides, this could be a separate patch which can improve some more cases.
Thoughts ?

Thank you for proposing an impressive improvement so quickly! Yes, I'm in the mood for adopting Amit-san's patch as a base because it's compact and readable, and plus add this patch of yours to complement the partition key function case.

But ...

* Applying your patch alone produced a compilation error. I'm sorry I mistakenly deleted the compile log, but it said something like "There's a redeclaration of PartKeyContext in partcache.h; the original definition is in partdef.h"

* Hmm, this may be too much to expect, but I wonder if we can make the patch more compact...

Regards
Takayuki Tsunakawa

#18houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: tsunakawa.takay@fujitsu.com (#17)
RE: Skip partition tuple routing with constant partition key

From: Tsunakawa, Takayuki <tsunakawa.takay@fujitsu.com>
Sent: Monday, May 24, 2021 3:34 PM

From: houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com>

I think this patch can solve the performance degradation of key
expression after applying the [Save the last partition] patch.
Besides, this could be a separate patch which can improve some more cases.
Thoughts ?

Thank you for proposing an impressive improvement so quickly! Yes, I'm in
the mood for adopting Amit-san's patch as a base because it's compact and
readable, and plus add this patch of yours to complement the partition key
function case.

Thanks for looking into this.

But ...

* Applying your patch alone produced a compilation error. I'm sorry I
mistakenly deleted the compile log, but it said something like "There's a
redeclaration of PartKeyContext in partcache.h; the original definition is in
partdef.h"

It seems a little strange, I have compiled it alone in two different linux machine and did
not find such an error. Did you compile it on a windows machine ?

* Hmm, this may be too much to expect, but I wonder if we can make the patch
more compact...

Of course, I will try to simplify the patch.

Best regards,
houzj

#19tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: houzj.fnst@fujitsu.com (#18)
RE: Skip partition tuple routing with constant partition key

From: Hou, Zhijie/侯 志杰 <houzj.fnst@fujitsu.com>

It seems a little strange, I have compiled it alone in two different linux machine
and did
not find such an error. Did you compile it on a windows machine ?

On Linux, it produces:

gcc -std=gnu99 -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-s\
tatement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-securit\
y -fno-strict-aliasing -fwrapv -g -O0 -I../../../src/include -D_GNU_SOURCE -\
c -o heap.o heap.c -MMD -MP -MF .deps/heap.Po
In file included from heap.c:86:
../../../src/include/utils/partcache.h:54: error: redefinition of typedef 'Part\
KeyContext'
../../../src/include/partitioning/partdefs.h:26: note: previous declaration of \
'PartKeyContext' was here

Regards
Takayuki Tsunakawa

#20houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: houzj.fnst@fujitsu.com (#18)
1 attachment(s)
RE: Skip partition tuple routing with constant partition key

From: houzj.fnst@fujitsu.com <houzj.fnst@fujitsu.com>
Sent: Monday, May 24, 2021 3:58 PM

From: Tsunakawa, Takayuki <mailto:tsunakawa.takay@fujitsu.com>
Sent: Monday, May 24, 2021 3:34 PM

From: mailto:houzj.fnst@fujitsu.com <mailto:houzj.fnst@fujitsu.com>

I think this patch can solve the performance degradation of key
expression after applying the [Save the last partition] patch.
Besides, this could be a separate patch which can improve some more

cases.

Thoughts ?

Thank you for proposing an impressive improvement so quickly! Yes,
I'm in the mood for adopting Amit-san's patch as a base because it's
compact and readable, and plus add this patch of yours to complement
the partition key function case.

Thanks for looking into this.

But ...

* Applying your patch alone produced a compilation error. I'm sorry I
mistakenly deleted the compile log, but it said something like
"There's a redeclaration of PartKeyContext in partcache.h; the
original definition is in partdef.h"

It seems a little strange, I have compiled it alone in two different linux machine
and did not find such an error. Did you compile it on a windows machine ?

Ah, Maybe I found the issue.
Attaching a new patch, please have a try on this patch.

Best regards,
houzj

Attachments:

0001-improving-ExecPartitionCheck.patchapplication/octet-stream; name=0001-improving-ExecPartitionCheck.patchDownload
From 1c2fe94ba8d103d76e41dca2dd0a4198e1e3170f Mon Sep 17 00:00:00 2001
From: houzj <houzj.fnst@cn.fujitsu.com>
Date: Mon, 24 May 2021 08:54:15 +0800
Subject: [PATCH] improving-ExecPartitionCheck

Currently for a normal partition key it will first generate a CHECK expression
Like : [Keyexpression IS NOT NULL AND Keyexpression > lowboud AND Keyexpression < lowboud].
In this case, Keyexpression will be re-executed which will bring some overhead.

Instead, I think we can try to do the following step:
1)extract the Keyexpression from the CHECK expression
2)evaluate the key expression in advance
3)pass the result of key expression to do the partition CHECK.
In this way ,we only execute the key expression once which looks more efficient.

---
 src/backend/commands/tablecmds.c      |   6 +-
 src/backend/executor/execMain.c       | 114 +++++++++++++++++++++-
 src/backend/optimizer/util/plancat.c  |   2 +-
 src/backend/partitioning/partbounds.c | 176 +++++++++++++++++++++++++++++++---
 src/backend/utils/cache/partcache.c   |  48 ++++++++--
 src/backend/utils/cache/relcache.c    |   2 +
 src/include/nodes/execnodes.h         |   9 ++
 src/include/partitioning/partbounds.h |   2 +-
 src/include/partitioning/partdefs.h   |   2 +
 src/include/utils/partcache.h         |   8 +-
 src/include/utils/rel.h               |   1 +
 11 files changed, 344 insertions(+), 26 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index ebc6203..d0dc337 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -17279,9 +17279,9 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd,
 	 * If the parent itself is a partition, make sure to include its
 	 * constraint as well.
 	 */
-	partBoundConstraint = get_qual_from_partbound(attachrel, rel, cmd->bound);
+	partBoundConstraint = get_qual_from_partbound(attachrel, rel, cmd->bound, NULL);
 	partConstraint = list_concat(partBoundConstraint,
-								 RelationGetPartitionQual(rel));
+								 RelationGetPartitionQual(rel, NULL));
 
 	/* Skip validation if there are no constraints to validate. */
 	if (partConstraint)
@@ -18083,7 +18083,7 @@ DetachAddConstraintIfNeeded(List **wqueue, Relation partRel)
 {
 	List	   *constraintExpr;
 
-	constraintExpr = RelationGetPartitionQual(partRel);
+	constraintExpr = RelationGetPartitionQual(partRel, NULL);
 	constraintExpr = (List *) eval_const_expressions(NULL, (Node *) constraintExpr);
 
 	/*
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index b3ce4ba..f2da243 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -53,6 +53,7 @@
 #include "jit/jit.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "nodes/nodeFuncs.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
 #include "storage/lmgr.h"
@@ -1699,6 +1700,7 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 {
 	ExprContext *econtext;
 	bool		success;
+	ListCell   *lc;
 
 	/*
 	 * If first time through, build expression state tree for the partition
@@ -1709,12 +1711,80 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 */
 	if (resultRelInfo->ri_PartitionCheckExpr == NULL)
 	{
+		int				i;
+		PartKeyContext	partkeycontext;
+		TupleDesc		tupdesc,
+						coltupdesc;
+		List		   *keyexpr_list;
+
 		/*
 		 * Ensure that the qual tree and prepared expression are in the
 		 * query-lifespan context.
 		 */
 		MemoryContext oldcxt = MemoryContextSwitchTo(estate->es_query_cxt);
-		List	   *qual = RelationGetPartitionQual(resultRelInfo->ri_RelationDesc);
+		List	   *qual;
+
+		/*
+		 * Extract the key expressions from the partition check expression to
+		 * avoid re-execution.
+		 */
+
+		/* The attno for key expr starts after the plain column */
+		partkeycontext.keycol_no = slot->tts_tupleDescriptor->natts + 1;
+		partkeycontext.keyexpr_list = NIL;
+		partkeycontext.keyexpr_varattno = NULL;
+
+		qual = RelationGetPartitionQual(resultRelInfo->ri_RelationDesc,
+										&partkeycontext);
+
+		keyexpr_list = partkeycontext.keyexpr_list;
+		resultRelInfo->ri_PartitionKeyExpr = NIL;
+		resultRelInfo->ri_PartitionKeySlot = NULL;
+
+
+		if (keyexpr_list != NIL)
+		{
+			/*
+			 * Build a slot which contains both the partition key and plain
+			 * column
+			 */
+			coltupdesc = slot->tts_tupleDescriptor;
+
+			tupdesc = CreateTemplateTupleDesc(partkeycontext.keycol_no - 1);
+
+			/* Copy the plain column */
+			TupleDescCopy(tupdesc, coltupdesc);
+
+			/* XXX adjust the natts */
+			tupdesc->natts = partkeycontext.keycol_no - 1;
+
+			/* Save the partition key list */
+			i = coltupdesc->natts + 1;
+			foreach(lc, keyexpr_list)
+			{
+				Node *e = lfirst(lc);
+
+				TupleDescInitEntry(tupdesc, i,
+								   NULL,
+								   exprType(e),
+								   exprTypmod(e),
+								   0);
+
+				TupleDescInitEntryCollation(tupdesc,
+											i,
+											exprCollation(e));
+
+				/* initialize each key expression for execution */
+				resultRelInfo->ri_PartitionKeyExpr =
+					lappend(resultRelInfo->ri_PartitionKeyExpr,
+							ExecPrepareExpr((Expr *) e, estate));
+
+				i++;
+			}
+
+			resultRelInfo->ri_PartitionKeySlot =
+				MakeTupleTableSlot(tupdesc, slot->tts_ops);
+		}
 
 		resultRelInfo->ri_PartitionCheckExpr = ExecPrepareCheck(qual, estate);
 		MemoryContextSwitchTo(oldcxt);
@@ -1726,6 +1796,48 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 */
 	econtext = GetPerTupleExprContext(estate);
 
+	if (resultRelInfo->ri_PartitionKeySlot != NULL)
+		ExecClearTuple(resultRelInfo->ri_PartitionKeySlot);
+
+	/*
+	 * Evaluate the partition expression in advance to avoid re-execution,
+	 * and add the result to the slot to do the partition check.
+	 */
+	foreach(lc, resultRelInfo->ri_PartitionKeyExpr)
+	{
+		Datum		datum;
+		bool		isNull;
+		ExprState  *keystate = lfirst_node(ExprState, lc);
+		int			i = foreach_current_index(lc) +
+						slot->tts_tupleDescriptor->natts;
+
+		econtext->ecxt_scantuple = slot;
+		datum = ExecEvalExprSwitchContext(keystate, econtext, &isNull);
+
+		resultRelInfo->ri_PartitionKeySlot->tts_values[i] = datum;
+		resultRelInfo->ri_PartitionKeySlot->tts_isnull[i] = isNull;
+	}
+
+	/*
+	 * Move the values from original slot to the new slot, then the new data
+	 * is like :
+	 * [col1 , ... colN , keyexpr1's result , ... keyexprN's result]
+	 */
+	if (resultRelInfo->ri_PartitionKeyExpr != NIL)
+	{
+		slot_getallattrs(slot);
+		memcpy(resultRelInfo->ri_PartitionKeySlot->tts_values,
+			   slot->tts_values,
+			   slot->tts_tupleDescriptor->natts * sizeof(Datum));
+
+		memcpy(resultRelInfo->ri_PartitionKeySlot->tts_isnull,
+			   slot->tts_isnull,
+			   slot->tts_tupleDescriptor->natts * sizeof(Datum));
+
+		slot = resultRelInfo->ri_PartitionKeySlot;
+		ExecStoreVirtualTuple(slot);
+	}
+
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index c5194fd..f274538 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -2399,7 +2399,7 @@ set_baserel_partition_constraint(Relation relation, RelOptInfo *rel)
 	 * implicit-AND format, we'd have to explicitly convert it to explicit-AND
 	 * format and back again.
 	 */
-	partconstr = RelationGetPartitionQual(relation);
+	partconstr = RelationGetPartitionQual(relation, NULL);
 	if (partconstr)
 	{
 		partconstr = (List *) expression_planner((Expr *) partconstr);
diff --git a/src/backend/partitioning/partbounds.c b/src/backend/partitioning/partbounds.c
index 7925fcc..9c551bd 100644
--- a/src/backend/partitioning/partbounds.c
+++ b/src/backend/partitioning/partbounds.c
@@ -231,14 +231,14 @@ static Oid	get_partition_operator(PartitionKey key, int col,
 static List *get_qual_for_hash(Relation parent, PartitionBoundSpec *spec);
 static List *get_qual_for_list(Relation parent, PartitionBoundSpec *spec);
 static List *get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
-								bool for_default);
+								bool for_default, PartKeyContext *context);
 static void get_range_key_properties(PartitionKey key, int keynum,
 									 PartitionRangeDatum *ldatum,
 									 PartitionRangeDatum *udatum,
 									 ListCell **partexprs_item,
 									 Expr **keyCol,
 									 Const **lower_val, Const **upper_val);
-static List *get_range_nulltest(PartitionKey key);
+static List *get_range_nulltest(PartitionKey key, PartKeyContext *context);
 
 /*
  * get_qual_from_partbound
@@ -247,7 +247,7 @@ static List *get_range_nulltest(PartitionKey key);
  */
 List *
 get_qual_from_partbound(Relation rel, Relation parent,
-						PartitionBoundSpec *spec)
+						PartitionBoundSpec *spec, PartKeyContext *context)
 {
 	PartitionKey key = RelationGetPartitionKey(parent);
 	List	   *my_qual = NIL;
@@ -268,7 +268,7 @@ get_qual_from_partbound(Relation rel, Relation parent,
 
 		case PARTITION_STRATEGY_RANGE:
 			Assert(spec->strategy == PARTITION_STRATEGY_RANGE);
-			my_qual = get_qual_for_range(parent, spec, false);
+			my_qual = get_qual_for_range(parent, spec, false, context);
 			break;
 
 		default:
@@ -3153,7 +3153,7 @@ check_default_partition_contents(Relation parent, Relation default_rel,
 
 	new_part_constraints = (new_spec->strategy == PARTITION_STRATEGY_LIST)
 		? get_qual_for_list(parent, new_spec)
-		: get_qual_for_range(parent, new_spec, false);
+		: get_qual_for_range(parent, new_spec, false, NULL);
 	def_part_constraints =
 		get_proposed_default_constraint(new_part_constraints);
 
@@ -4167,7 +4167,7 @@ get_qual_for_list(Relation parent, PartitionBoundSpec *spec)
  */
 static List *
 get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
-				   bool for_default)
+				   bool for_default, PartKeyContext *context)
 {
 	List	   *result = NIL;
 	ListCell   *cell1,
@@ -4190,6 +4190,13 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 			   *upper_or_start_datum;
 	bool		need_next_lower_arm,
 				need_next_upper_arm;
+	AttrNumber *keyexpr_varattno = NULL;
+	int			cur_keyexpr_no,
+				old_keyexpr_no;
+
+	if (context != NULL && context->keyexpr_varattno == NULL)
+		context->keyexpr_varattno =
+			palloc0(sizeof(AttrNumber) * list_length(key->partexprs));
 
 	if (spec->is_default)
 	{
@@ -4226,7 +4233,7 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 			{
 				List	   *part_qual;
 
-				part_qual = get_qual_for_range(parent, bspec, true);
+				part_qual = get_qual_for_range(parent, bspec, true, context);
 
 				/*
 				 * AND the constraints of the partition and add to
@@ -4251,7 +4258,7 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 			 */
 			other_parts_constr =
 				makeBoolExpr(AND_EXPR,
-							 lappend(get_range_nulltest(key),
+							 lappend(get_range_nulltest(key, context),
 									 list_length(or_expr_args) > 1
 									 ? makeBoolExpr(OR_EXPR, or_expr_args,
 													-1)
@@ -4274,7 +4281,13 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 	 * to avoid accumulating the NullTest on the same keys for each partition.
 	 */
 	if (!for_default)
-		result = get_range_nulltest(key);
+		result = get_range_nulltest(key, context);
+
+	if (context != NULL)
+		keyexpr_varattno = context->keyexpr_varattno;
+
+	cur_keyexpr_no = 0;
+	old_keyexpr_no = 0;
 
 	/*
 	 * Iterate over the key columns and check if the corresponding lower and
@@ -4295,6 +4308,8 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 		ExprState  *test_exprstate;
 		Datum		test_result;
 		bool		isNull;
+		int			key_attno = 0;
+		bool		varattno_saved = false;
 
 		ldatum = castNode(PartitionRangeDatum, lfirst(cell1));
 		udatum = castNode(PartitionRangeDatum, lfirst(cell2));
@@ -4306,12 +4321,33 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 		 */
 		partexprs_item_saved = partexprs_item;
 
+		old_keyexpr_no = cur_keyexpr_no;
+
 		get_range_key_properties(key, i, ldatum, udatum,
 								 &partexprs_item,
 								 &keyCol,
 								 &lower_val, &upper_val);
 
 		/*
+		 * Check if we have saved the same key expression, if so , just get
+		 * the attno from keyexpr_varattno
+		 */
+		if (context != NULL && !IsA(keyCol, Var))
+		{
+			if (keyexpr_varattno[cur_keyexpr_no] != 0)
+			{
+				varattno_saved = true;
+				key_attno = keyexpr_varattno[cur_keyexpr_no];
+			}
+			else
+			{
+				varattno_saved = false;
+				key_attno = context->keycol_no;
+			}
+			cur_keyexpr_no++;
+		}
+
+		/*
 		 * If either value is NULL, the corresponding partition bound is
 		 * either MINVALUE or MAXVALUE, and we treat them as unequal, because
 		 * even if they're the same, there is no common value to equate the
@@ -4346,6 +4382,25 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 		if (i == key->partnatts - 1)
 			elog(ERROR, "invalid range bound specification");
 
+		/* If key is not a plain column */
+		if (context != NULL && !IsA(keyCol, Var))
+		{
+			/* Save the keyexpr to keyexpr_list if first time meet */
+			if (!varattno_saved)
+			{
+				context->keyexpr_list = lappend(context->keyexpr_list, keyCol);
+				keyexpr_varattno[old_keyexpr_no] = key_attno;
+				context->keycol_no++;
+			}
+
+			keyCol = (Expr *) makeVar(2,
+									key_attno,
+									key->parttypid[i],
+									key->parttypmod[i],
+									key->parttypcoll[i],
+									0);
+		}
+
 		/* Equal, so generate keyCol = lower_val expression */
 		result = lappend(result,
 						 make_partition_op_expr(key, i, BTEqualStrategyNumber,
@@ -4372,11 +4427,16 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 		j = i;
 		partexprs_item = partexprs_item_saved;
 
+		cur_keyexpr_no = old_keyexpr_no;
+
 		for_both_cell(cell1, spec->lowerdatums, lower_or_start_datum,
 					  cell2, spec->upperdatums, upper_or_start_datum)
 		{
 			PartitionRangeDatum *ldatum_next = NULL,
 					   *udatum_next = NULL;
+			int			key_attno = 0;
+			bool		varattno_saved = false;
+			Expr		   *temp_keyCol;
 
 			ldatum = castNode(PartitionRangeDatum, lfirst(cell1));
 			if (lnext(spec->lowerdatums, cell1))
@@ -4391,6 +4451,26 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 									 &keyCol,
 									 &lower_val, &upper_val);
 
+			/*
+			 * Check if we have saved the same key expression, if so , just get
+			 * the attno from keyexpr_varattno
+			 */
+			temp_keyCol = keyCol;
+			if (context != NULL && !IsA(keyCol, Var))
+			{
+				if (keyexpr_varattno[cur_keyexpr_no] != 0)
+				{
+					varattno_saved = true;
+					key_attno = keyexpr_varattno[cur_keyexpr_no];
+				}
+				else
+				{
+					varattno_saved = false;
+					key_attno = context->keycol_no;
+				}
+				cur_keyexpr_no++;
+			}
+
 			if (need_next_lower_arm && lower_val)
 			{
 				uint16		strategy;
@@ -4410,10 +4490,30 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 				else
 					strategy = BTGreaterStrategyNumber;
 
+				/* If key is not a plain column */
+				if (context != NULL && !IsA(keyCol, Var))
+				{
+					/* Save the keyexpr to keyexpr_list if first time meet */
+					if (!varattno_saved)
+					{
+						context->keyexpr_list = lappend(context->keyexpr_list,
+														keyCol);
+						keyexpr_varattno[cur_keyexpr_no - 1] = key_attno;
+						context->keycol_no++;
+					}
+
+					temp_keyCol = (Expr *) makeVar(2,
+											key_attno,
+											key->parttypid[j],
+											key->parttypmod[j],
+											key->parttypcoll[j],
+											0);
+				}
+
 				lower_or_arm_args = lappend(lower_or_arm_args,
 											make_partition_op_expr(key, j,
 																   strategy,
-																   keyCol,
+																   temp_keyCol,
 																   (Expr *) lower_val));
 			}
 
@@ -4434,6 +4534,26 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 				else
 					strategy = BTLessStrategyNumber;
 
+				/* If key is not a plain column */
+				if (context != NULL && !IsA(keyCol, Var))
+				{
+					/* Save the keyexpr to keyexpr_list if first time meet */
+					if (keyexpr_varattno[cur_keyexpr_no - 1] == 0)
+					{
+						context->keyexpr_list = lappend(context->keyexpr_list,
+														keyCol);
+						keyexpr_varattno[cur_keyexpr_no - 1] = key_attno;
+						context->keycol_no++;
+					}
+
+					keyCol = (Expr *) makeVar(2,
+											key_attno,
+											key->parttypid[j],
+											key->parttypmod[j],
+											key->parttypcoll[j],
+											0);
+				}
+
 				upper_or_arm_args = lappend(upper_or_arm_args,
 											make_partition_op_expr(key, j,
 																   strategy,
@@ -4506,7 +4626,7 @@ get_qual_for_range(Relation parent, PartitionBoundSpec *spec,
 	 */
 	if (result == NIL)
 		result = for_default
-			? get_range_nulltest(key)
+			? get_range_nulltest(key, context)
 			: list_make1(makeBoolConst(true, false));
 
 	return result;
@@ -4572,13 +4692,15 @@ get_range_key_properties(PartitionKey key, int keynum,
  * keys to be null, so emit an IS NOT NULL expression for each key column.
  */
 static List *
-get_range_nulltest(PartitionKey key)
+get_range_nulltest(PartitionKey key, PartKeyContext *context)
 {
 	List	   *result = NIL;
 	NullTest   *nulltest;
 	ListCell   *partexprs_item;
 	int			i;
+	int			cur_keyexpr_no;
 
+	cur_keyexpr_no = 0;
 	partexprs_item = list_head(key->partexprs);
 	for (i = 0; i < key->partnatts; i++)
 	{
@@ -4598,6 +4720,36 @@ get_range_nulltest(PartitionKey key)
 			if (partexprs_item == NULL)
 				elog(ERROR, "wrong number of partition key expressions");
 			keyCol = copyObject(lfirst(partexprs_item));
+
+			if (context != NULL)
+			{
+				int key_attno;
+
+				if (context->keyexpr_varattno[cur_keyexpr_no] != 0)
+				{
+					key_attno = context->keyexpr_varattno[cur_keyexpr_no];
+				}
+				else
+				{
+					key_attno = context->keycol_no;
+
+					context->keyexpr_list = lappend(context->keyexpr_list,
+													keyCol);
+					context->keyexpr_varattno[cur_keyexpr_no] = key_attno;
+
+					context->keycol_no++;
+				}
+
+				cur_keyexpr_no++;
+
+				keyCol = (Expr *) makeVar(2,
+										key_attno,
+										key->parttypid[i],
+										key->parttypmod[i],
+										key->parttypcoll[i],
+										0);
+			}
+
 			partexprs_item = lnext(key->partexprs, partexprs_item);
 		}
 
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 21e60f0..117717b 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -38,7 +38,7 @@
 
 
 static void RelationBuildPartitionKey(Relation relation);
-static List *generate_partition_qual(Relation rel);
+static List *generate_partition_qual(Relation rel, PartKeyContext *context);
 
 /*
  * RelationGetPartitionKey -- get partition key, if relation is partitioned
@@ -273,13 +273,13 @@ RelationBuildPartitionKey(Relation relation)
  * Returns a list of partition quals
  */
 List *
-RelationGetPartitionQual(Relation rel)
+RelationGetPartitionQual(Relation rel, PartKeyContext *context)
 {
 	/* Quick exit */
 	if (!rel->rd_rel->relispartition)
 		return NIL;
 
-	return generate_partition_qual(rel);
+	return generate_partition_qual(rel, context);
 }
 
 /*
@@ -305,7 +305,7 @@ get_partition_qual_relid(Oid relid)
 		Relation	rel = relation_open(relid, AccessShareLock);
 		List	   *and_args;
 
-		and_args = generate_partition_qual(rel);
+		and_args = generate_partition_qual(rel, NULL);
 
 		/* Convert implicit-AND list format to boolean expression */
 		if (and_args == NIL)
@@ -333,7 +333,7 @@ get_partition_qual_relid(Oid relid)
  * into long-lived cache contexts, especially if we fail partway through.
  */
 static List *
-generate_partition_qual(Relation rel)
+generate_partition_qual(Relation rel, PartKeyContext *context)
 {
 	HeapTuple	tuple;
 	MemoryContext oldcxt;
@@ -349,7 +349,18 @@ generate_partition_qual(Relation rel)
 
 	/* If we already cached the result, just return a copy */
 	if (rel->rd_partcheckvalid)
+	{
+		if (context != NULL)
+		{
+			context->keyexpr_list = rel->rd_keyexpr_list;
+			context->keycol_no += list_length(context->keyexpr_list);
+		}
+
 		return copyObject(rel->rd_partcheck);
+	}
+
+	if (context != NULL)
+		context->keyexpr_varattno = NULL;
 
 	/*
 	 * Grab at least an AccessShareLock on the parent table.  Must do this
@@ -376,14 +387,27 @@ generate_partition_qual(Relation rel)
 		bound = castNode(PartitionBoundSpec,
 						 stringToNode(TextDatumGetCString(boundDatum)));
 
-		my_qual = get_qual_from_partbound(rel, parent, bound);
+		my_qual = get_qual_from_partbound(rel, parent, bound, context);
 	}
 
 	ReleaseSysCache(tuple);
 
 	/* Add the parent's quals to the list (if any) */
 	if (parent->rd_rel->relispartition)
-		result = list_concat(generate_partition_qual(parent), my_qual);
+	{
+		List *cur_keyexpr_list;
+		if (context != NULL)
+		{
+			cur_keyexpr_list = context->keyexpr_list;
+			context->keyexpr_list = NIL;
+		}
+
+		result = list_concat(generate_partition_qual(parent, context), my_qual);
+
+		if (context != NULL)
+			context->keyexpr_list = list_concat(context->keyexpr_list,
+					cur_keyexpr_list);
+	}
 	else
 		result = my_qual;
 
@@ -394,10 +418,14 @@ generate_partition_qual(Relation rel)
 	 * here.
 	 */
 	result = map_partition_varattnos(result, 1, rel, parent);
+	if (context != NULL)
+		context->keyexpr_list = map_partition_varattnos(context->keyexpr_list,
+														1, rel, parent);
 
 	/* Assert that we're not leaking any old data during assignments below */
 	Assert(rel->rd_partcheckcxt == NULL);
 	Assert(rel->rd_partcheck == NIL);
+	Assert(rel->rd_keyexpr_list == NIL);
 
 	/*
 	 * Save a copy in the relcache.  The order of these operations is fairly
@@ -416,10 +444,16 @@ generate_partition_qual(Relation rel)
 										  RelationGetRelationName(rel));
 		oldcxt = MemoryContextSwitchTo(rel->rd_partcheckcxt);
 		rel->rd_partcheck = copyObject(result);
+		if (context != NULL)
+			rel->rd_keyexpr_list = copyObject(context->keyexpr_list);
 		MemoryContextSwitchTo(oldcxt);
 	}
 	else
+	{
 		rel->rd_partcheck = NIL;
+		rel->rd_keyexpr_list = NIL;
+	}
+
 	rel->rd_partcheckvalid = true;
 
 	/* Keep the parent locked until commit */
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index fd05615..f7527da 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1161,6 +1161,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
 	relation->rd_pdcxt = NULL;
 	relation->rd_pddcxt = NULL;
 	relation->rd_partcheck = NIL;
+	relation->rd_keyexpr_list = NIL;
 	relation->rd_partcheckvalid = false;
 	relation->rd_partcheckcxt = NULL;
 
@@ -6041,6 +6042,7 @@ load_relcache_init_file(bool shared)
 		rel->rd_pdcxt = NULL;
 		rel->rd_pddcxt = NULL;
 		rel->rd_partcheck = NIL;
+		rel->rd_keyexpr_list = NIL;
 		rel->rd_partcheckvalid = false;
 		rel->rd_partcheckcxt = NULL;
 		rel->rd_indexprs = NIL;
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7795a69..8dd457c 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -497,6 +497,15 @@ typedef struct ResultRelInfo
 	ExprState  *ri_PartitionCheckExpr;
 
 	/*
+	 * Partition Key expressions that used in PartitionCheckExpr
+	 * (NULL if not set up yet)
+	 */
+	List  *ri_PartitionKeyExpr;
+
+	/* Used to evaluate the PartitionCheckExpr (NULL if not set up yet) */
+	TupleTableSlot *ri_PartitionKeySlot;
+
+	/*
 	 * Information needed by tuple routing target relations
 	 *
 	 * RootResultRelInfo gives the target relation mentioned in the query, if
diff --git a/src/include/partitioning/partbounds.h b/src/include/partitioning/partbounds.h
index ebf3ff1..800d4dc 100644
--- a/src/include/partitioning/partbounds.h
+++ b/src/include/partitioning/partbounds.h
@@ -86,7 +86,7 @@ extern uint64 compute_partition_hash_value(int partnatts, FmgrInfo *partsupfunc,
 										   Oid *partcollation,
 										   Datum *values, bool *isnull);
 extern List *get_qual_from_partbound(Relation rel, Relation parent,
-									 PartitionBoundSpec *spec);
+									 PartitionBoundSpec *spec, PartKeyContext *context);
 extern PartitionBoundInfo partition_bounds_create(PartitionBoundSpec **boundspecs,
 												  int nparts, PartitionKey key, int **mapping);
 extern bool partition_bounds_equal(int partnatts, int16 *parttyplen,
diff --git a/src/include/partitioning/partdefs.h b/src/include/partitioning/partdefs.h
index d742b96..be5591c 100644
--- a/src/include/partitioning/partdefs.h
+++ b/src/include/partitioning/partdefs.h
@@ -23,4 +23,6 @@ typedef struct PartitionDescData *PartitionDesc;
 
 typedef struct PartitionDirectoryData *PartitionDirectory;
 
+typedef struct PartKeyContext PartKeyContext;
+
 #endif							/* PARTDEFS_H */
diff --git a/src/include/utils/partcache.h b/src/include/utils/partcache.h
index a451bfb..6f1bcbb 100644
--- a/src/include/utils/partcache.h
+++ b/src/include/utils/partcache.h
@@ -46,9 +46,15 @@ typedef struct PartitionKeyData
 	Oid		   *parttypcoll;
 }			PartitionKeyData;
 
+struct PartKeyContext
+{
+	int keycol_no;
+	List *keyexpr_list;
+	AttrNumber *keyexpr_varattno;
+};
 
 extern PartitionKey RelationGetPartitionKey(Relation rel);
-extern List *RelationGetPartitionQual(Relation rel);
+extern List *RelationGetPartitionQual(Relation rel, PartKeyContext *context);
 extern Expr *get_partition_qual_relid(Oid relid);
 
 /*
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 774ac5b..111287b 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -143,6 +143,7 @@ typedef struct RelationData
 
 	/* data managed by RelationGetPartitionQual: */
 	List	   *rd_partcheck;	/* partition CHECK quals */
+	List	   *rd_keyexpr_list;	/* partition key exprs used in CHECK quals */
 	bool		rd_partcheckvalid;	/* true if list has been computed */
 	MemoryContext rd_partcheckcxt;	/* private cxt for rd_partcheck, if any */
 
-- 
2.7.2.windows.1

#21Amit Langote
amitlangote09@gmail.com
In reply to: houzj.fnst@fujitsu.com (#16)
1 attachment(s)
Re: Skip partition tuple routing with constant partition key

Hou-san,

On Mon, May 24, 2021 at 10:31 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

From: Amit Langote <amitlangote09@gmail.com>
Sent: Thursday, May 20, 2021 8:23 PM

This one seems bit tough. ExecPartitionCheck() uses the generic expression
evaluation machinery like a black box, which means execPartition.c can't really
tweal/control the time spent evaluating partition constraints. Given that, we
may have to disable the caching when key->partexprs != NIL, unless we can
reasonably do what you are suggesting.[]

I did some research on the CHECK expression that ExecPartitionCheck() execute.

Thanks for looking into this and writing the patch. Your idea does
sound promising.

Currently for a normal RANGE partition key it will first generate a CHECK expression
like : [Keyexpression IS NOT NULL AND Keyexpression > lowboud AND Keyexpression < lowboud].
In this case, Keyexpression will be re-executed which will bring some overhead.

Instead, I think we can try to do the following step:
1)extract the Keyexpression from the CHECK expression
2)evaluate the key expression in advance
3)pass the result of key expression to do the partition CHECK.
In this way ,we only execute the key expression once which looks more efficient.

I would have preferred this not to touch anything but
ExecPartitionCheck(), at least in the first version. Especially,
seeing that your patch touches partbounds.c makes me a bit nervous,
because the logic there is pretty complicated to begin with.

How about we start with something like the attached? It's the same
idea AFAICS, but implemented with a smaller footprint. We can
consider teaching relcache about this as the next step, if at all. I
haven't measured the performance, but maybe it's not as fast as yours,
so will need some fine-tuning. Can you please give it a read?

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachments:

ExecPartitionCheck-eval-partexprs-once-PoC_v1.patchapplication/octet-stream; name=ExecPartitionCheck-eval-partexprs-once-PoC_v1.patchDownload
diff --git a/src/backend/commands/copyfrom.c b/src/backend/commands/copyfrom.c
index 40a54ad0bd..50eb019da5 100644
--- a/src/backend/commands/copyfrom.c
+++ b/src/backend/commands/copyfrom.c
@@ -1042,7 +1042,8 @@ CopyFrom(CopyFromState cstate)
 				 */
 				if (resultRelInfo->ri_RelationDesc->rd_rel->relispartition &&
 					(proute == NULL || has_before_insert_row_trig))
-					ExecPartitionCheck(resultRelInfo, myslot, estate, true);
+					ExecPartitionCheck(resultRelInfo, myslot, estate, NULL,
+									   true);
 
 				/* Store the slot in the multi-insert buffer, when enabled. */
 				if (insertMethod == CIM_MULTI || leafpart_use_multi_insert)
diff --git a/src/backend/commands/trigger.c b/src/backend/commands/trigger.c
index 07c73f39de..8fc647cd0e 100644
--- a/src/backend/commands/trigger.c
+++ b/src/backend/commands/trigger.c
@@ -2288,7 +2288,7 @@ ExecBRInsertTriggers(EState *estate, ResultRelInfo *relinfo,
 			 * longer fits the partition.  Verify that.
 			 */
 			if (trigger->tgisclone &&
-				!ExecPartitionCheck(relinfo, slot, estate, false))
+				!ExecPartitionCheck(relinfo, slot, estate, NULL, false))
 				ereport(ERROR,
 						(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
 						 errmsg("moving row to another partition during a BEFORE FOR EACH ROW trigger is not supported"),
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index b3ce4bae53..8ddbc6d0c2 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -44,6 +44,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "catalog/namespace.h"
+#include "catalog/partition.h"
 #include "catalog/pg_publication.h"
 #include "commands/matview.h"
 #include "commands/trigger.h"
@@ -52,6 +53,8 @@
 #include "foreign/fdwapi.h"
 #include "jit/jit.h"
 #include "mb/pg_wchar.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
@@ -1686,6 +1689,32 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 	return NULL;
 }
 
+/*
+ * Replaces the occurrence of cxt->matchexpr in the expression tree given by
+ * 'node' by an OUTER var with provided attribute number.
+ */
+typedef struct
+{
+	Expr	   *matchexpr;
+	AttrNumber	varattno;
+} replace_partexpr_with_dummy_var_context;
+
+static Node *
+replace_partexpr_with_dummy_var(Node *node,
+								replace_partexpr_with_dummy_var_context *cxt)
+{
+	if (node == NULL)
+		return NULL;
+
+	if (equal(node, cxt->matchexpr))
+		return (Node *) makeVar(OUTER_VAR, cxt->varattno,
+								exprType(node), exprTypmod(node),
+								exprCollation(node), 0);
+
+	return expression_tree_mutator(node, replace_partexpr_with_dummy_var,
+								   (void *) cxt);
+}
+
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  *
@@ -1695,7 +1724,7 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
  */
 bool
 ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
-				   EState *estate, bool emitError)
+				   EState *estate, Relation parentrel, bool emitError)
 {
 	ExprContext *econtext;
 	bool		success;
@@ -1716,6 +1745,50 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 		MemoryContext oldcxt = MemoryContextSwitchTo(estate->es_query_cxt);
 		List	   *qual = RelationGetPartitionQual(resultRelInfo->ri_RelationDesc);
 
+		/*
+		 * If we have been passed the parent relation, optimize the evaluation
+		 * of partition key expressions.  The way we do that is by replacing
+		 * any occurrences of the individual expressions in this relation's
+		 * partition constraint by dummy Vars marked as coming from the
+		 * "OUTER" relation.  Then when actually executing such modified
+		 * partition constraint tree, we feed the actual partition expression
+		 * values via econtext->ecxt_outertuple; see below.
+		 */
+		if (parentrel)
+		{
+			replace_partexpr_with_dummy_var_context cxt;
+			List   *partexprs = RelationGetPartitionKey(parentrel)->partexprs;
+			ListCell *lc;
+			AttrNumber attrno = 1;
+			TupleDesc	partexprs_tupdesc;
+
+			partexprs_tupdesc = CreateTemplateTupleDesc(list_length(partexprs));
+
+			partexprs = map_partition_varattnos(partexprs, 1,
+												resultRelInfo->ri_RelationDesc,
+												parentrel);
+			foreach(lc, partexprs)
+			{
+				Expr   *expr = lfirst(lc);
+
+				cxt.matchexpr = expr;
+				cxt.varattno = attrno;
+				qual = (List *) replace_partexpr_with_dummy_var((Node *) qual,
+																&cxt);
+
+				resultRelInfo->ri_partitionKeyExprs =
+					lappend(resultRelInfo->ri_partitionKeyExprs,
+							ExecPrepareExpr(expr, estate));
+				TupleDescInitEntry(partexprs_tupdesc, attrno, NULL,
+								   exprType((Node *) expr),
+								   exprTypmod((Node *) expr), 0);
+				attrno++;
+			}
+
+			resultRelInfo->ri_partitionKeyExprsSlot =
+				ExecInitExtraTupleSlot(estate, partexprs_tupdesc, &TTSOpsVirtual);
+		}
+
 		resultRelInfo->ri_PartitionCheckExpr = ExecPrepareCheck(qual, estate);
 		MemoryContextSwitchTo(oldcxt);
 	}
@@ -1729,6 +1802,33 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
+	if (resultRelInfo->ri_partitionKeyExprs)
+	{
+		TupleTableSlot *partexprs_slot = resultRelInfo->ri_partitionKeyExprsSlot;
+		Datum	*values;
+		bool	*isnull;
+		ListCell *lc;
+		AttrNumber attrno = 1;
+
+		Assert(partexprs_slot != NULL);
+		ExecClearTuple(partexprs_slot);
+
+		values = partexprs_slot->tts_values;
+		isnull = partexprs_slot->tts_isnull;
+
+		foreach(lc, resultRelInfo->ri_partitionKeyExprs)
+		{
+			ExprState   *partexpr = lfirst(lc);
+
+			values[attrno-1] = ExecEvalExprSwitchContext(partexpr, econtext,
+												  &isnull[attrno-1]);
+			attrno++;
+		}
+		ExecStoreVirtualTuple(partexprs_slot);
+
+		econtext->ecxt_outertuple = partexprs_slot;
+	}
+
 	/*
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 2348eb3154..eaabdc6ee1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -287,7 +287,7 @@ ExecFindPartition(ModifyTableState *mtstate,
 	 * routing the tuple if it doesn't belong in the root table itself.
 	 */
 	if (rootResultRelInfo->ri_RelationDesc->rd_rel->relispartition)
-		ExecPartitionCheck(rootResultRelInfo, slot, estate, true);
+		ExecPartitionCheck(rootResultRelInfo, slot, estate, NULL, true);
 
 	/* start with the root partitioned table */
 	dispatch = pd[0];
@@ -298,6 +298,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 
 		CHECK_FOR_INTERRUPTS();
 
+		rel = dispatch->reldesc;
+
 		/*
 		 * Check if the saved partition accepts this tuple by evaluating its
 		 * partition constraint against the tuple.  If it does, we save a trip
@@ -317,7 +319,7 @@ ExecFindPartition(ModifyTableState *mtstate,
 												rri->ri_PartitionTupleSlot);
 			else
 				tmpslot = rootslot;
-			if (ExecPartitionCheck(rri, tmpslot, estate, false))
+			if (ExecPartitionCheck(rri, tmpslot, estate, rel, false))
 			{
 				/* and restore ecxt's scantuple */
 				ecxt->ecxt_scantuple = ecxt_scantuple_saved;
@@ -327,7 +329,6 @@ ExecFindPartition(ModifyTableState *mtstate,
 			dispatch->lastPartInfo = rri = NULL;
 		}
 
-		rel = dispatch->reldesc;
 		partdesc = dispatch->partdesc;
 
 		/*
@@ -513,7 +514,7 @@ ExecFindPartition(ModifyTableState *mtstate,
 					slot = rootslot;
 			}
 
-			ExecPartitionCheck(rri, slot, estate, true);
+			ExecPartitionCheck(rri, slot, estate, NULL, true);
 		}
 	}
 
diff --git a/src/backend/executor/execReplication.c b/src/backend/executor/execReplication.c
index 1e285e0349..9f37f55090 100644
--- a/src/backend/executor/execReplication.c
+++ b/src/backend/executor/execReplication.c
@@ -437,7 +437,7 @@ ExecSimpleRelationInsert(ResultRelInfo *resultRelInfo,
 		if (rel->rd_att->constr)
 			ExecConstraints(resultRelInfo, slot, estate);
 		if (rel->rd_rel->relispartition)
-			ExecPartitionCheck(resultRelInfo, slot, estate, true);
+			ExecPartitionCheck(resultRelInfo, slot, estate, NULL, true);
 
 		/* OK, store the tuple and create index entries for it */
 		simple_table_tuple_insert(resultRelInfo->ri_RelationDesc, slot);
@@ -505,7 +505,7 @@ ExecSimpleRelationUpdate(ResultRelInfo *resultRelInfo,
 		if (rel->rd_att->constr)
 			ExecConstraints(resultRelInfo, slot, estate);
 		if (rel->rd_rel->relispartition)
-			ExecPartitionCheck(resultRelInfo, slot, estate, true);
+			ExecPartitionCheck(resultRelInfo, slot, estate, NULL, true);
 
 		simple_table_tuple_update(rel, tid, slot, estate->es_snapshot,
 								  &update_indexes);
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 379b056310..c46fff0f7e 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -791,7 +791,7 @@ ExecInsert(ModifyTableState *mtstate,
 			(resultRelInfo->ri_RootResultRelInfo == NULL ||
 			 (resultRelInfo->ri_TrigDesc &&
 			  resultRelInfo->ri_TrigDesc->trig_insert_before_row)))
-			ExecPartitionCheck(resultRelInfo, slot, estate, true);
+			ExecPartitionCheck(resultRelInfo, slot, estate, NULL, true);
 
 		if (onconflict != ONCONFLICT_NONE && resultRelInfo->ri_NumIndices > 0)
 		{
@@ -1710,7 +1710,7 @@ lreplace:;
 		 */
 		partition_constraint_failed =
 			resultRelationDesc->rd_rel->relispartition &&
-			!ExecPartitionCheck(resultRelInfo, slot, estate, false);
+			!ExecPartitionCheck(resultRelInfo, slot, estate, NULL, false);
 
 		if (!partition_constraint_failed &&
 			resultRelInfo->ri_WithCheckOptions != NIL)
diff --git a/src/backend/replication/logical/worker.c b/src/backend/replication/logical/worker.c
index 6ba447ea97..ee4f7bf06f 100644
--- a/src/backend/replication/logical/worker.c
+++ b/src/backend/replication/logical/worker.c
@@ -1747,7 +1747,7 @@ apply_handle_tuple_routing(ApplyExecutionData *edata,
 				 */
 				if (!partrel->rd_rel->relispartition ||
 					ExecPartitionCheck(partrelinfo, remoteslot_part, estate,
-									   false))
+									   NULL, false))
 				{
 					/*
 					 * Yes, so simply UPDATE the partition.  We don't call
diff --git a/src/include/executor/executor.h b/src/include/executor/executor.h
index 3dc03c913e..c44ac1d7f3 100644
--- a/src/include/executor/executor.h
+++ b/src/include/executor/executor.h
@@ -207,7 +207,8 @@ extern ResultRelInfo *ExecGetTriggerResultRel(EState *estate, Oid relid);
 extern void ExecConstraints(ResultRelInfo *resultRelInfo,
 							TupleTableSlot *slot, EState *estate);
 extern bool ExecPartitionCheck(ResultRelInfo *resultRelInfo,
-							   TupleTableSlot *slot, EState *estate, bool emitError);
+							   TupleTableSlot *slot, EState *estate,
+							   Relation parentrel, bool emitError);
 extern void ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
 										TupleTableSlot *slot, EState *estate);
 extern void ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7795a69490..9984171576 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -496,6 +496,13 @@ typedef struct ResultRelInfo
 	/* partition check expression state (NULL if not set up yet) */
 	ExprState  *ri_PartitionCheckExpr;
 
+	/*
+	 * Information used by ExecPartitionCheck() to optimize some cases where
+	 * the parent's partition key contains arbitrary expressions.
+	 */
+	List	   *ri_partitionKeyExprs;
+	TupleTableSlot *ri_partitionKeyExprsSlot;
+
 	/*
 	 * Information needed by tuple routing target relations
 	 *
#22tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: houzj.fnst@fujitsu.com (#20)
RE: Skip partition tuple routing with constant partition key

From: Hou, Zhijie/侯 志杰 <houzj.fnst@fujitsu.com>

Ah, Maybe I found the issue.
Attaching a new patch, please have a try on this patch.

Thanks, it has compiled perfectly without any warning.

Regards
Takayuki Tsunakawa

#23houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: Amit Langote (#21)
1 attachment(s)
RE: Skip partition tuple routing with constant partition key

Hi Amit-san,

From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, May 24, 2021 4:27 PM

Hou-san,

On Mon, May 24, 2021 at 10:31 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

From: Amit Langote <amitlangote09@gmail.com>
Sent: Thursday, May 20, 2021 8:23 PM

This one seems bit tough. ExecPartitionCheck() uses the generic
expression evaluation machinery like a black box, which means
execPartition.c can't really tweal/control the time spent evaluating
partition constraints. Given that, we may have to disable the
caching when key->partexprs != NIL, unless we can reasonably do what
you are suggesting.[]

I did some research on the CHECK expression that ExecPartitionCheck()

execute.

Thanks for looking into this and writing the patch. Your idea does sound
promising.

Currently for a normal RANGE partition key it will first generate a
CHECK expression like : [Keyexpression IS NOT NULL AND Keyexpression >

lowboud AND Keyexpression < lowboud].

In this case, Keyexpression will be re-executed which will bring some

overhead.

Instead, I think we can try to do the following step:
1)extract the Keyexpression from the CHECK expression 2)evaluate the
key expression in advance 3)pass the result of key expression to do
the partition CHECK.
In this way ,we only execute the key expression once which looks more

efficient.

I would have preferred this not to touch anything but ExecPartitionCheck(), at
least in the first version. Especially, seeing that your patch touches
partbounds.c makes me a bit nervous, because the logic there is pretty
complicated to begin with.

Agreed.

How about we start with something like the attached? It's the same idea
AFAICS, but implemented with a smaller footprint. We can consider teaching
relcache about this as the next step, if at all. I haven't measured the
performance, but maybe it's not as fast as yours, so will need some fine-tuning.
Can you please give it a read?

Thanks for the patch and It looks more compact than mine.

After taking a quick look at the patch, I found a possible issue.
Currently, the patch does not search the parent's partition key expression recursively.
For example, If we have multi-level partition:
Table A is partition of Table B, Table B is partition of Table C.
It looks like if insert into Table A , then we did not replace the key expression which come from Table C.

If we want to get the Table C, we might need to use pg_inherit, but it costs too much to me.
Instead, maybe we can use the existing logic which already scanned the pg_inherit in function
generate_partition_qual(). Although this change is out of ExecPartitionCheck(). I think we'd better
replace all the parents and grandparent...'s key expression. Attaching a demo patch based on the
patch you posted earlier. I hope it will help.

Best regards,
houzj

Attachments:

0001-recursive-search-parent-partkeyexpr.patchapplication/octet-stream; name=0001-recursive-search-parent-partkeyexpr.patchDownload
From 49248c88201a4872d94d890f69ca529d49339a2c Mon Sep 17 00:00:00 2001
From: "houzj.fnst" <houzj.fnst@cn.fujitsu.com>
Date: Mon, 24 May 2021 20:44:31 +0800
Subject: [PATCH] -Recursive-search-parent-partkeyexpr

---
 src/backend/commands/tablecmds.c     |  4 +--
 src/backend/executor/execMain.c      |  7 ++--
 src/backend/optimizer/util/plancat.c |  2 +-
 src/backend/utils/cache/partcache.c  | 48 ++++++++++++++++++++++++----
 src/backend/utils/cache/relcache.c   |  2 ++
 src/include/utils/partcache.h        |  2 +-
 src/include/utils/rel.h              |  1 +
 7 files changed, 51 insertions(+), 15 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 11e91c4ad3..e48e4a5631 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -17280,7 +17280,7 @@ ATExecAttachPartition(List **wqueue, Relation rel, PartitionCmd *cmd,
 	 */
 	partBoundConstraint = get_qual_from_partbound(attachrel, rel, cmd->bound);
 	partConstraint = list_concat(partBoundConstraint,
-								 RelationGetPartitionQual(rel));
+								 RelationGetPartitionQual(rel, NULL));
 
 	/* Skip validation if there are no constraints to validate. */
 	if (partConstraint)
@@ -18082,7 +18082,7 @@ DetachAddConstraintIfNeeded(List **wqueue, Relation partRel)
 {
 	List	   *constraintExpr;
 
-	constraintExpr = RelationGetPartitionQual(partRel);
+	constraintExpr = RelationGetPartitionQual(partRel, NULL);
 	constraintExpr = (List *) eval_const_expressions(NULL, (Node *) constraintExpr);
 
 	/*
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 8ddbc6d0c2..91bbc5bc25 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1738,12 +1738,13 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	 */
 	if (resultRelInfo->ri_PartitionCheckExpr == NULL)
 	{
+		List *partexprs = NIL;
 		/*
 		 * Ensure that the qual tree and prepared expression are in the
 		 * query-lifespan context.
 		 */
 		MemoryContext oldcxt = MemoryContextSwitchTo(estate->es_query_cxt);
-		List	   *qual = RelationGetPartitionQual(resultRelInfo->ri_RelationDesc);
+		List	   *qual = RelationGetPartitionQual(resultRelInfo->ri_RelationDesc, &partexprs);
 
 		/*
 		 * If we have been passed the parent relation, optimize the evaluation
@@ -1757,16 +1758,12 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 		if (parentrel)
 		{
 			replace_partexpr_with_dummy_var_context cxt;
-			List   *partexprs = RelationGetPartitionKey(parentrel)->partexprs;
 			ListCell *lc;
 			AttrNumber attrno = 1;
 			TupleDesc	partexprs_tupdesc;
 
 			partexprs_tupdesc = CreateTemplateTupleDesc(list_length(partexprs));
 
-			partexprs = map_partition_varattnos(partexprs, 1,
-												resultRelInfo->ri_RelationDesc,
-												parentrel);
 			foreach(lc, partexprs)
 			{
 				Expr   *expr = lfirst(lc);
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/optimizer/util/plancat.c
index c5194fdbbf..f274538ba5 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -2399,7 +2399,7 @@ set_baserel_partition_constraint(Relation relation, RelOptInfo *rel)
 	 * implicit-AND format, we'd have to explicitly convert it to explicit-AND
 	 * format and back again.
 	 */
-	partconstr = RelationGetPartitionQual(relation);
+	partconstr = RelationGetPartitionQual(relation, NULL);
 	if (partconstr)
 	{
 		partconstr = (List *) expression_planner((Expr *) partconstr);
diff --git a/src/backend/utils/cache/partcache.c b/src/backend/utils/cache/partcache.c
index 21e60f0c5e..69bc431a69 100644
--- a/src/backend/utils/cache/partcache.c
+++ b/src/backend/utils/cache/partcache.c
@@ -38,7 +38,7 @@
 
 
 static void RelationBuildPartitionKey(Relation relation);
-static List *generate_partition_qual(Relation rel);
+static List *generate_partition_qual(Relation rel, List **partkeys);
 
 /*
  * RelationGetPartitionKey -- get partition key, if relation is partitioned
@@ -273,13 +273,13 @@ RelationBuildPartitionKey(Relation relation)
  * Returns a list of partition quals
  */
 List *
-RelationGetPartitionQual(Relation rel)
+RelationGetPartitionQual(Relation rel, List **partkeys)
 {
 	/* Quick exit */
 	if (!rel->rd_rel->relispartition)
 		return NIL;
 
-	return generate_partition_qual(rel);
+	return generate_partition_qual(rel, partkeys);
 }
 
 /*
@@ -305,7 +305,7 @@ get_partition_qual_relid(Oid relid)
 		Relation	rel = relation_open(relid, AccessShareLock);
 		List	   *and_args;
 
-		and_args = generate_partition_qual(rel);
+		and_args = generate_partition_qual(rel, NULL);
 
 		/* Convert implicit-AND list format to boolean expression */
 		if (and_args == NIL)
@@ -333,7 +333,7 @@ get_partition_qual_relid(Oid relid)
  * into long-lived cache contexts, especially if we fail partway through.
  */
 static List *
-generate_partition_qual(Relation rel)
+generate_partition_qual(Relation rel, List **partkeys)
 {
 	HeapTuple	tuple;
 	MemoryContext oldcxt;
@@ -349,7 +349,12 @@ generate_partition_qual(Relation rel)
 
 	/* If we already cached the result, just return a copy */
 	if (rel->rd_partcheckvalid)
+	{
+		if (partkeys != NULL && rel->rd_keyexpr_list != NIL)
+			*partkeys = list_concat(*partkeys, copyObject(rel->rd_keyexpr_list));
+
 		return copyObject(rel->rd_partcheck);
+	}
 
 	/*
 	 * Grab at least an AccessShareLock on the parent table.  Must do this
@@ -381,9 +386,31 @@ generate_partition_qual(Relation rel)
 
 	ReleaseSysCache(tuple);
 
+	/* Save partition key expressions */
+	if (partkeys != NULL)
+	{
+		PartitionKey key = RelationGetPartitionKey(parent);
+		if (key != NULL && key->partexprs != NIL)
+			*partkeys = list_concat(*partkeys, copyObject(key->partexprs));
+	}
+
 	/* Add the parent's quals to the list (if any) */
 	if (parent->rd_rel->relispartition)
-		result = list_concat(generate_partition_qual(parent), my_qual);
+	{
+		List *temp_partkeys;
+		if (partkeys != NULL)
+		{
+			temp_partkeys = *partkeys;
+			*partkeys = NIL;
+		}
+
+		result = list_concat(generate_partition_qual(parent, partkeys), my_qual);
+
+		if (partkeys != NULL)
+		{
+			*partkeys = list_concat(*partkeys, temp_partkeys);
+		}
+	}
 	else
 		result = my_qual;
 
@@ -394,10 +421,13 @@ generate_partition_qual(Relation rel)
 	 * here.
 	 */
 	result = map_partition_varattnos(result, 1, rel, parent);
+	if (partkeys != NULL)
+		*partkeys = map_partition_varattnos(*partkeys, 1, rel, parent);
 
 	/* Assert that we're not leaking any old data during assignments below */
 	Assert(rel->rd_partcheckcxt == NULL);
 	Assert(rel->rd_partcheck == NIL);
+	Assert(rel->rd_keyexpr_list == NIL);
 
 	/*
 	 * Save a copy in the relcache.  The order of these operations is fairly
@@ -416,10 +446,16 @@ generate_partition_qual(Relation rel)
 										  RelationGetRelationName(rel));
 		oldcxt = MemoryContextSwitchTo(rel->rd_partcheckcxt);
 		rel->rd_partcheck = copyObject(result);
+		if (partkeys != NULL && *partkeys != NIL)
+			rel->rd_keyexpr_list = copyObject(*partkeys);
 		MemoryContextSwitchTo(oldcxt);
 	}
 	else
+	{
 		rel->rd_partcheck = NIL;
+		rel->rd_keyexpr_list = NIL;
+	}
+
 	rel->rd_partcheckvalid = true;
 
 	/* Keep the parent locked until commit */
diff --git a/src/backend/utils/cache/relcache.c b/src/backend/utils/cache/relcache.c
index fd05615e76..f7527da0d8 100644
--- a/src/backend/utils/cache/relcache.c
+++ b/src/backend/utils/cache/relcache.c
@@ -1161,6 +1161,7 @@ RelationBuildDesc(Oid targetRelId, bool insertIt)
 	relation->rd_pdcxt = NULL;
 	relation->rd_pddcxt = NULL;
 	relation->rd_partcheck = NIL;
+	relation->rd_keyexpr_list = NIL;
 	relation->rd_partcheckvalid = false;
 	relation->rd_partcheckcxt = NULL;
 
@@ -6041,6 +6042,7 @@ load_relcache_init_file(bool shared)
 		rel->rd_pdcxt = NULL;
 		rel->rd_pddcxt = NULL;
 		rel->rd_partcheck = NIL;
+		rel->rd_keyexpr_list = NIL;
 		rel->rd_partcheckvalid = false;
 		rel->rd_partcheckcxt = NULL;
 		rel->rd_indexprs = NIL;
diff --git a/src/include/utils/partcache.h b/src/include/utils/partcache.h
index a451bfb239..4762bc9fed 100644
--- a/src/include/utils/partcache.h
+++ b/src/include/utils/partcache.h
@@ -48,7 +48,7 @@ typedef struct PartitionKeyData
 
 
 extern PartitionKey RelationGetPartitionKey(Relation rel);
-extern List *RelationGetPartitionQual(Relation rel);
+extern List *RelationGetPartitionQual(Relation rel, List **partkeys);
 extern Expr *get_partition_qual_relid(Oid relid);
 
 /*
diff --git a/src/include/utils/rel.h b/src/include/utils/rel.h
index 774ac5b2b1..111287b130 100644
--- a/src/include/utils/rel.h
+++ b/src/include/utils/rel.h
@@ -143,6 +143,7 @@ typedef struct RelationData
 
 	/* data managed by RelationGetPartitionQual: */
 	List	   *rd_partcheck;	/* partition CHECK quals */
+	List	   *rd_keyexpr_list;	/* partition key exprs used in CHECK quals */
 	bool		rd_partcheckvalid;	/* true if list has been computed */
 	MemoryContext rd_partcheckcxt;	/* private cxt for rd_partcheck, if any */
 
-- 
2.18.4

#24Amit Langote
amitlangote09@gmail.com
In reply to: houzj.fnst@fujitsu.com (#23)
2 attachment(s)
Re: Skip partition tuple routing with constant partition key

Hou-san,

On Mon, May 24, 2021 at 10:15 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, May 24, 2021 4:27 PM

On Mon, May 24, 2021 at 10:31 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

Currently for a normal RANGE partition key it will first generate a
CHECK expression like : [Keyexpression IS NOT NULL AND Keyexpression >

lowboud AND Keyexpression < lowboud].

In this case, Keyexpression will be re-executed which will bring some

overhead.

Instead, I think we can try to do the following step:
1)extract the Keyexpression from the CHECK expression 2)evaluate the
key expression in advance 3)pass the result of key expression to do
the partition CHECK.
In this way ,we only execute the key expression once which looks more

efficient.

I would have preferred this not to touch anything but ExecPartitionCheck(), at
least in the first version. Especially, seeing that your patch touches
partbounds.c makes me a bit nervous, because the logic there is pretty
complicated to begin with.

Agreed.

How about we start with something like the attached? It's the same idea
AFAICS, but implemented with a smaller footprint. We can consider teaching
relcache about this as the next step, if at all. I haven't measured the
performance, but maybe it's not as fast as yours, so will need some fine-tuning.
Can you please give it a read?

Thanks for the patch and It looks more compact than mine.

After taking a quick look at the patch, I found a possible issue.
Currently, the patch does not search the parent's partition key expression recursively.
For example, If we have multi-level partition:
Table A is partition of Table B, Table B is partition of Table C.
It looks like if insert into Table A , then we did not replace the key expression which come from Table C.

Good catch! Although, I was relieved to realize that it's not *wrong*
per se, as in it does not produce an incorrect result, but only
*slower* than if the patch was careful enough to replace all the
parents' key expressions.

If we want to get the Table C, we might need to use pg_inherit, but it costs too much to me.
Instead, maybe we can use the existing logic which already scanned the pg_inherit in function
generate_partition_qual(). Although this change is out of ExecPartitionCheck(). I think we'd better
replace all the parents and grandparent...'s key expression. Attaching a demo patch based on the
patch you posted earlier. I hope it will help.

Thanks.

Though again, I think we can do this without changing the relcache
interface, such as RelationGetPartitionQual().

PartitionTupleRouting has all the information that's needed here.
Each partitioned table involved in routing a tuple to the leaf
partition has a PartitionDispatch struct assigned to it. That struct
contains the PartitionKey and we can access partexprs from there. We
can arrange to assemble them into a single list that is saved to a
given partition's ResultRelInfo, that is, after converting the
expressions to have partition attribute numbers. I tried that in the
attached updated patch; see the 0002-* patch.

Regarding the first patch to make ExecFindPartition() cache last used
partition, I noticed that it only worked for the bottom-most parent in
a multi-level partition tree, because only leaf partitions were
assigned to dispatch->lastPartitionInfo. I have fixed the earlier
patch to also save non-leaf partitions and their corresponding
PartitionDispatch structs so that parents of all levels can use this
caching feature. The patch has to become somewhat complex as a
result, but hopefully not too unreadable.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachments:

0001-ExecFindPartition-cache-last-used-partition-v3.patchapplication/octet-stream; name=0001-ExecFindPartition-cache-last-used-partition-v3.patchDownload
From cf659e0b221ddc04f5851b91518cc123be547f21 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 25 May 2021 22:48:47 +0900
Subject: [PATCH 1/2] ExecFindPartition: cache last used partition v3

---
 src/backend/executor/execPartition.c | 198 ++++++++++++++++++++++-----
 1 file changed, 162 insertions(+), 36 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 606c920b06..1d0d8e63f6 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -133,6 +133,16 @@ struct PartitionTupleRouting
  *		routing it through this table). A NULL value is stored if no tuple
  *		conversion is required.
  *
+ * savedPartInfo
+ *		If non-NULL, ResultRelInfo for the partition that was most recently
+ *		chosen as the routing target; ExecFindPartition() checks if the
+ *		same one can be used for the current row before applying the tuple-
+ *		routing algorithm to it.
+ *
+ * savedDispatchInfo
+ *		If non-NULL, PartititionDispatch for the sub-partitioned partition
+ *		that was most recently chosen as the routing target
+ *
  * indexes
  *		Array of partdesc->nparts elements.  For leaf partitions the index
  *		corresponds to the partition's ResultRelInfo in the encapsulating
@@ -150,6 +160,8 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrMap    *tupmap;
+	ResultRelInfo *savedPartInfo;
+	PartitionDispatch savedDispatchInfo;
 	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 }			PartitionDispatchData;
 
@@ -234,6 +246,82 @@ ExecSetupPartitionTupleRouting(EState *estate, Relation rel)
 	return proute;
 }
 
+/*
+ * Remember this partition for the next tuple inserted into this parent; see
+ * CanUseSavedPartitionForTuple() for how it's decided whether a tuple can
+ * indeed reuse this partition.
+ *
+ * Do this only if we have range/list partitions, because only
+ * in that case it's conceivable that consecutively inserted rows
+ * tend to go into the same partition.
+ */
+static inline void
+SavePartitionForNextTuple(PartitionDispatch dispatch,
+						  ResultRelInfo *partInfo,
+						  PartitionDispatch dispatchInfo)
+{
+	if ((dispatch->key->strategy == PARTITION_STRATEGY_RANGE ||
+		 dispatch->key->strategy == PARTITION_STRATEGY_LIST))
+	{
+		dispatch->savedPartInfo = partInfo;
+		dispatch->savedDispatchInfo = dispatchInfo;
+	}
+}
+
+/*
+ * Check if the saved partition accepts this tuple by evaluating its
+ * partition constraint against the tuple.  If it does, we save a trip
+ * to get_partition_for_tuple(), which can be a slightly more expensive
+ * way to get the same partition, especially if there are many
+ * partitions to search through.
+ */
+static inline bool
+CanUseSavedPartitionForTuple(PartitionDispatch dispatch,
+							 TupleTableSlot *rootslot,
+							 EState *estate)
+{
+	if (dispatch->savedPartInfo)
+	{
+		ResultRelInfo *rri;
+		TupleTableSlot *tmpslot;
+		TupleConversionMap *map;
+
+		rri = dispatch->savedPartInfo;
+		map = rri->ri_RootToPartitionMap;
+		if (map)
+			tmpslot = execute_attr_map_slot(map->attrMap, rootslot,
+											rri->ri_PartitionTupleSlot);
+		else
+			tmpslot = rootslot;
+		return ExecPartitionCheck(rri, tmpslot, estate, false);
+	}
+
+	return false;
+}
+
+/*
+ * Convert the tuple to a sub-partitioned partition's layout, if needed.
+ */
+static inline TupleTableSlot *
+ConvertTupleToPartition(PartitionDispatch dispatch,
+						TupleTableSlot *slot,
+						TupleTableSlot **parent_slot)
+{
+	if (dispatch->tupslot)
+	{
+		AttrMap    *map = dispatch->tupmap;
+		TupleTableSlot *tempslot = *parent_slot;
+
+		*parent_slot = dispatch->tupslot;
+		slot = execute_attr_map_slot(map, slot, *parent_slot);
+
+		if (tempslot != NULL)
+			ExecClearTuple(tempslot);
+	}
+
+	return slot;
+}
+
 /*
  * ExecFindPartition -- Return the ResultRelInfo for the leaf partition that
  * the tuple contained in *slot should belong to.
@@ -292,6 +380,34 @@ ExecFindPartition(ModifyTableState *mtstate,
 		CHECK_FOR_INTERRUPTS();
 
 		rel = dispatch->reldesc;
+
+		if (CanUseSavedPartitionForTuple(dispatch, rootslot, estate))
+		{
+			/* If the saved partition is leaf partition, just return it. */
+			if (dispatch->savedDispatchInfo == NULL)
+			{
+				/* Restore ecxt's scantuple before returning. */
+				ecxt->ecxt_scantuple = ecxt_scantuple_saved;
+				MemoryContextSwitchTo(oldcxt);
+				return dispatch->savedPartInfo;
+			}
+			else
+			{
+				/*
+				 * Saved partition is sub-partitioned, so continue the loop to
+				 * find the next level partition.
+				 */
+				dispatch = dispatch->savedDispatchInfo;
+				slot = ConvertTupleToPartition(dispatch, slot, &myslot);
+				continue;
+			}
+		}
+		else
+		{
+			dispatch->savedPartInfo = rri = NULL;
+			dispatch->savedDispatchInfo = NULL;
+		}
+
 		partdesc = dispatch->partdesc;
 
 		/*
@@ -372,6 +488,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 			}
 			Assert(rri != NULL);
 
+			SavePartitionForNextTuple(dispatch, rri, NULL);
+
 			/* Signal to terminate the loop */
 			dispatch = NULL;
 		}
@@ -382,6 +500,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 			 */
 			if (likely(dispatch->indexes[partidx] >= 0))
 			{
+				PartitionDispatch subdispatch;
+
 				/* Already built. */
 				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
 
@@ -391,7 +511,11 @@ ExecFindPartition(ModifyTableState *mtstate,
 				 * Move down to the next partition level and search again
 				 * until we find a leaf partition that matches this tuple
 				 */
-				dispatch = pd[dispatch->indexes[partidx]];
+				subdispatch = pd[dispatch->indexes[partidx]];
+
+				SavePartitionForNextTuple(dispatch, rri, subdispatch);
+
+				dispatch = subdispatch;
 			}
 			else
 			{
@@ -411,24 +535,13 @@ ExecFindPartition(ModifyTableState *mtstate,
 					   dispatch->indexes[partidx] < proute->num_dispatch);
 
 				rri = proute->nonleaf_partitions[dispatch->indexes[partidx]];
-				dispatch = subdispatch;
-			}
 
-			/*
-			 * Convert the tuple to the new parent's layout, if different from
-			 * the previous parent.
-			 */
-			if (dispatch->tupslot)
-			{
-				AttrMap    *map = dispatch->tupmap;
-				TupleTableSlot *tempslot = myslot;
-
-				myslot = dispatch->tupslot;
-				slot = execute_attr_map_slot(map, slot, myslot);
+				SavePartitionForNextTuple(dispatch, rri, subdispatch);
 
-				if (tempslot != NULL)
-					ExecClearTuple(tempslot);
+				dispatch = subdispatch;
 			}
+
+			slot = ConvertTupleToPartition(dispatch, slot, &myslot);
 		}
 
 		/*
@@ -858,27 +971,11 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 	return leaf_part_rri;
 }
 
-/*
- * ExecInitRoutingInfo
- *		Set up information needed for translating tuples between root
- *		partitioned table format and partition format, and keep track of it
- *		in PartitionTupleRouting.
- */
-static void
-ExecInitRoutingInfo(ModifyTableState *mtstate,
-					EState *estate,
-					PartitionTupleRouting *proute,
-					PartitionDispatch dispatch,
-					ResultRelInfo *partRelInfo,
-					int partidx,
-					bool is_borrowed_rel)
+static inline void
+InitRootToPartitionMap(ResultRelInfo *partRelInfo,
+					   ResultRelInfo *rootRelInfo,
+					   EState *estate)
 {
-	ResultRelInfo *rootRelInfo = partRelInfo->ri_RootResultRelInfo;
-	MemoryContext oldcxt;
-	int			rri_index;
-
-	oldcxt = MemoryContextSwitchTo(proute->memcxt);
-
 	/*
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
@@ -907,6 +1004,30 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	}
 	else
 		partRelInfo->ri_PartitionTupleSlot = NULL;
+}
+
+/*
+ * ExecInitRoutingInfo
+ *		Set up information needed for translating tuples between root
+ *		partitioned table format and partition format, and keep track of it
+ *		in PartitionTupleRouting.
+ */
+static void
+ExecInitRoutingInfo(ModifyTableState *mtstate,
+					EState *estate,
+					PartitionTupleRouting *proute,
+					PartitionDispatch dispatch,
+					ResultRelInfo *partRelInfo,
+					int partidx,
+					bool is_borrowed_rel)
+{
+	ResultRelInfo *rootRelInfo = partRelInfo->ri_RootResultRelInfo;
+	MemoryContext oldcxt;
+	int			rri_index;
+
+	oldcxt = MemoryContextSwitchTo(proute->memcxt);
+
+	InitRootToPartitionMap(partRelInfo, rootRelInfo, estate);
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -1051,6 +1172,9 @@ ExecInitPartitionDispatchInfo(EState *estate,
 		pd->tupslot = NULL;
 	}
 
+	pd->savedPartInfo = NULL;
+	pd->savedDispatchInfo = NULL;
+
 	/*
 	 * Initialize with -1 to signify that the corresponding partition's
 	 * ResultRelInfo or PartitionDispatch has not been created yet.
@@ -1094,6 +1218,8 @@ ExecInitPartitionDispatchInfo(EState *estate,
 		ResultRelInfo *rri = makeNode(ResultRelInfo);
 
 		InitResultRelInfo(rri, rel, 0, rootResultRelInfo, 0);
+		/* The map is needed in CanUseSavedPartitionForTuple(). */
+		InitRootToPartitionMap(rri, rootResultRelInfo, estate);
 		proute->nonleaf_partitions[dispatchidx] = rri;
 	}
 	else
-- 
2.24.1

0002-ExecPartitionCheck-pre-compute-partition-key-express.patchapplication/octet-stream; name=0002-ExecPartitionCheck-pre-compute-partition-key-express.patchDownload
From a7909c2b142ef255c034d7d319d09abdeede23eb Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 25 May 2021 22:55:12 +0900
Subject: [PATCH 2/2] ExecPartitionCheck: pre-compute partition key expression
 v2

---
 src/backend/executor/execMain.c      | 95 ++++++++++++++++++++++++++++
 src/backend/executor/execPartition.c | 52 +++++++++++++++
 src/include/nodes/execnodes.h        |  9 +++
 3 files changed, 156 insertions(+)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index b3ce4bae53..1fc2a9fe82 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -44,6 +44,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "catalog/namespace.h"
+#include "catalog/partition.h"
 #include "catalog/pg_publication.h"
 #include "commands/matview.h"
 #include "commands/trigger.h"
@@ -52,6 +53,8 @@
 #include "foreign/fdwapi.h"
 #include "jit/jit.h"
 #include "mb/pg_wchar.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
@@ -1686,6 +1689,32 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 	return NULL;
 }
 
+/*
+ * Replaces the occurrence of cxt->matchexpr in the expression tree given by
+ * 'node' by an OUTER var with provided attribute number.
+ */
+typedef struct
+{
+	Expr	   *matchexpr;
+	AttrNumber	varattno;
+} replace_partexpr_with_dummy_var_context;
+
+static Node *
+replace_partexpr_with_dummy_var(Node *node,
+								replace_partexpr_with_dummy_var_context *cxt)
+{
+	if (node == NULL)
+		return NULL;
+
+	if (equal(node, cxt->matchexpr))
+		return (Node *) makeVar(OUTER_VAR, cxt->varattno,
+								exprType(node), exprTypmod(node),
+								exprCollation(node), 0);
+
+	return expression_tree_mutator(node, replace_partexpr_with_dummy_var,
+								   (void *) cxt);
+}
+
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  *
@@ -1716,6 +1745,45 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 		MemoryContext oldcxt = MemoryContextSwitchTo(estate->es_query_cxt);
 		List	   *qual = RelationGetPartitionQual(resultRelInfo->ri_RelationDesc);
 
+		/*
+		 * Optimize the evaluation of partition key expressions.  The way we do
+		 * that is by replacing any occurrences of the individual expressions
+		 * in this relation's partition constraint by dummy Vars marked as
+		 * coming from the "OUTER" relation.  Then when actually executing such
+		 * modified partition constraint tree, we feed the actual partition
+		 * expression values via econtext->ecxt_outertuple; see below.
+		 */
+		if (resultRelInfo->ri_partConstrKeyExprs)
+		{
+			List	  *partexprs = resultRelInfo->ri_partConstrKeyExprs;
+			ListCell  *lc;
+			AttrNumber attrno = 1;
+			TupleDesc	partexprs_tupdesc;
+			replace_partexpr_with_dummy_var_context cxt;
+
+			partexprs_tupdesc = CreateTemplateTupleDesc(list_length(partexprs));
+			foreach(lc, partexprs)
+			{
+				Expr   *expr = lfirst(lc);
+
+				cxt.matchexpr = expr;
+				cxt.varattno = attrno;
+				qual = (List *) replace_partexpr_with_dummy_var((Node *) qual,
+																&cxt);
+
+				resultRelInfo->ri_partConstrKeyExprStates =
+					lappend(resultRelInfo->ri_partConstrKeyExprStates,
+							ExecPrepareExpr(expr, estate));
+				TupleDescInitEntry(partexprs_tupdesc, attrno, NULL,
+								   exprType((Node *) expr),
+								   exprTypmod((Node *) expr), 0);
+				attrno++;
+			}
+
+			resultRelInfo->ri_partConstrKeyExprsSlot =
+				ExecInitExtraTupleSlot(estate, partexprs_tupdesc, &TTSOpsVirtual);
+		}
+
 		resultRelInfo->ri_PartitionCheckExpr = ExecPrepareCheck(qual, estate);
 		MemoryContextSwitchTo(oldcxt);
 	}
@@ -1729,6 +1797,33 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
+	if (resultRelInfo->ri_partConstrKeyExprStates)
+	{
+		TupleTableSlot *partexprs_slot = resultRelInfo->ri_partConstrKeyExprsSlot;
+		Datum	*values;
+		bool	*isnull;
+		ListCell *lc;
+		AttrNumber attrno = 1;
+
+		Assert(partexprs_slot != NULL);
+		ExecClearTuple(partexprs_slot);
+
+		values = partexprs_slot->tts_values;
+		isnull = partexprs_slot->tts_isnull;
+
+		foreach(lc, resultRelInfo->ri_partConstrKeyExprStates)
+		{
+			ExprState   *partexpr = lfirst(lc);
+
+			values[attrno-1] = ExecEvalExprSwitchContext(partexpr, econtext,
+												  &isnull[attrno-1]);
+			attrno++;
+		}
+		ExecStoreVirtualTuple(partexprs_slot);
+
+		econtext->ecxt_outertuple = partexprs_slot;
+	}
+
 	/*
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 1d0d8e63f6..0e689b9c56 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -143,6 +143,13 @@ struct PartitionTupleRouting
  *		If non-NULL, PartititionDispatch for the sub-partitioned partition
  *		that was most recently chosen as the routing target
  *
+ * partconstr_keyexprs
+ *		List of expressions present in the partition keys of all ancestors
+ *		of this table including itself, mapped to have the attribute
+ *		numbers of this table.  The field is so named because all of these
+ *		expressions appear in the partition constraint of each of this
+ *		table's partitions.
+ *
  * indexes
  *		Array of partdesc->nparts elements.  For leaf partitions the index
  *		corresponds to the partition's ResultRelInfo in the encapsulating
@@ -162,6 +169,7 @@ typedef struct PartitionDispatchData
 	AttrMap    *tupmap;
 	ResultRelInfo *savedPartInfo;
 	PartitionDispatch savedDispatchInfo;
+	List	   *partconstr_keyexprs;
 	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 }			PartitionDispatchData;
 
@@ -1006,6 +1014,21 @@ InitRootToPartitionMap(ResultRelInfo *partRelInfo,
 		partRelInfo->ri_PartitionTupleSlot = NULL;
 }
 
+/*
+ * Save parent's partition key expressions in the partition ResultRelInfo
+ * after mapping them to have the partition's attribute numbers.
+ */
+static inline void
+InitPartitionConstraintKeyExprs(PartitionDispatch dispatch,
+								ResultRelInfo *partRelInfo)
+{
+	if (dispatch->partconstr_keyexprs)
+		partRelInfo->ri_partConstrKeyExprs =
+			map_partition_varattnos(dispatch->partconstr_keyexprs, 1,
+									partRelInfo->ri_RelationDesc,
+									dispatch->reldesc);
+}
+
 /*
  * ExecInitRoutingInfo
  *		Set up information needed for translating tuples between root
@@ -1057,6 +1080,8 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 
 	partRelInfo->ri_CopyMultiInsertBuffer = NULL;
 
+	InitPartitionConstraintKeyExprs(dispatch, partRelInfo);
+
 	/*
 	 * Keep track of it in the PartitionTupleRouting->partitions array.
 	 */
@@ -1175,6 +1200,32 @@ ExecInitPartitionDispatchInfo(EState *estate,
 	pd->savedPartInfo = NULL;
 	pd->savedDispatchInfo = NULL;
 
+	if (pd->key->partexprs != NIL)
+	{
+		pd->partconstr_keyexprs = copyObject(pd->key->partexprs);
+		if (parent_pd)
+		{
+			List   *parent_keyexprs = parent_pd->partconstr_keyexprs;
+
+			if (parent_keyexprs && pd->tupmap)
+			{
+				bool	found_whole_row;
+
+				parent_keyexprs = (List *)
+					map_variable_attnos((Node *) parent_keyexprs, 1, 0,
+										pd->tupmap,
+										RelationGetForm(rel)->reltype,
+										&found_whole_row);
+			}
+			else if (parent_keyexprs)
+				parent_keyexprs = copyObject(parent_keyexprs);
+			pd->partconstr_keyexprs =
+				list_concat(pd->partconstr_keyexprs, parent_keyexprs);
+		}
+	}
+	else
+		pd->partconstr_keyexprs = NIL;
+
 	/*
 	 * Initialize with -1 to signify that the corresponding partition's
 	 * ResultRelInfo or PartitionDispatch has not been created yet.
@@ -1220,6 +1271,7 @@ ExecInitPartitionDispatchInfo(EState *estate,
 		InitResultRelInfo(rri, rel, 0, rootResultRelInfo, 0);
 		/* The map is needed in CanUseSavedPartitionForTuple(). */
 		InitRootToPartitionMap(rri, rootResultRelInfo, estate);
+		InitPartitionConstraintKeyExprs(pd, rri);
 		proute->nonleaf_partitions[dispatchidx] = rri;
 	}
 	else
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7795a69490..7f1ce732ea 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -496,6 +496,15 @@ typedef struct ResultRelInfo
 	/* partition check expression state (NULL if not set up yet) */
 	ExprState  *ri_PartitionCheckExpr;
 
+	/*
+	 * Information used by ExecPartitionCheck() to optimize some cases where
+	 * the partition's ancestors' partition keys contain arbitrary
+	 * expressions.
+	 */
+	List	   *ri_partConstrKeyExprs;
+	List	   *ri_partConstrKeyExprStates;
+	TupleTableSlot *ri_partConstrKeyExprsSlot;
+
 	/*
 	 * Information needed by tuple routing target relations
 	 *
-- 
2.24.1

#25houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: Amit Langote (#24)
RE: Skip partition tuple routing with constant partition key

Hi Amit-san

From: Amit Langote <amitlangote09@gmail.com>
Sent: Tuesday, May 25, 2021 10:06 PM

Hou-san,

Thanks for the patch and It looks more compact than mine.

After taking a quick look at the patch, I found a possible issue.
Currently, the patch does not search the parent's partition key expression

recursively.

For example, If we have multi-level partition:
Table A is partition of Table B, Table B is partition of Table C.
It looks like if insert into Table A , then we did not replace the key expression

which come from Table C.

Good catch! Although, I was relieved to realize that it's not *wrong* per se, as
in it does not produce an incorrect result, but only
*slower* than if the patch was careful enough to replace all the parents' key
expressions.

If we want to get the Table C, we might need to use pg_inherit, but it costs

too much to me.

Instead, maybe we can use the existing logic which already scanned the
pg_inherit in function generate_partition_qual(). Although this change
is out of ExecPartitionCheck(). I think we'd better replace all the
parents and grandparent...'s key expression. Attaching a demo patch based

on the patch you posted earlier. I hope it will help.

Thanks.

Though again, I think we can do this without changing the relcache interface,
such as RelationGetPartitionQual().

PartitionTupleRouting has all the information that's needed here.
Each partitioned table involved in routing a tuple to the leaf partition has a
PartitionDispatch struct assigned to it. That struct contains the PartitionKey
and we can access partexprs from there. We can arrange to assemble them
into a single list that is saved to a given partition's ResultRelInfo, that is, after
converting the expressions to have partition attribute numbers. I tried that in
the attached updated patch; see the 0002-* patch.

Thanks for the explanation !
Yeah, we can get all the parent table info from PartitionTupleRouting when INSERT into a partitioned table.

But I have two issues about using the information from PartitionTupleRouting to get the parent table's key expression:
1) It seems we do not initialize the PartitionTupleRouting when directly INSERT into a partition(not a partitioned table).
I think it will be better we let the pre-compute-key_expression feature to be used in all the possible cases, because it
could bring nice performance improvement.

2) When INSERT into a partitioned table which is also a partition, the PartitionTupleRouting is initialized after the ExecPartitionCheck.
For example:
create unlogged table parttable1 (a int, b int, c int, d int) partition by range (partition_func(a));
create unlogged table parttable1_a partition of parttable1 for values from (0) to (5000);
create unlogged table parttable1_b partition of parttable1 for values from (5000) to (10000);

create unlogged table parttable2 (a int, b int, c int, d int) partition by range (partition_func1(b));
create unlogged table parttable2_a partition of parttable2 for values from (0) to (5000);
create unlogged table parttable2_b partition of parttable2 for values from (5000) to (10000);

---When INSERT into parttable2, the code do partitioncheck before initialize the PartitionTupleRouting.
insert into parttable2 select 10001,100,10001,100;

Best regards,
houzj

#26Amit Langote
amitlangote09@gmail.com
In reply to: houzj.fnst@fujitsu.com (#25)
Re: Skip partition tuple routing with constant partition key

Hou-san,

On Wed, May 26, 2021 at 10:05 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

From: Amit Langote <amitlangote09@gmail.com>
Sent: Tuesday, May 25, 2021 10:06 PM

Though again, I think we can do this without changing the relcache interface,
such as RelationGetPartitionQual().

PartitionTupleRouting has all the information that's needed here.
Each partitioned table involved in routing a tuple to the leaf partition has a
PartitionDispatch struct assigned to it. That struct contains the PartitionKey
and we can access partexprs from there. We can arrange to assemble them
into a single list that is saved to a given partition's ResultRelInfo, that is, after
converting the expressions to have partition attribute numbers. I tried that in
the attached updated patch; see the 0002-* patch.

Thanks for the explanation !
Yeah, we can get all the parent table info from PartitionTupleRouting when INSERT into a partitioned table.

But I have two issues about using the information from PartitionTupleRouting to get the parent table's key expression:
1) It seems we do not initialize the PartitionTupleRouting when directly INSERT into a partition(not a partitioned table).
I think it will be better we let the pre-compute-key_expression feature to be used in all the possible cases, because it
could bring nice performance improvement.

2) When INSERT into a partitioned table which is also a partition, the PartitionTupleRouting is initialized after the ExecPartitionCheck.

Hmm, do we really need to optimize ExecPartitionCheck() when
partitions are directly inserted into? As also came up earlier in the
thread, we want to discourage users from doing that to begin with, so
it doesn't make much sense to spend our effort on that case.

Optimizing ExecPartitionCheck(), specifically your idea of
pre-computing the partition key expressions, only came up after
finding that the earlier patch to teach ExecFindPartition() to cache
partitions may benefit from it. IOW, optimizing ExecPartitionCheck()
for its own sake does not seem worthwhile, especially not if we'd need
to break module boundaries to make that happen.

Thoughts?

--
Amit Langote
EDB: http://www.enterprisedb.com

#27Zhihong Yu
zyu@yugabyte.com
In reply to: houzj.fnst@fujitsu.com (#25)
Re: Skip partition tuple routing with constant partition key

Hi, Amit:

For ConvertTupleToPartition() in
0001-ExecFindPartition-cache-last-used-partition-v3.patch:

+ if (tempslot != NULL)
+ ExecClearTuple(tempslot);

If tempslot and parent_slot point to the same slot, should ExecClearTuple()
still be called ?

Cheers

#28houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: Amit Langote (#26)
RE: Skip partition tuple routing with constant partition key

Hi amit-san

From: Amit Langote <amitlangote09@gmail.com>
Sent: Wednesday, May 26, 2021 9:38 AM

Hou-san,

On Wed, May 26, 2021 at 10:05 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

Thanks for the explanation !
Yeah, we can get all the parent table info from PartitionTupleRouting when

INSERT into a partitioned table.

But I have two issues about using the information from PartitionTupleRouting

to get the parent table's key expression:

1) It seems we do not initialize the PartitionTupleRouting when directly

INSERT into a partition(not a partitioned table).

I think it will be better we let the pre-compute-key_expression
feature to be used in all the possible cases, because it could bring nice

performance improvement.

2) When INSERT into a partitioned table which is also a partition, the

PartitionTupleRouting is initialized after the ExecPartitionCheck.

Hmm, do we really need to optimize ExecPartitionCheck() when partitions are
directly inserted into? As also came up earlier in the thread, we want to
discourage users from doing that to begin with, so it doesn't make much sense
to spend our effort on that case.

Optimizing ExecPartitionCheck(), specifically your idea of pre-computing the
partition key expressions, only came up after finding that the earlier patch to
teach ExecFindPartition() to cache partitions may benefit from it. IOW,
optimizing ExecPartitionCheck() for its own sake does not seem worthwhile,
especially not if we'd need to break module boundaries to make that happen.

Thoughts?

OK, I see the point, thanks for the explanation.
Let try to move forward.

About teaching relcache about caching the target partition.

David-san suggested cache the partidx in PartitionDesc.
And it will need looping and checking the cached value at each level.
I was thinking can we cache a partidx list[1, 2 ,3], and then we can follow
the list to get the last partition and do the partition CHECK only for the last
partition. If any unexpected thing happen, we can return to the original table
and redo the tuple routing without using the cached index.
What do you think ?

Best regards,
houzj

#29Amit Langote
amitlangote09@gmail.com
In reply to: Zhihong Yu (#27)
2 attachment(s)
Re: Skip partition tuple routing with constant partition key

Hi,

On Thu, May 27, 2021 at 2:30 AM Zhihong Yu <zyu@yugabyte.com> wrote:

Hi, Amit:

For ConvertTupleToPartition() in 0001-ExecFindPartition-cache-last-used-partition-v3.patch:

+ if (tempslot != NULL)
+ ExecClearTuple(tempslot);

If tempslot and parent_slot point to the same slot, should ExecClearTuple() still be called ?

Yeah, we decided back in 1c9bb02d8ec that it's necessary to free the
slot if it's the same slot as a parent partition's
PartitionDispatch->tupslot ("freeing parent's copy of the tuple").
Maybe we don't need this parent-slot-clearing anymore due to code
restructuring over the last 3 years, but that will have to be a
separate patch.

I hope the attached updated patch makes it a bit more clear what's
going on. I refactored more of the code in ExecFindPartition() to
make this patch more a bit more readable.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachments:

0002-ExecPartitionCheck-pre-compute-partition-key-express.patchapplication/octet-stream; name=0002-ExecPartitionCheck-pre-compute-partition-key-express.patchDownload
From fbe70854e03a75d9e149d303a09b79d5093aba43 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 25 May 2021 22:55:12 +0900
Subject: [PATCH 2/2] ExecPartitionCheck: pre-compute partition key expression
 v2

---
 src/backend/executor/execMain.c      | 95 ++++++++++++++++++++++++++++
 src/backend/executor/execPartition.c | 46 ++++++++++++++
 src/include/nodes/execnodes.h        |  9 +++
 3 files changed, 150 insertions(+)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index b3ce4bae53..1fc2a9fe82 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -44,6 +44,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "catalog/namespace.h"
+#include "catalog/partition.h"
 #include "catalog/pg_publication.h"
 #include "commands/matview.h"
 #include "commands/trigger.h"
@@ -52,6 +53,8 @@
 #include "foreign/fdwapi.h"
 #include "jit/jit.h"
 #include "mb/pg_wchar.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
@@ -1686,6 +1689,32 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 	return NULL;
 }
 
+/*
+ * Replaces the occurrence of cxt->matchexpr in the expression tree given by
+ * 'node' by an OUTER var with provided attribute number.
+ */
+typedef struct
+{
+	Expr	   *matchexpr;
+	AttrNumber	varattno;
+} replace_partexpr_with_dummy_var_context;
+
+static Node *
+replace_partexpr_with_dummy_var(Node *node,
+								replace_partexpr_with_dummy_var_context *cxt)
+{
+	if (node == NULL)
+		return NULL;
+
+	if (equal(node, cxt->matchexpr))
+		return (Node *) makeVar(OUTER_VAR, cxt->varattno,
+								exprType(node), exprTypmod(node),
+								exprCollation(node), 0);
+
+	return expression_tree_mutator(node, replace_partexpr_with_dummy_var,
+								   (void *) cxt);
+}
+
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  *
@@ -1716,6 +1745,45 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 		MemoryContext oldcxt = MemoryContextSwitchTo(estate->es_query_cxt);
 		List	   *qual = RelationGetPartitionQual(resultRelInfo->ri_RelationDesc);
 
+		/*
+		 * Optimize the evaluation of partition key expressions.  The way we do
+		 * that is by replacing any occurrences of the individual expressions
+		 * in this relation's partition constraint by dummy Vars marked as
+		 * coming from the "OUTER" relation.  Then when actually executing such
+		 * modified partition constraint tree, we feed the actual partition
+		 * expression values via econtext->ecxt_outertuple; see below.
+		 */
+		if (resultRelInfo->ri_partConstrKeyExprs)
+		{
+			List	  *partexprs = resultRelInfo->ri_partConstrKeyExprs;
+			ListCell  *lc;
+			AttrNumber attrno = 1;
+			TupleDesc	partexprs_tupdesc;
+			replace_partexpr_with_dummy_var_context cxt;
+
+			partexprs_tupdesc = CreateTemplateTupleDesc(list_length(partexprs));
+			foreach(lc, partexprs)
+			{
+				Expr   *expr = lfirst(lc);
+
+				cxt.matchexpr = expr;
+				cxt.varattno = attrno;
+				qual = (List *) replace_partexpr_with_dummy_var((Node *) qual,
+																&cxt);
+
+				resultRelInfo->ri_partConstrKeyExprStates =
+					lappend(resultRelInfo->ri_partConstrKeyExprStates,
+							ExecPrepareExpr(expr, estate));
+				TupleDescInitEntry(partexprs_tupdesc, attrno, NULL,
+								   exprType((Node *) expr),
+								   exprTypmod((Node *) expr), 0);
+				attrno++;
+			}
+
+			resultRelInfo->ri_partConstrKeyExprsSlot =
+				ExecInitExtraTupleSlot(estate, partexprs_tupdesc, &TTSOpsVirtual);
+		}
+
 		resultRelInfo->ri_PartitionCheckExpr = ExecPrepareCheck(qual, estate);
 		MemoryContextSwitchTo(oldcxt);
 	}
@@ -1729,6 +1797,33 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
+	if (resultRelInfo->ri_partConstrKeyExprStates)
+	{
+		TupleTableSlot *partexprs_slot = resultRelInfo->ri_partConstrKeyExprsSlot;
+		Datum	*values;
+		bool	*isnull;
+		ListCell *lc;
+		AttrNumber attrno = 1;
+
+		Assert(partexprs_slot != NULL);
+		ExecClearTuple(partexprs_slot);
+
+		values = partexprs_slot->tts_values;
+		isnull = partexprs_slot->tts_isnull;
+
+		foreach(lc, resultRelInfo->ri_partConstrKeyExprStates)
+		{
+			ExprState   *partexpr = lfirst(lc);
+
+			values[attrno-1] = ExecEvalExprSwitchContext(partexpr, econtext,
+												  &isnull[attrno-1]);
+			attrno++;
+		}
+		ExecStoreVirtualTuple(partexprs_slot);
+
+		econtext->ecxt_outertuple = partexprs_slot;
+	}
+
 	/*
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 3e6c8c58c4..cc1dd04b54 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -143,6 +143,13 @@ struct PartitionTupleRouting
  *		If non-NULL, PartititionDispatch for the sub-partitioned partition
  *		that was most recently chosen as the routing target
  *
+ * partconstr_keyexprs
+ *		List of expressions present in the partition keys of all ancestors
+ *		of this table including itself, mapped to have the attribute
+ *		numbers of this table.  The field is so named because all of these
+ *		expressions appear in the partition constraint of each of this
+ *		table's partitions.
+ *
  * indexes
  *		Array of partdesc->nparts elements.  For leaf partitions the index
  *		corresponds to the partition's ResultRelInfo in the encapsulating
@@ -162,6 +169,7 @@ typedef struct PartitionDispatchData
 	AttrMap    *tupmap;
 	ResultRelInfo *savedPartResultInfo;
 	PartitionDispatch savedPartDispatchInfo;
+	List	   *partconstr_keyexprs;
 	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 }			PartitionDispatchData;
 
@@ -995,6 +1003,21 @@ InitRootToPartitionMap(ResultRelInfo *partRelInfo,
 		partRelInfo->ri_PartitionTupleSlot = NULL;
 }
 
+/*
+ * Save parent's partition key expressions in the partition ResultRelInfo
+ * after mapping them to have the partition's attribute numbers.
+ */
+static inline void
+InitPartitionConstraintKeyExprs(PartitionDispatch dispatch,
+								ResultRelInfo *partRelInfo)
+{
+	if (dispatch->partconstr_keyexprs)
+		partRelInfo->ri_partConstrKeyExprs =
+			map_partition_varattnos(dispatch->partconstr_keyexprs, 1,
+									partRelInfo->ri_RelationDesc,
+									dispatch->reldesc);
+}
+
 /*
  * ExecInitRoutingInfo
  *		Set up information needed for translating tuples between root
@@ -1046,6 +1069,8 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 
 	partRelInfo->ri_CopyMultiInsertBuffer = NULL;
 
+	InitPartitionConstraintKeyExprs(dispatch, partRelInfo);
+
 	/*
 	 * Keep track of it in the PartitionTupleRouting->partitions array.
 	 */
@@ -1164,6 +1189,26 @@ ExecInitPartitionDispatchInfo(EState *estate,
 	pd->savedPartResultInfo = NULL;
 	pd->savedPartDispatchInfo = NULL;
 
+	if (pd->key->partexprs != NIL)
+	{
+		pd->partconstr_keyexprs = copyObject(pd->key->partexprs);
+		if (parent_pd)
+		{
+			List   *parent_keyexprs = parent_pd->partconstr_keyexprs;
+
+			if (parent_keyexprs && pd->tupmap)
+				parent_keyexprs = map_partition_varattnos(parent_keyexprs, 1,
+														  rel,
+														  parent_pd->reldesc);
+			else if (parent_keyexprs)
+				parent_keyexprs = copyObject(parent_keyexprs);
+			pd->partconstr_keyexprs =
+				list_concat(pd->partconstr_keyexprs, parent_keyexprs);
+		}
+	}
+	else
+		pd->partconstr_keyexprs = NIL;
+
 	/*
 	 * Initialize with -1 to signify that the corresponding partition's
 	 * ResultRelInfo or PartitionDispatch has not been created yet.
@@ -1209,6 +1254,7 @@ ExecInitPartitionDispatchInfo(EState *estate,
 		InitResultRelInfo(rri, rel, 0, rootResultRelInfo, 0);
 		/* The map is needed in CanUseSavedPartitionForTuple(). */
 		InitRootToPartitionMap(rri, rootResultRelInfo, estate);
+		InitPartitionConstraintKeyExprs(pd, rri);
 		proute->nonleaf_partitions[dispatchidx] = rri;
 	}
 	else
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7795a69490..7f1ce732ea 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -496,6 +496,15 @@ typedef struct ResultRelInfo
 	/* partition check expression state (NULL if not set up yet) */
 	ExprState  *ri_PartitionCheckExpr;
 
+	/*
+	 * Information used by ExecPartitionCheck() to optimize some cases where
+	 * the partition's ancestors' partition keys contain arbitrary
+	 * expressions.
+	 */
+	List	   *ri_partConstrKeyExprs;
+	List	   *ri_partConstrKeyExprStates;
+	TupleTableSlot *ri_partConstrKeyExprsSlot;
+
 	/*
 	 * Information needed by tuple routing target relations
 	 *
-- 
2.24.1

0001-ExecFindPartition-cache-last-used-partition-v4.patchapplication/octet-stream; name=0001-ExecFindPartition-cache-last-used-partition-v4.patchDownload
From 5a700674961a808e0f1048cf0300dcca6dc58be1 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 25 May 2021 22:48:47 +0900
Subject: [PATCH 1/2] ExecFindPartition: cache last used partition v4

---
 src/backend/executor/execPartition.c | 253 +++++++++++++++++++--------
 1 file changed, 184 insertions(+), 69 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 606c920b06..3e6c8c58c4 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -133,6 +133,16 @@ struct PartitionTupleRouting
  *		routing it through this table). A NULL value is stored if no tuple
  *		conversion is required.
  *
+ * savedPartResultInfo
+ *		If non-NULL, ResultRelInfo for the partition that was most recently
+ *		chosen as the routing target; ExecFindPartition() checks if the
+ *		same one can be used for the current row before applying the tuple-
+ *		routing algorithm to it.
+ *
+ * savedPartDispatchInfo
+ *		If non-NULL, PartititionDispatch for the sub-partitioned partition
+ *		that was most recently chosen as the routing target
+ *
  * indexes
  *		Array of partdesc->nparts elements.  For leaf partitions the index
  *		corresponds to the partition's ResultRelInfo in the encapsulating
@@ -150,6 +160,8 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrMap    *tupmap;
+	ResultRelInfo *savedPartResultInfo;
+	PartitionDispatch savedPartDispatchInfo;
 	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 }			PartitionDispatchData;
 
@@ -234,6 +246,88 @@ ExecSetupPartitionTupleRouting(EState *estate, Relation rel)
 	return proute;
 }
 
+/*
+ * Remember this partition for the next tuple inserted into this parent; see
+ * CanUseSavedPartitionForTuple() for how it's decided whether a tuple can
+ * indeed reuse this partition.
+ *
+ * Do this only if we have range/list partitions, because only
+ * in that case it's conceivable that consecutively inserted rows
+ * tend to go into the same partition.
+ */
+static inline void
+SavePartitionForNextTuple(PartitionDispatch dispatch,
+						  ResultRelInfo *partInfo,
+						  PartitionDispatch dispatchInfo)
+{
+	if ((dispatch->key->strategy == PARTITION_STRATEGY_RANGE ||
+		 dispatch->key->strategy == PARTITION_STRATEGY_LIST))
+	{
+		dispatch->savedPartResultInfo = partInfo;
+		dispatch->savedPartDispatchInfo = dispatchInfo;
+	}
+}
+
+/*
+ * Check if the saved partition accepts this tuple by evaluating its
+ * partition constraint against the tuple.  If it does, we save a trip
+ * to get_partition_for_tuple(), which can be a slightly more expensive
+ * way to get the same partition, especially if there are many
+ * partitions to search through.
+ */
+static inline bool
+CanUseSavedPartitionForTuple(PartitionDispatch dispatch,
+							 TupleTableSlot *rootslot,
+							 EState *estate)
+{
+	if (dispatch->savedPartResultInfo)
+	{
+		ResultRelInfo *rri;
+		TupleTableSlot *tmpslot;
+		TupleConversionMap *map;
+
+		rri = dispatch->savedPartResultInfo;
+
+		/*
+		 * If needed, convert the root-parent layout tuple into the partition's
+		 * layout, because ExecPartitionCheck() expects to be passed the
+		 * latter.
+		 */
+		map = rri->ri_RootToPartitionMap;
+		if (map)
+			tmpslot = execute_attr_map_slot(map->attrMap, rootslot,
+											rri->ri_PartitionTupleSlot);
+		else
+			tmpslot = rootslot;
+		return ExecPartitionCheck(rri, tmpslot, estate, false);
+	}
+
+	return false;
+}
+
+/*
+ * Convert tuple to a given sub-partitioned partition's layout, if
+ * needed.
+ */
+static inline TupleTableSlot *
+ConvertTupleToPartition(PartitionDispatch dispatch,
+						TupleTableSlot *slot,
+						TupleTableSlot *parent_slot)
+{
+	if (dispatch->tupslot)
+	{
+		AttrMap    *map = dispatch->tupmap;
+
+		Assert(map != NULL);
+		slot = execute_attr_map_slot(map, slot, dispatch->tupslot);
+		/* Don't leak the previous parent's copy of the tuple. */
+		if (parent_slot)
+			ExecClearTuple(parent_slot);
+	}
+
+	return slot;
+}
+
 /*
  * ExecFindPartition -- Return the ResultRelInfo for the leaf partition that
  * the tuple contained in *slot should belong to.
@@ -292,6 +386,35 @@ ExecFindPartition(ModifyTableState *mtstate,
 		CHECK_FOR_INTERRUPTS();
 
 		rel = dispatch->reldesc;
+
+		if (CanUseSavedPartitionForTuple(dispatch, rootslot, estate))
+		{
+			/* If the saved partition is leaf partition, just return it. */
+			if (dispatch->savedPartDispatchInfo == NULL)
+			{
+				/* Restore ecxt's scantuple before returning. */
+				ecxt->ecxt_scantuple = ecxt_scantuple_saved;
+				MemoryContextSwitchTo(oldcxt);
+				return dispatch->savedPartResultInfo;
+			}
+			else
+			{
+				/*
+				 * Saved partition is sub-partitioned, so continue the loop to
+				 * find the next level partition.
+				 */
+				myslot = dispatch->tupslot;
+				dispatch = dispatch->savedPartDispatchInfo;
+				slot = ConvertTupleToPartition(dispatch, slot, myslot);
+				continue;
+			}
+		}
+		else
+		{
+			dispatch->savedPartResultInfo = rri = NULL;
+			dispatch->savedPartDispatchInfo = NULL;
+		}
+
 		partdesc = dispatch->partdesc;
 
 		/*
@@ -331,16 +454,10 @@ ExecFindPartition(ModifyTableState *mtstate,
 		if (is_leaf)
 		{
 			/*
-			 * We've reached the leaf -- hurray, we're done.  Look to see if
-			 * we've already got a ResultRelInfo for this partition.
+			 * We've reached the leaf -- hurray, we're done.  Build the
+			 * ResultRelInfo for this partition if not already done.
 			 */
-			if (likely(dispatch->indexes[partidx] >= 0))
-			{
-				/* ResultRelInfo already built */
-				Assert(dispatch->indexes[partidx] < proute->num_partitions);
-				rri = proute->partitions[dispatch->indexes[partidx]];
-			}
-			else
+			if (unlikely(dispatch->indexes[partidx] < 0))
 			{
 				/*
 				 * If the partition is known in the owning ModifyTableState
@@ -370,65 +487,50 @@ ExecFindPartition(ModifyTableState *mtstate,
 												rootResultRelInfo, partidx);
 				}
 			}
+
+			Assert(dispatch->indexes[partidx] < proute->num_partitions);
+			rri = proute->partitions[dispatch->indexes[partidx]];
 			Assert(rri != NULL);
 
+			SavePartitionForNextTuple(dispatch, rri, NULL);
+
 			/* Signal to terminate the loop */
 			dispatch = NULL;
 		}
 		else
 		{
+			PartitionDispatch subdispatch;
+
 			/*
-			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 * Partition is a sub-partitioned table; get the PartitionDispatch.
+			 * Build it if not already done, passing the current one in as the
+			 * parent PartitionDspatch.
 			 */
-			if (likely(dispatch->indexes[partidx] >= 0))
-			{
-				/* Already built. */
-				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
-
-				rri = proute->nonleaf_partitions[dispatch->indexes[partidx]];
-
-				/*
-				 * Move down to the next partition level and search again
-				 * until we find a leaf partition that matches this tuple
-				 */
-				dispatch = pd[dispatch->indexes[partidx]];
-			}
-			else
-			{
-				/* Not yet built. Do that now. */
-				PartitionDispatch subdispatch;
-
-				/*
-				 * Create the new PartitionDispatch.  We pass the current one
-				 * in as the parent PartitionDispatch
-				 */
+			if (unlikely(dispatch->indexes[partidx] < 0))
 				subdispatch = ExecInitPartitionDispatchInfo(estate,
 															proute,
 															partdesc->oids[partidx],
 															dispatch, partidx,
 															mtstate->rootResultRelInfo);
-				Assert(dispatch->indexes[partidx] >= 0 &&
-					   dispatch->indexes[partidx] < proute->num_dispatch);
-
-				rri = proute->nonleaf_partitions[dispatch->indexes[partidx]];
-				dispatch = subdispatch;
-			}
+			Assert(dispatch->indexes[partidx] >= 0 &&
+				   dispatch->indexes[partidx] < proute->num_dispatch);
 
 			/*
-			 * Convert the tuple to the new parent's layout, if different from
-			 * the previous parent.
+			 * Move down to the next partition level and search again
+			 * until we find a leaf partition that matches this tuple
 			 */
-			if (dispatch->tupslot)
-			{
-				AttrMap    *map = dispatch->tupmap;
-				TupleTableSlot *tempslot = myslot;
+			subdispatch = pd[dispatch->indexes[partidx]];
+			rri = proute->nonleaf_partitions[dispatch->indexes[partidx]];
 
-				myslot = dispatch->tupslot;
-				slot = execute_attr_map_slot(map, slot, myslot);
+			/*
+			 * Save both the PartitionDispatch and the ResultRelInfo of
+			 * this partition to consider reusing for the next tuple.
+			 */
+			SavePartitionForNextTuple(dispatch, rri, subdispatch);
 
-				if (tempslot != NULL)
-					ExecClearTuple(tempslot);
-			}
+			myslot = dispatch->tupslot;
+			dispatch = subdispatch;
+			slot = ConvertTupleToPartition(dispatch, slot, myslot);
 		}
 
 		/*
@@ -858,27 +960,11 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 	return leaf_part_rri;
 }
 
-/*
- * ExecInitRoutingInfo
- *		Set up information needed for translating tuples between root
- *		partitioned table format and partition format, and keep track of it
- *		in PartitionTupleRouting.
- */
-static void
-ExecInitRoutingInfo(ModifyTableState *mtstate,
-					EState *estate,
-					PartitionTupleRouting *proute,
-					PartitionDispatch dispatch,
-					ResultRelInfo *partRelInfo,
-					int partidx,
-					bool is_borrowed_rel)
+static inline void
+InitRootToPartitionMap(ResultRelInfo *partRelInfo,
+					   ResultRelInfo *rootRelInfo,
+					   EState *estate)
 {
-	ResultRelInfo *rootRelInfo = partRelInfo->ri_RootResultRelInfo;
-	MemoryContext oldcxt;
-	int			rri_index;
-
-	oldcxt = MemoryContextSwitchTo(proute->memcxt);
-
 	/*
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
@@ -907,6 +993,30 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	}
 	else
 		partRelInfo->ri_PartitionTupleSlot = NULL;
+}
+
+/*
+ * ExecInitRoutingInfo
+ *		Set up information needed for translating tuples between root
+ *		partitioned table format and partition format, and keep track of it
+ *		in PartitionTupleRouting.
+ */
+static void
+ExecInitRoutingInfo(ModifyTableState *mtstate,
+					EState *estate,
+					PartitionTupleRouting *proute,
+					PartitionDispatch dispatch,
+					ResultRelInfo *partRelInfo,
+					int partidx,
+					bool is_borrowed_rel)
+{
+	ResultRelInfo *rootRelInfo = partRelInfo->ri_RootResultRelInfo;
+	MemoryContext oldcxt;
+	int			rri_index;
+
+	oldcxt = MemoryContextSwitchTo(proute->memcxt);
+
+	InitRootToPartitionMap(partRelInfo, rootRelInfo, estate);
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -1051,6 +1161,9 @@ ExecInitPartitionDispatchInfo(EState *estate,
 		pd->tupslot = NULL;
 	}
 
+	pd->savedPartResultInfo = NULL;
+	pd->savedPartDispatchInfo = NULL;
+
 	/*
 	 * Initialize with -1 to signify that the corresponding partition's
 	 * ResultRelInfo or PartitionDispatch has not been created yet.
@@ -1094,6 +1207,8 @@ ExecInitPartitionDispatchInfo(EState *estate,
 		ResultRelInfo *rri = makeNode(ResultRelInfo);
 
 		InitResultRelInfo(rri, rel, 0, rootResultRelInfo, 0);
+		/* The map is needed in CanUseSavedPartitionForTuple(). */
+		InitRootToPartitionMap(rri, rootResultRelInfo, estate);
 		proute->nonleaf_partitions[dispatchidx] = rri;
 	}
 	else
-- 
2.24.1

#30Zhihong Yu
zyu@yugabyte.com
In reply to: Amit Langote (#29)
Re: Skip partition tuple routing with constant partition key

On Wed, May 26, 2021 at 9:22 PM Amit Langote <amitlangote09@gmail.com>
wrote:

Hi,

On Thu, May 27, 2021 at 2:30 AM Zhihong Yu <zyu@yugabyte.com> wrote:

Hi, Amit:

For ConvertTupleToPartition() in

0001-ExecFindPartition-cache-last-used-partition-v3.patch:

+ if (tempslot != NULL)
+ ExecClearTuple(tempslot);

If tempslot and parent_slot point to the same slot, should

ExecClearTuple() still be called ?

Yeah, we decided back in 1c9bb02d8ec that it's necessary to free the
slot if it's the same slot as a parent partition's
PartitionDispatch->tupslot ("freeing parent's copy of the tuple").
Maybe we don't need this parent-slot-clearing anymore due to code
restructuring over the last 3 years, but that will have to be a
separate patch.

I hope the attached updated patch makes it a bit more clear what's
going on. I refactored more of the code in ExecFindPartition() to
make this patch more a bit more readable.

--
Amit Langote
EDB: http://www.enterprisedb.com

Hi, Amit:
Thanks for the explanation.

For CanUseSavedPartitionForTuple, nit: you can check
!dispatch->savedPartResultInfo at the beginning and return early.
This would save some indentation.

Cheers

#31Amit Langote
amitlangote09@gmail.com
In reply to: Zhihong Yu (#30)
2 attachment(s)
Re: Skip partition tuple routing with constant partition key

On Thu, May 27, 2021 at 1:55 PM Zhihong Yu <zyu@yugabyte.com> wrote:

For CanUseSavedPartitionForTuple, nit: you can check !dispatch->savedPartResultInfo at the beginning and return early.
This would save some indentation.

Sure, see the attached.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachments:

v5-0002-ExecPartitionCheck-pre-compute-partition-key-expr.patchapplication/octet-stream; name=v5-0002-ExecPartitionCheck-pre-compute-partition-key-expr.patchDownload
From c3c7c9646a5223faae0977c8b5c91e178d3a3c18 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 25 May 2021 22:55:12 +0900
Subject: [PATCH v5 2/2] ExecPartitionCheck: pre-compute partition key
 expression v2

---
 src/backend/executor/execMain.c      | 95 ++++++++++++++++++++++++++++
 src/backend/executor/execPartition.c | 46 ++++++++++++++
 src/include/nodes/execnodes.h        |  9 +++
 3 files changed, 150 insertions(+)

diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index b3ce4bae53..1fc2a9fe82 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -44,6 +44,7 @@
 #include "access/transam.h"
 #include "access/xact.h"
 #include "catalog/namespace.h"
+#include "catalog/partition.h"
 #include "catalog/pg_publication.h"
 #include "commands/matview.h"
 #include "commands/trigger.h"
@@ -52,6 +53,8 @@
 #include "foreign/fdwapi.h"
 #include "jit/jit.h"
 #include "mb/pg_wchar.h"
+#include "nodes/makefuncs.h"
+#include "nodes/nodeFuncs.h"
 #include "miscadmin.h"
 #include "parser/parsetree.h"
 #include "storage/bufmgr.h"
@@ -1686,6 +1689,32 @@ ExecRelCheck(ResultRelInfo *resultRelInfo,
 	return NULL;
 }
 
+/*
+ * Replaces the occurrence of cxt->matchexpr in the expression tree given by
+ * 'node' by an OUTER var with provided attribute number.
+ */
+typedef struct
+{
+	Expr	   *matchexpr;
+	AttrNumber	varattno;
+} replace_partexpr_with_dummy_var_context;
+
+static Node *
+replace_partexpr_with_dummy_var(Node *node,
+								replace_partexpr_with_dummy_var_context *cxt)
+{
+	if (node == NULL)
+		return NULL;
+
+	if (equal(node, cxt->matchexpr))
+		return (Node *) makeVar(OUTER_VAR, cxt->varattno,
+								exprType(node), exprTypmod(node),
+								exprCollation(node), 0);
+
+	return expression_tree_mutator(node, replace_partexpr_with_dummy_var,
+								   (void *) cxt);
+}
+
 /*
  * ExecPartitionCheck --- check that tuple meets the partition constraint.
  *
@@ -1716,6 +1745,45 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 		MemoryContext oldcxt = MemoryContextSwitchTo(estate->es_query_cxt);
 		List	   *qual = RelationGetPartitionQual(resultRelInfo->ri_RelationDesc);
 
+		/*
+		 * Optimize the evaluation of partition key expressions.  The way we do
+		 * that is by replacing any occurrences of the individual expressions
+		 * in this relation's partition constraint by dummy Vars marked as
+		 * coming from the "OUTER" relation.  Then when actually executing such
+		 * modified partition constraint tree, we feed the actual partition
+		 * expression values via econtext->ecxt_outertuple; see below.
+		 */
+		if (resultRelInfo->ri_partConstrKeyExprs)
+		{
+			List	  *partexprs = resultRelInfo->ri_partConstrKeyExprs;
+			ListCell  *lc;
+			AttrNumber attrno = 1;
+			TupleDesc	partexprs_tupdesc;
+			replace_partexpr_with_dummy_var_context cxt;
+
+			partexprs_tupdesc = CreateTemplateTupleDesc(list_length(partexprs));
+			foreach(lc, partexprs)
+			{
+				Expr   *expr = lfirst(lc);
+
+				cxt.matchexpr = expr;
+				cxt.varattno = attrno;
+				qual = (List *) replace_partexpr_with_dummy_var((Node *) qual,
+																&cxt);
+
+				resultRelInfo->ri_partConstrKeyExprStates =
+					lappend(resultRelInfo->ri_partConstrKeyExprStates,
+							ExecPrepareExpr(expr, estate));
+				TupleDescInitEntry(partexprs_tupdesc, attrno, NULL,
+								   exprType((Node *) expr),
+								   exprTypmod((Node *) expr), 0);
+				attrno++;
+			}
+
+			resultRelInfo->ri_partConstrKeyExprsSlot =
+				ExecInitExtraTupleSlot(estate, partexprs_tupdesc, &TTSOpsVirtual);
+		}
+
 		resultRelInfo->ri_PartitionCheckExpr = ExecPrepareCheck(qual, estate);
 		MemoryContextSwitchTo(oldcxt);
 	}
@@ -1729,6 +1797,33 @@ ExecPartitionCheck(ResultRelInfo *resultRelInfo, TupleTableSlot *slot,
 	/* Arrange for econtext's scan tuple to be the tuple under test */
 	econtext->ecxt_scantuple = slot;
 
+	if (resultRelInfo->ri_partConstrKeyExprStates)
+	{
+		TupleTableSlot *partexprs_slot = resultRelInfo->ri_partConstrKeyExprsSlot;
+		Datum	*values;
+		bool	*isnull;
+		ListCell *lc;
+		AttrNumber attrno = 1;
+
+		Assert(partexprs_slot != NULL);
+		ExecClearTuple(partexprs_slot);
+
+		values = partexprs_slot->tts_values;
+		isnull = partexprs_slot->tts_isnull;
+
+		foreach(lc, resultRelInfo->ri_partConstrKeyExprStates)
+		{
+			ExprState   *partexpr = lfirst(lc);
+
+			values[attrno-1] = ExecEvalExprSwitchContext(partexpr, econtext,
+												  &isnull[attrno-1]);
+			attrno++;
+		}
+		ExecStoreVirtualTuple(partexprs_slot);
+
+		econtext->ecxt_outertuple = partexprs_slot;
+	}
+
 	/*
 	 * As in case of the catalogued constraints, we treat a NULL result as
 	 * success here, not a failure.
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index dd812ae3fc..75dc151535 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -143,6 +143,13 @@ struct PartitionTupleRouting
  *		If non-NULL, PartititionDispatch for the sub-partitioned partition
  *		that was most recently chosen as the routing target
  *
+ * partconstr_keyexprs
+ *		List of expressions present in the partition keys of all ancestors
+ *		of this table including itself, mapped to have the attribute
+ *		numbers of this table.  The field is so named because all of these
+ *		expressions appear in the partition constraint of each of this
+ *		table's partitions.
+ *
  * indexes
  *		Array of partdesc->nparts elements.  For leaf partitions the index
  *		corresponds to the partition's ResultRelInfo in the encapsulating
@@ -162,6 +169,7 @@ typedef struct PartitionDispatchData
 	AttrMap    *tupmap;
 	ResultRelInfo *savedPartResultInfo;
 	PartitionDispatch savedPartDispatchInfo;
+	List	   *partconstr_keyexprs;
 	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 }			PartitionDispatchData;
 
@@ -994,6 +1002,21 @@ InitRootToPartitionMap(ResultRelInfo *partRelInfo,
 		partRelInfo->ri_PartitionTupleSlot = NULL;
 }
 
+/*
+ * Save parent's partition key expressions in the partition ResultRelInfo
+ * after mapping them to have the partition's attribute numbers.
+ */
+static inline void
+InitPartitionConstraintKeyExprs(PartitionDispatch dispatch,
+								ResultRelInfo *partRelInfo)
+{
+	if (dispatch->partconstr_keyexprs)
+		partRelInfo->ri_partConstrKeyExprs =
+			map_partition_varattnos(dispatch->partconstr_keyexprs, 1,
+									partRelInfo->ri_RelationDesc,
+									dispatch->reldesc);
+}
+
 /*
  * ExecInitRoutingInfo
  *		Set up information needed for translating tuples between root
@@ -1045,6 +1068,8 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 
 	partRelInfo->ri_CopyMultiInsertBuffer = NULL;
 
+	InitPartitionConstraintKeyExprs(dispatch, partRelInfo);
+
 	/*
 	 * Keep track of it in the PartitionTupleRouting->partitions array.
 	 */
@@ -1163,6 +1188,26 @@ ExecInitPartitionDispatchInfo(EState *estate,
 	pd->savedPartResultInfo = NULL;
 	pd->savedPartDispatchInfo = NULL;
 
+	if (pd->key->partexprs != NIL)
+	{
+		pd->partconstr_keyexprs = copyObject(pd->key->partexprs);
+		if (parent_pd)
+		{
+			List   *parent_keyexprs = parent_pd->partconstr_keyexprs;
+
+			if (parent_keyexprs && pd->tupmap)
+				parent_keyexprs = map_partition_varattnos(parent_keyexprs, 1,
+														  rel,
+														  parent_pd->reldesc);
+			else if (parent_keyexprs)
+				parent_keyexprs = copyObject(parent_keyexprs);
+			pd->partconstr_keyexprs =
+				list_concat(pd->partconstr_keyexprs, parent_keyexprs);
+		}
+	}
+	else
+		pd->partconstr_keyexprs = NIL;
+
 	/*
 	 * Initialize with -1 to signify that the corresponding partition's
 	 * ResultRelInfo or PartitionDispatch has not been created yet.
@@ -1208,6 +1253,7 @@ ExecInitPartitionDispatchInfo(EState *estate,
 		InitResultRelInfo(rri, rel, 0, rootResultRelInfo, 0);
 		/* The map is needed in CanUseSavedPartitionForTuple(). */
 		InitRootToPartitionMap(rri, rootResultRelInfo, estate);
+		InitPartitionConstraintKeyExprs(pd, rri);
 		proute->nonleaf_partitions[dispatchidx] = rri;
 	}
 	else
diff --git a/src/include/nodes/execnodes.h b/src/include/nodes/execnodes.h
index 7795a69490..7f1ce732ea 100644
--- a/src/include/nodes/execnodes.h
+++ b/src/include/nodes/execnodes.h
@@ -496,6 +496,15 @@ typedef struct ResultRelInfo
 	/* partition check expression state (NULL if not set up yet) */
 	ExprState  *ri_PartitionCheckExpr;
 
+	/*
+	 * Information used by ExecPartitionCheck() to optimize some cases where
+	 * the partition's ancestors' partition keys contain arbitrary
+	 * expressions.
+	 */
+	List	   *ri_partConstrKeyExprs;
+	List	   *ri_partConstrKeyExprStates;
+	TupleTableSlot *ri_partConstrKeyExprsSlot;
+
 	/*
 	 * Information needed by tuple routing target relations
 	 *
-- 
2.24.1

v5-0001-ExecFindPartition-cache-last-used-partition.patchapplication/octet-stream; name=v5-0001-ExecFindPartition-cache-last-used-partition.patchDownload
From 47e87e416dc67b1a22b7189388b897ceda76e1fa Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 25 May 2021 22:48:47 +0900
Subject: [PATCH v5 1/2] ExecFindPartition: cache last used partition

---
 src/backend/executor/execPartition.c | 252 +++++++++++++++++++--------
 1 file changed, 183 insertions(+), 69 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 606c920b06..dd812ae3fc 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -133,6 +133,16 @@ struct PartitionTupleRouting
  *		routing it through this table). A NULL value is stored if no tuple
  *		conversion is required.
  *
+ * savedPartResultInfo
+ *		If non-NULL, ResultRelInfo for the partition that was most recently
+ *		chosen as the routing target; ExecFindPartition() checks if the
+ *		same one can be used for the current row before applying the tuple-
+ *		routing algorithm to it.
+ *
+ * savedPartDispatchInfo
+ *		If non-NULL, PartititionDispatch for the sub-partitioned partition
+ *		that was most recently chosen as the routing target
+ *
  * indexes
  *		Array of partdesc->nparts elements.  For leaf partitions the index
  *		corresponds to the partition's ResultRelInfo in the encapsulating
@@ -150,6 +160,8 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrMap    *tupmap;
+	ResultRelInfo *savedPartResultInfo;
+	PartitionDispatch savedPartDispatchInfo;
 	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 }			PartitionDispatchData;
 
@@ -234,6 +246,87 @@ ExecSetupPartitionTupleRouting(EState *estate, Relation rel)
 	return proute;
 }
 
+/*
+ * Remember this partition for the next tuple inserted into this parent; see
+ * CanUseSavedPartitionForTuple() for how it's decided whether a tuple can
+ * indeed reuse this partition.
+ *
+ * Do this only if we have range/list partitions, because only
+ * in that case it's conceivable that consecutively inserted rows
+ * tend to go into the same partition.
+ */
+static inline void
+SavePartitionForNextTuple(PartitionDispatch dispatch,
+						  ResultRelInfo *partInfo,
+						  PartitionDispatch dispatchInfo)
+{
+	if ((dispatch->key->strategy == PARTITION_STRATEGY_RANGE ||
+		 dispatch->key->strategy == PARTITION_STRATEGY_LIST))
+	{
+		dispatch->savedPartResultInfo = partInfo;
+		dispatch->savedPartDispatchInfo = dispatchInfo;
+	}
+}
+
+/*
+ * Check if the saved partition accepts this tuple by evaluating its
+ * partition constraint against the tuple.  If it does, we save a trip
+ * to get_partition_for_tuple(), which can be a slightly more expensive
+ * way to get the same partition, especially if there are many
+ * partitions to search through.
+ */
+static inline bool
+CanUseSavedPartitionForTuple(PartitionDispatch dispatch,
+							 TupleTableSlot *rootslot,
+							 EState *estate)
+{
+	ResultRelInfo *rri;
+	TupleTableSlot *slot;
+	TupleConversionMap *map;
+
+	if (dispatch->savedPartResultInfo == NULL)
+		return false;
+
+	rri = dispatch->savedPartResultInfo;
+
+	/*
+	 * If needed, convert the root-parent layout tuple into the partition's
+	 * layout, because ExecPartitionCheck() expects to be passed the
+	 * latter.
+	 */
+	map = rri->ri_RootToPartitionMap;
+	if (map)
+		slot = execute_attr_map_slot(map->attrMap, rootslot,
+									 rri->ri_PartitionTupleSlot);
+	else
+		slot = rootslot;
+
+	return ExecPartitionCheck(rri, slot, estate, false);
+}
+
+/*
+ * Convert tuple to a given sub-partitioned partition's layout, if
+ * needed.
+ */
+static inline TupleTableSlot *
+ConvertTupleToPartition(PartitionDispatch dispatch,
+						TupleTableSlot *slot,
+						TupleTableSlot *parent_slot)
+{
+	if (dispatch->tupslot)
+	{
+		AttrMap    *map = dispatch->tupmap;
+
+		Assert(map != NULL);
+		slot = execute_attr_map_slot(map, slot, dispatch->tupslot);
+		/* Don't leak the previous parent's copy of the tuple. */
+		if (parent_slot)
+			ExecClearTuple(parent_slot);
+	}
+
+	return slot;
+}
+
 /*
  * ExecFindPartition -- Return the ResultRelInfo for the leaf partition that
  * the tuple contained in *slot should belong to.
@@ -292,6 +385,35 @@ ExecFindPartition(ModifyTableState *mtstate,
 		CHECK_FOR_INTERRUPTS();
 
 		rel = dispatch->reldesc;
+
+		if (CanUseSavedPartitionForTuple(dispatch, rootslot, estate))
+		{
+			/* If the saved partition is leaf partition, just return it. */
+			if (dispatch->savedPartDispatchInfo == NULL)
+			{
+				/* Restore ecxt's scantuple before returning. */
+				ecxt->ecxt_scantuple = ecxt_scantuple_saved;
+				MemoryContextSwitchTo(oldcxt);
+				return dispatch->savedPartResultInfo;
+			}
+			else
+			{
+				/*
+				 * Saved partition is sub-partitioned, so continue the loop to
+				 * find the next level partition.
+				 */
+				myslot = dispatch->tupslot;
+				dispatch = dispatch->savedPartDispatchInfo;
+				slot = ConvertTupleToPartition(dispatch, slot, myslot);
+				continue;
+			}
+		}
+		else
+		{
+			dispatch->savedPartResultInfo = rri = NULL;
+			dispatch->savedPartDispatchInfo = NULL;
+		}
+
 		partdesc = dispatch->partdesc;
 
 		/*
@@ -331,16 +453,10 @@ ExecFindPartition(ModifyTableState *mtstate,
 		if (is_leaf)
 		{
 			/*
-			 * We've reached the leaf -- hurray, we're done.  Look to see if
-			 * we've already got a ResultRelInfo for this partition.
+			 * We've reached the leaf -- hurray, we're done.  Build the
+			 * ResultRelInfo for this partition if not already done.
 			 */
-			if (likely(dispatch->indexes[partidx] >= 0))
-			{
-				/* ResultRelInfo already built */
-				Assert(dispatch->indexes[partidx] < proute->num_partitions);
-				rri = proute->partitions[dispatch->indexes[partidx]];
-			}
-			else
+			if (unlikely(dispatch->indexes[partidx] < 0))
 			{
 				/*
 				 * If the partition is known in the owning ModifyTableState
@@ -370,65 +486,50 @@ ExecFindPartition(ModifyTableState *mtstate,
 												rootResultRelInfo, partidx);
 				}
 			}
+
+			Assert(dispatch->indexes[partidx] < proute->num_partitions);
+			rri = proute->partitions[dispatch->indexes[partidx]];
 			Assert(rri != NULL);
 
+			SavePartitionForNextTuple(dispatch, rri, NULL);
+
 			/* Signal to terminate the loop */
 			dispatch = NULL;
 		}
 		else
 		{
+			PartitionDispatch subdispatch;
+
 			/*
-			 * Partition is a sub-partitioned table; get the PartitionDispatch
+			 * Partition is a sub-partitioned table; get the PartitionDispatch.
+			 * Build it if not already done, passing the current one in as the
+			 * parent PartitionDspatch.
 			 */
-			if (likely(dispatch->indexes[partidx] >= 0))
-			{
-				/* Already built. */
-				Assert(dispatch->indexes[partidx] < proute->num_dispatch);
-
-				rri = proute->nonleaf_partitions[dispatch->indexes[partidx]];
-
-				/*
-				 * Move down to the next partition level and search again
-				 * until we find a leaf partition that matches this tuple
-				 */
-				dispatch = pd[dispatch->indexes[partidx]];
-			}
-			else
-			{
-				/* Not yet built. Do that now. */
-				PartitionDispatch subdispatch;
-
-				/*
-				 * Create the new PartitionDispatch.  We pass the current one
-				 * in as the parent PartitionDispatch
-				 */
+			if (unlikely(dispatch->indexes[partidx] < 0))
 				subdispatch = ExecInitPartitionDispatchInfo(estate,
 															proute,
 															partdesc->oids[partidx],
 															dispatch, partidx,
 															mtstate->rootResultRelInfo);
-				Assert(dispatch->indexes[partidx] >= 0 &&
-					   dispatch->indexes[partidx] < proute->num_dispatch);
-
-				rri = proute->nonleaf_partitions[dispatch->indexes[partidx]];
-				dispatch = subdispatch;
-			}
+			Assert(dispatch->indexes[partidx] >= 0 &&
+				   dispatch->indexes[partidx] < proute->num_dispatch);
 
 			/*
-			 * Convert the tuple to the new parent's layout, if different from
-			 * the previous parent.
+			 * Move down to the next partition level and search again
+			 * until we find a leaf partition that matches this tuple
 			 */
-			if (dispatch->tupslot)
-			{
-				AttrMap    *map = dispatch->tupmap;
-				TupleTableSlot *tempslot = myslot;
+			subdispatch = pd[dispatch->indexes[partidx]];
+			rri = proute->nonleaf_partitions[dispatch->indexes[partidx]];
 
-				myslot = dispatch->tupslot;
-				slot = execute_attr_map_slot(map, slot, myslot);
+			/*
+			 * Save both the PartitionDispatch and the ResultRelInfo of
+			 * this partition to consider reusing for the next tuple.
+			 */
+			SavePartitionForNextTuple(dispatch, rri, subdispatch);
 
-				if (tempslot != NULL)
-					ExecClearTuple(tempslot);
-			}
+			myslot = dispatch->tupslot;
+			dispatch = subdispatch;
+			slot = ConvertTupleToPartition(dispatch, slot, myslot);
 		}
 
 		/*
@@ -858,27 +959,11 @@ ExecInitPartitionInfo(ModifyTableState *mtstate, EState *estate,
 	return leaf_part_rri;
 }
 
-/*
- * ExecInitRoutingInfo
- *		Set up information needed for translating tuples between root
- *		partitioned table format and partition format, and keep track of it
- *		in PartitionTupleRouting.
- */
-static void
-ExecInitRoutingInfo(ModifyTableState *mtstate,
-					EState *estate,
-					PartitionTupleRouting *proute,
-					PartitionDispatch dispatch,
-					ResultRelInfo *partRelInfo,
-					int partidx,
-					bool is_borrowed_rel)
+static inline void
+InitRootToPartitionMap(ResultRelInfo *partRelInfo,
+					   ResultRelInfo *rootRelInfo,
+					   EState *estate)
 {
-	ResultRelInfo *rootRelInfo = partRelInfo->ri_RootResultRelInfo;
-	MemoryContext oldcxt;
-	int			rri_index;
-
-	oldcxt = MemoryContextSwitchTo(proute->memcxt);
-
 	/*
 	 * Set up a tuple conversion map to convert a tuple routed to the
 	 * partition from the parent's type to the partition's.
@@ -907,6 +992,30 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 	}
 	else
 		partRelInfo->ri_PartitionTupleSlot = NULL;
+}
+
+/*
+ * ExecInitRoutingInfo
+ *		Set up information needed for translating tuples between root
+ *		partitioned table format and partition format, and keep track of it
+ *		in PartitionTupleRouting.
+ */
+static void
+ExecInitRoutingInfo(ModifyTableState *mtstate,
+					EState *estate,
+					PartitionTupleRouting *proute,
+					PartitionDispatch dispatch,
+					ResultRelInfo *partRelInfo,
+					int partidx,
+					bool is_borrowed_rel)
+{
+	ResultRelInfo *rootRelInfo = partRelInfo->ri_RootResultRelInfo;
+	MemoryContext oldcxt;
+	int			rri_index;
+
+	oldcxt = MemoryContextSwitchTo(proute->memcxt);
+
+	InitRootToPartitionMap(partRelInfo, rootRelInfo, estate);
 
 	/*
 	 * If the partition is a foreign table, let the FDW init itself for
@@ -1051,6 +1160,9 @@ ExecInitPartitionDispatchInfo(EState *estate,
 		pd->tupslot = NULL;
 	}
 
+	pd->savedPartResultInfo = NULL;
+	pd->savedPartDispatchInfo = NULL;
+
 	/*
 	 * Initialize with -1 to signify that the corresponding partition's
 	 * ResultRelInfo or PartitionDispatch has not been created yet.
@@ -1094,6 +1206,8 @@ ExecInitPartitionDispatchInfo(EState *estate,
 		ResultRelInfo *rri = makeNode(ResultRelInfo);
 
 		InitResultRelInfo(rri, rel, 0, rootResultRelInfo, 0);
+		/* The map is needed in CanUseSavedPartitionForTuple(). */
+		InitRootToPartitionMap(rri, rootResultRelInfo, estate);
 		proute->nonleaf_partitions[dispatchidx] = rri;
 	}
 	else
-- 
2.24.1

#32Amit Langote
amitlangote09@gmail.com
In reply to: houzj.fnst@fujitsu.com (#28)
Re: Skip partition tuple routing with constant partition key

On Thu, May 27, 2021 at 11:47 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

About teaching relcache about caching the target partition.

David-san suggested cache the partidx in PartitionDesc.
And it will need looping and checking the cached value at each level.
I was thinking can we cache a partidx list[1, 2 ,3], and then we can follow
the list to get the last partition and do the partition CHECK only for the last
partition. If any unexpected thing happen, we can return to the original table
and redo the tuple routing without using the cached index.
What do you think ?

Where are you thinking to cache the partidx list? Inside
PartitionDesc or some executor struct?

--
Amit Langote
EDB: http://www.enterprisedb.com

#33houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: Amit Langote (#32)
RE: Skip partition tuple routing with constant partition key

From: Amit Langote <amitlangote09@gmail.com>
Sent: Thursday, May 27, 2021 1:54 PM

On Thu, May 27, 2021 at 11:47 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

About teaching relcache about caching the target partition.

David-san suggested cache the partidx in PartitionDesc.
And it will need looping and checking the cached value at each level.
I was thinking can we cache a partidx list[1, 2 ,3], and then we can
follow the list to get the last partition and do the partition CHECK
only for the last partition. If any unexpected thing happen, we can
return to the original table and redo the tuple routing without using the

cached index.

What do you think ?

Where are you thinking to cache the partidx list? Inside PartitionDesc or some
executor struct?

I was thinking cache the partidx list in PartitionDescData which is in relcache, if possible, we can
use the cached partition between statements.

Best regards,
houzj

#34Amit Langote
amitlangote09@gmail.com
In reply to: houzj.fnst@fujitsu.com (#33)
Re: Skip partition tuple routing with constant partition key

Hou-san,

On Thu, May 27, 2021 at 3:56 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

From: Amit Langote <amitlangote09@gmail.com>
Sent: Thursday, May 27, 2021 1:54 PM

On Thu, May 27, 2021 at 11:47 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

About teaching relcache about caching the target partition.

David-san suggested cache the partidx in PartitionDesc.
And it will need looping and checking the cached value at each level.
I was thinking can we cache a partidx list[1, 2 ,3], and then we can
follow the list to get the last partition and do the partition CHECK
only for the last partition. If any unexpected thing happen, we can
return to the original table and redo the tuple routing without using the

cached index.

What do you think ?

Where are you thinking to cache the partidx list? Inside PartitionDesc or some
executor struct?

I was thinking cache the partidx list in PartitionDescData which is in relcache, if possible, we can
use the cached partition between statements.

Ah, okay. I thought you were talking about a different idea. How and
where would you determine that a cached partidx value is indeed the
correct one for a given row?

Anyway, do you want to try writing a patch to see how it might work?

--
Amit Langote
EDB: http://www.enterprisedb.com

#35houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: Amit Langote (#34)
1 attachment(s)
RE: Skip partition tuple routing with constant partition key

Hi Amit-san

From: Amit Langote <amitlangote09@gmail.com>
Sent: Thursday, May 27, 2021 4:46 PM

Hou-san,

On Thu, May 27, 2021 at 3:56 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

From: Amit Langote <amitlangote09@gmail.com>
Sent: Thursday, May 27, 2021 1:54 PM

On Thu, May 27, 2021 at 11:47 AM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

About teaching relcache about caching the target partition.

David-san suggested cache the partidx in PartitionDesc.
And it will need looping and checking the cached value at each level.
I was thinking can we cache a partidx list[1, 2 ,3], and then we
can follow the list to get the last partition and do the partition
CHECK only for the last partition. If any unexpected thing happen,
we can return to the original table and redo the tuple routing
without using the

cached index.

What do you think ?

Where are you thinking to cache the partidx list? Inside
PartitionDesc or some executor struct?

I was thinking cache the partidx list in PartitionDescData which is in
relcache, if possible, we can use the cached partition between statements.

Ah, okay. I thought you were talking about a different idea.
How and where would you determine that a cached partidx value is indeed the correct one for
a given row?
Anyway, do you want to try writing a patch to see how it might work?

Yeah, the different idea here is to see if it is possible to share the cached
partition info between statements efficiently.

But, after some research, I found something not as expected:
Currently, we tried to use ExecPartitionCheck to check the if the cached
partition is the correct one. And if we want to share the cached partition
between statements, we need to Invoke ExecPartitionCheck for single-row INSERT,
but the first time ExecPartitionCheck call will need to build expression state
tree for the partition. From some simple performance tests, the cost to build
the state tree could be more than the cached partition saved which could bring
performance degradation.

So, If we want to share the cached partition between statements, we seems cannot
use ExecPartitionCheck. Instead, I tried directly invoke the partition support
function(partsupfunc) to check If the cached info is correct. In this approach I
tried cache the *bound offset* in PartitionDescData, and we can use the bound offset
to get the bound datum from PartitionBoundInfoData and invoke the partsupfunc
to do the CHECK.

Attach a POC patch about it. Just to share an idea about sharing cached partition info
between statements.

Best regards,
houzj

Attachments:

0001-cache-bound-offset.patchapplication/octet-stream; name=0001-cache-bound-offset.patchDownload
From b86db0abe56ee6126c3dbf5e3f59da9baaff594f Mon Sep 17 00:00:00 2001
From: houzj <houzj.fnst@fujitsu.com>
Date: Tue, 1 Jun 2021 14:25:21 +0800
Subject: [PATCH] cache-bound-offset

Cached the bound offset in PartitionDescData.

Everytime we try to find a partition, we first use the cached offset to get
the target bound datum(low bound value and low bound value) from
PartitionBoundInfoData and check if the partition key value match the bound datum.
If match, we skip the get_partition_for_tuple.

Currently, only cache the bound offset in LIST and RANGE partition strategy.

---
 src/backend/executor/execPartition.c | 170 +++++++++++++++++++++++++++++------
 src/backend/partitioning/partdesc.c  |   2 +
 src/include/partitioning/partdesc.h  |   2 +
 3 files changed, 147 insertions(+), 27 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 606c920..7885f5d 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -176,7 +176,9 @@ static void FormPartitionKeyDatum(PartitionDispatch pd,
 								  Datum *values,
 								  bool *isnull);
 static int	get_partition_for_tuple(PartitionDispatch pd, Datum *values,
-									bool *isnull);
+									bool *isnull, int *bound_offset);
+static int check_partition_for_tuple(PartitionDispatch pd, Datum *values,
+									  bool *isnull, int partidx);
 static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 												  Datum *values,
 												  bool *isnull,
@@ -287,12 +289,14 @@ ExecFindPartition(ModifyTableState *mtstate,
 	while (dispatch != NULL)
 	{
 		int			partidx = -1;
+		int			cached_bound_offset = -1;
 		bool		is_leaf;
 
 		CHECK_FOR_INTERRUPTS();
 
 		rel = dispatch->reldesc;
 		partdesc = dispatch->partdesc;
+		cached_bound_offset = partdesc->bound_offset;
 
 		/*
 		 * Extract partition key from tuple. Expression evaluation machinery
@@ -305,26 +309,39 @@ ExecFindPartition(ModifyTableState *mtstate,
 		ecxt->ecxt_scantuple = slot;
 		FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
 
-		/*
-		 * If this partitioned table has no partitions or no partition for
-		 * these values, error out.
-		 */
-		if (partdesc->nparts == 0 ||
-			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
+		/* Check if cache the most recently chosen bound offset */
+		if (partdesc->nparts > 0 &&
+			cached_bound_offset >= 0 &&
+			cached_bound_offset < partdesc->boundinfo->ndatums)
 		{
-			char	   *val_desc;
-
-			val_desc = ExecBuildSlotPartitionKeyDescription(rel,
-															values, isnull, 64);
-			Assert(OidIsValid(RelationGetRelid(rel)));
-			ereport(ERROR,
-					(errcode(ERRCODE_CHECK_VIOLATION),
-					 errmsg("no partition of relation \"%s\" found for row",
-							RelationGetRelationName(rel)),
-					 val_desc ?
-					 errdetail("Partition key of the failing row contains %s.",
-							   val_desc) : 0,
-					 errtable(rel)));
+			partidx = check_partition_for_tuple(dispatch, values, isnull,
+												cached_bound_offset);
+		}
+
+		if (partidx < 0)
+		{
+			/*
+			 * If this partitioned table has no partitions or no partition for
+			 * these values, error out.
+			 */
+			if (partdesc->nparts == 0 ||
+				(partidx = get_partition_for_tuple(dispatch, values, isnull,
+												&partdesc->bound_offset)) < 0)
+			{
+				char	   *val_desc;
+
+				val_desc = ExecBuildSlotPartitionKeyDescription(rel,
+																values, isnull, 64);
+				Assert(OidIsValid(RelationGetRelid(rel)));
+				ereport(ERROR,
+						(errcode(ERRCODE_CHECK_VIOLATION),
+						 errmsg("no partition of relation \"%s\" found for row",
+								RelationGetRelationName(rel)),
+						 val_desc ?
+						 errdetail("Partition key of the failing row contains %s.",
+								   val_desc) : 0,
+						 errtable(rel)));
+			}
 		}
 
 		is_leaf = partdesc->is_leaf[partidx];
@@ -475,6 +492,7 @@ ExecFindPartition(ModifyTableState *mtstate,
 	ecxt->ecxt_scantuple = ecxt_scantuple_saved;
 	MemoryContextSwitchTo(oldcxt);
 
+
 	return rri;
 }
 
@@ -1232,6 +1250,104 @@ FormPartitionKeyDatum(PartitionDispatch pd,
 }
 
 /*
+ * check_partition_for_tuple
+ *		Check if the tuple value match the target partition bounds.
+ *
+ * Return value is index of the partition (>= 0 and < partdesc->nparts) if
+ * match or -1 if not.
+ */
+static int
+check_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull, int bound_offset)
+{
+	PartitionKey	key = pd->key;
+	PartitionDesc	partdesc = pd->partdesc;
+	PartitionBoundInfo boundinfo = partdesc->boundinfo;
+	FmgrInfo	   *partsupfunc = key->partsupfunc;
+	Oid			   *partcollation = key->partcollation;
+	Datum		  **datums = boundinfo->datums;
+	int				part_index = -1;
+	int				offset = bound_offset;
+	int32			cmpval;
+
+	/*
+	 * Compare the tuple value with the target partition bounds based on
+	 * partitioning strategy.
+	 *
+	 * Only check LIST and RANGE STRATEGY.
+	 */
+	switch (key->strategy)
+	{
+		case PARTITION_STRATEGY_LIST:
+			if (!isnull[0])
+			{
+				cmpval = DatumGetInt32(FunctionCall2Coll(&partsupfunc[0],
+														 partcollation[0],
+														 datums[offset][0],
+														 values[0]));
+				if (cmpval == 0)
+					part_index = boundinfo->indexes[offset];
+			}
+			break;
+		case PARTITION_STRATEGY_RANGE:
+			{
+				int			i;
+				bool		is_default = false;
+				int16		partnatts = key->partnatts;
+				PartitionRangeDatumKind **kind = boundinfo->kind;
+
+				/*
+				 * No range includes NULL, so this will be accepted by the
+				 * default partition if there is one, and otherwise rejected.
+				 */
+				for (i = 0; i < partnatts; i++)
+				{
+					if (isnull[i])
+					{
+						is_default = true;
+						break;
+					}
+				}
+
+				if (!is_default)
+				{
+					/* Check if the value is above the low bound */
+					cmpval = partition_rbound_datum_cmp(partsupfunc,
+														partcollation,
+														datums[offset],
+														kind[offset],
+														values,
+														partnatts);
+					if (cmpval == 0)
+						part_index = boundinfo->indexes[offset + 1];
+
+					else if (cmpval < 0 && offset + 1 < boundinfo->ndatums)
+					{
+						/* Check if the value is below the high bound */
+						offset ++;
+						cmpval = partition_rbound_datum_cmp(partsupfunc,
+															partcollation,
+															datums[offset],
+															kind[offset],
+															values,
+															partnatts);
+
+						if (cmpval > 0)
+							part_index = boundinfo->indexes[offset];
+					}
+				}
+				else if (boundinfo->indexes[offset + 1] == boundinfo->default_index)
+					part_index = boundinfo->indexes[offset + 1];
+			}
+			break;
+		default:
+			break;
+	}
+
+	return part_index;
+
+}
+
+/*
  * get_partition_for_tuple
  *		Finds partition of relation which accepts the partition key specified
  *		in values and isnull
@@ -1240,9 +1356,8 @@ FormPartitionKeyDatum(PartitionDispatch pd,
  * found or -1 if none found.
  */
 static int
-get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
+get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull, int *bound_offset)
 {
-	int			bound_offset;
 	int			part_index = -1;
 	PartitionKey key = pd->key;
 	PartitionDesc partdesc = pd->partdesc;
@@ -1261,6 +1376,7 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 													   values, isnull);
 
 				part_index = boundinfo->indexes[rowHash % boundinfo->nindexes];
+				*bound_offset = -1;
 			}
 			break;
 
@@ -1274,12 +1390,12 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 			{
 				bool		equal = false;
 
-				bound_offset = partition_list_bsearch(key->partsupfunc,
+				*bound_offset = partition_list_bsearch(key->partsupfunc,
 													  key->partcollation,
 													  boundinfo,
 													  values[0], &equal);
-				if (bound_offset >= 0 && equal)
-					part_index = boundinfo->indexes[bound_offset];
+				if (*bound_offset >= 0 && equal)
+					part_index = boundinfo->indexes[*bound_offset];
 			}
 			break;
 
@@ -1304,7 +1420,7 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 
 				if (!range_partkey_has_null)
 				{
-					bound_offset = partition_range_datum_bsearch(key->partsupfunc,
+					*bound_offset = partition_range_datum_bsearch(key->partsupfunc,
 																 key->partcollation,
 																 boundinfo,
 																 key->partnatts,
@@ -1317,7 +1433,7 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 					 * bound of the partition we're looking for, if there
 					 * actually exists one.
 					 */
-					part_index = boundinfo->indexes[bound_offset + 1];
+					part_index = boundinfo->indexes[*bound_offset + 1];
 				}
 			}
 			break;
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 9a9d6a9..b072da2 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -285,6 +285,8 @@ RelationBuildPartitionDesc(Relation rel, bool omit_detached)
 		MemoryContextAllocZero(new_pdcxt, sizeof(PartitionDescData));
 	partdesc->nparts = nparts;
 	partdesc->detached_exist = detached_exist;
+	partdesc->bound_offset = -1;
+
 	/* If there are no partitions, the rest of the partdesc can stay zero */
 	if (nparts > 0)
 	{
diff --git a/src/include/partitioning/partdesc.h b/src/include/partitioning/partdesc.h
index 0792f48..70ab0a2 100644
--- a/src/include/partitioning/partdesc.h
+++ b/src/include/partitioning/partdesc.h
@@ -36,6 +36,8 @@ typedef struct PartitionDescData
 								 * the corresponding 'oids' element belongs to
 								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
+	int			bound_offset;	/* offset of bound info that was most recently
+								 * chosen */
 } PartitionDescData;
 
 
-- 
2.7.2.windows.1

#36Amit Langote
amitlangote09@gmail.com
In reply to: houzj.fnst@fujitsu.com (#35)
Re: Skip partition tuple routing with constant partition key

Hou-san,

On Tue, Jun 1, 2021 at 5:43 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

From: Amit Langote <amitlangote09@gmail.com>

Where are you thinking to cache the partidx list? Inside
PartitionDesc or some executor struct?

I was thinking cache the partidx list in PartitionDescData which is in
relcache, if possible, we can use the cached partition between statements.

Ah, okay. I thought you were talking about a different idea.
How and where would you determine that a cached partidx value is indeed the correct one for
a given row?
Anyway, do you want to try writing a patch to see how it might work?

Yeah, the different idea here is to see if it is possible to share the cached
partition info between statements efficiently.

But, after some research, I found something not as expected:

Thanks for investigating this.

Currently, we tried to use ExecPartitionCheck to check the if the cached
partition is the correct one. And if we want to share the cached partition
between statements, we need to Invoke ExecPartitionCheck for single-row INSERT,
but the first time ExecPartitionCheck call will need to build expression state
tree for the partition. From some simple performance tests, the cost to build
the state tree could be more than the cached partition saved which could bring
performance degradation.

Yeah, using the executor in the lower layer will defeat the whole
point of caching in that layer.

So, If we want to share the cached partition between statements, we seems cannot
use ExecPartitionCheck. Instead, I tried directly invoke the partition support
function(partsupfunc) to check If the cached info is correct. In this approach I
tried cache the *bound offset* in PartitionDescData, and we can use the bound offset
to get the bound datum from PartitionBoundInfoData and invoke the partsupfunc
to do the CHECK.

Attach a POC patch about it. Just to share an idea about sharing cached partition info
between statements.

I have not looked at your patch yet, but yeah that's what I would
imagine doing it.

--
Amit Langote
EDB: http://www.enterprisedb.com

#37Amit Langote
amitlangote09@gmail.com
In reply to: Amit Langote (#36)
1 attachment(s)
Re: Skip partition tuple routing with constant partition key

On Thu, Jun 3, 2021 at 8:48 PM Amit Langote <amitlangote09@gmail.com> wrote:

On Tue, Jun 1, 2021 at 5:43 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

So, If we want to share the cached partition between statements, we seems cannot
use ExecPartitionCheck. Instead, I tried directly invoke the partition support
function(partsupfunc) to check If the cached info is correct. In this approach I
tried cache the *bound offset* in PartitionDescData, and we can use the bound offset
to get the bound datum from PartitionBoundInfoData and invoke the partsupfunc
to do the CHECK.

Attach a POC patch about it. Just to share an idea about sharing cached partition info
between statements.

I have not looked at your patch yet, but yeah that's what I would
imagine doing it.

Just read it and think it looks promising.

On code, I wonder why not add the rechecking-cached-offset code
directly in get_partiiton_for_tuple(), instead of adding a whole new
function for that. Can you please check the attached revised version?

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachments:

0001-cache-bound-offset_v2.patchapplication/octet-stream; name=0001-cache-bound-offset_v2.patchDownload
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 606c920b06..a775bda553 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1246,12 +1246,14 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 	int			part_index = -1;
 	PartitionKey key = pd->key;
 	PartitionDesc partdesc = pd->partdesc;
+	int			cached_off = partdesc->cached_bound_offset;
 	PartitionBoundInfo boundinfo = partdesc->boundinfo;
 
 	/* Route as appropriate based on partitioning strategy. */
 	switch (key->strategy)
 	{
 		case PARTITION_STRATEGY_HASH:
+			Assert(cached_off < 0);
 			{
 				uint64		rowHash;
 
@@ -1272,14 +1274,33 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 			}
 			else
 			{
-				bool		equal = false;
-
-				bound_offset = partition_list_bsearch(key->partsupfunc,
-													  key->partcollation,
-													  boundinfo,
-													  values[0], &equal);
-				if (bound_offset >= 0 && equal)
-					part_index = boundinfo->indexes[bound_offset];
+				if (cached_off >= 0)
+				{
+					Datum	bound_datum = boundinfo->datums[cached_off][0];
+					int32	cmpval;
+
+					cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
+															 key->partcollation[0],
+															 bound_datum,
+															 values[0]));
+					if (cmpval == 0)
+						part_index = boundinfo->indexes[cached_off];
+				}
+
+				if (part_index < 0)
+				{
+					bool		equal = false;
+
+					bound_offset = partition_list_bsearch(key->partsupfunc,
+														  key->partcollation,
+														  boundinfo,
+														  values[0], &equal);
+					if (bound_offset >= 0 && equal)
+					{
+						part_index = boundinfo->indexes[bound_offset];
+						partdesc->cached_bound_offset = bound_offset;
+					}
+				}
 			}
 			break;
 
@@ -1304,20 +1325,56 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 
 				if (!range_partkey_has_null)
 				{
-					bound_offset = partition_range_datum_bsearch(key->partsupfunc,
-																 key->partcollation,
-																 boundinfo,
-																 key->partnatts,
-																 values,
-																 &equal);
+					if (cached_off >= 0)
+					{
+						Datum   *bound_datums = boundinfo->datums[cached_off];
+						PartitionRangeDatumKind *bound_kind = boundinfo->kind[cached_off];
+						int32	cmpval;
+
+						/* Check if the value is above the low bound */
+						cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+															key->partcollation,
+															bound_datums,
+															bound_kind,
+															values,
+															key->partnatts);
+						if (cmpval == 0)
+							part_index = boundinfo->indexes[cached_off + 1];
+						else if (cmpval < 0 && cached_off + 1 < boundinfo->ndatums)
+						{
+							/* Check if the value is below the high bound */
+							bound_datums = boundinfo->datums[cached_off + 1];
+							bound_kind = boundinfo->kind[cached_off + 1];
+							cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+																key->partcollation,
+																bound_datums,
+																bound_kind,
+																values,
+																key->partnatts);
+
+							if (cmpval > 0)
+								part_index = boundinfo->indexes[cached_off + 1];
+						}
+					}
 
-					/*
-					 * The bound at bound_offset is less than or equal to the
-					 * tuple value, so the bound at offset+1 is the upper
-					 * bound of the partition we're looking for, if there
-					 * actually exists one.
-					 */
-					part_index = boundinfo->indexes[bound_offset + 1];
+					if (part_index < 0)
+					{
+						bound_offset = partition_range_datum_bsearch(key->partsupfunc,
+																	 key->partcollation,
+																	 boundinfo,
+																	 key->partnatts,
+																	 values,
+																	 &equal);
+
+						/*
+						 * The bound at bound_offset is less than or equal to the
+						 * tuple value, so the bound at offset+1 is the upper
+						 * bound of the partition we're looking for, if there
+						 * actually exists one.
+						 */
+						part_index = boundinfo->indexes[bound_offset + 1];
+						partdesc->cached_bound_offset = bound_offset;
+					}
 				}
 			}
 			break;
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 9a9d6a9643..e6ec01de71 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -285,6 +285,8 @@ RelationBuildPartitionDesc(Relation rel, bool omit_detached)
 		MemoryContextAllocZero(new_pdcxt, sizeof(PartitionDescData));
 	partdesc->nparts = nparts;
 	partdesc->detached_exist = detached_exist;
+	partdesc->cached_bound_offset = -1;
+
 	/* If there are no partitions, the rest of the partdesc can stay zero */
 	if (nparts > 0)
 	{
diff --git a/src/include/partitioning/partdesc.h b/src/include/partitioning/partdesc.h
index 0792f48507..9df28ae14f 100644
--- a/src/include/partitioning/partdesc.h
+++ b/src/include/partitioning/partdesc.h
@@ -36,6 +36,9 @@ typedef struct PartitionDescData
 								 * the corresponding 'oids' element belongs to
 								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
+	int			cached_bound_offset; /* offset of the bound datum most
+									  * recently chosen by
+									  * get_partition_for_tuple() */
 } PartitionDescData;
 
 
#38Amit Langote
amitlangote09@gmail.com
In reply to: Amit Langote (#37)
1 attachment(s)
Re: Skip partition tuple routing with constant partition key

On Fri, Jun 4, 2021 at 4:38 PM Amit Langote <amitlangote09@gmail.com> wrote:

On Thu, Jun 3, 2021 at 8:48 PM Amit Langote <amitlangote09@gmail.com> wrote:

On Tue, Jun 1, 2021 at 5:43 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

So, If we want to share the cached partition between statements, we seems cannot
use ExecPartitionCheck. Instead, I tried directly invoke the partition support
function(partsupfunc) to check If the cached info is correct. In this approach I
tried cache the *bound offset* in PartitionDescData, and we can use the bound offset
to get the bound datum from PartitionBoundInfoData and invoke the partsupfunc
to do the CHECK.

Attach a POC patch about it. Just to share an idea about sharing cached partition info
between statements.

I have not looked at your patch yet, but yeah that's what I would
imagine doing it.

Just read it and think it looks promising.

On code, I wonder why not add the rechecking-cached-offset code
directly in get_partiiton_for_tuple(), instead of adding a whole new
function for that. Can you please check the attached revised version?

Here's another, slightly more polished version of that. Also, I added
a check_cached parameter to get_partition_for_tuple() to allow the
caller to disable checking the cached version.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachments:

0001-cache-bound-offset_v3.patchapplication/octet-stream; name=0001-cache-bound-offset_v3.patchDownload
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 606c920b06..610f98aab1 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -176,7 +176,7 @@ static void FormPartitionKeyDatum(PartitionDispatch pd,
 								  Datum *values,
 								  bool *isnull);
 static int	get_partition_for_tuple(PartitionDispatch pd, Datum *values,
-									bool *isnull);
+									bool *isnull, bool check_cached);
 static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 												  Datum *values,
 												  bool *isnull,
@@ -271,6 +271,7 @@ ExecFindPartition(ModifyTableState *mtstate,
 	TupleTableSlot *myslot = NULL;
 	MemoryContext oldcxt;
 	ResultRelInfo *rri = NULL;
+	bool		check_cached = true;
 
 	/* use per-tuple context here to avoid leaking memory */
 	oldcxt = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
@@ -310,7 +311,8 @@ ExecFindPartition(ModifyTableState *mtstate,
 		 * these values, error out.
 		 */
 		if (partdesc->nparts == 0 ||
-			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
+			(partidx = get_partition_for_tuple(dispatch, values, isnull,
+											   check_cached)) < 0)
 		{
 			char	   *val_desc;
 
@@ -1236,11 +1238,16 @@ FormPartitionKeyDatum(PartitionDispatch pd,
  *		Finds partition of relation which accepts the partition key specified
  *		in values and isnull
  *
+ * If check_cached is true, this short-circuits a full-blown search across all
+ * the bounds, instead checking the tuple against the bound whose offset is
+ * cached in pd->partdesc.
+ *
  * Return value is index of the partition (>= 0 and < partdesc->nparts) if one
  * found or -1 if none found.
  */
 static int
-get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
+get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull,
+						bool check_cached)
 {
 	int			bound_offset;
 	int			part_index = -1;
@@ -1248,10 +1255,18 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 	PartitionDesc partdesc = pd->partdesc;
 	PartitionBoundInfo boundinfo = partdesc->boundinfo;
 
+	/*
+	 * If the cached bound offset is valid, we check below if the that bound
+	 * is satisfied by the new tuple.  If it is, there's no need to perform a
+	 * search across all bounds.
+	 */
+	int			cached_off = partdesc->cached_bound_offset;
+
 	/* Route as appropriate based on partitioning strategy. */
 	switch (key->strategy)
 	{
 		case PARTITION_STRATEGY_HASH:
+			Assert(cached_off < 0);
 			{
 				uint64		rowHash;
 
@@ -1272,14 +1287,33 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 			}
 			else
 			{
-				bool		equal = false;
-
-				bound_offset = partition_list_bsearch(key->partsupfunc,
-													  key->partcollation,
-													  boundinfo,
-													  values[0], &equal);
-				if (bound_offset >= 0 && equal)
-					part_index = boundinfo->indexes[bound_offset];
+				if (cached_off >= 0 && check_cached)
+				{
+					Datum	bound_datum = boundinfo->datums[cached_off][0];
+					int32	cmpval;
+
+					cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
+															 key->partcollation[0],
+															 bound_datum,
+															 values[0]));
+					if (cmpval == 0)
+						part_index = boundinfo->indexes[cached_off];
+				}
+
+				if (part_index < 0)
+				{
+					bool		equal = false;
+
+					bound_offset = partition_list_bsearch(key->partsupfunc,
+														  key->partcollation,
+														  boundinfo,
+														  values[0], &equal);
+					if (bound_offset >= 0 && equal)
+					{
+						part_index = boundinfo->indexes[bound_offset];
+						partdesc->cached_bound_offset = bound_offset;
+					}
+				}
 			}
 			break;
 
@@ -1304,20 +1338,56 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 
 				if (!range_partkey_has_null)
 				{
-					bound_offset = partition_range_datum_bsearch(key->partsupfunc,
-																 key->partcollation,
-																 boundinfo,
-																 key->partnatts,
-																 values,
-																 &equal);
+					if (cached_off >= 0 && check_cached)
+					{
+						Datum   *bound_datums = boundinfo->datums[cached_off];
+						PartitionRangeDatumKind *bound_kind = boundinfo->kind[cached_off];
+						int32	cmpval;
+
+						/* Check if the value is above the low bound */
+						cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+															key->partcollation,
+															bound_datums,
+															bound_kind,
+															values,
+															key->partnatts);
+						if (cmpval == 0)
+							part_index = boundinfo->indexes[cached_off + 1];
+						else if (cmpval < 0 && cached_off + 1 < boundinfo->ndatums)
+						{
+							/* Check if the value is below the high bound */
+							bound_datums = boundinfo->datums[cached_off + 1];
+							bound_kind = boundinfo->kind[cached_off + 1];
+							cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+																key->partcollation,
+																bound_datums,
+																bound_kind,
+																values,
+																key->partnatts);
+
+							if (cmpval > 0)
+								part_index = boundinfo->indexes[cached_off + 1];
+						}
+					}
 
-					/*
-					 * The bound at bound_offset is less than or equal to the
-					 * tuple value, so the bound at offset+1 is the upper
-					 * bound of the partition we're looking for, if there
-					 * actually exists one.
-					 */
-					part_index = boundinfo->indexes[bound_offset + 1];
+					if (part_index < 0)
+					{
+						bound_offset = partition_range_datum_bsearch(key->partsupfunc,
+																	 key->partcollation,
+																	 boundinfo,
+																	 key->partnatts,
+																	 values,
+																	 &equal);
+
+						/*
+						 * The bound at bound_offset is less than or equal to the
+						 * tuple value, so the bound at offset+1 is the upper
+						 * bound of the partition we're looking for, if there
+						 * actually exists one.
+						 */
+						part_index = boundinfo->indexes[bound_offset + 1];
+						partdesc->cached_bound_offset = bound_offset;
+					}
 				}
 			}
 			break;
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 9a9d6a9643..e6ec01de71 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -285,6 +285,8 @@ RelationBuildPartitionDesc(Relation rel, bool omit_detached)
 		MemoryContextAllocZero(new_pdcxt, sizeof(PartitionDescData));
 	partdesc->nparts = nparts;
 	partdesc->detached_exist = detached_exist;
+	partdesc->cached_bound_offset = -1;
+
 	/* If there are no partitions, the rest of the partdesc can stay zero */
 	if (nparts > 0)
 	{
diff --git a/src/include/partitioning/partdesc.h b/src/include/partitioning/partdesc.h
index 0792f48507..9df28ae14f 100644
--- a/src/include/partitioning/partdesc.h
+++ b/src/include/partitioning/partdesc.h
@@ -36,6 +36,9 @@ typedef struct PartitionDescData
 								 * the corresponding 'oids' element belongs to
 								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
+	int			cached_bound_offset; /* offset of the bound datum most
+									  * recently chosen by
+									  * get_partition_for_tuple() */
 } PartitionDescData;
 
 
#39Amit Langote
amitlangote09@gmail.com
In reply to: Amit Langote (#38)
Re: Skip partition tuple routing with constant partition key

On Fri, Jun 4, 2021 at 6:05 PM Amit Langote <amitlangote09@gmail.com> wrote:

On Fri, Jun 4, 2021 at 4:38 PM Amit Langote <amitlangote09@gmail.com> wrote:

On Thu, Jun 3, 2021 at 8:48 PM Amit Langote <amitlangote09@gmail.com> wrote:

On Tue, Jun 1, 2021 at 5:43 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

So, If we want to share the cached partition between statements, we seems cannot
use ExecPartitionCheck. Instead, I tried directly invoke the partition support
function(partsupfunc) to check If the cached info is correct. In this approach I
tried cache the *bound offset* in PartitionDescData, and we can use the bound offset
to get the bound datum from PartitionBoundInfoData and invoke the partsupfunc
to do the CHECK.

Attach a POC patch about it. Just to share an idea about sharing cached partition info
between statements.

I have not looked at your patch yet, but yeah that's what I would
imagine doing it.

Just read it and think it looks promising.

On code, I wonder why not add the rechecking-cached-offset code
directly in get_partiiton_for_tuple(), instead of adding a whole new
function for that. Can you please check the attached revised version?

I should have clarified a bit more on why I think a new function
looked unnecessary to me. The thing about that function that bothered
me was that it appeared to duplicate a lot of code fragments of
get_partition_for_tuple(). That kind of duplication often leads to
bugs of omission later if something from either function needs to
change.

--
Amit Langote
EDB: http://www.enterprisedb.com

#40houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: Amit Langote (#39)
1 attachment(s)
RE: Skip partition tuple routing with constant partition key

Hi Amit-san

From: Amit Langote <amitlangote09@gmail.com>

On Fri, Jun 4, 2021 at 6:05 PM Amit Langote <amitlangote09@gmail.com>
wrote:

On Fri, Jun 4, 2021 at 4:38 PM Amit Langote <amitlangote09@gmail.com>

wrote:

On Thu, Jun 3, 2021 at 8:48 PM Amit Langote <amitlangote09@gmail.com>

wrote:

On Tue, Jun 1, 2021 at 5:43 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

So, If we want to share the cached partition between statements,
we seems cannot use ExecPartitionCheck. Instead, I tried
directly invoke the partition support
function(partsupfunc) to check If the cached info is correct. In
this approach I tried cache the *bound offset* in
PartitionDescData, and we can use the bound offset to get the
bound datum from PartitionBoundInfoData and invoke the

partsupfunc to do the CHECK.

Attach a POC patch about it. Just to share an idea about sharing
cached partition info between statements.

I have not looked at your patch yet, but yeah that's what I would
imagine doing it.

Just read it and think it looks promising.

On code, I wonder why not add the rechecking-cached-offset code
directly in get_partiiton_for_tuple(), instead of adding a whole new
function for that. Can you please check the attached revised version?

I should have clarified a bit more on why I think a new function looked
unnecessary to me. The thing about that function that bothered me was that
it appeared to duplicate a lot of code fragments of get_partition_for_tuple().
That kind of duplication often leads to bugs of omission later if something from
either function needs to change.

Thanks for the patch and explanation, I think you are right that it’s better add
the rechecking-cached-offset code directly in get_partiiton_for_tuple().

And now, I think maybe it's time to try to optimize the performance.
Currently, if every row to be inserted in a statement belongs to different
partition, then the cache check code will bring a slight performance
degradation(AFAICS: 2% ~ 4%).

So, If we want to solve this, then we may need 1) a reloption to let user control whether use the cache.
Or, 2) introduce some simple strategy to control whether use cache automatically.

I have not write a patch about 1) reloption, because I think it will be nice if we can
enable this cache feature by default. So, I attached a WIP patch about approach 2).

The rough idea is to check the average batch number every 1000 rows.
If the average batch num is greater than 1, then we enable the cache check,
if not, disable cache check. This is similar to what 0d5f05cde0 did.

Thoughts ?

Best regards,
houzj

Attachments:

0001-WIP-cache-bound-offset-adaptively_v4.patchapplication/octet-stream; name=0001-WIP-cache-bound-offset-adaptively_v4.patchDownload
From 7d0bd28f75cf2b26706fe5cc8473aa01950286cb Mon Sep 17 00:00:00 2001
From: "houzj.fnst" <houzj.fnst@cn.fujitsu.com>
Date: Mon, 7 Jun 2021 18:57:14 +0800
Subject: [PATCH] cache-bound-offset_v3_modify

---
 src/backend/executor/execPartition.c | 149 ++++++++++++++++++++++-----
 src/backend/partitioning/partdesc.c  |   2 +
 src/include/partitioning/partdesc.h  |   3 +
 3 files changed, 130 insertions(+), 24 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 606c920b06..122d330a1d 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -150,9 +150,15 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrMap    *tupmap;
+
+	short		ntupinserts;
+	short		npartchanges;
+	bool		force;
+	bool		check_cached;
 	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 }			PartitionDispatchData;
 
+#define RECHECK_BOUND_CACHE_THRESHOLD 1000
 
 static ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
 											EState *estate, PartitionTupleRouting *proute,
@@ -176,7 +182,7 @@ static void FormPartitionKeyDatum(PartitionDispatch pd,
 								  Datum *values,
 								  bool *isnull);
 static int	get_partition_for_tuple(PartitionDispatch pd, Datum *values,
-									bool *isnull);
+									bool *isnull, bool check_cached);
 static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
 												  Datum *values,
 												  bool *isnull,
@@ -305,12 +311,23 @@ ExecFindPartition(ModifyTableState *mtstate,
 		ecxt->ecxt_scantuple = slot;
 		FormPartitionKeyDatum(dispatch, slot, estate, values, isnull);
 
+		if (dispatch->ntupinserts > RECHECK_BOUND_CACHE_THRESHOLD &&
+				!dispatch->force)
+		{
+			if (dispatch->npartchanges == 0)
+				dispatch->check_cached = true;
+			else
+				dispatch->check_cached = ((dispatch->ntupinserts/dispatch->npartchanges) > 1);
+			dispatch->ntupinserts = dispatch->npartchanges = 0;
+		}
+
 		/*
 		 * If this partitioned table has no partitions or no partition for
 		 * these values, error out.
 		 */
 		if (partdesc->nparts == 0 ||
-			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
+			(partidx = get_partition_for_tuple(dispatch, values, isnull,
+							dispatch->check_cached)) < 0)
 		{
 			char	   *val_desc;
 
@@ -1026,6 +1043,12 @@ ExecInitPartitionDispatchInfo(EState *estate,
 	pd->key = RelationGetPartitionKey(rel);
 	pd->keystate = NIL;
 	pd->partdesc = partdesc;
+	pd->ntupinserts = 0;
+	pd->npartchanges = 0;
+	partdesc->cached_bound_offset = -1;
+
+	pd->check_cached = false;
+
 	if (parent_pd != NULL)
 	{
 		TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -1236,11 +1259,16 @@ FormPartitionKeyDatum(PartitionDispatch pd,
  *		Finds partition of relation which accepts the partition key specified
  *		in values and isnull
  *
+ * If check_cached is true, this short-circuits a full-blown search across all
+ * the bounds, instead checking the tuple against the bound whose offset is
+ * cached in pd->partdesc.
+ *
  * Return value is index of the partition (>= 0 and < partdesc->nparts) if one
  * found or -1 if none found.
  */
 static int
-get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
+get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull,
+						bool check_cached)
 {
 	int			bound_offset;
 	int			part_index = -1;
@@ -1248,10 +1276,18 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 	PartitionDesc partdesc = pd->partdesc;
 	PartitionBoundInfo boundinfo = partdesc->boundinfo;
 
+	/*
+	 * If the cached bound offset is valid, we check below if the that bound
+	 * is satisfied by the new tuple.  If it is, there's no need to perform a
+	 * search across all bounds.
+	 */
+	int			cached_off = partdesc->cached_bound_offset;
+
 	/* Route as appropriate based on partitioning strategy. */
 	switch (key->strategy)
 	{
 		case PARTITION_STRATEGY_HASH:
+			Assert(cached_off < 0);
 			{
 				uint64		rowHash;
 
@@ -1265,6 +1301,7 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 			break;
 
 		case PARTITION_STRATEGY_LIST:
+			pd->ntupinserts++;
 			if (isnull[0])
 			{
 				if (partition_bound_accepts_nulls(boundinfo))
@@ -1272,14 +1309,37 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 			}
 			else
 			{
-				bool		equal = false;
-
-				bound_offset = partition_list_bsearch(key->partsupfunc,
-													  key->partcollation,
-													  boundinfo,
-													  values[0], &equal);
-				if (bound_offset >= 0 && equal)
-					part_index = boundinfo->indexes[bound_offset];
+				if (cached_off >= 0 && check_cached)
+				{
+					Datum	bound_datum = boundinfo->datums[cached_off][0];
+					int32	cmpval;
+
+					cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
+															 key->partcollation[0],
+															 bound_datum,
+															 values[0]));
+					if (cmpval == 0)
+						part_index = boundinfo->indexes[cached_off];
+				}
+
+				if (part_index < 0)
+				{
+					bool		equal = false;
+
+					bound_offset = partition_list_bsearch(key->partsupfunc,
+														  key->partcollation,
+														  boundinfo,
+														  values[0], &equal);
+					if (bound_offset >= 0 && equal)
+					{
+						part_index = boundinfo->indexes[bound_offset];
+						if (cached_off != bound_offset)
+						{
+							pd->npartchanges++;
+							partdesc->cached_bound_offset = bound_offset;
+						}
+					}
+				}
 			}
 			break;
 
@@ -1289,6 +1349,7 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 							range_partkey_has_null = false;
 				int			i;
 
+				pd->ntupinserts++;
 				/*
 				 * No range includes NULL, so this will be accepted by the
 				 * default partition if there is one, and otherwise rejected.
@@ -1304,20 +1365,60 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 
 				if (!range_partkey_has_null)
 				{
-					bound_offset = partition_range_datum_bsearch(key->partsupfunc,
-																 key->partcollation,
-																 boundinfo,
-																 key->partnatts,
-																 values,
-																 &equal);
+					if (cached_off >= 0 && check_cached)
+					{
+						Datum   *bound_datums = boundinfo->datums[cached_off];
+						PartitionRangeDatumKind *bound_kind = boundinfo->kind[cached_off];
+						int32	cmpval;
+
+						/* Check if the value is above the low bound */
+						cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+															key->partcollation,
+															bound_datums,
+															bound_kind,
+															values,
+															key->partnatts);
+						if (cmpval == 0)
+							part_index = boundinfo->indexes[cached_off + 1];
+						else if (cmpval < 0 && cached_off + 1 < boundinfo->ndatums)
+						{
+							/* Check if the value is below the high bound */
+							bound_datums = boundinfo->datums[cached_off + 1];
+							bound_kind = boundinfo->kind[cached_off + 1];
+							cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+																key->partcollation,
+																bound_datums,
+																bound_kind,
+																values,
+																key->partnatts);
+
+							if (cmpval > 0)
+								part_index = boundinfo->indexes[cached_off + 1];
+						}
+					}
 
-					/*
-					 * The bound at bound_offset is less than or equal to the
-					 * tuple value, so the bound at offset+1 is the upper
-					 * bound of the partition we're looking for, if there
-					 * actually exists one.
-					 */
-					part_index = boundinfo->indexes[bound_offset + 1];
+					if (part_index < 0)
+					{
+						bound_offset = partition_range_datum_bsearch(key->partsupfunc,
+																	 key->partcollation,
+																	 boundinfo,
+																	 key->partnatts,
+																	 values,
+																	 &equal);
+
+						/*
+						 * The bound at bound_offset is less than or equal to the
+						 * tuple value, so the bound at offset+1 is the upper
+						 * bound of the partition we're looking for, if there
+						 * actually exists one.
+						 */
+						part_index = boundinfo->indexes[bound_offset + 1];
+						if (cached_off != bound_offset)
+						{
+							pd->npartchanges++;
+							partdesc->cached_bound_offset = bound_offset;
+						}
+					}
 				}
 			}
 			break;
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 9a9d6a9643..e6ec01de71 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -285,6 +285,8 @@ RelationBuildPartitionDesc(Relation rel, bool omit_detached)
 		MemoryContextAllocZero(new_pdcxt, sizeof(PartitionDescData));
 	partdesc->nparts = nparts;
 	partdesc->detached_exist = detached_exist;
+	partdesc->cached_bound_offset = -1;
+
 	/* If there are no partitions, the rest of the partdesc can stay zero */
 	if (nparts > 0)
 	{
diff --git a/src/include/partitioning/partdesc.h b/src/include/partitioning/partdesc.h
index 0792f48507..9df28ae14f 100644
--- a/src/include/partitioning/partdesc.h
+++ b/src/include/partitioning/partdesc.h
@@ -36,6 +36,9 @@ typedef struct PartitionDescData
 								 * the corresponding 'oids' element belongs to
 								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
+	int			cached_bound_offset; /* offset of the bound datum most
+									  * recently chosen by
+									  * get_partition_for_tuple() */
 } PartitionDescData;
 
 
-- 
2.18.4

#41Amit Langote
amitlangote09@gmail.com
In reply to: houzj.fnst@fujitsu.com (#40)
1 attachment(s)
Re: Skip partition tuple routing with constant partition key

Hou-san,

On Mon, Jun 7, 2021 at 8:38 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

Thanks for the patch and explanation, I think you are right that it’s better add
the rechecking-cached-offset code directly in get_partiiton_for_tuple().

And now, I think maybe it's time to try to optimize the performance.
Currently, if every row to be inserted in a statement belongs to different
partition, then the cache check code will bring a slight performance
degradation(AFAICS: 2% ~ 4%).

So, If we want to solve this, then we may need 1) a reloption to let user control whether use the cache.
Or, 2) introduce some simple strategy to control whether use cache automatically.

I have not write a patch about 1) reloption, because I think it will be nice if we can
enable this cache feature by default. So, I attached a WIP patch about approach 2).

The rough idea is to check the average batch number every 1000 rows.
If the average batch num is greater than 1, then we enable the cache check,
if not, disable cache check. This is similar to what 0d5f05cde0 did.

Thanks for sharing the idea and writing a patch for it.

I considered a simpler heuristic where we enable/disable caching of a
given offset if it is found by the binary search algorithm at least N
consecutive times. But your idea to check the ratio of the number of
tuples inserted over partition/bound offset changes every N tuples
inserted may be more adaptive.

Please find attached a revised version of your patch, where I tried to
make it a bit easier to follow, hopefully. While doing so, I realized
that caching the bound offset across queries makes little sense now,
so I decided to keep the changes confined to execPartition.c. Do you
have a counter-argument to that?
--
Amit Langote
EDB: http://www.enterprisedb.com

Attachments:

v5-0001-adpative-bound-offset-caching-v5.patchapplication/octet-stream; name=v5-0001-adpative-bound-offset-caching-v5.patchDownload
From 191abc179b3e4e62d2d2720924678d9c1d7271f9 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 15 Jun 2021 16:21:48 +0900
Subject: [PATCH v5] adpative bound offset caching v5

---
 src/backend/executor/execPartition.c | 210 ++++++++++++++++++++++++---
 1 file changed, 187 insertions(+), 23 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 606c920b06..acbf71cb75 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -133,6 +133,13 @@ struct PartitionTupleRouting
  *		routing it through this table). A NULL value is stored if no tuple
  *		conversion is required.
  *
+ * cached_bound_offset
+ * last_seen_offset
+ * n_offset_changed
+ * n_tups_inserted
+ *		Fields to manage the state for bound offset caching; see
+ *		maybe_cache_partition_bound_offset()
+ *
  * indexes
  *		Array of partdesc->nparts elements.  For leaf partitions the index
  *		corresponds to the partition's ResultRelInfo in the encapsulating
@@ -150,10 +157,15 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrMap    *tupmap;
+
+	int			cached_bound_offset;
+	int			last_seen_offset;
+	int			n_offset_changed;
+	int			n_tups_inserted;
+
 	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 }			PartitionDispatchData;
 
-
 static ResultRelInfo *ExecInitPartitionInfo(ModifyTableState *mtstate,
 											EState *estate, PartitionTupleRouting *proute,
 											PartitionDispatch dispatch,
@@ -1026,6 +1038,10 @@ ExecInitPartitionDispatchInfo(EState *estate,
 	pd->key = RelationGetPartitionKey(rel);
 	pd->keystate = NIL;
 	pd->partdesc = partdesc;
+
+	pd->cached_bound_offset = pd->last_seen_offset = -1;
+	pd->n_tups_inserted = pd->n_offset_changed = 0;
+
 	if (parent_pd != NULL)
 	{
 		TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -1231,6 +1247,134 @@ FormPartitionKeyDatum(PartitionDispatch pd,
 		elog(ERROR, "wrong number of partition key expressions");
 }
 
+/*
+ * Threshold of the number of tuples to have been processed before
+ * maybe_cache_partition_bound_offset() re-assesses whether caching must be
+ * enabled or disabled for subsequent tuples.
+ */
+#define	CACHE_BOUND_OFFSET_THRESHOLD_TUPS	1000
+
+/*
+ * maybe_cache_partition_bound_offset
+ *		Conditionally sets pd->cached_bound_offset so that
+ *		get_cached_{list|range}_partition can be used for subsequent
+ *		tuples
+ *
+ * It is set if it appears that some offsets observed over the last
+ * pd->n_tups_inserted tuples would have been reused, which can be inferred
+ * from seeing that the ratio of tuples inserted and the number of times the
+ * offset needed to be changed during the insertion of those tuples is greater
+ * than 1.  Conversely, we disable the caching if it the ratio is 1, because
+ * it suggests that every consecutive tuple mapped to a different partition.
+ */
+static inline void
+maybe_cache_partition_bound_offset(PartitionDispatch pd, int offset)
+{
+	if (offset != pd->last_seen_offset)
+	{
+		pd->last_seen_offset = offset;
+		pd->n_offset_changed += 1;
+		/* Only set to the new value after calculating the ratio formula. */
+		pd->cached_bound_offset = -1;
+	}
+
+	/*
+	 * Only consider (re-)enabling/disabling caching if we've seen at least
+	 * a threshold number of tuples since the last time we enabled/disabled
+	 * it.
+	 */
+	if (pd->n_tups_inserted < CACHE_BOUND_OFFSET_THRESHOLD_TUPS)
+		return;
+
+	/* Wouldn't get called if the cached bound offset worked. */
+	Assert(offset != pd->cached_bound_offset);
+
+	/* If the offset didn't change at all, caching it might be a good idea. */
+	if (pd->n_offset_changed == 0 ||
+		(double) pd->n_tups_inserted / pd->n_offset_changed > 1)
+		pd->cached_bound_offset = offset;
+	else
+		pd->cached_bound_offset = -1;
+	pd->n_tups_inserted = pd->n_offset_changed = 0;
+}
+
+/*
+ * get_cached_{list|range}_partition
+ *		Computes if the cached bound offset value, if any, is satisfied by
+ *		the tuple specified in 'values' and it is, returns the index of
+ *		the partition corresponding to that bound
+ *
+ * Callers must ensure that none of the elements of 'values' is NULL.
+ */
+static inline int
+get_cached_list_partition(PartitionDispatch pd,
+						  PartitionBoundInfo boundinfo,
+						  PartitionKey key,
+						  Datum *values)
+{
+	int		part_index = -1;
+	int		cached_off = pd->cached_bound_offset;
+
+	if (cached_off >= 0)
+	{
+		Datum	bound_datum = boundinfo->datums[cached_off][0];
+		int32	cmpval;
+
+		cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
+												 key->partcollation[0],
+												 bound_datum,
+												 values[0]));
+		if (cmpval == 0)
+			part_index = boundinfo->indexes[cached_off];
+	}
+
+	return part_index;
+}
+
+static inline int
+get_cached_range_partition(PartitionDispatch pd,
+						   PartitionBoundInfo boundinfo,
+						   PartitionKey key,
+						   Datum *values)
+{
+	int		part_index = -1;
+	int		cached_off = pd->cached_bound_offset;
+
+	if (cached_off >= 0)
+	{
+		Datum   *bound_datums = boundinfo->datums[cached_off];
+		PartitionRangeDatumKind *bound_kind = boundinfo->kind[cached_off];
+		int32	cmpval;
+
+		/* Check if the value is above the low bound */
+		cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+											key->partcollation,
+											bound_datums,
+											bound_kind,
+											values,
+											key->partnatts);
+		if (cmpval == 0)
+			part_index = boundinfo->indexes[cached_off + 1];
+		else if (cmpval < 0 && cached_off + 1 < boundinfo->ndatums)
+		{
+			/* Check if the value is below the high bound */
+			bound_datums = boundinfo->datums[cached_off + 1];
+			bound_kind = boundinfo->kind[cached_off + 1];
+			cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+												key->partcollation,
+												bound_datums,
+												bound_kind,
+												values,
+												key->partnatts);
+
+			if (cmpval > 0)
+				part_index = boundinfo->indexes[cached_off + 1];
+		}
+	}
+
+	return part_index;
+}
+
 /*
  * get_partition_for_tuple
  *		Finds partition of relation which accepts the partition key specified
@@ -1248,6 +1392,8 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 	PartitionDesc partdesc = pd->partdesc;
 	PartitionBoundInfo boundinfo = partdesc->boundinfo;
 
+	pd->n_tups_inserted += 1;
+
 	/* Route as appropriate based on partitioning strategy. */
 	switch (key->strategy)
 	{
@@ -1272,14 +1418,24 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 			}
 			else
 			{
-				bool		equal = false;
-
-				bound_offset = partition_list_bsearch(key->partsupfunc,
-													  key->partcollation,
-													  boundinfo,
-													  values[0], &equal);
-				if (bound_offset >= 0 && equal)
-					part_index = boundinfo->indexes[bound_offset];
+				part_index = get_cached_list_partition(pd, boundinfo, key,
+													   values);
+				if (part_index < 0)
+				{
+					bool		equal = false;
+
+					bound_offset = partition_list_bsearch(key->partsupfunc,
+														  key->partcollation,
+														  boundinfo,
+														  values[0], &equal);
+					if (bound_offset >= 0 && equal)
+					{
+						part_index = boundinfo->indexes[bound_offset];
+						if (part_index >= 0)
+							maybe_cache_partition_bound_offset(pd,
+															   bound_offset);
+					}
+				}
 			}
 			break;
 
@@ -1304,20 +1460,28 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 
 				if (!range_partkey_has_null)
 				{
-					bound_offset = partition_range_datum_bsearch(key->partsupfunc,
-																 key->partcollation,
-																 boundinfo,
-																 key->partnatts,
-																 values,
-																 &equal);
-
-					/*
-					 * The bound at bound_offset is less than or equal to the
-					 * tuple value, so the bound at offset+1 is the upper
-					 * bound of the partition we're looking for, if there
-					 * actually exists one.
-					 */
-					part_index = boundinfo->indexes[bound_offset + 1];
+					part_index = get_cached_range_partition(pd, boundinfo,
+															key, values);
+					if (part_index < 0)
+					{
+						bound_offset = partition_range_datum_bsearch(key->partsupfunc,
+																	 key->partcollation,
+																	 boundinfo,
+																	 key->partnatts,
+																	 values,
+																	 &equal);
+
+						/*
+						 * The bound at bound_offset is less than or equal to the
+						 * tuple value, so the bound at offset+1 is the upper
+						 * bound of the partition we're looking for, if there
+						 * actually exists one.
+						 */
+						part_index = boundinfo->indexes[bound_offset + 1];
+						if (part_index >= 0)
+							maybe_cache_partition_bound_offset(pd,
+															   bound_offset);
+					}
 				}
 			}
 			break;
-- 
2.24.1

#42Amit Langote
amitlangote09@gmail.com
In reply to: Amit Langote (#41)
1 attachment(s)
Re: Skip partition tuple routing with constant partition key

On Wed, Jun 16, 2021 at 4:27 PM Amit Langote <amitlangote09@gmail.com> wrote:

On Mon, Jun 7, 2021 at 8:38 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

The rough idea is to check the average batch number every 1000 rows.
If the average batch num is greater than 1, then we enable the cache check,
if not, disable cache check. This is similar to what 0d5f05cde0 did.

Thanks for sharing the idea and writing a patch for it.

I considered a simpler heuristic where we enable/disable caching of a
given offset if it is found by the binary search algorithm at least N
consecutive times. But your idea to check the ratio of the number of
tuples inserted over partition/bound offset changes every N tuples
inserted may be more adaptive.

Please find attached a revised version of your patch, where I tried to
make it a bit easier to follow, hopefully. While doing so, I realized
that caching the bound offset across queries makes little sense now,
so I decided to keep the changes confined to execPartition.c. Do you
have a counter-argument to that?

Attached a slightly revised version of that patch, with a commit
message this time.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachments:

v6-0001-Teach-get_partition_for_tuple-to-cache-bound-offs.patchapplication/octet-stream; name=v6-0001-Teach-get_partition_for_tuple-to-cache-bound-offs.patchDownload
From e74a22ea5debe9d549864069dfb95bb20cfef30a Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 15 Jun 2021 16:21:48 +0900
Subject: [PATCH v6] Teach get_partition_for_tuple to cache bound offset

For bulk loads into list and range partitioned tables, it can be
very likely that long runs of consecutive tuples route to the same
partition.  In such cases, we can avoid the overhead of performing
a binary search for each such tuple by caching the offset of the
bound for that partition and checking that the bound indeed satisfies
any subsequent tuples, which can be implemented with fewer comparisons
than the binary search.

To avoid impacting the cases where such caching can be unproductive,
an adaptive algorithm is used to determine whether to actually
enable caching or to disable it if checking the cached offset seems
to add pure overhead per tuple.

Author: Hou Zhijie
Author: Amit Langote
---
 src/backend/executor/execPartition.c | 209 ++++++++++++++++++++++++---
 1 file changed, 187 insertions(+), 22 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 606c920b06..d213ee1a6b 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -133,6 +133,13 @@ struct PartitionTupleRouting
  *		routing it through this table). A NULL value is stored if no tuple
  *		conversion is required.
  *
+ * cached_bound_offset
+ * last_seen_offset
+ * n_tups_inserted
+ * n_offset_changed
+ *		Fields to manage the state for bound offset caching; see
+ *		maybe_cache_partition_bound_offset()
+ *
  * indexes
  *		Array of partdesc->nparts elements.  For leaf partitions the index
  *		corresponds to the partition's ResultRelInfo in the encapsulating
@@ -150,6 +157,12 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrMap    *tupmap;
+
+	int			cached_bound_offset;
+	int			last_seen_offset;
+	int			n_tups_inserted;
+	int			n_offset_changed;
+
 	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 }			PartitionDispatchData;
 
@@ -1026,6 +1039,10 @@ ExecInitPartitionDispatchInfo(EState *estate,
 	pd->key = RelationGetPartitionKey(rel);
 	pd->keystate = NIL;
 	pd->partdesc = partdesc;
+
+	pd->cached_bound_offset = pd->last_seen_offset = -1;
+	pd->n_tups_inserted = pd->n_offset_changed = 0;
+
 	if (parent_pd != NULL)
 	{
 		TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -1231,6 +1248,134 @@ FormPartitionKeyDatum(PartitionDispatch pd,
 		elog(ERROR, "wrong number of partition key expressions");
 }
 
+/*
+ * Threshold of the number of tuples to need to have been processed before
+ * maybe_cache_partition_bound_offset() (re-)assesses whether caching must be
+ * enabled or disabled for subsequent tuples.
+ */
+#define	CACHE_BOUND_OFFSET_THRESHOLD_TUPS	1000
+
+/*
+ * maybe_cache_partition_bound_offset
+ *		Conditionally sets pd->cached_bound_offset so that
+ *		get_cached_{list|range}_partition can be used for subsequent
+ *		tuples
+ *
+ * It is set if it appears that some offsets observed over the last
+ * pd->n_tups_inserted tuples would have been reused, which can be inferred
+ * from seeing that the ratio of tuples inserted and the number of times the
+ * offset needed to be changed during the insertion of those tuples is greater
+ * than 1.  Conversely, we disable the caching if it the ratio is 1, because
+ * it suggests that every consecutive tuple mapped to a different partition.
+ */
+static inline void
+maybe_cache_partition_bound_offset(PartitionDispatch pd, int offset)
+{
+	if (offset != pd->last_seen_offset)
+	{
+		pd->last_seen_offset = offset;
+		pd->n_offset_changed += 1;
+		/* Only set to the new value after calculating the ratio formula. */
+		pd->cached_bound_offset = -1;
+	}
+
+	/*
+	 * Only consider (re-)enabling/disabling caching if we've seen at least
+	 * a threshold number of tuples since the last time we enabled/disabled
+	 * it.
+	 */
+	if (pd->n_tups_inserted < CACHE_BOUND_OFFSET_THRESHOLD_TUPS)
+		return;
+
+	/* Wouldn't get called if the cached bound offset worked. */
+	Assert(offset != pd->cached_bound_offset);
+
+	/* If the offset didn't change at all, caching it might be a good idea. */
+	if (pd->n_offset_changed == 0 ||
+		(double) pd->n_tups_inserted / pd->n_offset_changed > 1)
+		pd->cached_bound_offset = offset;
+	else
+		pd->cached_bound_offset = -1;
+	pd->n_tups_inserted = pd->n_offset_changed = 0;
+}
+
+/*
+ * get_cached_{list|range}_partition
+ *		Computes if the cached bound offset value, if any, is satisfied by
+ *		the tuple specified in 'values' and it is, returns the index of
+ *		the partition corresponding to that bound
+ *
+ * Callers must ensure that none of the elements of 'values' is NULL.
+ */
+static inline int
+get_cached_list_partition(PartitionDispatch pd,
+						  PartitionBoundInfo boundinfo,
+						  PartitionKey key,
+						  Datum *values)
+{
+	int		part_index = -1;
+	int		cached_off = pd->cached_bound_offset;
+
+	if (cached_off >= 0)
+	{
+		Datum	bound_datum = boundinfo->datums[cached_off][0];
+		int32	cmpval;
+
+		cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
+												 key->partcollation[0],
+												 bound_datum,
+												 values[0]));
+		if (cmpval == 0)
+			part_index = boundinfo->indexes[cached_off];
+	}
+
+	return part_index;
+}
+
+static inline int
+get_cached_range_partition(PartitionDispatch pd,
+						   PartitionBoundInfo boundinfo,
+						   PartitionKey key,
+						   Datum *values)
+{
+	int		part_index = -1;
+	int		cached_off = pd->cached_bound_offset;
+
+	if (cached_off >= 0)
+	{
+		Datum   *bound_datums = boundinfo->datums[cached_off];
+		PartitionRangeDatumKind *bound_kind = boundinfo->kind[cached_off];
+		int32	cmpval;
+
+		/* Check if the value is above the low bound */
+		cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+											key->partcollation,
+											bound_datums,
+											bound_kind,
+											values,
+											key->partnatts);
+		if (cmpval == 0)
+			part_index = boundinfo->indexes[cached_off + 1];
+		else if (cmpval < 0 && cached_off + 1 < boundinfo->ndatums)
+		{
+			/* Check if the value is below the high bound */
+			bound_datums = boundinfo->datums[cached_off + 1];
+			bound_kind = boundinfo->kind[cached_off + 1];
+			cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+												key->partcollation,
+												bound_datums,
+												bound_kind,
+												values,
+												key->partnatts);
+
+			if (cmpval > 0)
+				part_index = boundinfo->indexes[cached_off + 1];
+		}
+	}
+
+	return part_index;
+}
+
 /*
  * get_partition_for_tuple
  *		Finds partition of relation which accepts the partition key specified
@@ -1248,6 +1393,8 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 	PartitionDesc partdesc = pd->partdesc;
 	PartitionBoundInfo boundinfo = partdesc->boundinfo;
 
+	pd->n_tups_inserted += 1;
+
 	/* Route as appropriate based on partitioning strategy. */
 	switch (key->strategy)
 	{
@@ -1272,14 +1419,24 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 			}
 			else
 			{
-				bool		equal = false;
-
-				bound_offset = partition_list_bsearch(key->partsupfunc,
-													  key->partcollation,
-													  boundinfo,
-													  values[0], &equal);
-				if (bound_offset >= 0 && equal)
-					part_index = boundinfo->indexes[bound_offset];
+				part_index = get_cached_list_partition(pd, boundinfo, key,
+													   values);
+				if (part_index < 0)
+				{
+					bool		equal = false;
+
+					bound_offset = partition_list_bsearch(key->partsupfunc,
+														  key->partcollation,
+														  boundinfo,
+														  values[0], &equal);
+					if (bound_offset >= 0 && equal)
+					{
+						part_index = boundinfo->indexes[bound_offset];
+						if (part_index >= 0)
+							maybe_cache_partition_bound_offset(pd,
+															   bound_offset);
+					}
+				}
 			}
 			break;
 
@@ -1304,20 +1461,28 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 
 				if (!range_partkey_has_null)
 				{
-					bound_offset = partition_range_datum_bsearch(key->partsupfunc,
-																 key->partcollation,
-																 boundinfo,
-																 key->partnatts,
-																 values,
-																 &equal);
-
-					/*
-					 * The bound at bound_offset is less than or equal to the
-					 * tuple value, so the bound at offset+1 is the upper
-					 * bound of the partition we're looking for, if there
-					 * actually exists one.
-					 */
-					part_index = boundinfo->indexes[bound_offset + 1];
+					part_index = get_cached_range_partition(pd, boundinfo,
+															key, values);
+					if (part_index < 0)
+					{
+						bound_offset = partition_range_datum_bsearch(key->partsupfunc,
+																	 key->partcollation,
+																	 boundinfo,
+																	 key->partnatts,
+																	 values,
+																	 &equal);
+
+						/*
+						 * The bound at bound_offset is less than or equal to the
+						 * tuple value, so the bound at offset+1 is the upper
+						 * bound of the partition we're looking for, if there
+						 * actually exists one.
+						 */
+						part_index = boundinfo->indexes[bound_offset + 1];
+						if (part_index >= 0)
+							maybe_cache_partition_bound_offset(pd,
+															   bound_offset);
+					}
 				}
 			}
 			break;
-- 
2.24.1

#43Zhihong Yu
zyu@yugabyte.com
In reply to: Amit Langote (#42)
Re: Skip partition tuple routing with constant partition key

On Wed, Jun 16, 2021 at 9:29 PM Amit Langote <amitlangote09@gmail.com>
wrote:

On Wed, Jun 16, 2021 at 4:27 PM Amit Langote <amitlangote09@gmail.com>
wrote:

On Mon, Jun 7, 2021 at 8:38 PM houzj.fnst@fujitsu.com
<houzj.fnst@fujitsu.com> wrote:

The rough idea is to check the average batch number every 1000 rows.
If the average batch num is greater than 1, then we enable the cache

check,

if not, disable cache check. This is similar to what 0d5f05cde0 did.

Thanks for sharing the idea and writing a patch for it.

I considered a simpler heuristic where we enable/disable caching of a
given offset if it is found by the binary search algorithm at least N
consecutive times. But your idea to check the ratio of the number of
tuples inserted over partition/bound offset changes every N tuples
inserted may be more adaptive.

Please find attached a revised version of your patch, where I tried to
make it a bit easier to follow, hopefully. While doing so, I realized
that caching the bound offset across queries makes little sense now,
so I decided to keep the changes confined to execPartition.c. Do you
have a counter-argument to that?

Attached a slightly revised version of that patch, with a commit
message this time.

--
Amit Langote
EDB: http://www.enterprisedb.com

Hi,

+ int n_tups_inserted;
+ int n_offset_changed;

Since tuples appear in plural, maybe offset should be as well: offsets.

+               part_index = get_cached_list_partition(pd, boundinfo, key,
+                                                      values);

nit:either put values on the same line, or align the 4 parameters on
different lines.

+                   if (part_index < 0)
+                   {
+                       bound_offset =
partition_range_datum_bsearch(key->partsupfunc,

Do we need to check the value of equal before computing part_index ?

Cheers

#44Amit Langote
amitlangote09@gmail.com
In reply to: Zhihong Yu (#43)
1 attachment(s)
Re: Skip partition tuple routing with constant partition key

Hi,

Thanks for reading the patch.

On Thu, Jun 17, 2021 at 1:46 PM Zhihong Yu <zyu@yugabyte.com> wrote:

On Wed, Jun 16, 2021 at 9:29 PM Amit Langote <amitlangote09@gmail.com> wrote:

Attached a slightly revised version of that patch, with a commit
message this time.

+ int n_tups_inserted;
+ int n_offset_changed;

Since tuples appear in plural, maybe offset should be as well: offsets.

I was hoping one would read that as "the number of times the offset
changed" while inserting "that many tuples", so the singular form
makes more sense to me.

Actually, I even considered naming the variable n_offsets_seen, in
which case the plural form makes sense, but I chose not to go with
that name.

+               part_index = get_cached_list_partition(pd, boundinfo, key,
+                                                      values);

nit:either put values on the same line, or align the 4 parameters on different lines.

Not sure pgindent requires us to follow that style, but I too prefer
the way you suggest. It does make the patch a bit longer though.

+                   if (part_index < 0)
+                   {
+                       bound_offset = partition_range_datum_bsearch(key->partsupfunc,

Do we need to check the value of equal before computing part_index ?

Just in case you didn't notice, this is not new code, but appears as a
diff hunk due to indenting.

As for whether the code should be checking 'equal', I don't think the
logic at this particular site should do that. Requiring 'equal' to be
true would mean that this code would only accept tuples that exactly
match the bound that partition_range_datum_bsearch() returned.

Updated patch attached. Aside from addressing your 2nd point, I fixed
a typo in a comment.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachments:

v7-0001-Teach-get_partition_for_tuple-to-cache-bound-offs.patchapplication/octet-stream; name=v7-0001-Teach-get_partition_for_tuple-to-cache-bound-offs.patchDownload
From 4631acb275d09de01f8b297ad4d7708ddeed9e8f Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 15 Jun 2021 16:21:48 +0900
Subject: [PATCH v7] Teach get_partition_for_tuple to cache bound offset

For bulk loads into list and range partitioned tables, it can be
very likely that long runs of consecutive tuples route to the same
partition.  In such cases, we can avoid the overhead of performing
a binary search for each such tuple by caching the offset of the
bound for that partition and checking that the bound indeed satisfies
any subsequent tuples, which can be implemented with fewer comparisons
than the binary search.

To avoid impacting the cases where such caching can be unproductive,
an adaptive algorithm is used to determine whether to actually
enable caching or to disable it if checking the cached offset seems
to add pure overhead per tuple.

Author: Hou Zhijie
Author: Amit Langote
---
 src/backend/executor/execPartition.c | 214 ++++++++++++++++++++++++---
 1 file changed, 192 insertions(+), 22 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 606c920b06..d52f8c4931 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -133,6 +133,13 @@ struct PartitionTupleRouting
  *		routing it through this table). A NULL value is stored if no tuple
  *		conversion is required.
  *
+ * cached_bound_offset
+ * last_seen_offset
+ * n_tups_inserted
+ * n_offset_changed
+ *		Fields to manage the state for bound offset caching; see
+ *		maybe_cache_partition_bound_offset()
+ *
  * indexes
  *		Array of partdesc->nparts elements.  For leaf partitions the index
  *		corresponds to the partition's ResultRelInfo in the encapsulating
@@ -150,6 +157,12 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrMap    *tupmap;
+
+	int			cached_bound_offset;
+	int			last_seen_offset;
+	int			n_tups_inserted;
+	int			n_offset_changed;
+
 	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 }			PartitionDispatchData;
 
@@ -1026,6 +1039,10 @@ ExecInitPartitionDispatchInfo(EState *estate,
 	pd->key = RelationGetPartitionKey(rel);
 	pd->keystate = NIL;
 	pd->partdesc = partdesc;
+
+	pd->cached_bound_offset = pd->last_seen_offset = -1;
+	pd->n_tups_inserted = pd->n_offset_changed = 0;
+
 	if (parent_pd != NULL)
 	{
 		TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -1231,6 +1248,135 @@ FormPartitionKeyDatum(PartitionDispatch pd,
 		elog(ERROR, "wrong number of partition key expressions");
 }
 
+/*
+ * Threshold of the number of tuples to need to have been processed before
+ * maybe_cache_partition_bound_offset() (re-)assesses whether caching must be
+ * enabled or disabled for subsequent tuples.
+ */
+#define	CACHE_BOUND_OFFSET_THRESHOLD_TUPS	1000
+
+/*
+ * maybe_cache_partition_bound_offset
+ *		Conditionally sets pd->cached_bound_offset so that
+ *		get_cached_{list|range}_partition can be used for subsequent
+ *		tuples
+ *
+ * It is set if it appears that some offsets observed over the last
+ * pd->n_tups_inserted tuples would have been reused, which can be inferred
+ * from seeing that the ratio of tuples inserted and the number of times the
+ * offset needed to be changed during the insertion of those tuples is greater
+ * than 1.  Conversely, we disable the caching if the ratio is 1, because that
+ * suggests that every consecutively inserted tuple mapped to a different
+ * partition.
+ */
+static inline void
+maybe_cache_partition_bound_offset(PartitionDispatch pd, int offset)
+{
+	if (offset != pd->last_seen_offset)
+	{
+		pd->last_seen_offset = offset;
+		pd->n_offset_changed += 1;
+		/* Only set to the new value after calculating the ratio formula. */
+		pd->cached_bound_offset = -1;
+	}
+
+	/*
+	 * Only consider (re-)enabling/disabling caching if we've seen at least
+	 * a threshold number of tuples since the last time we enabled/disabled
+	 * it.
+	 */
+	if (pd->n_tups_inserted < CACHE_BOUND_OFFSET_THRESHOLD_TUPS)
+		return;
+
+	/* Wouldn't get called if the cached bound offset worked. */
+	Assert(offset != pd->cached_bound_offset);
+
+	/* If the offset didn't change at all, caching it might be a good idea. */
+	if (pd->n_offset_changed == 0 ||
+		(double) pd->n_tups_inserted / pd->n_offset_changed > 1)
+		pd->cached_bound_offset = offset;
+	else
+		pd->cached_bound_offset = -1;
+	pd->n_tups_inserted = pd->n_offset_changed = 0;
+}
+
+/*
+ * get_cached_{list|range}_partition
+ *		Computes if the cached bound offset value, if any, is satisfied by
+ *		the tuple specified in 'values' and it is, returns the index of
+ *		the partition corresponding to that bound
+ *
+ * Callers must ensure that none of the elements of 'values' is NULL.
+ */
+static inline int
+get_cached_list_partition(PartitionDispatch pd,
+						  PartitionBoundInfo boundinfo,
+						  PartitionKey key,
+						  Datum *values)
+{
+	int		part_index = -1;
+	int		cached_off = pd->cached_bound_offset;
+
+	if (cached_off >= 0)
+	{
+		Datum	bound_datum = boundinfo->datums[cached_off][0];
+		int32	cmpval;
+
+		cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
+												 key->partcollation[0],
+												 bound_datum,
+												 values[0]));
+		if (cmpval == 0)
+			part_index = boundinfo->indexes[cached_off];
+	}
+
+	return part_index;
+}
+
+static inline int
+get_cached_range_partition(PartitionDispatch pd,
+						   PartitionBoundInfo boundinfo,
+						   PartitionKey key,
+						   Datum *values)
+{
+	int		part_index = -1;
+	int		cached_off = pd->cached_bound_offset;
+
+	if (cached_off >= 0)
+	{
+		Datum   *bound_datums = boundinfo->datums[cached_off];
+		PartitionRangeDatumKind *bound_kind = boundinfo->kind[cached_off];
+		int32	cmpval;
+
+		/* Check if the value is above the low bound */
+		cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+											key->partcollation,
+											bound_datums,
+											bound_kind,
+											values,
+											key->partnatts);
+		if (cmpval == 0)
+			part_index = boundinfo->indexes[cached_off + 1];
+		else if (cmpval < 0 && cached_off + 1 < boundinfo->ndatums)
+		{
+			/* Check if the value is below the high bound */
+			bound_datums = boundinfo->datums[cached_off + 1];
+			bound_kind = boundinfo->kind[cached_off + 1];
+			cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+												key->partcollation,
+												bound_datums,
+												bound_kind,
+												values,
+												key->partnatts);
+
+			if (cmpval > 0)
+				part_index = boundinfo->indexes[cached_off + 1];
+		}
+	}
+
+	return part_index;
+}
+
 /*
  * get_partition_for_tuple
  *		Finds partition of relation which accepts the partition key specified
@@ -1248,6 +1394,8 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 	PartitionDesc partdesc = pd->partdesc;
 	PartitionBoundInfo boundinfo = partdesc->boundinfo;
 
+	pd->n_tups_inserted += 1;
+
 	/* Route as appropriate based on partitioning strategy. */
 	switch (key->strategy)
 	{
@@ -1272,14 +1420,26 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 			}
 			else
 			{
-				bool		equal = false;
-
-				bound_offset = partition_list_bsearch(key->partsupfunc,
-													  key->partcollation,
-													  boundinfo,
-													  values[0], &equal);
-				if (bound_offset >= 0 && equal)
-					part_index = boundinfo->indexes[bound_offset];
+				part_index = get_cached_list_partition(pd,
+													   boundinfo,
+													   key,
+													   values);
+				if (part_index < 0)
+				{
+					bool		equal = false;
+
+					bound_offset = partition_list_bsearch(key->partsupfunc,
+														  key->partcollation,
+														  boundinfo,
+														  values[0], &equal);
+					if (bound_offset >= 0 && equal)
+					{
+						part_index = boundinfo->indexes[bound_offset];
+						if (part_index >= 0)
+							maybe_cache_partition_bound_offset(pd,
+															   bound_offset);
+					}
+				}
 			}
 			break;
 
@@ -1304,20 +1464,30 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 
 				if (!range_partkey_has_null)
 				{
-					bound_offset = partition_range_datum_bsearch(key->partsupfunc,
-																 key->partcollation,
-																 boundinfo,
-																 key->partnatts,
-																 values,
-																 &equal);
-
-					/*
-					 * The bound at bound_offset is less than or equal to the
-					 * tuple value, so the bound at offset+1 is the upper
-					 * bound of the partition we're looking for, if there
-					 * actually exists one.
-					 */
-					part_index = boundinfo->indexes[bound_offset + 1];
+					part_index = get_cached_range_partition(pd,
+															boundinfo,
+															key,
+															values);
+					if (part_index < 0)
+					{
+						bound_offset = partition_range_datum_bsearch(key->partsupfunc,
+																	 key->partcollation,
+																	 boundinfo,
+																	 key->partnatts,
+																	 values,
+																	 &equal);
+
+						/*
+						 * The bound at bound_offset is less than or equal to the
+						 * tuple value, so the bound at offset+1 is the upper
+						 * bound of the partition we're looking for, if there
+						 * actually exists one.
+						 */
+						part_index = boundinfo->indexes[bound_offset + 1];
+						if (part_index >= 0)
+							maybe_cache_partition_bound_offset(pd,
+															   bound_offset);
+					}
 				}
 			}
 			break;
-- 
2.24.1

#45Zhihong Yu
zyu@yugabyte.com
In reply to: Amit Langote (#44)
Re: Skip partition tuple routing with constant partition key

On Wed, Jun 16, 2021 at 10:37 PM Amit Langote <amitlangote09@gmail.com>
wrote:

Hi,

Thanks for reading the patch.

On Thu, Jun 17, 2021 at 1:46 PM Zhihong Yu <zyu@yugabyte.com> wrote:

On Wed, Jun 16, 2021 at 9:29 PM Amit Langote <amitlangote09@gmail.com>

wrote:

Attached a slightly revised version of that patch, with a commit
message this time.

+ int n_tups_inserted;
+ int n_offset_changed;

Since tuples appear in plural, maybe offset should be as well: offsets.

I was hoping one would read that as "the number of times the offset
changed" while inserting "that many tuples", so the singular form
makes more sense to me.

Actually, I even considered naming the variable n_offsets_seen, in
which case the plural form makes sense, but I chose not to go with
that name.

+ part_index = get_cached_list_partition(pd, boundinfo,

key,

+ values);

nit:either put values on the same line, or align the 4 parameters on

different lines.

Not sure pgindent requires us to follow that style, but I too prefer
the way you suggest. It does make the patch a bit longer though.

+                   if (part_index < 0)
+                   {
+                       bound_offset =

partition_range_datum_bsearch(key->partsupfunc,

Do we need to check the value of equal before computing part_index ?

Just in case you didn't notice, this is not new code, but appears as a
diff hunk due to indenting.

As for whether the code should be checking 'equal', I don't think the
logic at this particular site should do that. Requiring 'equal' to be
true would mean that this code would only accept tuples that exactly
match the bound that partition_range_datum_bsearch() returned.

Updated patch attached. Aside from addressing your 2nd point, I fixed
a typo in a comment.

--
Amit Langote
EDB: http://www.enterprisedb.com

Hi, Amit:
Thanks for the quick response.
w.r.t. the last point, since variable equal is defined within the case of
PARTITION_STRATEGY_RANGE,
I wonder if it can be named don_t_care or something like that.
That way, it would be clearer to the reader that its value is purposefully
not checked.

It is fine to leave the variable as is since this was existing code.

Cheers

#46Amit Langote
amitlangote09@gmail.com
In reply to: Zhihong Yu (#45)
Re: Skip partition tuple routing with constant partition key

On Thu, Jun 17, 2021 at 4:18 PM Zhihong Yu <zyu@yugabyte.com> wrote:

On Wed, Jun 16, 2021 at 10:37 PM Amit Langote <amitlangote09@gmail.com> wrote:

+                   if (part_index < 0)
+                   {
+                       bound_offset = partition_range_datum_bsearch(key->partsupfunc,

Do we need to check the value of equal before computing part_index ?

Just in case you didn't notice, this is not new code, but appears as a
diff hunk due to indenting.

As for whether the code should be checking 'equal', I don't think the
logic at this particular site should do that. Requiring 'equal' to be
true would mean that this code would only accept tuples that exactly
match the bound that partition_range_datum_bsearch() returned.

Hi, Amit:
Thanks for the quick response.
w.r.t. the last point, since variable equal is defined within the case of PARTITION_STRATEGY_RANGE,
I wonder if it can be named don_t_care or something like that.
That way, it would be clearer to the reader that its value is purposefully not checked.

Normally, we write a comment in such cases, like

/* The value returned in 'equal' is ignored! */

Though I forgot to do that when I first wrote this code. :(

It is fine to leave the variable as is since this was existing code.

Yeah, maybe there's not much to be gained by doing something about
that now, unless of course a committer insists that we do.

--
Amit Langote
EDB: http://www.enterprisedb.com

#47Amit Langote
amitlangote09@gmail.com
In reply to: Amit Langote (#46)
1 attachment(s)
Re: Skip partition tuple routing with constant partition key

I noticed that there is no CF entry for this, so created one in the next CF:

https://commitfest.postgresql.org/34/3270/

Rebased patch attached.

Attachments:

v8-0001-Teach-get_partition_for_tuple-to-cache-bound-offs.patchapplication/octet-stream; name=v8-0001-Teach-get_partition_for_tuple-to-cache-bound-offs.patchDownload
From 1c6c5432dc6821164402d7d3c003f8ff9fc03edb Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 15 Jun 2021 16:21:48 +0900
Subject: [PATCH v8] Teach get_partition_for_tuple to cache bound offset

For bulk loads into list and range partitioned tables, it can be
very likely that long runs of consecutive tuples route to the same
partition.  In such cases, we can avoid the overhead of performing
a binary search for each such tuple by caching the offset of the
bound for that partition and checking that the bound indeed satisfies
any subsequent tuples, which can be implemented with fewer comparisons
than the binary search.

To avoid impacting the cases where such caching can be unproductive,
an adaptive algorithm is used to determine whether to actually
enable caching or to disable it if checking the cached offset seems
to add pure overhead per tuple.

Author: Hou Zhijie
Author: Amit Langote
---
 src/backend/executor/execPartition.c | 214 ++++++++++++++++++++++++---
 1 file changed, 192 insertions(+), 22 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 5c723bc54e..9047fcbcd5 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -133,6 +133,13 @@ struct PartitionTupleRouting
  *		routing it through this table). A NULL value is stored if no tuple
  *		conversion is required.
  *
+ * cached_bound_offset
+ * last_seen_offset
+ * n_tups_inserted
+ * n_offset_changed
+ *		Fields to manage the state for bound offset caching; see
+ *		maybe_cache_partition_bound_offset()
+ *
  * indexes
  *		Array of partdesc->nparts elements.  For leaf partitions the index
  *		corresponds to the partition's ResultRelInfo in the encapsulating
@@ -150,6 +157,12 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrMap    *tupmap;
+
+	int			cached_bound_offset;
+	int			last_seen_offset;
+	int			n_tups_inserted;
+	int			n_offset_changed;
+
 	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 }			PartitionDispatchData;
 
@@ -1026,6 +1039,10 @@ ExecInitPartitionDispatchInfo(EState *estate,
 	pd->key = RelationGetPartitionKey(rel);
 	pd->keystate = NIL;
 	pd->partdesc = partdesc;
+
+	pd->cached_bound_offset = pd->last_seen_offset = -1;
+	pd->n_tups_inserted = pd->n_offset_changed = 0;
+
 	if (parent_pd != NULL)
 	{
 		TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -1231,6 +1248,135 @@ FormPartitionKeyDatum(PartitionDispatch pd,
 		elog(ERROR, "wrong number of partition key expressions");
 }
 
+/*
+ * Threshold of the number of tuples to need to have been processed before
+ * maybe_cache_partition_bound_offset() (re-)assesses whether caching must be
+ * enabled or disabled for subsequent tuples.
+ */
+#define	CACHE_BOUND_OFFSET_THRESHOLD_TUPS	1000
+
+/*
+ * maybe_cache_partition_bound_offset
+ *		Conditionally sets pd->cached_bound_offset so that
+ *		get_cached_{list|range}_partition can be used for subsequent
+ *		tuples
+ *
+ * It is set if it appears that some offsets observed over the last
+ * pd->n_tups_inserted tuples would have been reused, which can be inferred
+ * from seeing that the ratio of tuples inserted and the number of times the
+ * offset needed to be changed during the insertion of those tuples is greater
+ * than 1.  Conversely, we disable the caching if the ratio is 1, because that
+ * suggests that every consecutively inserted tuple mapped to a different
+ * partition.
+ */
+static inline void
+maybe_cache_partition_bound_offset(PartitionDispatch pd, int offset)
+{
+	if (offset != pd->last_seen_offset)
+	{
+		pd->last_seen_offset = offset;
+		pd->n_offset_changed += 1;
+		/* Only set to the new value after calculating the ratio formula. */
+		pd->cached_bound_offset = -1;
+	}
+
+	/*
+	 * Only consider (re-)enabling/disabling caching if we've seen at least
+	 * a threshold number of tuples since the last time we enabled/disabled
+	 * it.
+	 */
+	if (pd->n_tups_inserted < CACHE_BOUND_OFFSET_THRESHOLD_TUPS)
+		return;
+
+	/* Wouldn't get called if the cached bound offset worked. */
+	Assert(offset != pd->cached_bound_offset);
+
+	/* If the offset didn't change at all, caching it might be a good idea. */
+	if (pd->n_offset_changed == 0 ||
+		(double) pd->n_tups_inserted / pd->n_offset_changed > 1)
+		pd->cached_bound_offset = offset;
+	else
+		pd->cached_bound_offset = -1;
+	pd->n_tups_inserted = pd->n_offset_changed = 0;
+}
+
+/*
+ * get_cached_{list|range}_partition
+ *		Computes if the cached bound offset value, if any, is satisfied by
+ *		the tuple specified in 'values' and it is, returns the index of
+ *		the partition corresponding to that bound
+ *
+ * Callers must ensure that none of the elements of 'values' is NULL.
+ */
+static inline int
+get_cached_list_partition(PartitionDispatch pd,
+						  PartitionBoundInfo boundinfo,
+						  PartitionKey key,
+						  Datum *values)
+{
+	int		part_index = -1;
+	int		cached_off = pd->cached_bound_offset;
+
+	if (cached_off >= 0)
+	{
+		Datum	bound_datum = boundinfo->datums[cached_off][0];
+		int32	cmpval;
+
+		cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
+												 key->partcollation[0],
+												 bound_datum,
+												 values[0]));
+		if (cmpval == 0)
+			part_index = boundinfo->indexes[cached_off];
+	}
+
+	return part_index;
+}
+
+static inline int
+get_cached_range_partition(PartitionDispatch pd,
+						   PartitionBoundInfo boundinfo,
+						   PartitionKey key,
+						   Datum *values)
+{
+	int		part_index = -1;
+	int		cached_off = pd->cached_bound_offset;
+
+	if (cached_off >= 0)
+	{
+		Datum   *bound_datums = boundinfo->datums[cached_off];
+		PartitionRangeDatumKind *bound_kind = boundinfo->kind[cached_off];
+		int32	cmpval;
+
+		/* Check if the value is above the low bound */
+		cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+											key->partcollation,
+											bound_datums,
+											bound_kind,
+											values,
+											key->partnatts);
+		if (cmpval == 0)
+			part_index = boundinfo->indexes[cached_off + 1];
+		else if (cmpval < 0 && cached_off + 1 < boundinfo->ndatums)
+		{
+			/* Check if the value is below the high bound */
+			bound_datums = boundinfo->datums[cached_off + 1];
+			bound_kind = boundinfo->kind[cached_off + 1];
+			cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+												key->partcollation,
+												bound_datums,
+												bound_kind,
+												values,
+												key->partnatts);
+
+			if (cmpval > 0)
+				part_index = boundinfo->indexes[cached_off + 1];
+		}
+	}
+
+	return part_index;
+}
+
 /*
  * get_partition_for_tuple
  *		Finds partition of relation which accepts the partition key specified
@@ -1248,6 +1394,8 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 	PartitionDesc partdesc = pd->partdesc;
 	PartitionBoundInfo boundinfo = partdesc->boundinfo;
 
+	pd->n_tups_inserted += 1;
+
 	/* Route as appropriate based on partitioning strategy. */
 	switch (key->strategy)
 	{
@@ -1272,14 +1420,26 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 			}
 			else
 			{
-				bool		equal = false;
-
-				bound_offset = partition_list_bsearch(key->partsupfunc,
-													  key->partcollation,
-													  boundinfo,
-													  values[0], &equal);
-				if (bound_offset >= 0 && equal)
-					part_index = boundinfo->indexes[bound_offset];
+				part_index = get_cached_list_partition(pd,
+													   boundinfo,
+													   key,
+													   values);
+				if (part_index < 0)
+				{
+					bool		equal = false;
+
+					bound_offset = partition_list_bsearch(key->partsupfunc,
+														  key->partcollation,
+														  boundinfo,
+														  values[0], &equal);
+					if (bound_offset >= 0 && equal)
+					{
+						part_index = boundinfo->indexes[bound_offset];
+						if (part_index >= 0)
+							maybe_cache_partition_bound_offset(pd,
+															   bound_offset);
+					}
+				}
 			}
 			break;
 
@@ -1304,20 +1464,30 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 
 				if (!range_partkey_has_null)
 				{
-					bound_offset = partition_range_datum_bsearch(key->partsupfunc,
-																 key->partcollation,
-																 boundinfo,
-																 key->partnatts,
-																 values,
-																 &equal);
-
-					/*
-					 * The bound at bound_offset is less than or equal to the
-					 * tuple value, so the bound at offset+1 is the upper
-					 * bound of the partition we're looking for, if there
-					 * actually exists one.
-					 */
-					part_index = boundinfo->indexes[bound_offset + 1];
+					part_index = get_cached_range_partition(pd,
+															boundinfo,
+															key,
+															values);
+					if (part_index < 0)
+					{
+						bound_offset = partition_range_datum_bsearch(key->partsupfunc,
+																	 key->partcollation,
+																	 boundinfo,
+																	 key->partnatts,
+																	 values,
+																	 &equal);
+
+						/*
+						 * The bound at bound_offset is less than or equal to the
+						 * tuple value, so the bound at offset+1 is the upper
+						 * bound of the partition we're looking for, if there
+						 * actually exists one.
+						 */
+						part_index = boundinfo->indexes[bound_offset + 1];
+						if (part_index >= 0)
+							maybe_cache_partition_bound_offset(pd,
+															   bound_offset);
+					}
 				}
 			}
 			break;
-- 
2.24.1

#48Greg Stark
stark@mit.edu
In reply to: Amit Langote (#47)
Re: Skip partition tuple routing with constant partition key

There are a whole lot of different patches in this thread.

However this last one https://commitfest.postgresql.org/37/3270/
created by Amit seems like a fairly straightforward optimization that
can be evaluated on its own separately from the others and seems quite
mature. I'm actually inclined to set it to "Ready for Committer".

Incidentally a quick read-through of the patch myself and the only
question I have is how the parameters of the adaptive algorithm were
chosen. They seem ludicrously conservative to me and a bit of simple
arguments about how expensive an extra check is versus the time saved
in the boolean search should be easy enough to come up with to justify
whatever values make sense.

#49Amit Langote
amitlangote09@gmail.com
In reply to: Greg Stark (#48)
1 attachment(s)
Re: Skip partition tuple routing with constant partition key

Hi Greg,

On Wed, Mar 16, 2022 at 6:54 AM Greg Stark <stark@mit.edu> wrote:

There are a whole lot of different patches in this thread.

However this last one https://commitfest.postgresql.org/37/3270/
created by Amit seems like a fairly straightforward optimization that
can be evaluated on its own separately from the others and seems quite
mature. I'm actually inclined to set it to "Ready for Committer".

Thanks for taking a look at it.

Incidentally a quick read-through of the patch myself and the only
question I have is how the parameters of the adaptive algorithm were
chosen. They seem ludicrously conservative to me

Do you think CACHE_BOUND_OFFSET_THRESHOLD_TUPS (1000) is too high? I
suspect maybe you do.

Basically, the way this works is that once set, cached_bound_offset is
not reset until encountering a tuple for which cached_bound_offset
doesn't give the correct partition, so the threshold doesn't matter
when the caching is active. However, once reset, it is not again set
till the threshold number of tuples have been processed and that too
only if the binary searches done during that interval appear to have
returned the same bound offset in succession a number of times. Maybe
waiting a 1000 tuples to re-assess that is a bit too conservative,
yes. I guess even as small a number as 10 is fine here?

I've attached an updated version of the patch, though I haven't
changed the threshold constant.

--
Amit Langote
EDB: http://www.enterprisedb.com

On Wed, Mar 16, 2022 at 6:54 AM Greg Stark <stark@mit.edu> wrote:

There are a whole lot of different patches in this thread.

However this last one https://commitfest.postgresql.org/37/3270/
created by Amit seems like a fairly straightforward optimization that
can be evaluated on its own separately from the others and seems quite
mature. I'm actually inclined to set it to "Ready for Committer".

Incidentally a quick read-through of the patch myself and the only
question I have is how the parameters of the adaptive algorithm were
chosen. They seem ludicrously conservative to me and a bit of simple
arguments about how expensive an extra check is versus the time saved
in the boolean search should be easy enough to come up with to justify
whatever values make sense.

--
Amit Langote
EDB: http://www.enterprisedb.com

Attachments:

v9-0001-Optimze-get_partition_for_tuple-by-caching-bound-.patchapplication/octet-stream; name=v9-0001-Optimze-get_partition_for_tuple-by-caching-bound-.patchDownload
From fe28c2a39a709ca5413ca94596776efa5d4e1914 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 15 Jun 2021 16:21:48 +0900
Subject: [PATCH v9] Optimze get_partition_for_tuple by caching bound offset

For bulk loads into list and range partitioned tables, it can be very
likely that long runs of consecutive tuples route to the same
partition.  In such cases, we can perform a binary search only once
to find the partition's bound and cache thus found offset.  And then
for the subsequent tuples, only check if they satisfy the bound at the
cached offset, something that's done with up to 2 comparisons, compared
to O(log num_parts) comparisons needed for the binary search.

To avoid impacting the cases where such caching can be unproductive,
it is disabled on the first tuple that no longer satisfies the cached
bound and only re-enabled if the individual bound offsets are found
to re-occur in succession over the span of a threshold number of
tuples, that is, after that many tuples have been processed.

Author: Hou Zhijie
Author: Amit Langote
---
 src/backend/executor/execPartition.c | 212 ++++++++++++++++++++++++---
 1 file changed, 190 insertions(+), 22 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 90ed1485d1..742fd00066 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -133,6 +133,13 @@ struct PartitionTupleRouting
  *		routing it through this table). A NULL value is stored if no tuple
  *		conversion is required.
  *
+ * cached_bound_offset
+ * last_seen_offset
+ * n_tups_inserted
+ * n_offset_changed
+ *		Fields to manage the state for bound offset caching; see
+ *		maybe_cache_partition_bound_offset()
+ *
  * indexes
  *		Array of partdesc->nparts elements.  For leaf partitions the index
  *		corresponds to the partition's ResultRelInfo in the encapsulating
@@ -150,6 +157,12 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrMap    *tupmap;
+
+	int			cached_bound_offset;
+	int			last_seen_offset;
+	int			n_tups_inserted;
+	int			n_offset_changed;
+
 	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 }			PartitionDispatchData;
 
@@ -1026,6 +1039,10 @@ ExecInitPartitionDispatchInfo(EState *estate,
 	pd->key = RelationGetPartitionKey(rel);
 	pd->keystate = NIL;
 	pd->partdesc = partdesc;
+
+	pd->cached_bound_offset = pd->last_seen_offset = -1;
+	pd->n_tups_inserted = pd->n_offset_changed = 0;
+
 	if (parent_pd != NULL)
 	{
 		TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -1231,6 +1248,133 @@ FormPartitionKeyDatum(PartitionDispatch pd,
 		elog(ERROR, "wrong number of partition key expressions");
 }
 
+/*
+ * Threshold of the number of tuples to need to have been processed before
+ * maybe_cache_partition_bound_offset() (re-)assesses whether caching must be
+ * enabled for subsequent tuples.
+ */
+#define	CACHE_BOUND_OFFSET_THRESHOLD_TUPS	1000
+
+/*
+ * maybe_cache_partition_bound_offset
+ *		Conditionally sets pd->cached_bound_offset so that
+ *		get_cached_{list|range}_partition can be used for subsequent
+ *		tuples
+ *
+ * It is set if it appears that some offsets observed over the last
+ * pd->n_tups_inserted tuples would have been reused, which can be inferred
+ * from seeing that the ratio of tuples inserted and the number of times the
+ * offset needed to be changed during the insertion of those tuples is greater
+ * than 1.
+ */
+static inline void
+maybe_cache_partition_bound_offset(PartitionDispatch pd, int offset)
+{
+	if (offset != pd->last_seen_offset)
+	{
+		pd->last_seen_offset = offset;
+		pd->n_offset_changed += 1;
+		/* Only set to the new value after calculating the ratio formula. */
+		pd->cached_bound_offset = -1;
+	}
+
+	/*
+	 * Only consider (re-)enabling/disabling caching if we've seen at least
+	 * a threshold number of tuples since the last time we enabled/disabled
+	 * it.
+	 */
+	if (pd->n_tups_inserted < CACHE_BOUND_OFFSET_THRESHOLD_TUPS)
+		return;
+
+	/* Wouldn't get called if the cached bound offset worked. */
+	Assert(offset != pd->cached_bound_offset);
+
+	/* If the offset didn't change at all, caching it might be a good idea. */
+	if (pd->n_offset_changed == 0 ||
+		(double) pd->n_tups_inserted / pd->n_offset_changed > 1)
+		pd->cached_bound_offset = offset;
+
+	/* Reset the counters for the next run of tuples. */
+	pd->n_tups_inserted = pd->n_offset_changed = 0;
+}
+
+/*
+ * get_cached_{list|range}_partition
+ *		Computes if the cached bound offset value, if any, is satisfied by
+ *		the tuple specified in 'values' and if it is, returns the index of
+ *		the partition corresponding to that bound
+ *
+ * Callers must ensure that none of the elements of 'values' is NULL.
+ */
+static inline int
+get_cached_list_partition(PartitionDispatch pd,
+						  PartitionBoundInfo boundinfo,
+						  PartitionKey key,
+						  Datum *values)
+{
+	int		part_index = -1;
+	int		cached_off = pd->cached_bound_offset;
+
+	if (cached_off >= 0)
+	{
+		Datum	bound_datum = boundinfo->datums[cached_off][0];
+		int32	cmpval;
+
+		cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
+												 key->partcollation[0],
+												 bound_datum,
+												 values[0]));
+		if (cmpval == 0)
+			part_index = boundinfo->indexes[cached_off];
+	}
+
+	return part_index;
+}
+
+static inline int
+get_cached_range_partition(PartitionDispatch pd,
+						   PartitionBoundInfo boundinfo,
+						   PartitionKey key,
+						   Datum *values)
+{
+	int		part_index = -1;
+	int		cached_off = pd->cached_bound_offset;
+
+	if (cached_off >= 0)
+	{
+		Datum   *bound_datums = boundinfo->datums[cached_off];
+		PartitionRangeDatumKind *bound_kind = boundinfo->kind[cached_off];
+		int32	cmpval;
+
+		/* Check if the value is above the low bound */
+		cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+											key->partcollation,
+											bound_datums,
+											bound_kind,
+											values,
+											key->partnatts);
+		if (cmpval == 0)
+			part_index = boundinfo->indexes[cached_off + 1];
+		else if (cmpval < 0 && cached_off + 1 < boundinfo->ndatums)
+		{
+			/* Check if the value is below the high bound */
+			bound_datums = boundinfo->datums[cached_off + 1];
+			bound_kind = boundinfo->kind[cached_off + 1];
+			cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+												key->partcollation,
+												bound_datums,
+												bound_kind,
+												values,
+												key->partnatts);
+
+			if (cmpval > 0)
+				part_index = boundinfo->indexes[cached_off + 1];
+		}
+	}
+
+	return part_index;
+}
+
 /*
  * get_partition_for_tuple
  *		Finds partition of relation which accepts the partition key specified
@@ -1248,6 +1392,8 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 	PartitionDesc partdesc = pd->partdesc;
 	PartitionBoundInfo boundinfo = partdesc->boundinfo;
 
+	pd->n_tups_inserted += 1;
+
 	/* Route as appropriate based on partitioning strategy. */
 	switch (key->strategy)
 	{
@@ -1272,14 +1418,26 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 			}
 			else
 			{
-				bool		equal = false;
-
-				bound_offset = partition_list_bsearch(key->partsupfunc,
-													  key->partcollation,
-													  boundinfo,
-													  values[0], &equal);
-				if (bound_offset >= 0 && equal)
-					part_index = boundinfo->indexes[bound_offset];
+				part_index = get_cached_list_partition(pd,
+													   boundinfo,
+													   key,
+													   values);
+				if (part_index < 0)
+				{
+					bool		equal = false;
+
+					bound_offset = partition_list_bsearch(key->partsupfunc,
+														  key->partcollation,
+														  boundinfo,
+														  values[0], &equal);
+					if (bound_offset >= 0 && equal)
+					{
+						part_index = boundinfo->indexes[bound_offset];
+						if (part_index >= 0)
+							maybe_cache_partition_bound_offset(pd,
+															   bound_offset);
+					}
+				}
 			}
 			break;
 
@@ -1304,20 +1462,30 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 
 				if (!range_partkey_has_null)
 				{
-					bound_offset = partition_range_datum_bsearch(key->partsupfunc,
-																 key->partcollation,
-																 boundinfo,
-																 key->partnatts,
-																 values,
-																 &equal);
-
-					/*
-					 * The bound at bound_offset is less than or equal to the
-					 * tuple value, so the bound at offset+1 is the upper
-					 * bound of the partition we're looking for, if there
-					 * actually exists one.
-					 */
-					part_index = boundinfo->indexes[bound_offset + 1];
+					part_index = get_cached_range_partition(pd,
+															boundinfo,
+															key,
+															values);
+					if (part_index < 0)
+					{
+						bound_offset = partition_range_datum_bsearch(key->partsupfunc,
+																	 key->partcollation,
+																	 boundinfo,
+																	 key->partnatts,
+																	 values,
+																	 &equal);
+
+						/*
+						 * The bound at bound_offset is less than or equal to the
+						 * tuple value, so the bound at offset+1 is the upper
+						 * bound of the partition we're looking for, if there
+						 * actually exists one.
+						 */
+						part_index = boundinfo->indexes[bound_offset + 1];
+						if (part_index >= 0)
+							maybe_cache_partition_bound_offset(pd,
+															   bound_offset);
+					}
 				}
 			}
 			break;
-- 
2.24.1

#50Zhihong Yu
zyu@yugabyte.com
In reply to: Amit Langote (#49)
Re: Skip partition tuple routing with constant partition key

On Wed, Mar 23, 2022 at 5:52 AM Amit Langote <amitlangote09@gmail.com>
wrote:

Hi Greg,

On Wed, Mar 16, 2022 at 6:54 AM Greg Stark <stark@mit.edu> wrote:

There are a whole lot of different patches in this thread.

However this last one https://commitfest.postgresql.org/37/3270/
created by Amit seems like a fairly straightforward optimization that
can be evaluated on its own separately from the others and seems quite
mature. I'm actually inclined to set it to "Ready for Committer".

Thanks for taking a look at it.

Incidentally a quick read-through of the patch myself and the only
question I have is how the parameters of the adaptive algorithm were
chosen. They seem ludicrously conservative to me

Do you think CACHE_BOUND_OFFSET_THRESHOLD_TUPS (1000) is too high? I
suspect maybe you do.

Basically, the way this works is that once set, cached_bound_offset is
not reset until encountering a tuple for which cached_bound_offset
doesn't give the correct partition, so the threshold doesn't matter
when the caching is active. However, once reset, it is not again set
till the threshold number of tuples have been processed and that too
only if the binary searches done during that interval appear to have
returned the same bound offset in succession a number of times. Maybe
waiting a 1000 tuples to re-assess that is a bit too conservative,
yes. I guess even as small a number as 10 is fine here?

I've attached an updated version of the patch, though I haven't
changed the threshold constant.

--
Amit Langote
EDB: http://www.enterprisedb.com

On Wed, Mar 16, 2022 at 6:54 AM Greg Stark <stark@mit.edu> wrote:

There are a whole lot of different patches in this thread.

However this last one https://commitfest.postgresql.org/37/3270/
created by Amit seems like a fairly straightforward optimization that
can be evaluated on its own separately from the others and seems quite
mature. I'm actually inclined to set it to "Ready for Committer".

Incidentally a quick read-through of the patch myself and the only
question I have is how the parameters of the adaptive algorithm were
chosen. They seem ludicrously conservative to me and a bit of simple
arguments about how expensive an extra check is versus the time saved
in the boolean search should be easy enough to come up with to justify
whatever values make sense.

Hi,

+ * Threshold of the number of tuples to need to have been processed before
+ * maybe_cache_partition_bound_offset() (re-)assesses whether caching must
be

The first part of the comment should be:

Threshold of the number of tuples which need to have been processed

+ (double) pd->n_tups_inserted / pd->n_offset_changed > 1)

I think division can be avoided - the condition can be written as:

pd->n_tups_inserted > pd->n_offset_changed

+ /* Check if the value is below the high bound */

high bound -> upper bound

Cheers

#51Amit Langote
amitlangote09@gmail.com
In reply to: Zhihong Yu (#50)
1 attachment(s)
Re: Skip partition tuple routing with constant partition key

On Thu, Mar 24, 2022 at 1:55 AM Zhihong Yu <zyu@yugabyte.com> wrote:

On Wed, Mar 23, 2022 at 5:52 AM Amit Langote <amitlangote09@gmail.com> wrote:

I've attached an updated version of the patch, though I haven't
changed the threshold constant.

+ * Threshold of the number of tuples to need to have been processed before
+ * maybe_cache_partition_bound_offset() (re-)assesses whether caching must be

The first part of the comment should be:

Threshold of the number of tuples which need to have been processed

Sounds the same to me, so leaving it as it is.

+ (double) pd->n_tups_inserted / pd->n_offset_changed > 1)

I think division can be avoided - the condition can be written as:

pd->n_tups_inserted > pd->n_offset_changed

+ /* Check if the value is below the high bound */

high bound -> upper bound

Both done, thanks.

In the attached updated patch, I've also lowered the threshold number
of tuples to wait before re-enabling caching from 1000 down to 10.
AFAICT, it only makes things better for the cases in which the
proposed caching is supposed to help, while not affecting the cases in
which caching might actually make things worse.

I've repeated the benchmark mentioned in [1]/messages/by-id/CA+HiwqFbMSLDMinPRsGQVn_gfb-bMy0J2z_rZ0-b9kSfxXF+Ag@mail.gmail.com:

-- creates a range-partitioned table with 1000 partitions
create unlogged table foo (a int) partition by range (a);
select 'create unlogged table foo_' || i || ' partition of foo for
values from (' || (i-1)*100000+1 || ') to (' || i*100000+1 || ');'
from generate_series(1, 1000) i;
\gexec

-- generates a 100 million record file
copy (select generate_series(1, 100000000)) to '/tmp/100m.csv' csv;

HEAD:

postgres=# copy foo from '/tmp/100m.csv' csv; truncate foo;
COPY 100000000
Time: 39445.421 ms (00:39.445)
TRUNCATE TABLE
Time: 381.570 ms
postgres=# copy foo from '/tmp/100m.csv' csv; truncate foo;
COPY 100000000
Time: 38779.235 ms (00:38.779)

Patched:

postgres=# copy foo from '/tmp/100m.csv' csv; truncate foo;
COPY 100000000
Time: 33136.202 ms (00:33.136)
TRUNCATE TABLE
Time: 394.939 ms
postgres=# copy foo from '/tmp/100m.csv' csv; truncate foo;
COPY 100000000
Time: 33914.856 ms (00:33.915)
TRUNCATE TABLE
Time: 407.451 ms

So roughly, 38 seconds with HEAD vs. 33 seconds with the patch applied.

(Curiously, the numbers with both HEAD and patched look worse this
time around, because they were 31 seconds with HEAD vs. 26 seconds
with patched back in May 2021. Unless that's measurement noise, maybe
something to look into.)

--
Amit Langote
EDB: http://www.enterprisedb.com

[1]: /messages/by-id/CA+HiwqFbMSLDMinPRsGQVn_gfb-bMy0J2z_rZ0-b9kSfxXF+Ag@mail.gmail.com

Attachments:

v10-0001-Optimze-get_partition_for_tuple-by-caching-bound.patchapplication/octet-stream; name=v10-0001-Optimze-get_partition_for_tuple-by-caching-bound.patchDownload
From c83fd58b09544b2debce1a7960f55cb252c26973 Mon Sep 17 00:00:00 2001
From: amitlan <amitlangote09@gmail.com>
Date: Tue, 15 Jun 2021 16:21:48 +0900
Subject: [PATCH v10] Optimze get_partition_for_tuple by caching bound offset

For bulk loads into list and range partitioned tables, it can be very
likely that long runs of consecutive tuples route to the same
partition.  In such cases, we can perform a binary search only once
to find the partition's bound and cache thus found offset.  And then
for the subsequent tuples, only check if they satisfy the bound at the
cached offset, something that's done with up to 2 comparisons, compared
to O(log num_parts) comparisons needed for the binary search.

To avoid impacting the cases where such caching can be unproductive,
it is disabled on the first tuple that no longer satisfies the cached
bound and only re-enabled if the individual bound offsets are found
to re-occur in succession over the span of a threshold number of
tuples, that is, after that many tuples have been processed.

Author: Hou Zhijie
Author: Amit Langote
---
 src/backend/executor/execPartition.c | 208 ++++++++++++++++++++++++---
 1 file changed, 186 insertions(+), 22 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 90ed1485d1..0d9e524026 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -133,6 +133,13 @@ struct PartitionTupleRouting
  *		routing it through this table). A NULL value is stored if no tuple
  *		conversion is required.
  *
+ * cached_bound_offset
+ * last_seen_offset
+ * n_tups_inserted
+ * n_offset_changed
+ *		Fields to manage the state for bound offset caching; see
+ *		maybe_cache_partition_bound_offset()
+ *
  * indexes
  *		Array of partdesc->nparts elements.  For leaf partitions the index
  *		corresponds to the partition's ResultRelInfo in the encapsulating
@@ -150,6 +157,12 @@ typedef struct PartitionDispatchData
 	PartitionDesc partdesc;
 	TupleTableSlot *tupslot;
 	AttrMap    *tupmap;
+
+	int			cached_bound_offset;
+	int			last_seen_offset;
+	int			n_tups_inserted;
+	int			n_offset_changed;
+
 	int			indexes[FLEXIBLE_ARRAY_MEMBER];
 }			PartitionDispatchData;
 
@@ -1026,6 +1039,10 @@ ExecInitPartitionDispatchInfo(EState *estate,
 	pd->key = RelationGetPartitionKey(rel);
 	pd->keystate = NIL;
 	pd->partdesc = partdesc;
+
+	pd->cached_bound_offset = pd->last_seen_offset = -1;
+	pd->n_tups_inserted = pd->n_offset_changed = 0;
+
 	if (parent_pd != NULL)
 	{
 		TupleDesc	tupdesc = RelationGetDescr(rel);
@@ -1231,6 +1248,129 @@ FormPartitionKeyDatum(PartitionDispatch pd,
 		elog(ERROR, "wrong number of partition key expressions");
 }
 
+/*
+ * Threshold of the number of tuples to need to have been processed before
+ * maybe_cache_partition_bound_offset() (re-)assesses whether caching must be
+ * enabled for subsequent tuples.
+ */
+#define	CACHE_BOUND_OFFSET_THRESHOLD_TUPS	10
+
+/*
+ * maybe_cache_partition_bound_offset
+ *		Conditionally sets pd->cached_bound_offset so that
+ *		get_cached_{list|range}_partition can be used for subsequent
+ *		tuples
+ *
+ * It is set if it appears that some offsets observed over the last
+ * pd->n_tups_inserted tuples would have been reused, which can be inferred
+ * from seeing that the number of tuples inserted is greater than the number
+ * of times the bound offsets to which they were routed changed.
+ */
+static inline void
+maybe_cache_partition_bound_offset(PartitionDispatch pd, int offset)
+{
+	/* If the offset has changed, reset the cached value. */
+	if (offset != pd->last_seen_offset)
+	{
+		pd->last_seen_offset = offset;
+		pd->n_offset_changed += 1;
+		pd->cached_bound_offset = -1;
+	}
+
+	/*
+	 * Only consider (re-) enabling caching if we've seen at least a threshold
+	 * number of tuples.
+	 */
+	if (pd->n_tups_inserted < CACHE_BOUND_OFFSET_THRESHOLD_TUPS)
+		return;
+
+	/* Wouldn't get called if the cached bound offset worked. */
+	Assert(offset != pd->cached_bound_offset);
+
+	if (pd->n_tups_inserted > pd->n_offset_changed)
+		pd->cached_bound_offset = offset;
+
+	/* Reset the counters for the next run of tuples. */
+	pd->n_tups_inserted = pd->n_offset_changed = 0;
+}
+
+/*
+ * get_cached_{list|range}_partition
+ *		Computes if the cached bound offset value, if any, is satisfied by
+ *		the tuple specified in 'values' and if it is, returns the index of
+ *		the partition corresponding to that bound
+ *
+ * Callers must ensure that none of the elements of 'values' is NULL.
+ */
+static inline int
+get_cached_list_partition(PartitionDispatch pd,
+						  PartitionBoundInfo boundinfo,
+						  PartitionKey key,
+						  Datum *values)
+{
+	int		part_index = -1;
+	int		cached_off = pd->cached_bound_offset;
+
+	if (cached_off >= 0)
+	{
+		Datum	bound_datum = boundinfo->datums[cached_off][0];
+		int32	cmpval;
+
+		cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
+												 key->partcollation[0],
+												 bound_datum,
+												 values[0]));
+		if (cmpval == 0)
+			part_index = boundinfo->indexes[cached_off];
+	}
+
+	return part_index;
+}
+
+static inline int
+get_cached_range_partition(PartitionDispatch pd,
+						   PartitionBoundInfo boundinfo,
+						   PartitionKey key,
+						   Datum *values)
+{
+	int		part_index = -1;
+	int		cached_off = pd->cached_bound_offset;
+
+	if (cached_off >= 0)
+	{
+		Datum   *bound_datums = boundinfo->datums[cached_off];
+		PartitionRangeDatumKind *bound_kind = boundinfo->kind[cached_off];
+		int32	cmpval;
+
+		/* Check if the value is above the lower bound */
+		cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+											key->partcollation,
+											bound_datums,
+											bound_kind,
+											values,
+											key->partnatts);
+		if (cmpval == 0)
+			part_index = boundinfo->indexes[cached_off + 1];
+		else if (cmpval < 0 && cached_off + 1 < boundinfo->ndatums)
+		{
+			/* Check if the value is below the upper bound */
+			bound_datums = boundinfo->datums[cached_off + 1];
+			bound_kind = boundinfo->kind[cached_off + 1];
+			cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+												key->partcollation,
+												bound_datums,
+												bound_kind,
+												values,
+												key->partnatts);
+
+			if (cmpval > 0)
+				part_index = boundinfo->indexes[cached_off + 1];
+		}
+	}
+
+	return part_index;
+}
+
 /*
  * get_partition_for_tuple
  *		Finds partition of relation which accepts the partition key specified
@@ -1248,6 +1388,8 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 	PartitionDesc partdesc = pd->partdesc;
 	PartitionBoundInfo boundinfo = partdesc->boundinfo;
 
+	pd->n_tups_inserted += 1;
+
 	/* Route as appropriate based on partitioning strategy. */
 	switch (key->strategy)
 	{
@@ -1272,14 +1414,26 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 			}
 			else
 			{
-				bool		equal = false;
-
-				bound_offset = partition_list_bsearch(key->partsupfunc,
-													  key->partcollation,
-													  boundinfo,
-													  values[0], &equal);
-				if (bound_offset >= 0 && equal)
-					part_index = boundinfo->indexes[bound_offset];
+				part_index = get_cached_list_partition(pd,
+													   boundinfo,
+													   key,
+													   values);
+				if (part_index < 0)
+				{
+					bool		equal = false;
+
+					bound_offset = partition_list_bsearch(key->partsupfunc,
+														  key->partcollation,
+														  boundinfo,
+														  values[0], &equal);
+					if (bound_offset >= 0 && equal)
+					{
+						part_index = boundinfo->indexes[bound_offset];
+						if (part_index >= 0)
+							maybe_cache_partition_bound_offset(pd,
+															   bound_offset);
+					}
+				}
 			}
 			break;
 
@@ -1304,20 +1458,30 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 
 				if (!range_partkey_has_null)
 				{
-					bound_offset = partition_range_datum_bsearch(key->partsupfunc,
-																 key->partcollation,
-																 boundinfo,
-																 key->partnatts,
-																 values,
-																 &equal);
-
-					/*
-					 * The bound at bound_offset is less than or equal to the
-					 * tuple value, so the bound at offset+1 is the upper
-					 * bound of the partition we're looking for, if there
-					 * actually exists one.
-					 */
-					part_index = boundinfo->indexes[bound_offset + 1];
+					part_index = get_cached_range_partition(pd,
+															boundinfo,
+															key,
+															values);
+					if (part_index < 0)
+					{
+						bound_offset = partition_range_datum_bsearch(key->partsupfunc,
+																	 key->partcollation,
+																	 boundinfo,
+																	 key->partnatts,
+																	 values,
+																	 &equal);
+
+						/*
+						 * The bound at bound_offset is less than or equal to the
+						 * tuple value, so the bound at offset+1 is the upper
+						 * bound of the partition we're looking for, if there
+						 * actually exists one.
+						 */
+						part_index = boundinfo->indexes[bound_offset + 1];
+						if (part_index >= 0)
+							maybe_cache_partition_bound_offset(pd,
+															   bound_offset);
+					}
 				}
 			}
 			break;
-- 
2.24.1

#52Greg Stark
stark@mit.edu
In reply to: Amit Langote (#51)
Re: Skip partition tuple routing with constant partition key

Is this a problem with the patch or its tests?

[18:14:20.798] # poll_query_until timed out executing this query:
[18:14:20.798] # SELECT count(1) = 0 FROM pg_subscription_rel WHERE
srsubstate NOT IN ('r', 's');
[18:14:20.798] # expecting this output:
[18:14:20.798] # t
[18:14:20.798] # last actual query output:
[18:14:20.798] # f
[18:14:20.798] # with stderr:
[18:14:20.798] # Tests were run but no plan was declared and
done_testing() was not seen.
[18:14:20.798] # Looks like your test exited with 60 just after 31.
[18:14:20.798] [18:12:21] t/013_partition.pl .................
[18:14:20.798] Dubious, test returned 60 (wstat 15360, 0x3c00)
...
[18:14:20.798] Test Summary Report
[18:14:20.798] -------------------
[18:14:20.798] t/013_partition.pl (Wstat: 15360 Tests: 31 Failed: 0)
[18:14:20.798] Non-zero exit status: 60
[18:14:20.798] Parse errors: No plan found in TAP output
[18:14:20.798] Files=32, Tests=328, 527 wallclock secs ( 0.16 usr 0.09
sys + 99.81 cusr 87.08 csys = 187.14 CPU)
[18:14:20.798] Result: FAIL

#53Amit Langote
amitlangote09@gmail.com
In reply to: Greg Stark (#52)
Re: Skip partition tuple routing with constant partition key

On Sun, Apr 3, 2022 at 10:31 PM Greg Stark <stark@mit.edu> wrote:

Is this a problem with the patch or its tests?

[18:14:20.798] # poll_query_until timed out executing this query:
[18:14:20.798] # SELECT count(1) = 0 FROM pg_subscription_rel WHERE
srsubstate NOT IN ('r', 's');
[18:14:20.798] # expecting this output:
[18:14:20.798] # t
[18:14:20.798] # last actual query output:
[18:14:20.798] # f
[18:14:20.798] # with stderr:
[18:14:20.798] # Tests were run but no plan was declared and
done_testing() was not seen.
[18:14:20.798] # Looks like your test exited with 60 just after 31.
[18:14:20.798] [18:12:21] t/013_partition.pl .................
[18:14:20.798] Dubious, test returned 60 (wstat 15360, 0x3c00)
...
[18:14:20.798] Test Summary Report
[18:14:20.798] -------------------
[18:14:20.798] t/013_partition.pl (Wstat: 15360 Tests: 31 Failed: 0)
[18:14:20.798] Non-zero exit status: 60
[18:14:20.798] Parse errors: No plan found in TAP output
[18:14:20.798] Files=32, Tests=328, 527 wallclock secs ( 0.16 usr 0.09
sys + 99.81 cusr 87.08 csys = 187.14 CPU)
[18:14:20.798] Result: FAIL

Hmm, make check-world passes for me after rebasing the patch (v10) to
the latest HEAD (clean), nor do I see a failure on cfbot:

http://cfbot.cputube.org/amit-langote.html

--
Amit Langote
EDB: http://www.enterprisedb.com

#54Tom Lane
tgl@sss.pgh.pa.us
In reply to: Amit Langote (#53)
Re: Skip partition tuple routing with constant partition key

Amit Langote <amitlangote09@gmail.com> writes:

On Sun, Apr 3, 2022 at 10:31 PM Greg Stark <stark@mit.edu> wrote:

Is this a problem with the patch or its tests?
[18:14:20.798] Test Summary Report
[18:14:20.798] -------------------
[18:14:20.798] t/013_partition.pl (Wstat: 15360 Tests: 31 Failed: 0)

Hmm, make check-world passes for me after rebasing the patch (v10) to
the latest HEAD (clean), nor do I see a failure on cfbot:
http://cfbot.cputube.org/amit-langote.html

013_partition.pl has been failing regularly in the buildfarm,
most recently here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&amp;dt=2022-03-31%2000%3A49%3A45

I don't think there's room to blame any uncommitted patches
for that. Somebody broke it a short time before here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&amp;dt=2022-03-17%2016%3A08%3A19

regards, tom lane

#55Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#54)
Re: Skip partition tuple routing with constant partition key

Hi,

On 2022-04-06 00:07:07 -0400, Tom Lane wrote:

Amit Langote <amitlangote09@gmail.com> writes:

On Sun, Apr 3, 2022 at 10:31 PM Greg Stark <stark@mit.edu> wrote:

Is this a problem with the patch or its tests?
[18:14:20.798] Test Summary Report
[18:14:20.798] -------------------
[18:14:20.798] t/013_partition.pl (Wstat: 15360 Tests: 31 Failed: 0)

Hmm, make check-world passes for me after rebasing the patch (v10) to
the latest HEAD (clean), nor do I see a failure on cfbot:
http://cfbot.cputube.org/amit-langote.html

013_partition.pl has been failing regularly in the buildfarm,
most recently here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&amp;dt=2022-03-31%2000%3A49%3A45

Just failed locally on my machine as well.

I don't think there's room to blame any uncommitted patches
for that. Somebody broke it a short time before here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&amp;dt=2022-03-17%2016%3A08%3A19

The obvious thing to point a finger at is

commit c91f71b9dc91ef95e1d50d6d782f477258374fc6
Author: Tomas Vondra <tomas.vondra@postgresql.org>
Date: 2022-03-16 16:42:47 +0100

Fix publish_as_relid with multiple publications

Greetings,

Andres Freund

#56Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andres Freund (#55)
Re: Skip partition tuple routing with constant partition key

Hi,

On Thu, Apr 7, 2022 at 4:37 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-04-06 00:07:07 -0400, Tom Lane wrote:

Amit Langote <amitlangote09@gmail.com> writes:

On Sun, Apr 3, 2022 at 10:31 PM Greg Stark <stark@mit.edu> wrote:

Is this a problem with the patch or its tests?
[18:14:20.798] Test Summary Report
[18:14:20.798] -------------------
[18:14:20.798] t/013_partition.pl (Wstat: 15360 Tests: 31 Failed: 0)

Hmm, make check-world passes for me after rebasing the patch (v10) to
the latest HEAD (clean), nor do I see a failure on cfbot:
http://cfbot.cputube.org/amit-langote.html

013_partition.pl has been failing regularly in the buildfarm,
most recently here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&amp;dt=2022-03-31%2000%3A49%3A45

Just failed locally on my machine as well.

I don't think there's room to blame any uncommitted patches
for that. Somebody broke it a short time before here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&amp;dt=2022-03-17%2016%3A08%3A19

The obvious thing to point a finger at is

commit c91f71b9dc91ef95e1d50d6d782f477258374fc6
Author: Tomas Vondra <tomas.vondra@postgresql.org>
Date: 2022-03-16 16:42:47 +0100

Fix publish_as_relid with multiple publications

I've not managed to reproduce this issue on my machine but while
reviewing the code and the server logs[1]https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=grassquit&amp;dt=2022-04-08%2014%3A13%3A27&amp;stg=subscription-check I may have found possible
bugs:

2022-04-08 12:59:30.701 EDT [91997:1] LOG: logical replication apply
worker for subscription "sub2" has started
2022-04-08 12:59:30.702 EDT [91998:3] 013_partition.pl LOG:
statement: ALTER SUBSCRIPTION sub2 SET PUBLICATION pub_lower_level,
pub_all
2022-04-08 12:59:30.733 EDT [91998:4] 013_partition.pl LOG:
disconnection: session time: 0:00:00.036 user=buildfarm
database=postgres host=[local]
2022-04-08 12:59:30.740 EDT [92001:1] LOG: logical replication table
synchronization worker for subscription "sub2", table "tab4_1" has
started
2022-04-08 12:59:30.744 EDT [91997:2] LOG: logical replication apply
worker for subscription "sub2" will restart because of a parameter
change
2022-04-08 12:59:30.750 EDT [92003:1] LOG: logical replication table
synchronization worker for subscription "sub2", table "tab3" has
started

The logs say that the apply worker for "sub2" finished whereas the
tablesync workers for "tab4_1" and "tab3" started. After these logs,
there are no logs that these tablesync workers finished and the apply
worker for "sub2" restarted, until the timeout. While reviewing the
code, I realized that the tablesync workers can advance its relstate
even without the apply worker intervention.

After a tablesync worker copies the table it sets
SUBREL_STATE_SYNCWAIT to its relstate, then it waits for the apply
worker to update the relstate to SUBREL_STATE_CATCHUP. If the apply
worker has already died, it breaks from the wait loop and returns
false:

wait_for_worker_state_change():

for (;;)
{
LogicalRepWorker *worker;

:

/*
* Bail out if the apply worker has died, else signal it we're
* waiting.
*/
LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
worker = logicalrep_worker_find(MyLogicalRepWorker->subid,
InvalidOid, false);
if (worker && worker->proc)
logicalrep_worker_wakeup_ptr(worker);
LWLockRelease(LogicalRepWorkerLock);
if (!worker)
break;

:
}

return false;

However, the caller doesn't check the return value at all:

/*
* We are done with the initial data synchronization, update the state.
*/
SpinLockAcquire(&MyLogicalRepWorker->relmutex);
MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCWAIT;
MyLogicalRepWorker->relstate_lsn = *origin_startpos;
SpinLockRelease(&MyLogicalRepWorker->relmutex);

/*
* Finally, wait until the main apply worker tells us to catch up and then
* return to let LogicalRepApplyLoop do it.
*/
wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
return slotname;

Therefore, the tablesync worker started logical replication while
keeping its relstate as SUBREL_STATE_SYNCWAIT.

Given the server logs, it's likely that both tablesync workers for
"tab4_1" and "tab3" were in this situation. That is, there were two
tablesync workers who were applying changes for the target relation
but the relstate was SUBREL_STATE_SYNCWAIT.

When it comes to starting the apply worker, probably it didn't happen
since there are already running tablesync workers as much as
max_sync_workers_per_subscription (2 by default):

logicalrep_worker_launch():

/*
* If we reached the sync worker limit per subscription, just exit
* silently as we might get here because of an otherwise harmless race
* condition.
*/
if (nsyncworkers >= max_sync_workers_per_subscription)
{
LWLockRelease(LogicalRepWorkerLock);
return;
}

This scenario seems possible in principle but I've not managed to
reproduce this issue so I might be wrong. Especially, according to the
server logs, it seems like the tablesync workers were launched before
the apply worker restarted due to parameter change and this is a
common pattern among other failure logs. But I'm not sure how it could
really happen. IIUC the apply worker always re-reads subscription (and
exits if there is parameter change) and then requests to launch
tablesync workers accordingly. Also, the fact that we don't check the
return value of wiat_for_worker_state_change() is not a new thing; we
have been living with this behavior since v10. So I'm not really sure
why this problem appeared recently if my hypothesis is correct.

Regards,

[1]: https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=grassquit&amp;dt=2022-04-08%2014%3A13%3A27&amp;stg=subscription-check

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#57Amit Kapila
amit.kapila16@gmail.com
In reply to: Masahiko Sawada (#56)
Re: Skip partition tuple routing with constant partition key

On Tue, Apr 12, 2022 at 6:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

On Thu, Apr 7, 2022 at 4:37 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-04-06 00:07:07 -0400, Tom Lane wrote:

Amit Langote <amitlangote09@gmail.com> writes:

On Sun, Apr 3, 2022 at 10:31 PM Greg Stark <stark@mit.edu> wrote:

Is this a problem with the patch or its tests?
[18:14:20.798] Test Summary Report
[18:14:20.798] -------------------
[18:14:20.798] t/013_partition.pl (Wstat: 15360 Tests: 31 Failed: 0)

Hmm, make check-world passes for me after rebasing the patch (v10) to
the latest HEAD (clean), nor do I see a failure on cfbot:
http://cfbot.cputube.org/amit-langote.html

013_partition.pl has been failing regularly in the buildfarm,
most recently here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&amp;dt=2022-03-31%2000%3A49%3A45

Just failed locally on my machine as well.

I don't think there's room to blame any uncommitted patches
for that. Somebody broke it a short time before here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&amp;dt=2022-03-17%2016%3A08%3A19

The obvious thing to point a finger at is

commit c91f71b9dc91ef95e1d50d6d782f477258374fc6
Author: Tomas Vondra <tomas.vondra@postgresql.org>
Date: 2022-03-16 16:42:47 +0100

Fix publish_as_relid with multiple publications

I've not managed to reproduce this issue on my machine but while
reviewing the code and the server logs[1] I may have found possible
bugs:

2022-04-08 12:59:30.701 EDT [91997:1] LOG: logical replication apply
worker for subscription "sub2" has started
2022-04-08 12:59:30.702 EDT [91998:3] 013_partition.pl LOG:
statement: ALTER SUBSCRIPTION sub2 SET PUBLICATION pub_lower_level,
pub_all
2022-04-08 12:59:30.733 EDT [91998:4] 013_partition.pl LOG:
disconnection: session time: 0:00:00.036 user=buildfarm
database=postgres host=[local]
2022-04-08 12:59:30.740 EDT [92001:1] LOG: logical replication table
synchronization worker for subscription "sub2", table "tab4_1" has
started
2022-04-08 12:59:30.744 EDT [91997:2] LOG: logical replication apply
worker for subscription "sub2" will restart because of a parameter
change
2022-04-08 12:59:30.750 EDT [92003:1] LOG: logical replication table
synchronization worker for subscription "sub2", table "tab3" has
started

The logs say that the apply worker for "sub2" finished whereas the
tablesync workers for "tab4_1" and "tab3" started. After these logs,
there are no logs that these tablesync workers finished and the apply
worker for "sub2" restarted, until the timeout. While reviewing the
code, I realized that the tablesync workers can advance its relstate
even without the apply worker intervention.

After a tablesync worker copies the table it sets
SUBREL_STATE_SYNCWAIT to its relstate, then it waits for the apply
worker to update the relstate to SUBREL_STATE_CATCHUP. If the apply
worker has already died, it breaks from the wait loop and returns
false:

wait_for_worker_state_change():

for (;;)
{
LogicalRepWorker *worker;

:

/*
* Bail out if the apply worker has died, else signal it we're
* waiting.
*/
LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
worker = logicalrep_worker_find(MyLogicalRepWorker->subid,
InvalidOid, false);
if (worker && worker->proc)
logicalrep_worker_wakeup_ptr(worker);
LWLockRelease(LogicalRepWorkerLock);
if (!worker)
break;

:
}

return false;

However, the caller doesn't check the return value at all:

/*
* We are done with the initial data synchronization, update the state.
*/
SpinLockAcquire(&MyLogicalRepWorker->relmutex);
MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCWAIT;
MyLogicalRepWorker->relstate_lsn = *origin_startpos;
SpinLockRelease(&MyLogicalRepWorker->relmutex);

/*
* Finally, wait until the main apply worker tells us to catch up and then
* return to let LogicalRepApplyLoop do it.
*/
wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
return slotname;

Therefore, the tablesync worker started logical replication while
keeping its relstate as SUBREL_STATE_SYNCWAIT.

Given the server logs, it's likely that both tablesync workers for
"tab4_1" and "tab3" were in this situation. That is, there were two
tablesync workers who were applying changes for the target relation
but the relstate was SUBREL_STATE_SYNCWAIT.

When it comes to starting the apply worker, probably it didn't happen
since there are already running tablesync workers as much as
max_sync_workers_per_subscription (2 by default):

logicalrep_worker_launch():

/*
* If we reached the sync worker limit per subscription, just exit
* silently as we might get here because of an otherwise harmless race
* condition.
*/
if (nsyncworkers >= max_sync_workers_per_subscription)
{
LWLockRelease(LogicalRepWorkerLock);
return;
}

This scenario seems possible in principle but I've not managed to
reproduce this issue so I might be wrong.

This is exactly the same analysis I have done in the original thread
where that patch was committed. I have found some crude ways to
reproduce it with a different test as well. See emails [1]/messages/by-id/CAA4eK1LpBFU49Ohbnk=dv_v9YP+Kqh1+Sf8i++_s-QhD1Gy4Qw@mail.gmail.com[2]/messages/by-id/CAA4eK1JzzoE61CY1qi9Vcdi742JFwG4YA3XpoMHwfKNhbFic6g@mail.gmail.com[3]/messages/by-id/CAA4eK1JcQRQw0G-U4A+vaGaBWSvggYMMDJH4eDtJ0Yf2eUYXyA@mail.gmail.com.

Especially, according to the
server logs, it seems like the tablesync workers were launched before
the apply worker restarted due to parameter change and this is a
common pattern among other failure logs. But I'm not sure how it could
really happen. IIUC the apply worker always re-reads subscription (and
exits if there is parameter change) and then requests to launch
tablesync workers accordingly.

Is there any rule/documentation which ensures that we must re-read the
subscription parameter change before trying to launch sync workers?

Actually, it would be better if we discuss this problem on another
thread [1]/messages/by-id/CAA4eK1LpBFU49Ohbnk=dv_v9YP+Kqh1+Sf8i++_s-QhD1Gy4Qw@mail.gmail.com to avoid hijacking this thread. So, it would be good if you
respond there with your thoughts. Thanks for looking into this.

[1]: /messages/by-id/CAA4eK1LpBFU49Ohbnk=dv_v9YP+Kqh1+Sf8i++_s-QhD1Gy4Qw@mail.gmail.com
[2]: /messages/by-id/CAA4eK1JzzoE61CY1qi9Vcdi742JFwG4YA3XpoMHwfKNhbFic6g@mail.gmail.com
[3]: /messages/by-id/CAA4eK1JcQRQw0G-U4A+vaGaBWSvggYMMDJH4eDtJ0Yf2eUYXyA@mail.gmail.com

--
With Regards,
Amit Kapila.

#58Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Amit Kapila (#57)
Re: Skip partition tuple routing with constant partition key

On Wed, Apr 13, 2022 at 2:39 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Apr 12, 2022 at 6:16 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

Hi,

On Thu, Apr 7, 2022 at 4:37 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2022-04-06 00:07:07 -0400, Tom Lane wrote:

Amit Langote <amitlangote09@gmail.com> writes:

On Sun, Apr 3, 2022 at 10:31 PM Greg Stark <stark@mit.edu> wrote:

Is this a problem with the patch or its tests?
[18:14:20.798] Test Summary Report
[18:14:20.798] -------------------
[18:14:20.798] t/013_partition.pl (Wstat: 15360 Tests: 31 Failed: 0)

Hmm, make check-world passes for me after rebasing the patch (v10) to
the latest HEAD (clean), nor do I see a failure on cfbot:
http://cfbot.cputube.org/amit-langote.html

013_partition.pl has been failing regularly in the buildfarm,
most recently here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=florican&amp;dt=2022-03-31%2000%3A49%3A45

Just failed locally on my machine as well.

I don't think there's room to blame any uncommitted patches
for that. Somebody broke it a short time before here:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=wrasse&amp;dt=2022-03-17%2016%3A08%3A19

The obvious thing to point a finger at is

commit c91f71b9dc91ef95e1d50d6d782f477258374fc6
Author: Tomas Vondra <tomas.vondra@postgresql.org>
Date: 2022-03-16 16:42:47 +0100

Fix publish_as_relid with multiple publications

I've not managed to reproduce this issue on my machine but while
reviewing the code and the server logs[1] I may have found possible
bugs:

2022-04-08 12:59:30.701 EDT [91997:1] LOG: logical replication apply
worker for subscription "sub2" has started
2022-04-08 12:59:30.702 EDT [91998:3] 013_partition.pl LOG:
statement: ALTER SUBSCRIPTION sub2 SET PUBLICATION pub_lower_level,
pub_all
2022-04-08 12:59:30.733 EDT [91998:4] 013_partition.pl LOG:
disconnection: session time: 0:00:00.036 user=buildfarm
database=postgres host=[local]
2022-04-08 12:59:30.740 EDT [92001:1] LOG: logical replication table
synchronization worker for subscription "sub2", table "tab4_1" has
started
2022-04-08 12:59:30.744 EDT [91997:2] LOG: logical replication apply
worker for subscription "sub2" will restart because of a parameter
change
2022-04-08 12:59:30.750 EDT [92003:1] LOG: logical replication table
synchronization worker for subscription "sub2", table "tab3" has
started

The logs say that the apply worker for "sub2" finished whereas the
tablesync workers for "tab4_1" and "tab3" started. After these logs,
there are no logs that these tablesync workers finished and the apply
worker for "sub2" restarted, until the timeout. While reviewing the
code, I realized that the tablesync workers can advance its relstate
even without the apply worker intervention.

After a tablesync worker copies the table it sets
SUBREL_STATE_SYNCWAIT to its relstate, then it waits for the apply
worker to update the relstate to SUBREL_STATE_CATCHUP. If the apply
worker has already died, it breaks from the wait loop and returns
false:

wait_for_worker_state_change():

for (;;)
{
LogicalRepWorker *worker;

:

/*
* Bail out if the apply worker has died, else signal it we're
* waiting.
*/
LWLockAcquire(LogicalRepWorkerLock, LW_SHARED);
worker = logicalrep_worker_find(MyLogicalRepWorker->subid,
InvalidOid, false);
if (worker && worker->proc)
logicalrep_worker_wakeup_ptr(worker);
LWLockRelease(LogicalRepWorkerLock);
if (!worker)
break;

:
}

return false;

However, the caller doesn't check the return value at all:

/*
* We are done with the initial data synchronization, update the state.
*/
SpinLockAcquire(&MyLogicalRepWorker->relmutex);
MyLogicalRepWorker->relstate = SUBREL_STATE_SYNCWAIT;
MyLogicalRepWorker->relstate_lsn = *origin_startpos;
SpinLockRelease(&MyLogicalRepWorker->relmutex);

/*
* Finally, wait until the main apply worker tells us to catch up and then
* return to let LogicalRepApplyLoop do it.
*/
wait_for_worker_state_change(SUBREL_STATE_CATCHUP);
return slotname;

Therefore, the tablesync worker started logical replication while
keeping its relstate as SUBREL_STATE_SYNCWAIT.

Given the server logs, it's likely that both tablesync workers for
"tab4_1" and "tab3" were in this situation. That is, there were two
tablesync workers who were applying changes for the target relation
but the relstate was SUBREL_STATE_SYNCWAIT.

When it comes to starting the apply worker, probably it didn't happen
since there are already running tablesync workers as much as
max_sync_workers_per_subscription (2 by default):

logicalrep_worker_launch():

/*
* If we reached the sync worker limit per subscription, just exit
* silently as we might get here because of an otherwise harmless race
* condition.
*/
if (nsyncworkers >= max_sync_workers_per_subscription)
{
LWLockRelease(LogicalRepWorkerLock);
return;
}

This scenario seems possible in principle but I've not managed to
reproduce this issue so I might be wrong.

This is exactly the same analysis I have done in the original thread
where that patch was committed. I have found some crude ways to
reproduce it with a different test as well. See emails [1][2][3].

Great. I didn't realize there is a discussion there.

Especially, according to the
server logs, it seems like the tablesync workers were launched before
the apply worker restarted due to parameter change and this is a
common pattern among other failure logs. But I'm not sure how it could
really happen. IIUC the apply worker always re-reads subscription (and
exits if there is parameter change) and then requests to launch
tablesync workers accordingly.

Is there any rule/documentation which ensures that we must re-read the
subscription parameter change before trying to launch sync workers?

No, but as far as I read the code I could not find any path of that.

Actually, it would be better if we discuss this problem on another
thread [1] to avoid hijacking this thread. So, it would be good if you
respond there with your thoughts. Thanks for looking into this.

Agreed. I'll respond there.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#59David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#51)
5 attachment(s)
Re: Skip partition tuple routing with constant partition key

I've spent some time looking at the v10 patch, and to be honest, I
don't really like the look of it :(

1. I think we should be putting the cache fields in PartitionDescData
rather than PartitionDispatch. Having them in PartitionDescData allows
caching between statements.
2. The function name maybe_cache_partition_bound_offset() fills me
with dread. It's very unconcise. I don't think anyone should ever use
that word in a function or variable name.
3. I'm not really sure why there's a field named n_tups_inserted.
That would lead me to believe that ExecFindPartition is only executed
for INSERTs. UPDATEs need to know the partition too.
4. The fields you're adding to PartitionDispatch are very poorly
documented. I'm not really sure what n_offset_changed means. Why
can't you just keep track by recording the last used partition, the
last index into the datum array, and then just a count of the number
of times we've found the last used partition in a row? When the found
partition does not match the last partition, just reset the counter
and when the counter reaches the cache threshold, use the cache path.

I've taken a go at rewriting this, from scratch, into what I think it
should look like. I then looked at what I came up with and decided
the logic for finding partitions should all be kept in a single
function. That way there's much less chance of someone forgetting to
update the double-checking logic during cache hits when they update
the logic for finding partitions without the cache.

The 0001 patch is my original attempt. I then rewrote it and came up
with 0002 (applies on top of 0001).

After writing a benchmark script, I noticed that the performance of
0002 was quite a bit worse than 0001. I noticed that the benchmark
where the partition changes each time got much worse with 0002. I can
only assume that's due to the increased code size, so I played around
with likely() and unlikely() to see if I could use those to shift the
code layout around in such a way to make 0002 faster. Surprisingly
using likely() for the cache hit path make it faster. I'd have assumed
it would be unlikely() that would work.

cache_partition_bench.png shows the results. I tested with master @
a5f9f1b88. The "Amit" column is your v10 patch.
copybench.sh is the script I used to run the benchmarks. This tries
all 3 partitioning strategies and performs 2 COPY FROMs, one with the
rows arriving in partition order and another where the next row always
goes into a different partition. I'm expecting to see the "ordered"
case get better for LIST and RANGE partitions and the "unordered" case
not to get any slower.

With all of the attached patches applied, it does seem like I've
managed to slightly speed up all of the unordered cases slightly.
This might be noise, but I did manage to remove some redundant code
that needlessly checked if the HASH partitioned table had a DEFAULT
partition, which it cannot. This may account for some of the increase
in performance.

I do need to stare at the patch a bit more before I'm confident that
it's correct. I just wanted to share it before I go and do that.

David

Attachments:

v11-0001-WIP-Cache-last-used-partition-in-PartitionDesc.patchtext/plain; charset=US-ASCII; name=v11-0001-WIP-Cache-last-used-partition-in-PartitionDesc.patchDownload
From c482323feabae476f92accb696dc1efc5628d589 Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Wed, 14 Jul 2021 00:34:18 +1200
Subject: [PATCH v11 1/3] WIP: Cache last used partition in PartitionDesc

---
 src/backend/executor/execPartition.c | 195 ++++++++++++++++++++++++++-
 src/backend/partitioning/partdesc.c  |   6 +
 src/include/partitioning/partdesc.h  |  11 ++
 3 files changed, 211 insertions(+), 1 deletion(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index e03ea27299..aca08791e9 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -176,6 +176,8 @@ static void FormPartitionKeyDatum(PartitionDispatch pd,
 								  EState *estate,
 								  Datum *values,
 								  bool *isnull);
+static int	get_partition_for_tuple_using_cache(PartitionDispatch pd,
+												Datum *values, bool *isnull);
 static int	get_partition_for_tuple(PartitionDispatch pd, Datum *values,
 									bool *isnull);
 static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
@@ -318,7 +320,7 @@ ExecFindPartition(ModifyTableState *mtstate,
 		 * these values, error out.
 		 */
 		if (partdesc->nparts == 0 ||
-			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
+			(partidx = get_partition_for_tuple_using_cache(dispatch, values, isnull)) < 0)
 		{
 			char	   *val_desc;
 
@@ -1332,6 +1334,191 @@ FormPartitionKeyDatum(PartitionDispatch pd,
 		elog(ERROR, "wrong number of partition key expressions");
 }
 
+/*
+ * find_last_partition_for_tuple
+ *		Checks if 'values' and 'isnull' matches the last found partition and
+ *		returns the partition index of that partition or -1 if the given
+ *		values don't belong to the last found partition.
+ *
+ * Note: If calculating the correct partition is just as cheap as checking if
+ * these values belong to the last partition, here we just calculate the
+ * correct partition for the given values.  This is the case for HASH
+ * partitioning and for LIST partitioning with a NULL value.
+ */
+static inline int
+find_last_partition_for_tuple(PartitionDispatch pd, PartitionDesc partdesc,
+							  Datum *values, bool *isnull)
+{
+	PartitionKey key;
+	PartitionBoundInfo boundinfo;
+
+	/* No last partition? No match then. */
+	if (partdesc->last_found_part_index == -1)
+		return -1;
+
+	key = pd->key;
+	boundinfo = partdesc->boundinfo;
+
+	switch (key->strategy)
+	{
+		case PARTITION_STRATEGY_HASH:
+			{
+				uint64		rowHash;
+
+				rowHash = compute_partition_hash_value(key->partnatts,
+													   key->partsupfunc,
+													   key->partcollation,
+													   values, isnull);
+
+				/* Just calculate the correct partition and return it */
+				return boundinfo->indexes[rowHash % boundinfo->nindexes];
+			}
+
+		case PARTITION_STRATEGY_LIST:
+			if (isnull[0])
+			{
+				/* Just return the NULL partition, if there is one */
+				return boundinfo->null_index;
+			}
+			else
+			{
+				int			last_datum_offset = partdesc->last_found_datum_index;
+				Datum		lastDatum = boundinfo->datums[last_datum_offset][0];
+				int32		cmpval;
+
+				cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
+														 key->partcollation[0],
+														 lastDatum,
+														 values[0]));
+
+				if (cmpval == 0)
+					return boundinfo->indexes[last_datum_offset];
+				break;
+			}
+
+		case PARTITION_STRATEGY_RANGE:
+			{
+				int			last_datum_offset = partdesc->last_found_datum_index;
+				Datum	   *lastDatums = boundinfo->datums[last_datum_offset];
+				PartitionRangeDatumKind *kind = boundinfo->kind[last_datum_offset];
+				int32		cmpval;
+
+				/* Check for NULLs and abort the cache check if we find any */
+				for (int i = 0; i < key->partnatts; i++)
+				{
+					if (isnull[i])
+						return -1;
+				}
+
+				/* Check if the value is equal to the lower bound */
+				cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+													key->partcollation,
+													lastDatums,
+													kind,
+													values,
+													key->partnatts);
+
+				if (cmpval == 0)
+					return boundinfo->indexes[last_datum_offset + 1];
+
+				else if (cmpval < 0 && last_datum_offset + 1 < boundinfo->ndatums)
+				{
+					/* Check if the value is below the upper bound */
+					lastDatums = boundinfo->datums[last_datum_offset + 1];
+					kind = boundinfo->kind[last_datum_offset + 1];
+					cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+														key->partcollation,
+														lastDatums,
+														kind,
+														values,
+														key->partnatts);
+
+					if (cmpval > 0)
+						return boundinfo->indexes[last_datum_offset + 1];
+				}
+				break;
+			}
+
+		default:
+			elog(ERROR, "unexpected partition strategy: %d",
+				 (int) key->strategy);
+	}
+
+	return -1;
+}
+
+/*
+ * The number of times the same partition must be found in a row before we
+ * switch from a search for the given values to just checking if the values
+ * belong to the last found partition.
+ */
+#define PARTITION_CACHED_FIND_THRESHOLD		16
+
+/*
+ * get_partition_for_tuple_using_cache
+ *		As get_partition_for_tuple, but use caching logic and check if the
+ *		given 'values' and 'isnull' array also belong to the last found
+ *		partition.  If it does then this can save an expensive binary search
+ *		for LIST and RANGE partitioning.
+ */
+static int
+get_partition_for_tuple_using_cache(PartitionDispatch pd, Datum *values,
+									bool *isnull)
+{
+	PartitionDesc partdesc = pd->partdesc;
+	int			lastpart;
+
+	/*
+	 * When we've found that the same partition matches
+	 * PARTITION_CACHED_FIND_THRESHOLD times in a row, instead of doing a
+	 * partition search, we just check if the last partition found will also
+	 * accept these values.  If it does then that'll save us from searching
+	 * for the correct partition.
+	 */
+
+	/* Have we found the same partition enough times to use the cache? */
+	if (partdesc->last_found_count >= PARTITION_CACHED_FIND_THRESHOLD)
+	{
+		/* check if these values also belong to the last found partition */
+		lastpart = find_last_partition_for_tuple(pd, partdesc, values, isnull);
+
+		if (lastpart == -1)
+		{
+			/*
+			 * The last partition did not match.  We must fall back on a
+			 * search for the correct partition without the cache.
+			 */
+			lastpart = get_partition_for_tuple(pd, values, isnull);
+			partdesc->last_found_count = 1;
+			return lastpart;
+		}
+		else
+		{
+			/* no point in advancing last_found_count any further */
+			return lastpart;
+		}
+	}
+	else
+	{
+		int			thispart;
+
+		/*
+		 * We've not met the threshold for caching yet. Just perform a search.
+		 * get_partition_for_tuple will stash the last_found_part_index.
+		 */
+		lastpart = partdesc->last_found_part_index;
+		thispart = get_partition_for_tuple(pd, values, isnull);
+
+		/* adjust the count accordingly if the partition matched or not */
+		if (thispart == lastpart)
+			partdesc->last_found_count++;
+		else
+			partdesc->last_found_count = 1;
+
+		return thispart;
+	}
+}
+
 /*
  * get_partition_for_tuple
  *		Finds partition of relation which accepts the partition key specified
@@ -1380,7 +1567,10 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 													  boundinfo,
 													  values[0], &equal);
 				if (bound_offset >= 0 && equal)
+				{
 					part_index = boundinfo->indexes[bound_offset];
+					partdesc->last_found_datum_index = bound_offset;
+				}
 			}
 			break;
 
@@ -1419,6 +1609,7 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 					 * actually exists one.
 					 */
 					part_index = boundinfo->indexes[bound_offset + 1];
+					partdesc->last_found_datum_index = bound_offset;
 				}
 			}
 			break;
@@ -1435,6 +1626,8 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 	if (part_index < 0)
 		part_index = boundinfo->default_index;
 
+	partdesc->last_found_part_index = part_index;
+
 	return part_index;
 }
 
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 8b6e0bd595..737f0edd89 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -290,6 +290,12 @@ RelationBuildPartitionDesc(Relation rel, bool omit_detached)
 	{
 		oldcxt = MemoryContextSwitchTo(new_pdcxt);
 		partdesc->boundinfo = partition_bounds_copy(boundinfo, key);
+
+		/* Initialize caching fields for speeding up ExecFindPartition */
+		partdesc->last_found_datum_index = -1;
+		partdesc->last_found_part_index = -1;
+		partdesc->last_found_count = 0;
+
 		partdesc->oids = (Oid *) palloc(nparts * sizeof(Oid));
 		partdesc->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
diff --git a/src/include/partitioning/partdesc.h b/src/include/partitioning/partdesc.h
index ae1afe3d78..7121ac7f7a 100644
--- a/src/include/partitioning/partdesc.h
+++ b/src/include/partitioning/partdesc.h
@@ -36,6 +36,17 @@ typedef struct PartitionDescData
 								 * the corresponding 'oids' element belongs to
 								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
+	int			last_found_datum_index; /* Index into the owning
+										 * PartitionBoundInfo's datum array
+										 * for the last found partition */
+	int			last_found_part_index;	/* Partition index of the last found
+										 * partition or -1 if none have been
+										 * found yet or if we've failed to
+										 * find one */
+	int			last_found_count;	/* Number of times in a row have we found
+									 * values to match the partition
+									 * referenced in the last_found_part_index
+									 * field */
 } PartitionDescData;
 
 
-- 
2.35.1.windows.2

cache_partition_bench.pngimage/png; name=cache_partition_bench.pngDownload
copybench.shtext/x-sh; charset=US-ASCII; name=copybench.shDownload
v11-0002-Do-partition-caching-another-way.patchtext/plain; charset=US-ASCII; name=v11-0002-Do-partition-caching-another-way.patchDownload
From f2c2a04a34ffde2942cc5b75d66eaa6b524c12bc Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Thu, 14 Jul 2022 16:56:10 +1200
Subject: [PATCH v11 2/3] Do partition caching another way

---
 src/backend/executor/execPartition.c | 357 +++++++++++----------------
 1 file changed, 151 insertions(+), 206 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index aca08791e9..7bdf78af99 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -176,8 +176,6 @@ static void FormPartitionKeyDatum(PartitionDispatch pd,
 								  EState *estate,
 								  Datum *values,
 								  bool *isnull);
-static int	get_partition_for_tuple_using_cache(PartitionDispatch pd,
-												Datum *values, bool *isnull);
 static int	get_partition_for_tuple(PartitionDispatch pd, Datum *values,
 									bool *isnull);
 static char *ExecBuildSlotPartitionKeyDescription(Relation rel,
@@ -320,7 +318,7 @@ ExecFindPartition(ModifyTableState *mtstate,
 		 * these values, error out.
 		 */
 		if (partdesc->nparts == 0 ||
-			(partidx = get_partition_for_tuple_using_cache(dispatch, values, isnull)) < 0)
+			(partidx = get_partition_for_tuple(dispatch, values, isnull)) < 0)
 		{
 			char	   *val_desc;
 
@@ -1334,195 +1332,42 @@ FormPartitionKeyDatum(PartitionDispatch pd,
 		elog(ERROR, "wrong number of partition key expressions");
 }
 
-/*
- * find_last_partition_for_tuple
- *		Checks if 'values' and 'isnull' matches the last found partition and
- *		returns the partition index of that partition or -1 if the given
- *		values don't belong to the last found partition.
- *
- * Note: If calculating the correct partition is just as cheap as checking if
- * these values belong to the last partition, here we just calculate the
- * correct partition for the given values.  This is the case for HASH
- * partitioning and for LIST partitioning with a NULL value.
- */
-static inline int
-find_last_partition_for_tuple(PartitionDispatch pd, PartitionDesc partdesc,
-							  Datum *values, bool *isnull)
-{
-	PartitionKey key;
-	PartitionBoundInfo boundinfo;
-
-	/* No last partition? No match then. */
-	if (partdesc->last_found_part_index == -1)
-		return -1;
-
-	key = pd->key;
-	boundinfo = partdesc->boundinfo;
-
-	switch (key->strategy)
-	{
-		case PARTITION_STRATEGY_HASH:
-			{
-				uint64		rowHash;
-
-				rowHash = compute_partition_hash_value(key->partnatts,
-													   key->partsupfunc,
-													   key->partcollation,
-													   values, isnull);
-
-				/* Just calculate the correct partition and return it */
-				return boundinfo->indexes[rowHash % boundinfo->nindexes];
-			}
-
-		case PARTITION_STRATEGY_LIST:
-			if (isnull[0])
-			{
-				/* Just return the NULL partition, if there is one */
-				return boundinfo->null_index;
-			}
-			else
-			{
-				int			last_datum_offset = partdesc->last_found_datum_index;
-				Datum		lastDatum = boundinfo->datums[last_datum_offset][0];
-				int32		cmpval;
-
-				cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
-														 key->partcollation[0],
-														 lastDatum,
-														 values[0]));
-
-				if (cmpval == 0)
-					return boundinfo->indexes[last_datum_offset];
-				break;
-			}
-
-		case PARTITION_STRATEGY_RANGE:
-			{
-				int			last_datum_offset = partdesc->last_found_datum_index;
-				Datum	   *lastDatums = boundinfo->datums[last_datum_offset];
-				PartitionRangeDatumKind *kind = boundinfo->kind[last_datum_offset];
-				int32		cmpval;
-
-				/* Check for NULLs and abort the cache check if we find any */
-				for (int i = 0; i < key->partnatts; i++)
-				{
-					if (isnull[i])
-						return -1;
-				}
-
-				/* Check if the value is equal to the lower bound */
-				cmpval = partition_rbound_datum_cmp(key->partsupfunc,
-													key->partcollation,
-													lastDatums,
-													kind,
-													values,
-													key->partnatts);
-
-				if (cmpval == 0)
-					return boundinfo->indexes[last_datum_offset + 1];
-
-				else if (cmpval < 0 && last_datum_offset + 1 < boundinfo->ndatums)
-				{
-					/* Check if the value is below the upper bound */
-					lastDatums = boundinfo->datums[last_datum_offset + 1];
-					kind = boundinfo->kind[last_datum_offset + 1];
-					cmpval = partition_rbound_datum_cmp(key->partsupfunc,
-														key->partcollation,
-														lastDatums,
-														kind,
-														values,
-														key->partnatts);
-
-					if (cmpval > 0)
-						return boundinfo->indexes[last_datum_offset + 1];
-				}
-				break;
-			}
-
-		default:
-			elog(ERROR, "unexpected partition strategy: %d",
-				 (int) key->strategy);
-	}
-
-	return -1;
-}
-
 /*
  * The number of times the same partition must be found in a row before we
  * switch from a search for the given values to just checking if the values
- * belong to the last found partition.
- */
-#define PARTITION_CACHED_FIND_THRESHOLD		16
-
-/*
- * get_partition_for_tuple_using_cache
- *		As get_partition_for_tuple, but use caching logic and check if the
- *		given 'values' and 'isnull' array also belong to the last found
- *		partition.  If it does then this can save an expensive binary search
- *		for LIST and RANGE partitioning.
+ * belong to the last found partition.  This must be above 0.
  */
-static int
-get_partition_for_tuple_using_cache(PartitionDispatch pd, Datum *values,
-									bool *isnull)
-{
-	PartitionDesc partdesc = pd->partdesc;
-	int			lastpart;
-
-	/*
-	 * When we've found that the same partition matches
-	 * PARTITION_CACHED_FIND_THRESHOLD times in a row, instead of doing a
-	 * partition search, we just check if the last partition found will also
-	 * accept these values.  If it does then that'll save us from searching
-	 * for the correct partition.
-	 */
+#define PARTITION_CACHED_FIND_THRESHOLD			16
 
-	/* Have we found the same partition enough times to use the cache? */
-	if (partdesc->last_found_count >= PARTITION_CACHED_FIND_THRESHOLD)
-	{
-		/* check if these values also belong to the last found partition */
-		lastpart = find_last_partition_for_tuple(pd, partdesc, values, isnull);
-
-		if (lastpart == -1)
-		{
-			/*
-			 * The last partition did not match.  We must fall back on a
-			 * search for the correct partition without the cache.
-			 */
-			lastpart = get_partition_for_tuple(pd, values, isnull);
-			partdesc->last_found_count = 1;
-			return lastpart;
-		}
-		else
-		{
-			/* no point in advancing last_found_count any further */
-			return lastpart;
-		}
-	}
-	else
-	{
-		int			thispart;
-
-		/*
-		 * We've not met the threshold for caching yet. Just perform a search.
-		 * get_partition_for_tuple will stash the last_found_part_index.
-		 */
-		lastpart = partdesc->last_found_part_index;
-		thispart = get_partition_for_tuple(pd, values, isnull);
-
-		/* adjust the count accordingly if the partition matched or not */
-		if (thispart == lastpart)
-			partdesc->last_found_count++;
-		else
-			partdesc->last_found_count = 1;
-
-		return thispart;
-	}
-}
-
-/*
+ /*
  * get_partition_for_tuple
  *		Finds partition of relation which accepts the partition key specified
- *		in values and isnull
+ *		in values and isnull.
+ *
+ * Calling this function can be quite expensive for LIST and RANGE partitioned
+ * tables have many partitions.  This is due to the binary search that's done
+ * to find the correct partition.  Many of the use cases for LIST and RANGE
+ * partitioned tables mean that the same partition is likely to be found in
+ * subsequent ExecFindPartition() calls.  This is especially true for cases
+ * such as RANGE partitioned tables on a TIMESTAMP column where the partition
+ * key is the current time.  When asked to find a partition for a RANGE or
+ * LIST partitioned table, we record the partition index we've found in the
+ * PartitionDesc (which is stored in the relcache), and if we keep finding the
+ * same partition PARTITION_CACHED_FIND_THRESHOLD times, then we'll enable
+ * caching logic and instead of performing a binary search, we'll double check
+ * that the values still belong to the last found partition, and if so, we'll
+ * return that partition index without performing the binary search.  If we
+ * fail to match the last partition when double checking, then we fall back on
+ * doing a normal search.  In this case, we'll set the number of times we've
+ * hit the partition back to 1 again so that we don't attempt to use the cache
+ * again.   For cases where the partition changes on each lookup, the amount
+ * of additional work required just amounts to recording the last found
+ * partition and setting the found counter back to 1 again.
+ *
+ * No caching of partitions is done when the last found partition is th
+ * DEFAULT partition.  In this case, we don't have sufficient information about
+ * the last found partition to confirm the Datum being looked up belongs to
+ * the DEFAULT partition.
  *
  * Return value is index of the partition (>= 0 and < partdesc->nparts) if one
  * found or -1 if none found.
@@ -1536,6 +1381,18 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 	PartitionDesc partdesc = pd->partdesc;
 	PartitionBoundInfo boundinfo = partdesc->boundinfo;
 
+	/*
+	 * In the switch statement below, when we perform a cached lookup for
+	 * RANGE and LIST partitioned tables, if we find that the last found
+	 * partition matches the 'values', we return the partition index right
+	 * away.  We do this instead of breaking out of the switch as we don't
+	 * want to execute the code about the default partition or do any updates
+	 * for any of the cache-related fields.  That would be a waste of effort
+	 * as we already know it's not the DEFAULT partition and have no need
+	 * to increment the number of times we found the same partition any
+	 * higher than PARTITION_CACHED_FIND_THRESHOLD.
+	 */
+
 	/* Route as appropriate based on partitioning strategy. */
 	switch (key->strategy)
 	{
@@ -1543,24 +1400,56 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 			{
 				uint64		rowHash;
 
+				/* hash partitioning is too cheap to bother caching */
 				rowHash = compute_partition_hash_value(key->partnatts,
 													   key->partsupfunc,
 													   key->partcollation,
 													   values, isnull);
 
-				part_index = boundinfo->indexes[rowHash % boundinfo->nindexes];
+				/*
+				 * HASH partitions can't have a DEFAULT partition and we don't
+				 * do any caching work for them, so just return the part index
+				 */
+				return boundinfo->indexes[rowHash % boundinfo->nindexes];
 			}
-			break;
 
 		case PARTITION_STRATEGY_LIST:
 			if (isnull[0])
 			{
+				/* this is far too cheap to bother doing any caching */
 				if (partition_bound_accepts_nulls(boundinfo))
 					part_index = boundinfo->null_index;
 			}
 			else
 			{
-				bool		equal = false;
+				bool	equal;
+
+				if (partdesc->last_found_count >= PARTITION_CACHED_FIND_THRESHOLD)
+				{
+					int			last_datum_offset = partdesc->last_found_datum_index;
+					Datum		lastDatum = boundinfo->datums[last_datum_offset][0];
+					int32		cmpval;
+
+					/*
+					 * Check if the last found datum index is the same as this
+					 * Datum.
+					 */
+					cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
+															 key->partcollation[0],
+															 lastDatum,
+															 values[0]));
+
+					if (cmpval == 0)
+						return boundinfo->indexes[last_datum_offset];
+
+					/*
+					 * The Datum has changed.  Zero the number of times we've
+					 * found last_found_datum_index in a row.
+					 */
+					partdesc->last_found_count = 0;
+
+					/* fall-through and do a manual lookup */
+				}
 
 				bound_offset = partition_list_bsearch(key->partsupfunc,
 													  key->partcollation,
@@ -1593,24 +1482,65 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 					}
 				}
 
-				if (!range_partkey_has_null)
+				if (range_partkey_has_null)
 				{
-					bound_offset = partition_range_datum_bsearch(key->partsupfunc,
-																 key->partcollation,
-																 boundinfo,
-																 key->partnatts,
-																 values,
-																 &equal);
+					/* Zero the "winning streak" on the cache hit count */
+					partdesc->last_found_count = 0;
+					break;
+				}
 
-					/*
-					 * The bound at bound_offset is less than or equal to the
-					 * tuple value, so the bound at offset+1 is the upper
-					 * bound of the partition we're looking for, if there
-					 * actually exists one.
-					 */
-					part_index = boundinfo->indexes[bound_offset + 1];
-					partdesc->last_found_datum_index = bound_offset;
+				if (partdesc->last_found_count >= PARTITION_CACHED_FIND_THRESHOLD)
+				{
+					int			last_datum_offset = partdesc->last_found_datum_index;
+					Datum	   *lastDatums = boundinfo->datums[last_datum_offset];
+					PartitionRangeDatumKind *kind = boundinfo->kind[last_datum_offset];
+					int32		cmpval;
+
+					/* Check if the value is equal to the lower bound */
+					cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+														key->partcollation,
+														lastDatums,
+														kind,
+														values,
+														key->partnatts);
+
+					if (cmpval == 0)
+						return boundinfo->indexes[last_datum_offset + 1];
+
+					else if (cmpval < 0 && last_datum_offset + 1 < boundinfo->ndatums)
+					{
+						/* Check if the value is below the upper bound */
+						lastDatums = boundinfo->datums[last_datum_offset + 1];
+						kind = boundinfo->kind[last_datum_offset + 1];
+						cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+															key->partcollation,
+															lastDatums,
+															kind,
+															values,
+															key->partnatts);
+
+						if (cmpval > 0)
+							return boundinfo->indexes[last_datum_offset + 1];
+					}
+
+					/* fall-through and do a manual lookup */
 				}
+
+				bound_offset = partition_range_datum_bsearch(key->partsupfunc,
+															 key->partcollation,
+															 boundinfo,
+															 key->partnatts,
+															 values,
+															 &equal);
+
+				/*
+				 * The bound at bound_offset is less than or equal to the
+				 * tuple value, so the bound at offset+1 is the upper
+				 * bound of the partition we're looking for, if there
+				 * actually exists one.
+				 */
+				part_index = boundinfo->indexes[bound_offset + 1];
+				partdesc->last_found_datum_index = bound_offset;
 			}
 			break;
 
@@ -1625,9 +1555,24 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 	 */
 	if (part_index < 0)
 		part_index = boundinfo->default_index;
-
-	partdesc->last_found_part_index = part_index;
-
+	else
+	{
+		/*
+		 * Attend to the cache fields.  If this partition is the same as the
+		 * last partition found, then bump the count by one.  If all goes well
+		 * we'll eventually reach PARTITION_CACHED_FIND_THRESHOLD and we'll
+		 * try the cache path next time around.  If the part_index is not the
+		 * same as last time then we'll reset the cache count back to 1 and
+		 * record this partition to say we've found this one once.
+		 */
+		if (part_index == partdesc->last_found_part_index)
+			partdesc->last_found_count++;
+		else
+		{
+			partdesc->last_found_count = 1;
+			partdesc->last_found_part_index = part_index;
+		}
+	}
 	return part_index;
 }
 
-- 
2.35.1.windows.2

v11-0003-likely.patchtext/plain; charset=US-ASCII; name=v11-0003-likely.patchDownload
From c6acc6db6be20e72acd1a28669ea4566c8deb44d Mon Sep 17 00:00:00 2001
From: David Rowley <dgrowley@gmail.com>
Date: Thu, 14 Jul 2022 16:57:51 +1200
Subject: [PATCH v11 3/3] likely

---
 src/backend/executor/execPartition.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 7bdf78af99..edacf28524 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1424,7 +1424,7 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 			{
 				bool	equal;
 
-				if (partdesc->last_found_count >= PARTITION_CACHED_FIND_THRESHOLD)
+				if (likely(partdesc->last_found_count >= PARTITION_CACHED_FIND_THRESHOLD))
 				{
 					int			last_datum_offset = partdesc->last_found_datum_index;
 					Datum		lastDatum = boundinfo->datums[last_datum_offset][0];
@@ -1489,7 +1489,7 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 					break;
 				}
 
-				if (partdesc->last_found_count >= PARTITION_CACHED_FIND_THRESHOLD)
+				if (likely(partdesc->last_found_count >= PARTITION_CACHED_FIND_THRESHOLD))
 				{
 					int			last_datum_offset = partdesc->last_found_datum_index;
 					Datum	   *lastDatums = boundinfo->datums[last_datum_offset];
-- 
2.35.1.windows.2

#60Amit Langote
amitlangote09@gmail.com
In reply to: David Rowley (#59)
Re: Skip partition tuple routing with constant partition key

On Thu, Jul 14, 2022 at 2:31 PM David Rowley <dgrowleyml@gmail.com> wrote:

I've spent some time looking at the v10 patch, and to be honest, I
don't really like the look of it :(

Thanks for the review and sorry for the delay in replying.

1. I think we should be putting the cache fields in PartitionDescData
rather than PartitionDispatch. Having them in PartitionDescData allows
caching between statements.

Looking at your patch, yes, that makes sense. Initially, I didn't see
much point in having the ability to cache between (supposedly simple
OLTP) statements, because the tuple routing binary search is such a
minuscule portion of their execution, but now I agree why not.

2. The function name maybe_cache_partition_bound_offset() fills me
with dread. It's very unconcise. I don't think anyone should ever use
that word in a function or variable name.

Yeah, we can live without this one for sure as your patch
demonstrates, but to be fair, it's not like we don't have "maybe_"
used in variables and functions in arguably even trickier parts of our
code, like those you can find with `git grep maybe_`.

3. I'm not really sure why there's a field named n_tups_inserted.
That would lead me to believe that ExecFindPartition is only executed
for INSERTs. UPDATEs need to know the partition too.

Hmm, (cross-partition) UPDATEs internally use an INSERT that does
ExecFindPartition(). I don't see ExecUpdate() directly calling
ExecFindPartition(). Well, yes, apply_handle_tuple_routing() in a way
does, but apparently I didn't worry about that function.

4. The fields you're adding to PartitionDispatch are very poorly
documented. I'm not really sure what n_offset_changed means.

My intention with that variable was to count the number of partition
switches that happened over the course of inserting N tuples. The
theory was that if the ratio of the number of partition switches and
the number of tuples inserted is too close to 1, the dataset being
loaded is not really in an order that'd benefit from caching. That
was an attempt to get some kind of adaptability to account for the
cases where the ordering in the dataset is not consistent, but it
seems like your approach is just as adaptive. And your code is much
simpler.

Why
can't you just keep track by recording the last used partition, the
last index into the datum array, and then just a count of the number
of times we've found the last used partition in a row? When the found
partition does not match the last partition, just reset the counter
and when the counter reaches the cache threshold, use the cache path.

Yeah, it makes sense and is easier to understand.

I've taken a go at rewriting this, from scratch, into what I think it
should look like. I then looked at what I came up with and decided
the logic for finding partitions should all be kept in a single
function. That way there's much less chance of someone forgetting to
update the double-checking logic during cache hits when they update
the logic for finding partitions without the cache.

The 0001 patch is my original attempt. I then rewrote it and came up
with 0002 (applies on top of 0001).

Thanks for these patches. I've been reading and can't really find
anything to complain about at a high level.

After writing a benchmark script, I noticed that the performance of
0002 was quite a bit worse than 0001. I noticed that the benchmark
where the partition changes each time got much worse with 0002. I can
only assume that's due to the increased code size, so I played around
with likely() and unlikely() to see if I could use those to shift the
code layout around in such a way to make 0002 faster. Surprisingly
using likely() for the cache hit path make it faster. I'd have assumed
it would be unlikely() that would work.

Hmm, I too would think that unlikely() on that condition, not
likely(), would have helped the unordered case better.

cache_partition_bench.png shows the results. I tested with master @
a5f9f1b88. The "Amit" column is your v10 patch.
copybench.sh is the script I used to run the benchmarks. This tries
all 3 partitioning strategies and performs 2 COPY FROMs, one with the
rows arriving in partition order and another where the next row always
goes into a different partition. I'm expecting to see the "ordered"
case get better for LIST and RANGE partitions and the "unordered" case
not to get any slower.

With all of the attached patches applied, it does seem like I've
managed to slightly speed up all of the unordered cases slightly.
This might be noise, but I did manage to remove some redundant code
that needlessly checked if the HASH partitioned table had a DEFAULT
partition, which it cannot. This may account for some of the increase
in performance.

I do need to stare at the patch a bit more before I'm confident that
it's correct. I just wanted to share it before I go and do that.

The patch looks good to me. I thought some about whether the cache
fields in PartitionDesc may ever be "wrong". For example, the index
values becoming out-of-bound after partition DETACHes. Even though
there's some PartitionDesc-preserving cases in
RelationClearRelation(), I don't think that a preserved PartitionDesc
would ever contain a wrong value.

Here are some comments.

    PartitionBoundInfo boundinfo;   /* collection of partition bounds */
+   int         last_found_datum_index; /* Index into the owning
+                                        * PartitionBoundInfo's datum array
+                                        * for the last found partition */

What does "owning PartitionBoundInfo's" mean? Maybe the "owning" is
unnecessary?

+   int         last_found_part_index;  /* Partition index of the last found
+                                        * partition or -1 if none have been
+                                        * found yet or if we've failed to
+                                        * find one */

-1 if none *has* been...?

+   int         last_found_count;   /* Number of times in a row have we found
+                                    * values to match the partition

Number of times in a row *that we have* found.

+                   /*
+                    * The Datum has changed.  Zero the number of times we've
+                    * found last_found_datum_index in a row.
+                    */
+                   partdesc->last_found_count = 0;
+                   /* Zero the "winning streak" on the cache hit count */
+                   partdesc->last_found_count = 0;

Might it be better for the two comments to say the same thing? Also,
I wonder which one do you intend as the resetting of last_found_count:
setting it to 0 or 1? I can see that the stanza at the end of the
function sets to 1 to start a new cycle.

+                   /* Check if the value is equal to the lower bound */
+                   cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+                                                       key->partcollation,
+                                                       lastDatums,
+                                                       kind,
+                                                       values,
+                                                       key->partnatts);

The function does not merely check for equality, so maybe better to
say the following instead:

Check if the value is >= the lower bound.

Perhaps, just like you've done in the LIST stanza even mention that
the lower bound is same as the last found one, like:

Check if the value >= the last found lower bound.

And likewise, change the nearby comment that says this:

+ /* Check if the value is below the upper bound */

to say:

Now check if the value is below the corresponding [to last found lower
bound] upper bound.

+ * No caching of partitions is done when the last found partition is th

the

+ * Calling this function can be quite expensive for LIST and RANGE partitioned
+ * tables have many partitions.

having many partitions

Many of the use cases for LIST and RANGE
+ * partitioned tables mean that the same partition is likely to be found in

mean -> are such that

we record the partition index we've found in the
+ * PartitionDesc

we record the partition index we've found *for given values* in the
PartitionDesc

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

#61David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#60)
1 attachment(s)
Re: Skip partition tuple routing with constant partition key

Thank for looking at this.

On Sat, 23 Jul 2022 at 01:23, Amit Langote <amitlangote09@gmail.com> wrote:

+                   /*
+                    * The Datum has changed.  Zero the number of times we've
+                    * found last_found_datum_index in a row.
+                    */
+                   partdesc->last_found_count = 0;
+                   /* Zero the "winning streak" on the cache hit count */
+                   partdesc->last_found_count = 0;

Might it be better for the two comments to say the same thing? Also,
I wonder which one do you intend as the resetting of last_found_count:
setting it to 0 or 1? I can see that the stanza at the end of the
function sets to 1 to start a new cycle.

I think I've addressed all of your comments. The above one in
particular caused me to make some larger changes.

The reason I was zeroing the last_found_count in LIST partitioned
tables when the Datum was not equal to the previous found Datum was
due to the fact that the code at the end of the function was only
checking the partition indexes matched rather than the bound_offset vs
last_found_datum_index. The reason I wanted to zero this was that if
you had a partition FOR VALUES IN(1,2), and you received rows with
values alternating between 1 and 2 then we'd match to the same
partition each time, however the equality test with the current
'values' and the Datum at last_found_datum_index would have been false
each time. If we didn't zero the last_found_count we'd have kept
using the cache path even though the Datum and last Datum wouldn't
have been equal each time. That would have resulted in always doing
the cache check and failing, then doing the binary search anyway.

I've now changed the code so that instead of checking the last found
partition is the same as the last one, I'm now checking if
bound_offset is the same as last_found_datum_index. This will be
false in the "values alternating between 1 and 2" case from above.
This caused me to have to change how the caching works for LIST
partitions with a NULL partition which is receiving NULL values. I've
coded things now to just skip the cache for that case. Finding the
correct LIST partition for a NULL value is cheap and no need to cache
that. I've also moved all the code which updates the cache fields to
the bottom of get_partition_for_tuple(). I'm only expecting to do that
when bound_offset is set by the lookup code in the switch statement.
Any paths, e.g. HASH partitioning lookup and LIST or RANGE with NULL
values shouldn't reach the code which updates the partition fields.
I've added an Assert(bound_offset >= 0) to ensure that stays true.

There's probably a bit more to optimise here too, but not much. I
don't think the partdesc->last_found_part_index = -1; is needed when
we're in the code block that does return boundinfo->default_index;
However, that only might very slightly speedup the case when we're
inserting continuously into the DEFAULT partition. That code path is
also used when we fail to find any matching partition. That's not one
we need to worry about making go faster.

I also ran the benchmarks again and saw that most of the use of
likely() and unlikely() no longer did what I found them to do earlier.
So the weirdness we saw there most likely was just down to random code
layout changes. In this patch, I just dropped the use of either of
those two macros.

David

Attachments:

v12_cache_last_partition.patchtext/plain; charset=US-ASCII; name=v12_cache_last_partition.patchDownload
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index e03ea27299..6a323436d5 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1332,10 +1332,48 @@ FormPartitionKeyDatum(PartitionDispatch pd,
 		elog(ERROR, "wrong number of partition key expressions");
 }
 
+/*
+ * The number of times the same partition must be found in a row before we
+ * switch from a search for the given values to just checking if the values
+ * belong to the last found partition.  This must be above 0.
+ */
+#define PARTITION_CACHED_FIND_THRESHOLD			16
+
 /*
  * get_partition_for_tuple
  *		Finds partition of relation which accepts the partition key specified
- *		in values and isnull
+ *		in values and isnull.
+ *
+ * Calling this function can be quite expensive when LIST and RANGE
+ * partitioned tables have many partitions.  This is due to the binary search
+ * that's done to find the correct partition.  Many of the use cases for LIST
+ * and RANGE partitioned tables make it likely that the same partition is
+ * found in subsequent ExecFindPartition() calls.  This is especially true for
+ * cases such as RANGE partitioned tables on a TIMESTAMP column where the
+ * partition key is the current time.  When asked to find a partition for a
+ * RANGE or LIST partitioned table, we record the partition index and datum
+ * offset we've found for the given 'values' in the PartitionDesc (which is
+ * stored in relcache), and if we keep finding the same partition
+ * PARTITION_CACHED_FIND_THRESHOLD times in a row, then we'll enable caching
+ * logic and instead of performing a binary search to find the correct
+ * partition, we'll just double-check that 'values' still belong to the last
+ * found partition, and if so, we'll return that partition index, thus
+ * skipping the need for the binary search.  If we fail to match the last
+ * partition when double checking, then we fall back on doing a binary search.
+ * In this case, we'll reset the number of times we've hit the same partition
+ * so that we don't attempt to use the cache again until we've found that
+ * partition at least PARTITION_CACHED_FIND_THRESHOLD times in a row.
+ *
+ * For cases where the partition changes on each lookup, the amount of
+ * additional work required just amounts to recording the last found partition
+ * and bound offset then resetting the found counter.  This is cheap and does
+ * not appear to cause any meaningful slowdowns for such cases.
+ *
+ * No caching of partitions is done when the last found partition is the
+ * DEFAULT or NULL partition.  For the case of the DEFAULT partition, there
+ * is no bound offset storing the matching datum, so we cannot confirm the
+ * indexes match.  For the NULL partition, this is just so cheap, there's no
+ * sense in caching.
  *
  * Return value is index of the partition (>= 0 and < partdesc->nparts) if one
  * found or -1 if none found.
@@ -1343,12 +1381,24 @@ FormPartitionKeyDatum(PartitionDispatch pd,
 static int
 get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 {
-	int			bound_offset;
+	int			bound_offset = -1;
 	int			part_index = -1;
 	PartitionKey key = pd->key;
 	PartitionDesc partdesc = pd->partdesc;
 	PartitionBoundInfo boundinfo = partdesc->boundinfo;
 
+	/*
+	 * In the switch statement below, when we perform a cached lookup for
+	 * RANGE and LIST partitioned tables, if we find that the last found
+	 * partition matches the 'values', we return the partition index right
+	 * away.  We do this instead of breaking out of the switch as we don't
+	 * want to execute the code about the default partition or do any updates
+	 * for any of the cache-related fields.  That would be a waste of effort
+	 * as we already know it's not the DEFAULT partition and have no need to
+	 * increment the number of times we found the same partition any higher
+	 * than PARTITION_CACHED_FIND_THRESHOLD.
+	 */
+
 	/* Route as appropriate based on partitioning strategy. */
 	switch (key->strategy)
 	{
@@ -1356,24 +1406,62 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 			{
 				uint64		rowHash;
 
+				/* hash partitioning is too cheap to bother caching */
 				rowHash = compute_partition_hash_value(key->partnatts,
 													   key->partsupfunc,
 													   key->partcollation,
 													   values, isnull);
 
-				part_index = boundinfo->indexes[rowHash % boundinfo->nindexes];
+				/*
+				 * HASH partitions can't have a DEFAULT partition and we don't
+				 * do any caching work for them, so just return the part index
+				 */
+				return boundinfo->indexes[rowHash % boundinfo->nindexes];
 			}
-			break;
 
 		case PARTITION_STRATEGY_LIST:
 			if (isnull[0])
 			{
+				/* this is far too cheap to bother doing any caching */
 				if (partition_bound_accepts_nulls(boundinfo))
-					part_index = boundinfo->null_index;
+				{
+					/*
+					 * When there is a NULL partition we just return that
+					 * directly.  We don't have a bound_offset so it's not
+					 * valid to drop into the code after the switch which
+					 * checks and updates the cache fields.  We perhaps should
+					 * be invalidating the details of the last cached
+					 * partition but there's no real need to.  Keeping those
+					 * fields set gives a chance at a matching to the cached
+					 * partition on the next lookup.
+					 */
+					return boundinfo->null_index;
+				}
 			}
 			else
 			{
-				bool		equal = false;
+				bool		equal;
+
+				if (partdesc->last_found_count >= PARTITION_CACHED_FIND_THRESHOLD)
+				{
+					int			last_datum_offset = partdesc->last_found_datum_index;
+					Datum		lastDatum = boundinfo->datums[last_datum_offset][0];
+					int32		cmpval;
+
+					/*
+					 * Check if the last found datum index is the same as this
+					 * Datum.
+					 */
+					cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
+															 key->partcollation[0],
+															 lastDatum,
+															 values[0]));
+
+					if (cmpval == 0)
+						return boundinfo->indexes[last_datum_offset];
+
+					/* fall-through and do a manual lookup */
+				}
 
 				bound_offset = partition_list_bsearch(key->partsupfunc,
 													  key->partcollation,
@@ -1403,23 +1491,63 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 					}
 				}
 
-				if (!range_partkey_has_null)
+				if (range_partkey_has_null)
+					break;
+
+				if (partdesc->last_found_count >= PARTITION_CACHED_FIND_THRESHOLD)
 				{
-					bound_offset = partition_range_datum_bsearch(key->partsupfunc,
-																 key->partcollation,
-																 boundinfo,
-																 key->partnatts,
-																 values,
-																 &equal);
+					int			last_datum_offset = partdesc->last_found_datum_index;
+					Datum	   *lastDatums = boundinfo->datums[last_datum_offset];
+					PartitionRangeDatumKind *kind = boundinfo->kind[last_datum_offset];
+					int32		cmpval;
+
+					/* Check if the value is >= to the lower bound */
+					cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+														key->partcollation,
+														lastDatums,
+														kind,
+														values,
+														key->partnatts);
 
 					/*
-					 * The bound at bound_offset is less than or equal to the
-					 * tuple value, so the bound at offset+1 is the upper
-					 * bound of the partition we're looking for, if there
-					 * actually exists one.
+					 * If it's equal to the lower bound then no need to check
+					 * the upper bound.
 					 */
-					part_index = boundinfo->indexes[bound_offset + 1];
+					if (cmpval == 0)
+						return boundinfo->indexes[last_datum_offset + 1];
+
+					else if (cmpval < 0 && last_datum_offset + 1 < boundinfo->ndatums)
+					{
+						/* Check if the value is below the upper bound */
+						lastDatums = boundinfo->datums[last_datum_offset + 1];
+						kind = boundinfo->kind[last_datum_offset + 1];
+						cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+															key->partcollation,
+															lastDatums,
+															kind,
+															values,
+															key->partnatts);
+
+						if (cmpval > 0)
+							return boundinfo->indexes[last_datum_offset + 1];
+					}
+					/* fall-through and do a manual lookup */
 				}
+
+				bound_offset = partition_range_datum_bsearch(key->partsupfunc,
+															 key->partcollation,
+															 boundinfo,
+															 key->partnatts,
+															 values,
+															 &equal);
+
+				/*
+				 * The bound at bound_offset is less than or equal to the
+				 * tuple value, so the bound at offset+1 is the upper bound of
+				 * the partition we're looking for, if there actually exists
+				 * one.
+				 */
+				part_index = boundinfo->indexes[bound_offset + 1];
 			}
 			break;
 
@@ -1433,7 +1561,39 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 	 * the default partition, if there is one.
 	 */
 	if (part_index < 0)
-		part_index = boundinfo->default_index;
+	{
+		/*
+		 * Since we don't do caching for the default partition or failed
+		 * lookups, we'll just wipe the cache fields back to their initial
+		 * values.  The count becomes 0 rather than 1 as 1 means it's the
+		 * first time we've found a partition we're recording for the cache.
+		 */
+		partdesc->last_found_datum_index = -1;
+		partdesc->last_found_part_index = -1;
+		partdesc->last_found_count = 0;
+
+		return boundinfo->default_index;
+	}
+
+	/* We should only make it here when the code above set bound_offset */
+	Assert(bound_offset >= 0);
+
+	/*
+	 * Attend to the cache fields.  If the bound_offset matches the last
+	 * cached bound offset then we've found the same partition as last time,
+	 * so bump the count by one.  If all goes well we'll eventually reach
+	 * PARTITION_CACHED_FIND_THRESHOLD and we'll try the cache path next time
+	 * around.  Otherwise, we'll reset the cache count back to 1 to mark that
+	 * we've found this partition for the first time.
+	 */
+	if (bound_offset == partdesc->last_found_datum_index)
+		partdesc->last_found_count++;
+	else
+	{
+		partdesc->last_found_count = 1;
+		partdesc->last_found_part_index = part_index;
+		partdesc->last_found_datum_index = bound_offset;
+	}
 
 	return part_index;
 }
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 8b6e0bd595..737f0edd89 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -290,6 +290,12 @@ RelationBuildPartitionDesc(Relation rel, bool omit_detached)
 	{
 		oldcxt = MemoryContextSwitchTo(new_pdcxt);
 		partdesc->boundinfo = partition_bounds_copy(boundinfo, key);
+
+		/* Initialize caching fields for speeding up ExecFindPartition */
+		partdesc->last_found_datum_index = -1;
+		partdesc->last_found_part_index = -1;
+		partdesc->last_found_count = 0;
+
 		partdesc->oids = (Oid *) palloc(nparts * sizeof(Oid));
 		partdesc->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
diff --git a/src/include/partitioning/partdesc.h b/src/include/partitioning/partdesc.h
index ae1afe3d78..4659fe0e64 100644
--- a/src/include/partitioning/partdesc.h
+++ b/src/include/partitioning/partdesc.h
@@ -36,6 +36,32 @@ typedef struct PartitionDescData
 								 * the corresponding 'oids' element belongs to
 								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
+
+	/* Caching fields to cache lookups in get_partition_for_tuple() */
+
+	/*
+	 * Index into the PartitionBoundInfo's datum array for the last found
+	 * partition or -1 if none.
+	 */
+	int			last_found_datum_index;
+
+	/*
+	 * Partition index of the last found partition or -1 if none has been
+	 * found yet, the last found was the DEFAULT partition, or there was no
+	 * valid partition for the last looked up values.
+	 */
+	int			last_found_part_index;
+
+	/*
+	 * For LIST partitioning, this is the number of times in a row that the
+	 * the datum we're looking for a partition for matches the datum in the
+	 * last_found_datum_index index of the boundinfo->datums array.  For RANGE
+	 * partitioning, this is the number of times in a row we've found that the
+	 * datum we're looking for a partition for falls into the range of the
+	 * partition corresponding to the last_found_datum_index index of the
+	 * boundinfo->datums array.
+	 */
+	int			last_found_count;
 } PartitionDescData;
 
 
#62Zhihong Yu
zyu@yugabyte.com
In reply to: David Rowley (#61)
Re: Skip partition tuple routing with constant partition key

On Tue, Jul 26, 2022 at 3:28 PM David Rowley <dgrowleyml@gmail.com> wrote:

Thank for looking at this.

On Sat, 23 Jul 2022 at 01:23, Amit Langote <amitlangote09@gmail.com>
wrote:

+                   /*
+                    * The Datum has changed.  Zero the number of times

we've

+                    * found last_found_datum_index in a row.
+                    */
+                   partdesc->last_found_count = 0;

+ /* Zero the "winning streak" on the cache hit count

*/

+ partdesc->last_found_count = 0;

Might it be better for the two comments to say the same thing? Also,
I wonder which one do you intend as the resetting of last_found_count:
setting it to 0 or 1? I can see that the stanza at the end of the
function sets to 1 to start a new cycle.

I think I've addressed all of your comments. The above one in
particular caused me to make some larger changes.

The reason I was zeroing the last_found_count in LIST partitioned
tables when the Datum was not equal to the previous found Datum was
due to the fact that the code at the end of the function was only
checking the partition indexes matched rather than the bound_offset vs
last_found_datum_index. The reason I wanted to zero this was that if
you had a partition FOR VALUES IN(1,2), and you received rows with
values alternating between 1 and 2 then we'd match to the same
partition each time, however the equality test with the current
'values' and the Datum at last_found_datum_index would have been false
each time. If we didn't zero the last_found_count we'd have kept
using the cache path even though the Datum and last Datum wouldn't
have been equal each time. That would have resulted in always doing
the cache check and failing, then doing the binary search anyway.

I've now changed the code so that instead of checking the last found
partition is the same as the last one, I'm now checking if
bound_offset is the same as last_found_datum_index. This will be
false in the "values alternating between 1 and 2" case from above.
This caused me to have to change how the caching works for LIST
partitions with a NULL partition which is receiving NULL values. I've
coded things now to just skip the cache for that case. Finding the
correct LIST partition for a NULL value is cheap and no need to cache
that. I've also moved all the code which updates the cache fields to
the bottom of get_partition_for_tuple(). I'm only expecting to do that
when bound_offset is set by the lookup code in the switch statement.
Any paths, e.g. HASH partitioning lookup and LIST or RANGE with NULL
values shouldn't reach the code which updates the partition fields.
I've added an Assert(bound_offset >= 0) to ensure that stays true.

There's probably a bit more to optimise here too, but not much. I
don't think the partdesc->last_found_part_index = -1; is needed when
we're in the code block that does return boundinfo->default_index;
However, that only might very slightly speedup the case when we're
inserting continuously into the DEFAULT partition. That code path is
also used when we fail to find any matching partition. That's not one
we need to worry about making go faster.

I also ran the benchmarks again and saw that most of the use of
likely() and unlikely() no longer did what I found them to do earlier.
So the weirdness we saw there most likely was just down to random code
layout changes. In this patch, I just dropped the use of either of
those two macros.

David

Hi,

+                       return boundinfo->indexes[last_datum_offset + 1];
+
+                   else if (cmpval < 0 && last_datum_offset + 1 <
boundinfo->ndatums)

nit: the `else` keyword is not needed.

Cheers

#63Amit Langote
amitlangote09@gmail.com
In reply to: David Rowley (#61)
Re: Skip partition tuple routing with constant partition key

On Wed, Jul 27, 2022 at 7:28 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Sat, 23 Jul 2022 at 01:23, Amit Langote <amitlangote09@gmail.com> wrote:

+                   /*
+                    * The Datum has changed.  Zero the number of times we've
+                    * found last_found_datum_index in a row.
+                    */
+                   partdesc->last_found_count = 0;
+                   /* Zero the "winning streak" on the cache hit count */
+                   partdesc->last_found_count = 0;

Might it be better for the two comments to say the same thing? Also,
I wonder which one do you intend as the resetting of last_found_count:
setting it to 0 or 1? I can see that the stanza at the end of the
function sets to 1 to start a new cycle.

I think I've addressed all of your comments. The above one in
particular caused me to make some larger changes.

The reason I was zeroing the last_found_count in LIST partitioned
tables when the Datum was not equal to the previous found Datum was
due to the fact that the code at the end of the function was only
checking the partition indexes matched rather than the bound_offset vs
last_found_datum_index. The reason I wanted to zero this was that if
you had a partition FOR VALUES IN(1,2), and you received rows with
values alternating between 1 and 2 then we'd match to the same
partition each time, however the equality test with the current
'values' and the Datum at last_found_datum_index would have been false
each time. If we didn't zero the last_found_count we'd have kept
using the cache path even though the Datum and last Datum wouldn't
have been equal each time. That would have resulted in always doing
the cache check and failing, then doing the binary search anyway.

Thanks for the explanation. So, in a way the caching scheme works for
LIST partitioning only if the same value appears consecutively in the
input set, whereas it does not for *a set of* values belonging to the
same partition appearing consecutively. Maybe that's a reasonable
restriction for now.

I've now changed the code so that instead of checking the last found
partition is the same as the last one, I'm now checking if
bound_offset is the same as last_found_datum_index. This will be
false in the "values alternating between 1 and 2" case from above.
This caused me to have to change how the caching works for LIST
partitions with a NULL partition which is receiving NULL values. I've
coded things now to just skip the cache for that case. Finding the
correct LIST partition for a NULL value is cheap and no need to cache
that. I've also moved all the code which updates the cache fields to
the bottom of get_partition_for_tuple(). I'm only expecting to do that
when bound_offset is set by the lookup code in the switch statement.
Any paths, e.g. HASH partitioning lookup and LIST or RANGE with NULL
values shouldn't reach the code which updates the partition fields.
I've added an Assert(bound_offset >= 0) to ensure that stays true.

Looks good.

There's probably a bit more to optimise here too, but not much. I
don't think the partdesc->last_found_part_index = -1; is needed when
we're in the code block that does return boundinfo->default_index;
However, that only might very slightly speedup the case when we're
inserting continuously into the DEFAULT partition. That code path is
also used when we fail to find any matching partition. That's not one
we need to worry about making go faster.

So this is about:

    if (part_index < 0)
-       part_index = boundinfo->default_index;
+   {
+       /*
+        * Since we don't do caching for the default partition or failed
+        * lookups, we'll just wipe the cache fields back to their initial
+        * values.  The count becomes 0 rather than 1 as 1 means it's the
+        * first time we've found a partition we're recording for the cache.
+        */
+       partdesc->last_found_datum_index = -1;
+       partdesc->last_found_part_index = -1;
+       partdesc->last_found_count = 0;
+
+       return boundinfo->default_index;
+   }

I wonder why not to leave the cache untouched in this case? It's
possible that erratic rows only rarely occur in the input sets.

I also ran the benchmarks again and saw that most of the use of
likely() and unlikely() no longer did what I found them to do earlier.
So the weirdness we saw there most likely was just down to random code
layout changes. In this patch, I just dropped the use of either of
those two macros.

Ah, using either seems to be trying to fit the code one or the other
pattern in the input set anyway, so seems fine to keep them out for
now.

Some minor comments:

+ * The number of times the same partition must be found in a row before we
+ * switch from a search for the given values to just checking if the values

How about:

switch from using a binary search for the given values to...

Should the comment update above get_partition_for_tuple() mention
something like the cached path is basically O(1) and the non-cache
path O (log N) as I can see in comments in some other modules, like
pairingheap.c?

+ * so bump the count by one. If all goes well we'll eventually reach

Maybe a comma is needed after "well", because I got tricked into
thinking the "well" is duplicated.

+ * PARTITION_CACHED_FIND_THRESHOLD and we'll try the cache path next time

"we'll" sounds redundant with the one in the previous line.

+ * found yet, the last found was the DEFAULT partition, or there was no

Adding "if" to both sentence fragments might make this sound better.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

#64David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#63)
1 attachment(s)
Re: Skip partition tuple routing with constant partition key

On Thu, 28 Jul 2022 at 00:50, Amit Langote <amitlangote09@gmail.com> wrote:

So, in a way the caching scheme works for
LIST partitioning only if the same value appears consecutively in the
input set, whereas it does not for *a set of* values belonging to the
same partition appearing consecutively. Maybe that's a reasonable
restriction for now.

I'm not really seeing another cheap enough way of doing that. Any LIST
partition could allow any number of values. We've only space to record
1 of those values by way of recording which element in the
PartitionBound that it was located.

if (part_index < 0)
-       part_index = boundinfo->default_index;
+   {
+       /*
+        * Since we don't do caching for the default partition or failed
+        * lookups, we'll just wipe the cache fields back to their initial
+        * values.  The count becomes 0 rather than 1 as 1 means it's the
+        * first time we've found a partition we're recording for the cache.
+        */
+       partdesc->last_found_datum_index = -1;
+       partdesc->last_found_part_index = -1;
+       partdesc->last_found_count = 0;
+
+       return boundinfo->default_index;
+   }

I wonder why not to leave the cache untouched in this case? It's
possible that erratic rows only rarely occur in the input sets.

I looked into that and I ended up just removing the code to reset the
cache. It now works similarly to a LIST partitioned table's NULL
partition.

Should the comment update above get_partition_for_tuple() mention
something like the cached path is basically O(1) and the non-cache
path O (log N) as I can see in comments in some other modules, like
pairingheap.c?

I adjusted for the other things you mentioned but I didn't add the big
O stuff. I thought the comment was clear enough.

I'd quite like to push this patch early next week, so if anyone else
is following along that might have any objections, could they do so
before then?

David

Attachments:

v13_cache_last_partition.patchtext/plain; charset=US-ASCII; name=v13_cache_last_partition.patchDownload
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index e03ea27299..7ae7496737 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -1332,10 +1332,49 @@ FormPartitionKeyDatum(PartitionDispatch pd,
 		elog(ERROR, "wrong number of partition key expressions");
 }
 
+/*
+ * The number of times the same partition must be found in a row before we
+ * switch from a binary search for the given values to just checking if the
+ * values belong to the last found partition.  This must be above 0.
+ */
+#define PARTITION_CACHED_FIND_THRESHOLD			16
+
 /*
  * get_partition_for_tuple
  *		Finds partition of relation which accepts the partition key specified
- *		in values and isnull
+ *		in values and isnull.
+ *
+ * Calling this function can be quite expensive when LIST and RANGE
+ * partitioned tables have many partitions.  This is due to the binary search
+ * that's done to find the correct partition.  Many of the use cases for LIST
+ * and RANGE partitioned tables make it likely that the same partition is
+ * found in subsequent ExecFindPartition() calls.  This is especially true for
+ * cases such as RANGE partitioned tables on a TIMESTAMP column where the
+ * partition key is the current time.  When asked to find a partition for a
+ * RANGE or LIST partitioned table, we record the partition index and datum
+ * offset we've found for the given 'values' in the PartitionDesc (which is
+ * stored in relcache), and if we keep finding the same partition
+ * PARTITION_CACHED_FIND_THRESHOLD times in a row, then we'll enable caching
+ * logic and instead of performing a binary search to find the correct
+ * partition, we'll just double-check that 'values' still belong to the last
+ * found partition, and if so, we'll return that partition index, thus
+ * skipping the need for the binary search.  If we fail to match the last
+ * partition when double checking, then we fall back on doing a binary search.
+ * In this case, unless we find 'values' belong to the DEFAULT partition,
+ * we'll reset the number of times we've hit the same partition so that we
+ * don't attempt to use the cache again until we've found that partition at
+ * least PARTITION_CACHED_FIND_THRESHOLD times in a row.
+ *
+ * For cases where the partition changes on each lookup, the amount of
+ * additional work required just amounts to recording the last found partition
+ * and bound offset then resetting the found counter.  This is cheap and does
+ * not appear to cause any meaningful slowdowns for such cases.
+ *
+ * No caching of partitions is done when the last found partition is the
+ * DEFAULT or NULL partition.  For the case of the DEFAULT partition, there
+ * is no bound offset storing the matching datum, so we cannot confirm the
+ * indexes match.  For the NULL partition, this is just so cheap, there's no
+ * sense in caching.
  *
  * Return value is index of the partition (>= 0 and < partdesc->nparts) if one
  * found or -1 if none found.
@@ -1343,12 +1382,24 @@ FormPartitionKeyDatum(PartitionDispatch pd,
 static int
 get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 {
-	int			bound_offset;
+	int			bound_offset = -1;
 	int			part_index = -1;
 	PartitionKey key = pd->key;
 	PartitionDesc partdesc = pd->partdesc;
 	PartitionBoundInfo boundinfo = partdesc->boundinfo;
 
+	/*
+	 * In the switch statement below, when we perform a cached lookup for
+	 * RANGE and LIST partitioned tables, if we find that the last found
+	 * partition matches the 'values', we return the partition index right
+	 * away.  We do this instead of breaking out of the switch as we don't
+	 * want to execute the code about the DEFAULT partition or do any updates
+	 * for any of the cache-related fields.  That would be a waste of effort
+	 * as we already know it's not the DEFAULT partition and have no need to
+	 * increment the number of times we found the same partition any higher
+	 * than PARTITION_CACHED_FIND_THRESHOLD.
+	 */
+
 	/* Route as appropriate based on partitioning strategy. */
 	switch (key->strategy)
 	{
@@ -1356,24 +1407,62 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 			{
 				uint64		rowHash;
 
+				/* hash partitioning is too cheap to bother caching */
 				rowHash = compute_partition_hash_value(key->partnatts,
 													   key->partsupfunc,
 													   key->partcollation,
 													   values, isnull);
 
-				part_index = boundinfo->indexes[rowHash % boundinfo->nindexes];
+				/*
+				 * HASH partitions can't have a DEFAULT partition and we don't
+				 * do any caching work for them, so just return the part index
+				 */
+				return boundinfo->indexes[rowHash % boundinfo->nindexes];
 			}
-			break;
 
 		case PARTITION_STRATEGY_LIST:
 			if (isnull[0])
 			{
+				/* this is far too cheap to bother doing any caching */
 				if (partition_bound_accepts_nulls(boundinfo))
-					part_index = boundinfo->null_index;
+				{
+					/*
+					 * When there is a NULL partition we just return that
+					 * directly.  We don't have a bound_offset so it's not
+					 * valid to drop into the code after the switch which
+					 * checks and updates the cache fields.  We perhaps should
+					 * be invalidating the details of the last cached
+					 * partition but there's no real need to.  Keeping those
+					 * fields set gives a chance at matching to the cached
+					 * partition on the next lookup.
+					 */
+					return boundinfo->null_index;
+				}
 			}
 			else
 			{
-				bool		equal = false;
+				bool		equal;
+
+				if (partdesc->last_found_count >= PARTITION_CACHED_FIND_THRESHOLD)
+				{
+					int			last_datum_offset = partdesc->last_found_datum_index;
+					Datum		lastDatum = boundinfo->datums[last_datum_offset][0];
+					int32		cmpval;
+
+					/*
+					 * Check if the last found datum index is the same as this
+					 * Datum.
+					 */
+					cmpval = DatumGetInt32(FunctionCall2Coll(&key->partsupfunc[0],
+															 key->partcollation[0],
+															 lastDatum,
+															 values[0]));
+
+					if (cmpval == 0)
+						return boundinfo->indexes[last_datum_offset];
+
+					/* fall-through and do a manual lookup */
+				}
 
 				bound_offset = partition_list_bsearch(key->partsupfunc,
 													  key->partcollation,
@@ -1403,23 +1492,63 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 					}
 				}
 
-				if (!range_partkey_has_null)
+				if (range_partkey_has_null)
+					break;
+
+				if (partdesc->last_found_count >= PARTITION_CACHED_FIND_THRESHOLD)
 				{
-					bound_offset = partition_range_datum_bsearch(key->partsupfunc,
-																 key->partcollation,
-																 boundinfo,
-																 key->partnatts,
-																 values,
-																 &equal);
+					int			last_datum_offset = partdesc->last_found_datum_index;
+					Datum	   *lastDatums = boundinfo->datums[last_datum_offset];
+					PartitionRangeDatumKind *kind = boundinfo->kind[last_datum_offset];
+					int32		cmpval;
+
+					/* Check if the value is >= to the lower bound */
+					cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+														key->partcollation,
+														lastDatums,
+														kind,
+														values,
+														key->partnatts);
 
 					/*
-					 * The bound at bound_offset is less than or equal to the
-					 * tuple value, so the bound at offset+1 is the upper
-					 * bound of the partition we're looking for, if there
-					 * actually exists one.
+					 * If it's equal to the lower bound then no need to check
+					 * the upper bound.
 					 */
-					part_index = boundinfo->indexes[bound_offset + 1];
+					if (cmpval == 0)
+						return boundinfo->indexes[last_datum_offset + 1];
+
+					if (cmpval < 0 && last_datum_offset + 1 < boundinfo->ndatums)
+					{
+						/* Check if the value is below the upper bound */
+						lastDatums = boundinfo->datums[last_datum_offset + 1];
+						kind = boundinfo->kind[last_datum_offset + 1];
+						cmpval = partition_rbound_datum_cmp(key->partsupfunc,
+															key->partcollation,
+															lastDatums,
+															kind,
+															values,
+															key->partnatts);
+
+						if (cmpval > 0)
+							return boundinfo->indexes[last_datum_offset + 1];
+					}
+					/* fall-through and do a manual lookup */
 				}
+
+				bound_offset = partition_range_datum_bsearch(key->partsupfunc,
+															 key->partcollation,
+															 boundinfo,
+															 key->partnatts,
+															 values,
+															 &equal);
+
+				/*
+				 * The bound at bound_offset is less than or equal to the
+				 * tuple value, so the bound at offset+1 is the upper bound of
+				 * the partition we're looking for, if there actually exists
+				 * one.
+				 */
+				part_index = boundinfo->indexes[bound_offset + 1];
 			}
 			break;
 
@@ -1433,7 +1562,34 @@ get_partition_for_tuple(PartitionDispatch pd, Datum *values, bool *isnull)
 	 * the default partition, if there is one.
 	 */
 	if (part_index < 0)
-		part_index = boundinfo->default_index;
+	{
+		/*
+		 * No need to reset the cache fields here.  The next set of values
+		 * might end up belonging to the cached partition, so leaving the
+		 * cache alone improves the chances of a cache hit on the next lookup.
+		 */
+		return boundinfo->default_index;
+	}
+
+	/* We should only make it here when the code above set bound_offset */
+	Assert(bound_offset >= 0);
+
+	/*
+	 * Attend to the cache fields.  If the bound_offset matches the last
+	 * cached bound offset then we've found the same partition as last time,
+	 * so bump the count by one.  If all goes well, we'll eventually reach
+	 * PARTITION_CACHED_FIND_THRESHOLD and try the cache path next time
+	 * around.  Otherwise, we'll reset the cache count back to 1 to mark that
+	 * we've found this partition for the first time.
+	 */
+	if (bound_offset == partdesc->last_found_datum_index)
+		partdesc->last_found_count++;
+	else
+	{
+		partdesc->last_found_count = 1;
+		partdesc->last_found_part_index = part_index;
+		partdesc->last_found_datum_index = bound_offset;
+	}
 
 	return part_index;
 }
diff --git a/src/backend/partitioning/partdesc.c b/src/backend/partitioning/partdesc.c
index 8b6e0bd595..737f0edd89 100644
--- a/src/backend/partitioning/partdesc.c
+++ b/src/backend/partitioning/partdesc.c
@@ -290,6 +290,12 @@ RelationBuildPartitionDesc(Relation rel, bool omit_detached)
 	{
 		oldcxt = MemoryContextSwitchTo(new_pdcxt);
 		partdesc->boundinfo = partition_bounds_copy(boundinfo, key);
+
+		/* Initialize caching fields for speeding up ExecFindPartition */
+		partdesc->last_found_datum_index = -1;
+		partdesc->last_found_part_index = -1;
+		partdesc->last_found_count = 0;
+
 		partdesc->oids = (Oid *) palloc(nparts * sizeof(Oid));
 		partdesc->is_leaf = (bool *) palloc(nparts * sizeof(bool));
 
diff --git a/src/include/partitioning/partdesc.h b/src/include/partitioning/partdesc.h
index ae1afe3d78..1de9a658c7 100644
--- a/src/include/partitioning/partdesc.h
+++ b/src/include/partitioning/partdesc.h
@@ -36,6 +36,32 @@ typedef struct PartitionDescData
 								 * the corresponding 'oids' element belongs to
 								 * a leaf partition or not */
 	PartitionBoundInfo boundinfo;	/* collection of partition bounds */
+
+	/* Caching fields to cache lookups in get_partition_for_tuple() */
+
+	/*
+	 * Index into the PartitionBoundInfo's datum array for the last found
+	 * partition or -1 if none.
+	 */
+	int			last_found_datum_index;
+
+	/*
+	 * Partition index of the last found partition or -1 if none has been
+	 * found yet, if the last found was the DEFAULT partition, or if there was
+	 * no valid partition for the last looked up values.
+	 */
+	int			last_found_part_index;
+
+	/*
+	 * For LIST partitioning, this is the number of times in a row that the
+	 * the datum we're looking for a partition for matches the datum in the
+	 * last_found_datum_index index of the boundinfo->datums array.  For RANGE
+	 * partitioning, this is the number of times in a row we've found that the
+	 * datum we're looking for a partition for falls into the range of the
+	 * partition corresponding to the last_found_datum_index index of the
+	 * boundinfo->datums array.
+	 */
+	int			last_found_count;
 } PartitionDescData;
 
 
#65Amit Langote
amitlangote09@gmail.com
In reply to: David Rowley (#64)
Re: Skip partition tuple routing with constant partition key

On Thu, Jul 28, 2022 at 11:59 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 28 Jul 2022 at 00:50, Amit Langote <amitlangote09@gmail.com> wrote:

So, in a way the caching scheme works for
LIST partitioning only if the same value appears consecutively in the
input set, whereas it does not for *a set of* values belonging to the
same partition appearing consecutively. Maybe that's a reasonable
restriction for now.

I'm not really seeing another cheap enough way of doing that. Any LIST
partition could allow any number of values. We've only space to record
1 of those values by way of recording which element in the
PartitionBound that it was located.

Yeah, no need to complicate the implementation for the LIST case.

if (part_index < 0)
-       part_index = boundinfo->default_index;
+   {
+       /*
+        * Since we don't do caching for the default partition or failed
+        * lookups, we'll just wipe the cache fields back to their initial
+        * values.  The count becomes 0 rather than 1 as 1 means it's the
+        * first time we've found a partition we're recording for the cache.
+        */
+       partdesc->last_found_datum_index = -1;
+       partdesc->last_found_part_index = -1;
+       partdesc->last_found_count = 0;
+
+       return boundinfo->default_index;
+   }

I wonder why not to leave the cache untouched in this case? It's
possible that erratic rows only rarely occur in the input sets.

I looked into that and I ended up just removing the code to reset the
cache. It now works similarly to a LIST partitioned table's NULL
partition.

+1

Should the comment update above get_partition_for_tuple() mention
something like the cached path is basically O(1) and the non-cache
path O (log N) as I can see in comments in some other modules, like
pairingheap.c?

I adjusted for the other things you mentioned but I didn't add the big
O stuff. I thought the comment was clear enough.

WFM.

I'd quite like to push this patch early next week, so if anyone else
is following along that might have any objections, could they do so
before then?

I have no more comments.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

#66houzj.fnst@fujitsu.com
houzj.fnst@fujitsu.com
In reply to: David Rowley (#64)
RE: Skip partition tuple routing with constant partition key

On Thursday, July 28, 2022 10:59 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 28 Jul 2022 at 00:50, Amit Langote <amitlangote09@gmail.com>
wrote:

So, in a way the caching scheme works for LIST partitioning only if
the same value appears consecutively in the input set, whereas it does
not for *a set of* values belonging to the same partition appearing
consecutively. Maybe that's a reasonable restriction for now.

I'm not really seeing another cheap enough way of doing that. Any LIST
partition could allow any number of values. We've only space to record
1 of those values by way of recording which element in the PartitionBound that
it was located.

if (part_index < 0)
-       part_index = boundinfo->default_index;
+   {
+       /*
+        * Since we don't do caching for the default partition or failed
+        * lookups, we'll just wipe the cache fields back to their initial
+        * values.  The count becomes 0 rather than 1 as 1 means it's the
+        * first time we've found a partition we're recording for the cache.
+        */
+       partdesc->last_found_datum_index = -1;
+       partdesc->last_found_part_index = -1;
+       partdesc->last_found_count = 0;
+
+       return boundinfo->default_index;
+   }

I wonder why not to leave the cache untouched in this case? It's
possible that erratic rows only rarely occur in the input sets.

I looked into that and I ended up just removing the code to reset the cache. It
now works similarly to a LIST partitioned table's NULL partition.

Should the comment update above get_partition_for_tuple() mention
something like the cached path is basically O(1) and the non-cache
path O (log N) as I can see in comments in some other modules, like
pairingheap.c?

I adjusted for the other things you mentioned but I didn't add the big O stuff. I
thought the comment was clear enough.

I'd quite like to push this patch early next week, so if anyone else is following
along that might have any objections, could they do so before then?

Thanks for the patch. The patch looks good to me.

Only a minor nitpick:

+	/*
+	 * For LIST partitioning, this is the number of times in a row that the
+	 * the datum we're looking

It seems a duplicate 'the' word in this comment.
"the the datum".

Best regards,
Hou Zhijie

#67David Rowley
dgrowleyml@gmail.com
In reply to: Amit Langote (#65)
Re: Skip partition tuple routing with constant partition key

On Thu, 28 Jul 2022 at 19:37, Amit Langote <amitlangote09@gmail.com> wrote:

On Thu, Jul 28, 2022 at 11:59 AM David Rowley <dgrowleyml@gmail.com> wrote:

I'd quite like to push this patch early next week, so if anyone else
is following along that might have any objections, could they do so
before then?

I have no more comments.

Thank you both for the reviews.

I've now pushed this.

David

#68Amit Langote
amitlangote09@gmail.com
In reply to: David Rowley (#67)
Re: Skip partition tuple routing with constant partition key

On Tue, Aug 2, 2022 at 6:58 AM David Rowley <dgrowleyml@gmail.com> wrote:

On Thu, 28 Jul 2022 at 19:37, Amit Langote <amitlangote09@gmail.com> wrote:

On Thu, Jul 28, 2022 at 11:59 AM David Rowley <dgrowleyml@gmail.com> wrote:

I'd quite like to push this patch early next week, so if anyone else
is following along that might have any objections, could they do so
before then?

I have no more comments.

Thank you both for the reviews.

I've now pushed this.

Thank you for working on this.

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com