partitioning - changing a slot's descriptor is expensive

Started by Andres Freundover 7 years ago16 messages

andres@anarazel.de

over 7 years ago

Hi,

(sorry if I CCed the wrong folks, the code has changed a fair bit)

I noticed that several places in the partitioning code look like:

/*
* If the tuple has been routed, it's been converted to the
* partition's rowtype, which might differ from the root
* table's. We must convert it back to the root table's
* rowtype so that val_desc shown error message matches the
* input tuple.
*/
if (resultRelInfo->ri_PartitionRoot)
{
TableTuple tuple = ExecFetchSlotTuple(slot);
TupleConversionMap *map;

rel = resultRelInfo->ri_PartitionRoot;
tupdesc = RelationGetDescr(rel);
/* a reverse map */
map = convert_tuples_by_name(orig_tupdesc, tupdesc,
gettext_noop("could not convert row type"));
if (map != NULL)
{
tuple = do_convert_tuple(tuple, map);
ExecSetSlotDescriptor(slot, tupdesc);
ExecStoreTuple(tuple, slot, InvalidBuffer, false);
}
}

Unfortunately calling ExecSetSlotDescriptor() is far from cheap, it has
to reallocate the values/isnull arrays (and potentially do desc pinning
etc). Nor is it always cheap to call ExecFetchSlotTuple(slot) - it might
be a virtual slot. Calling heap_deform_tuple(), as done in
do_convert_tuple(), might be superfluous work, because the input tuple
might already be entirely deformed in the input slot.

I think it'd be good to rewrite the code so there's an input and an
output slot that each will keep their slot descriptors set.

Besides performance as the code stands, this'll also be important for
pluggable storage (as we'd otherwise get a *lot* of back/forth
conversions between tuple formats).

Anybody interesting in writing a patch that redoes this?

Greetings,

Andres Freund

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 7 years ago

In reply to: Andres Freund (#1)

Re: partitioning - changing a slot's descriptor is expensive

Hi Andres,

On 2018/06/27 14:09, Andres Freund wrote:

Hi,

(sorry if I CCed the wrong folks, the code has changed a fair bit)

I noticed that several places in the partitioning code look like:

/*
* If the tuple has been routed, it's been converted to the
* partition's rowtype, which might differ from the root
* table's. We must convert it back to the root table's
* rowtype so that val_desc shown error message matches the
* input tuple.
*/
if (resultRelInfo->ri_PartitionRoot)
{
TableTuple tuple = ExecFetchSlotTuple(slot);
TupleConversionMap *map;

rel = resultRelInfo->ri_PartitionRoot;
tupdesc = RelationGetDescr(rel);
/* a reverse map */
map = convert_tuples_by_name(orig_tupdesc, tupdesc,
gettext_noop("could not convert row type"));
if (map != NULL)
{
tuple = do_convert_tuple(tuple, map);
ExecSetSlotDescriptor(slot, tupdesc);
ExecStoreTuple(tuple, slot, InvalidBuffer, false);
}
}

This particular code block (and a couple of others like it) are executed
only if a tuple fails partition constraint check, so maybe not a very
frequently executed piece of code.

There is however similar code that runs in non-error paths, such as in
ExecPrepareTupleRouting(), that is executed *if* the leaf partition and
the root parent have differing attribute numbers. So, one might say that
that code's there to handle the special cases which may not arise much in
practice, but it may still be a good idea to make improvements if we can
do so.

Unfortunately calling ExecSetSlotDescriptor() is far from cheap, it has
to reallocate the values/isnull arrays (and potentially do desc pinning
etc). Nor is it always cheap to call ExecFetchSlotTuple(slot) - it might
be a virtual slot. Calling heap_deform_tuple(), as done in
do_convert_tuple(), might be superfluous work, because the input tuple
might already be entirely deformed in the input slot.

I think it'd be good to rewrite the code so there's an input and an
output slot that each will keep their slot descriptors set.

Besides performance as the code stands, this'll also be important for
pluggable storage (as we'd otherwise get a *lot* of back/forth
conversions between tuple formats).

Anybody interesting in writing a patch that redoes this?

Thanks for the pointers above, I will see if I can produce a patch and
will also be interested in hearing what others may have to say.

Thanks,
Amit

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Amit Langote (#2)

Re: partitioning - changing a slot's descriptor is expensive

On 2018-06-27 14:46:26 +0900, Amit Langote wrote:

Hi Andres,

On 2018/06/27 14:09, Andres Freund wrote:

Hi,

(sorry if I CCed the wrong folks, the code has changed a fair bit)

I noticed that several places in the partitioning code look like:

/*
* If the tuple has been routed, it's been converted to the
* partition's rowtype, which might differ from the root
* table's. We must convert it back to the root table's
* rowtype so that val_desc shown error message matches the
* input tuple.
*/
if (resultRelInfo->ri_PartitionRoot)
{
TableTuple tuple = ExecFetchSlotTuple(slot);
TupleConversionMap *map;

rel = resultRelInfo->ri_PartitionRoot;
tupdesc = RelationGetDescr(rel);
/* a reverse map */
map = convert_tuples_by_name(orig_tupdesc, tupdesc,
gettext_noop("could not convert row type"));
if (map != NULL)
{
tuple = do_convert_tuple(tuple, map);
ExecSetSlotDescriptor(slot, tupdesc);
ExecStoreTuple(tuple, slot, InvalidBuffer, false);
}
}

This particular code block (and a couple of others like it) are executed
only if a tuple fails partition constraint check, so maybe not a very
frequently executed piece of code.

There is however similar code that runs in non-error paths, such as in
ExecPrepareTupleRouting(), that is executed *if* the leaf partition and
the root parent have differing attribute numbers. So, one might say that
that code's there to handle the special cases which may not arise much in
practice, but it may still be a good idea to make improvements if we can
do so.

Yea, I was referring to all do_convert_tuple() callers, and some of them
are more hot-path than the specific one above. E.g. the
ConvertPartitionTupleSlot() call in CopyFrom().

Unfortunately calling ExecSetSlotDescriptor() is far from cheap, it has
to reallocate the values/isnull arrays (and potentially do desc pinning
etc). Nor is it always cheap to call ExecFetchSlotTuple(slot) - it might
be a virtual slot. Calling heap_deform_tuple(), as done in
do_convert_tuple(), might be superfluous work, because the input tuple
might already be entirely deformed in the input slot.

I think it'd be good to rewrite the code so there's an input and an
output slot that each will keep their slot descriptors set.

Besides performance as the code stands, this'll also be important for
pluggable storage (as we'd otherwise get a *lot* of back/forth
conversions between tuple formats).

Anybody interesting in writing a patch that redoes this?

Thanks for the pointers above, I will see if I can produce a patch and
will also be interested in hearing what others may have to say.

Cool.

Greetings,

Andres Freund

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 7 years ago

In reply to: Andres Freund (#3)

Re: partitioning - changing a slot's descriptor is expensive

On 2018/06/27 14:55, Andres Freund wrote:

On 2018-06-27 14:46:26 +0900, Amit Langote wrote:

There is however similar code that runs in non-error paths, such as in
ExecPrepareTupleRouting(), that is executed *if* the leaf partition and
the root parent have differing attribute numbers. So, one might say that
that code's there to handle the special cases which may not arise much in
practice, but it may still be a good idea to make improvements if we can
do so.

Yea, I was referring to all do_convert_tuple() callers, and some of them
are more hot-path than the specific one above. E.g. the
ConvertPartitionTupleSlot() call in CopyFrom().

Just in case you haven't noticed, ConvertPartitionTupleSlot(), even if
it's called unconditionally from CopyFrom etc., calls do_convert_tuple
only if necessary. Note the return statement at the top of its body.

HeapTuple
ConvertPartitionTupleSlot(TupleConversionMap *map,
HeapTuple tuple,
TupleTableSlot *new_slot,
TupleTableSlot **p_my_slot)
{
if (!map)
return tuple;

map is non-NULL only if a partition has different attribute numbers than
the root parent.

Thanks,
Amit

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 7 years ago

In reply to: Andres Freund (#1)

Re: partitioning - changing a slot's descriptor is expensive

On Wed, Jun 27, 2018 at 10:39 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

(sorry if I CCed the wrong folks, the code has changed a fair bit)

I noticed that several places in the partitioning code look like:

/*
* If the tuple has been routed, it's been converted to the
* partition's rowtype, which might differ from the root
* table's. We must convert it back to the root table's
* rowtype so that val_desc shown error message matches the
* input tuple.
*/
if (resultRelInfo->ri_PartitionRoot)
{
TableTuple tuple = ExecFetchSlotTuple(slot);
TupleConversionMap *map;

rel = resultRelInfo->ri_PartitionRoot;
tupdesc = RelationGetDescr(rel);
/* a reverse map */
map = convert_tuples_by_name(orig_tupdesc, tupdesc,
gettext_noop("could not convert row type"));
if (map != NULL)
{
tuple = do_convert_tuple(tuple, map);
ExecSetSlotDescriptor(slot, tupdesc);
ExecStoreTuple(tuple, slot, InvalidBuffer, false);
}
}

Unfortunately calling ExecSetSlotDescriptor() is far from cheap, it has
to reallocate the values/isnull arrays (and potentially do desc pinning
etc).

I bumped into this code yesterday while looking at ExecFetchSlotTuple
and ExecMaterializeSlot usages. I was wondering the same.

ExecSetSlotDescriptor() always frees tts_values and tts_isnull array
even if the tuple descriptor being set has same number of attributes
as previous one. Most of the times that will be the case. I think we
should optimize that case. If you agree, I will submit a patch for
that.

Amit Langote has clarified that this doesn't happen often since
partitioned tables will not require tuple conversion usually. But I
agree that in pathological case this is a performance eater and should
be fixed.

Nor is it always cheap to call ExecFetchSlotTuple(slot) - it might
be a virtual slot. Calling heap_deform_tuple(), as done in
do_convert_tuple(), might be superfluous work, because the input tuple
might already be entirely deformed in the input slot.

I think it'd be good to rewrite the code so there's an input and an
output slot that each will keep their slot descriptors set.

+1 for that.

But I am worried that the code will have to create thousand slots if
there are thousand partitions. I think we will need to see how much
effect that has.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Ashutosh Bapat (#5)

Re: partitioning - changing a slot's descriptor is expensive

On 27 June 2018 at 18:33, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Jun 27, 2018 at 10:39 AM, Andres Freund <andres@anarazel.de> wrote:

Hi,

(sorry if I CCed the wrong folks, the code has changed a fair bit)

I noticed that several places in the partitioning code look like:

/*
* If the tuple has been routed, it's been converted to the
* partition's rowtype, which might differ from the root
* table's. We must convert it back to the root table's
* rowtype so that val_desc shown error message matches the
* input tuple.
*/
if (resultRelInfo->ri_PartitionRoot)
{
TableTuple tuple = ExecFetchSlotTuple(slot);
TupleConversionMap *map;

rel = resultRelInfo->ri_PartitionRoot;
tupdesc = RelationGetDescr(rel);
/* a reverse map */
map = convert_tuples_by_name(orig_tupdesc, tupdesc,
gettext_noop("could not convert row type"));
if (map != NULL)
{
tuple = do_convert_tuple(tuple, map);
ExecSetSlotDescriptor(slot, tupdesc);
ExecStoreTuple(tuple, slot, InvalidBuffer, false);
}
}

Unfortunately calling ExecSetSlotDescriptor() is far from cheap, it has
to reallocate the values/isnull arrays (and potentially do desc pinning
etc).

I bumped into this code yesterday while looking at ExecFetchSlotTuple
and ExecMaterializeSlot usages. I was wondering the same.

ExecSetSlotDescriptor() always frees tts_values and tts_isnull array
even if the tuple descriptor being set has same number of attributes
as previous one. Most of the times that will be the case. I think we
should optimize that case.

I think it'd be good to rewrite the code so there's an input and an
output slot that each will keep their slot descriptors set.

+1 for that.

But I am worried that the code will have to create thousand slots if
there are thousand partitions. I think we will need to see how much
effect that has.

I agree that it does not make sense to create as many slots, if at all
we go by this approach. Suppose the partitioned table is the only one
having different tuple descriptor, and rest of the partitions have
same tuple descriptor. In that case, keep track of unique descriptors,
and allocate a slot per unique descriptor.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Amit Khandekar (#6)

Re: partitioning - changing a slot's descriptor is expensive

On 2018-06-29 11:20:53 +0530, Amit Khandekar wrote:

On 27 June 2018 at 18:33, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Jun 27, 2018 at 10:39 AM, Andres Freund <andres@anarazel.de> wrote:

Unfortunately calling ExecSetSlotDescriptor() is far from cheap, it has
to reallocate the values/isnull arrays (and potentially do desc pinning
etc).

I bumped into this code yesterday while looking at ExecFetchSlotTuple
and ExecMaterializeSlot usages. I was wondering the same.

ExecSetSlotDescriptor() always frees tts_values and tts_isnull array
even if the tuple descriptor being set has same number of attributes
as previous one. Most of the times that will be the case. I think we
should optimize that case.

+1

This doesn't strike me as a great optimization. Any place where change
descriptors with any regularity, we're doing something wrong or at least
decidedly suboptimal. We shouldn't react to that by optimizing the wrong
thing, we should do the wrong thing less often.

I think it'd be good to rewrite the code so there's an input and an
output slot that each will keep their slot descriptors set.

+1 for that.

But I am worried that the code will have to create thousand slots if
there are thousand partitions. I think we will need to see how much
effect that has.

I agree that it does not make sense to create as many slots, if at all
we go by this approach. Suppose the partitioned table is the only one
having different tuple descriptor, and rest of the partitions have
same tuple descriptor. In that case, keep track of unique descriptors,
and allocate a slot per unique descriptor.

Why? Compared to the rest of the structures created, a slot is not
particularly expensive? I don't see what you're optimizing here.

Greetings,

Andres Freund

Ashutosh Bapat

ashutosh.bapat@enterprisedb.com

over 7 years ago

In reply to: Andres Freund (#7)

Re: partitioning - changing a slot's descriptor is expensive

On Fri, Jun 29, 2018 at 11:29 AM, Andres Freund <andres@anarazel.de> wrote:

On 2018-06-29 11:20:53 +0530, Amit Khandekar wrote:

On 27 June 2018 at 18:33, Ashutosh Bapat
<ashutosh.bapat@enterprisedb.com> wrote:

On Wed, Jun 27, 2018 at 10:39 AM, Andres Freund <andres@anarazel.de> wrote:

Unfortunately calling ExecSetSlotDescriptor() is far from cheap, it has
to reallocate the values/isnull arrays (and potentially do desc pinning
etc).

I bumped into this code yesterday while looking at ExecFetchSlotTuple
and ExecMaterializeSlot usages. I was wondering the same.

ExecSetSlotDescriptor() always frees tts_values and tts_isnull array
even if the tuple descriptor being set has same number of attributes
as previous one. Most of the times that will be the case. I think we
should optimize that case.

+1

This doesn't strike me as a great optimization. Any place where change
descriptors with any regularity, we're doing something wrong or at least
decidedly suboptimal. We shouldn't react to that by optimizing the wrong
thing, we should do the wrong thing less often.

I agree with all of that, but I think this tiny optimization can be
done independent of partitioning problem as well.

I think it'd be good to rewrite the code so there's an input and an
output slot that each will keep their slot descriptors set.

+1 for that.

But I am worried that the code will have to create thousand slots if
there are thousand partitions. I think we will need to see how much
effect that has.

I agree that it does not make sense to create as many slots, if at all
we go by this approach. Suppose the partitioned table is the only one
having different tuple descriptor, and rest of the partitions have
same tuple descriptor. In that case, keep track of unique descriptors,
and allocate a slot per unique descriptor.

Why? Compared to the rest of the structures created, a slot is not
particularly expensive? I don't see what you're optimizing here.

The size of slot depends upon the number of attributes of the table. A
ten column table will take 80 byes for datum array and 10 bytes (+
padding) for isnull array, which for a thousand partitions would
translate to 90KB memory. That may be small compared to the relation
cache memory consumed, but it's some memory.

--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 7 years ago

In reply to: Andres Freund (#7)

1 attachment(s)

Re: partitioning - changing a slot's descriptor is expensive

On 2018/06/29 14:59, Andres Freund wrote:

On 2018-06-29 11:20:53 +0530, Amit Khandekar wrote:

On 27 June 2018 at 18:33, Ashutosh Bapat

But I am worried that the code will have to create thousand slots if
there are thousand partitions. I think we will need to see how much
effect that has.

I agree that it does not make sense to create as many slots, if at all
we go by this approach. Suppose the partitioned table is the only one
having different tuple descriptor, and rest of the partitions have
same tuple descriptor. In that case, keep track of unique descriptors,
and allocate a slot per unique descriptor.

Why? Compared to the rest of the structures created, a slot is not
particularly expensive? I don't see what you're optimizing here.

What I'm thinking of doing is something that's inspired by one of the
things that David Rowley proposes in his patch for PG 12 to remove
inefficiencies in the tuple routing code [1]/messages/by-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com.

Instead of a single TupleTableSlot attached at partition_tuple_slot, we
allocate an array of TupleTableSlot pointers of same length as the number
of partitions, as you mentioned upthread. We then call
MakeTupleTableSlot() only if a partition needs it and pass it the
partition's TupleDesc. Allocated slots are remembered in a list.
ExecDropSingleTupleTableSlot is then called on those allocated slots when
the plan ends. Note that the array of slots is not allocated if none of
the partitions affected by a given query (or copy) needed to convert tuples.

Other issues that you mentioned, such as needless heap_tuple_deform/form
being invoked, seem less localized (to me) than this particular issue, so
I created a patch for just this, which is attached with this email. I'm
thinking about how to fix the other issues, but will need a bit more time.

Thanks,
Amit

[1]: /messages/by-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com
/messages/by-id/CAKJS1f_1RJyFquuCKRFHTdcXqoPX-PYqAd7nz=GVBwvGh4a6xA@mail.gmail.com

Attachments:

v1-0001-Allocate-dedicated-slots-of-partition-tuple-conve.patchtext/plain; charset=UTF-8; name=v1-0001-Allocate-dedicated-slots-of-partition-tuple-conve.patchDownload

From 126ddc16956ae0aac0cc53ce9d522c76cc99706c Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Thu, 28 Jun 2018 15:47:47 +0900
Subject: [PATCH v1] Allocate dedicated slots of partition tuple conversion

Currently there is just one slot called partition_tuple_slot and
we do ExecSetSlotDescriptor every time a tuple is inserted into a
partition that requires tuple conversion.  That's excessively
inefficient during bulk inserts.

Fix that by allocating a dedicated slot for each partitions that
requires it.
---
 src/backend/commands/copy.c            | 15 +++++++---
 src/backend/executor/execMain.c        | 37 ++++++++++++++-----------
 src/backend/executor/execPartition.c   | 50 ++++++++++++++++++++++++----------
 src/backend/executor/nodeModifyTable.c | 26 ++++++++++++------
 src/include/executor/execPartition.h   | 12 ++++++--
 5 files changed, 95 insertions(+), 45 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3a66cb5025..4bc5a0199b 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2612,6 +2612,7 @@ CopyFrom(CopyState cstate)
 		{
 			int			leaf_part_index;
 			PartitionTupleRouting *proute = cstate->partition_tuple_routing;
+			TupleConversionMap *map;
 
 			/*
 			 * Away we go ... If we end up not finding a partition after all,
@@ -2693,10 +2694,16 @@ CopyFrom(CopyState cstate)
 			 * We might need to convert from the parent rowtype to the
 			 * partition rowtype.
 			 */
-			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
-											  tuple,
-											  proute->partition_tuple_slot,
-											  &slot);
+			map = proute->parent_child_tupconv_maps[leaf_part_index];
+			if (map != NULL)
+			{
+				TupleTableSlot *new_slot;
+
+				Assert(proute->partition_tuple_slots != NULL &&
+					   proute->partition_tuple_slots[leaf_part_index] != NULL);
+				new_slot = proute->partition_tuple_slots[leaf_part_index];
+				tuple = ConvertPartitionTupleSlot(map, tuple, new_slot, &slot);
+			}
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index 969944cc12..9bd5af1ad4 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1933,12 +1933,11 @@ ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
 		/* a reverse map */
 		map = convert_tuples_by_name(old_tupdesc, tupdesc,
 									 gettext_noop("could not convert row type"));
+		/* Input slot might be of a partition, which has a fixed tupdesc. */
+		slot = MakeTupleTableSlot(tupdesc);
 		if (map != NULL)
-		{
 			tuple = do_convert_tuple(tuple, map);
-			ExecSetSlotDescriptor(slot, tupdesc);
-			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-		}
+		ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 	}
 
 	insertedCols = GetInsertedColumns(resultRelInfo, estate);
@@ -2012,12 +2011,14 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 					/* a reverse map */
 					map = convert_tuples_by_name(orig_tupdesc, tupdesc,
 												 gettext_noop("could not convert row type"));
+					/*
+					 * Input slot might be partition-specific, which has a
+					 * fixed tupdesc.
+					 */
+					slot = MakeTupleTableSlot(tupdesc);
 					if (map != NULL)
-					{
 						tuple = do_convert_tuple(tuple, map);
-						ExecSetSlotDescriptor(slot, tupdesc);
-						ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-					}
+					ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 				}
 
 				insertedCols = GetInsertedColumns(resultRelInfo, estate);
@@ -2060,12 +2061,14 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 				/* a reverse map */
 				map = convert_tuples_by_name(old_tupdesc, tupdesc,
 											 gettext_noop("could not convert row type"));
+				/*
+				 * Input slot might be partition-specific, which has a
+				 * fixed tupdesc.
+				 */
+				slot = MakeTupleTableSlot(tupdesc);
 				if (map != NULL)
-				{
 					tuple = do_convert_tuple(tuple, map);
-					ExecSetSlotDescriptor(slot, tupdesc);
-					ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-				}
+				ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 			}
 
 			insertedCols = GetInsertedColumns(resultRelInfo, estate);
@@ -2166,12 +2169,14 @@ ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 						/* a reverse map */
 						map = convert_tuples_by_name(old_tupdesc, tupdesc,
 													 gettext_noop("could not convert row type"));
+						/*
+						 * Input slot might be partition-specific, which has a
+						 * fixed tupdesc.
+						 */
+						slot = MakeTupleTableSlot(tupdesc);
 						if (map != NULL)
-						{
 							tuple = do_convert_tuple(tuple, map);
-							ExecSetSlotDescriptor(slot, tupdesc);
-							ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-						}
+						ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 					}
 
 					insertedCols = GetInsertedColumns(resultRelInfo, estate);
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 7a4665cc4e..6c541ca43b 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -114,17 +114,9 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 		 * We need an additional tuple slot for storing transient tuples that
 		 * are converted to the root table descriptor.
 		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
+		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
 	}
 
-	/*
-	 * Initialize an empty slot that will be used to manipulate tuples of any
-	 * given partition's rowtype.  It is attached to the caller-specified node
-	 * (such as ModifyTableState) and released when the node finishes
-	 * processing.
-	 */
-	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
-
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
@@ -687,6 +679,33 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 							   gettext_noop("could not convert row type"));
 
 	/*
+	 * Initialize an empty slot that will be used to manipulate tuples of any
+	 * this partition's rowtype.
+	 */
+	if (proute->parent_child_tupconv_maps[partidx] != NULL)
+	{
+		Relation	partrel = partRelInfo->ri_RelationDesc;
+		/*
+		 * Initialized the array where these slots are stored, if not already
+		 * done.
+		 */
+		if (proute->partition_tuple_slots == NULL)
+			proute->partition_tuple_slots = (TupleTableSlot **)
+							palloc0(proute->num_partitions *
+									sizeof(TupleTableSlot *));
+
+		/*
+		 * Now initialize the slot itself and add it to the list of allocated
+		 * slots.
+		 */
+		proute->partition_tuple_slots[partidx] =
+							MakeTupleTableSlot(RelationGetDescr(partrel));
+		proute->partition_tuple_slots_alloced =
+							lappend(proute->partition_tuple_slots_alloced,
+									proute->partition_tuple_slots[partidx]);
+	}
+
+	/*
 	 * If the partition is a foreign table, let the FDW init itself for
 	 * routing tuples to the partition.
 	 */
@@ -778,8 +797,7 @@ ConvertPartitionTupleSlot(TupleConversionMap *map,
 						  TupleTableSlot *new_slot,
 						  TupleTableSlot **p_my_slot)
 {
-	if (!map)
-		return tuple;
+	Assert(map != NULL && new_slot != NULL);
 
 	tuple = do_convert_tuple(tuple, map);
 
@@ -788,7 +806,6 @@ ConvertPartitionTupleSlot(TupleConversionMap *map,
 	 */
 	*p_my_slot = new_slot;
 	Assert(new_slot != NULL);
-	ExecSetSlotDescriptor(new_slot, map->outdesc);
 	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
 
 	return tuple;
@@ -806,6 +823,7 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 {
 	int			i;
 	int			subplan_index = 0;
+	ListCell   *lc;
 
 	/*
 	 * Remember, proute->partition_dispatch_info[0] corresponds to the root
@@ -862,8 +880,12 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	/* Release the standalone partition tuple descriptors, if any */
 	if (proute->root_tuple_slot)
 		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
-	if (proute->partition_tuple_slot)
-		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
+	foreach(lc, proute->partition_tuple_slots_alloced)
+	{
+		TupleTableSlot *slot = lfirst(lc);
+
+		ExecDropSingleTupleTableSlot(slot);
+	}
 }
 
 /*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index 7e0b867971..542bfe103d 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1128,10 +1128,13 @@ lreplace:;
 			map_index = resultRelInfo - mtstate->resultRelInfo;
 			Assert(map_index >= 0 && map_index < mtstate->mt_nplans);
 			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
-			tuple = ConvertPartitionTupleSlot(tupconv_map,
-											  tuple,
-											  proute->root_tuple_slot,
-											  &slot);
+			if (tupconv_map != NULL)
+			{
+				tuple = ConvertPartitionTupleSlot(tupconv_map,
+												  tuple,
+												  proute->root_tuple_slot,
+												  &slot);
+			}
 
 			/*
 			 * Prepare for tuple routing, making it look like we're inserting
@@ -1669,6 +1672,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	int			partidx;
 	ResultRelInfo *partrel;
 	HeapTuple	tuple;
+	TupleConversionMap *map;
 
 	/*
 	 * Determine the target partition.  If ExecFindPartition does not find a
@@ -1756,10 +1760,16 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
-							  tuple,
-							  proute->partition_tuple_slot,
-							  &slot);
+	map = proute->parent_child_tupconv_maps[partidx];
+	if (map != NULL)
+	{
+		TupleTableSlot *new_slot;
+
+		Assert(proute->partition_tuple_slots != NULL &&
+			   proute->partition_tuple_slots[partidx] != NULL);
+		new_slot = proute->partition_tuple_slots[partidx];
+		ConvertPartitionTupleSlot(map, tuple, new_slot, &slot);
+	}
 
 	/* Initialize information needed to handle ON CONFLICT DO UPDATE. */
 	Assert(mtstate != NULL);
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 862bf65060..7fa460398f 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -86,10 +86,15 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *								element of this array has the index into the
  *								corresponding partition in partitions array.
  * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slot			TupleTableSlot to be used to manipulate any
+ * partition_tuple_slots		TupleTableSlots to be used to manipulate any
  *								given leaf partition's rowtype after that
  *								partition is chosen for insertion by
- *								tuple-routing.
+ *								tuple-routing; contains valid entry for each
+ *								leaf partition whose attribute numbers are
+ *								found to differ from the root parent and NULL
+ *								otherwise
+ * partition_tuple_slots_alloced List of allocated slots that are stored in
+ *								partition_tuple_slots
  * root_tuple_slot				TupleTableSlot to be used to transiently hold
  *								copy of a tuple that's being moved across
  *								partitions in the root partitioned table's
@@ -108,7 +113,8 @@ typedef struct PartitionTupleRouting
 	bool	   *child_parent_map_not_required;
 	int		   *subplan_partition_offsets;
 	int			num_subplan_partition_offsets;
-	TupleTableSlot *partition_tuple_slot;
+	TupleTableSlot **partition_tuple_slots;
+	List	   *partition_tuple_slots_alloced;
 	TupleTableSlot *root_tuple_slot;
 } PartitionTupleRouting;
 
-- 
2.11.0

#10

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 7 years ago

In reply to: Amit Langote (#9)

Re: partitioning - changing a slot's descriptor is expensive

On 2018/06/29 15:23, Amit Langote wrote:

Instead of a single TupleTableSlot attached at partition_tuple_slot, we
allocate an array of TupleTableSlot pointers of same length as the number
of partitions, as you mentioned upthread. We then call
MakeTupleTableSlot() only if a partition needs it and pass it the
partition's TupleDesc. Allocated slots are remembered in a list.
ExecDropSingleTupleTableSlot is then called on those allocated slots when
the plan ends. Note that the array of slots is not allocated if none of
the partitions affected by a given query (or copy) needed to convert tuples.

Forgot to show effect the patch has (on my workstation, so numbers a bit
noisy).

create table p (a int, b int, c int) partition by range (a);
create table p1 (b int, a int, c int);
alter table p attach partition p1 for values from (1) to (maxvalue);

Note that every row will end up into p1 and will require conversion.

-- 2 million records
copy (select i, i+1, i+2 from generate_series(1, 2000000) i) to
'/tmp/data.csv' csv;

Un-patched:

truncate p;
copy p from '/tmp/data.csv' csv;
COPY 2000000
Time: 8521.308 ms (00:08.521)

truncate p;
copy p from '/tmp/data.csv' csv;
COPY 2000000
Time: 8160.741 ms (00:08.161)

truncate p;
copy p from '/tmp/data.csv' csv;
COPY 2000000
Time: 8389.925 ms (00:08.390)

Patched:

truncate p;
copy p from '/tmp/data.csv' csv;
COPY 2000000
Time: 7716.568 ms (00:07.717)

truncate p;
copy p from '/tmp/data.csv' csv;
COPY 2000000
Time: 7569.224 ms (00:07.569)

truncate p;
copy p from '/tmp/data.csv' csv;
COPY 2000000
Time: 7572.085 ms (00:07.572)

So, there is at least some speedup.

Thanks,
Amit

#11

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Amit Langote (#9)

Re: partitioning - changing a slot's descriptor is expensive

On 29 June 2018 at 11:53, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

What I'm thinking of doing is something that's inspired by one of the
things that David Rowley proposes in his patch for PG 12 to remove
inefficiencies in the tuple routing code [1].

Instead of a single TupleTableSlot attached at partition_tuple_slot, we
allocate an array of TupleTableSlot pointers of same length as the number
of partitions, as you mentioned upthread. We then call
MakeTupleTableSlot() only if a partition needs it and pass it the
partition's TupleDesc. Allocated slots are remembered in a list.
ExecDropSingleTupleTableSlot is then called on those allocated slots when
the plan ends. Note that the array of slots is not allocated if none of
the partitions affected by a given query (or copy) needed to convert tuples.

Thanks for the patch. I did some review of the patch (needs a rebase).
Below are my comments.

@@ -1936,12 +1936,11 @@ ExecPartitionCheckEmitError(ResultRelInfo
*resultRelInfo,
+   /* Input slot might be of a partition, which has a fixed tupdesc. */
+   slot = MakeTupleTableSlot(tupdesc);
    if (map != NULL)
-   {
      tuple = do_convert_tuple(tuple, map);
-     ExecSetSlotDescriptor(slot, tupdesc);
-     ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-   }
+   ExecStoreTuple(tuple, slot, InvalidBuffer, false);

Both MakeTupleTableSlot() and ExecStoreTuple() can be inside the (map
!= NULL) if condition.
This also applies for similar changes in ExecConstraints() and
ExecWithCheckOptions().

+ * Initialize an empty slot that will be used to manipulate tuples of any
+ * this partition's rowtype.
of any this => of this

+ * Initialized the array where these slots are stored, if not already
Initialized => Initialize

+ proute->partition_tuple_slots_alloced =
+ lappend(proute->partition_tuple_slots_alloced,
+ proute->partition_tuple_slots[partidx]);

Instead of doing the above, I think we can collect those slots in
estate->es_tupleTable using ExecInitExtraTupleSlot() so that they
don't have to be explicitly dropped in ExecCleanupTupleRouting(). And
also then we won't require the new field
proute->partition_tuple_slots_alloced.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#12

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Amit Khandekar (#11)

Re: partitioning - changing a slot's descriptor is expensive

On 29 June 2018 at 11:53, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Other issues that you mentioned, such as needless heap_tuple_deform/form
being invoked, seem less localized (to me) than this particular issue, so
I created a patch for just this, which is attached with this email. I'm
thinking about how to fix the other issues, but will need a bit more time.

I have started a new thread on these tuple forming/deforming issues
[1]: /messages/by-id/CAJ3gD9fR0wRNeAE8VqffNTyONS_UfFPRpqxhnD9Q42vZB+Jvpg@mail.gmail.com
Allocate-dedicated-slots-of-partition-tuple.patch (which I rebased and
posted in that thread) :

[1]: /messages/by-id/CAJ3gD9fR0wRNeAE8VqffNTyONS_UfFPRpqxhnD9Q42vZB+Jvpg@mail.gmail.com

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#13

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 7 years ago

In reply to: Amit Khandekar (#11)

1 attachment(s)

Re: partitioning - changing a slot's descriptor is expensive

Thanks for the review.

On 2018/08/17 15:00, Amit Khandekar wrote:

Thanks for the patch. I did some review of the patch (needs a rebase).
Below are my comments.
@@ -1936,12 +1936,11 @@ ExecPartitionCheckEmitError(ResultRelInfo
*resultRelInfo,
+   /* Input slot might be of a partition, which has a fixed tupdesc. */
+   slot = MakeTupleTableSlot(tupdesc);
if (map != NULL)
-   {
tuple = do_convert_tuple(tuple, map);
-     ExecSetSlotDescriptor(slot, tupdesc);
-     ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-   }
+   ExecStoreTuple(tuple, slot, InvalidBuffer, false);
Both MakeTupleTableSlot() and ExecStoreTuple() can be inside the (map
!= NULL) if condition.
This also applies for similar changes in ExecConstraints() and
ExecWithCheckOptions().

Ah, okay. I guess that means we'll allocate a new slot here only if we
had to switch to a partition-specific slot in the first place.

+ * Initialize an empty slot that will be used to manipulate tuples of any
+ * this partition's rowtype.
of any this => of this
+ * Initialized the array where these slots are stored, if not already
Initialized => Initialize

Fixed.

+ proute->partition_tuple_slots_alloced =
+ lappend(proute->partition_tuple_slots_alloced,
+ proute->partition_tuple_slots[partidx]);
Instead of doing the above, I think we can collect those slots in
estate->es_tupleTable using ExecInitExtraTupleSlot() so that they
don't have to be explicitly dropped in ExecCleanupTupleRouting(). And
also then we won't require the new field
proute->partition_tuple_slots_alloced.

Although I was slightly uncomfortable of the idea at first, thinking that
it's not good for tuple routing specific resources to be released by
generic executor code, doesn't seem too bad to do it the way you suggest.

Attached updated patch. By the way, I think it might be a good idea to
try to merge this patch with the effort in the following thread.

* Reduce partition tuple routing overheads *
https://commitfest.postgresql.org/19/1690/

Thanks,
Amit

Attachments:

v2-0001-Allocate-dedicated-slots-of-partition-tuple-conve.patchtext/plain; charset=UTF-8; name=v2-0001-Allocate-dedicated-slots-of-partition-tuple-conve.patchDownload

From a543b15f83f5e1adaef2a46de59022b0b21711e4 Mon Sep 17 00:00:00 2001
From: amit <amitlangote09@gmail.com>
Date: Thu, 28 Jun 2018 15:47:47 +0900
Subject: [PATCH v2] Allocate dedicated slots of partition tuple conversion

Currently there is just one slot called partition_tuple_slot and
we do ExecSetSlotDescriptor every time a tuple is inserted into a
partition that requires tuple conversion.  That's excessively
inefficient during bulk inserts.

Fix that by allocating a dedicated slot for each partitions that
requires it.
---
 src/backend/commands/copy.c            | 17 +++++++++----
 src/backend/executor/execMain.c        | 24 +++++++++++++++---
 src/backend/executor/execPartition.c   | 45 +++++++++++++++++++++++-----------
 src/backend/executor/nodeModifyTable.c | 27 ++++++++++++--------
 src/include/executor/execPartition.h   | 13 ++++++----
 5 files changed, 88 insertions(+), 38 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 9bc67ce60f..f61db3e173 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2690,6 +2690,7 @@ CopyFrom(CopyState cstate)
 		if (proute)
 		{
 			int			leaf_part_index;
+			TupleConversionMap *map;
 
 			/*
 			 * Away we go ... If we end up not finding a partition after all,
@@ -2864,11 +2865,17 @@ CopyFrom(CopyState cstate)
 			 * partition rowtype.  Don't free the already stored tuple as it
 			 * may still be required for a multi-insert batch.
 			 */
-			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
-											  tuple,
-											  proute->partition_tuple_slot,
-											  &slot,
-											  false);
+			map = proute->parent_child_tupconv_maps[leaf_part_index];
+			if (map != NULL)
+			{
+				TupleTableSlot *new_slot;
+
+				Assert(proute->partition_tuple_slots != NULL &&
+					   proute->partition_tuple_slots[leaf_part_index] != NULL);
+				new_slot = proute->partition_tuple_slots[leaf_part_index];
+				tuple = ConvertPartitionTupleSlot(map, tuple, new_slot, &slot,
+												  false);
+			}
 
 			tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
 		}
diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
index c583e020a0..415670ed7e 100644
--- a/src/backend/executor/execMain.c
+++ b/src/backend/executor/execMain.c
@@ -1938,8 +1938,12 @@ ExecPartitionCheckEmitError(ResultRelInfo *resultRelInfo,
 									 gettext_noop("could not convert row type"));
 		if (map != NULL)
 		{
+			/*
+			 * Partition-specific slot's tupdesc can't be changed, so allocate
+			 * a new one.
+			 */
+			slot = MakeTupleTableSlot(tupdesc);
 			tuple = do_convert_tuple(tuple, map);
-			ExecSetSlotDescriptor(slot, tupdesc);
 			ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 		}
 	}
@@ -2017,8 +2021,12 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 												 gettext_noop("could not convert row type"));
 					if (map != NULL)
 					{
+						/*
+						 * Partition-specific slot's tupdesc can't be changed,
+						 * so allocate a new one.
+						 */
+						slot = MakeTupleTableSlot(tupdesc);
 						tuple = do_convert_tuple(tuple, map);
-						ExecSetSlotDescriptor(slot, tupdesc);
 						ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 					}
 				}
@@ -2065,8 +2073,12 @@ ExecConstraints(ResultRelInfo *resultRelInfo,
 											 gettext_noop("could not convert row type"));
 				if (map != NULL)
 				{
+					/*
+					 * Partition-specific slot's tupdesc can't be changed,
+					 * so allocate a new one.
+					 */
+					slot = MakeTupleTableSlot(tupdesc);
 					tuple = do_convert_tuple(tuple, map);
-					ExecSetSlotDescriptor(slot, tupdesc);
 					ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 				}
 			}
@@ -2171,8 +2183,12 @@ ExecWithCheckOptions(WCOKind kind, ResultRelInfo *resultRelInfo,
 													 gettext_noop("could not convert row type"));
 						if (map != NULL)
 						{
+							/*
+							 * Partition-specific slot's tupdesc can't be
+							 * changed, so allocate a new one.
+							 */
+							slot = MakeTupleTableSlot(tupdesc);
 							tuple = do_convert_tuple(tuple, map);
-							ExecSetSlotDescriptor(slot, tupdesc);
 							ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 						}
 					}
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index 1a9943c3aa..2af7599152 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -114,17 +114,9 @@ ExecSetupPartitionTupleRouting(ModifyTableState *mtstate, Relation rel)
 		 * We need an additional tuple slot for storing transient tuples that
 		 * are converted to the root table descriptor.
 		 */
-		proute->root_tuple_slot = MakeTupleTableSlot(NULL);
+		proute->root_tuple_slot = MakeTupleTableSlot(RelationGetDescr(rel));
 	}
 
-	/*
-	 * Initialize an empty slot that will be used to manipulate tuples of any
-	 * given partition's rowtype.  It is attached to the caller-specified node
-	 * (such as ModifyTableState) and released when the node finishes
-	 * processing.
-	 */
-	proute->partition_tuple_slot = MakeTupleTableSlot(NULL);
-
 	i = 0;
 	foreach(cell, leaf_parts)
 	{
@@ -709,6 +701,35 @@ ExecInitRoutingInfo(ModifyTableState *mtstate,
 							   gettext_noop("could not convert row type"));
 
 	/*
+	 * If partition has different rowtype than root parent, initialize a slot
+	 * dedicated to storing this partition's tuples.  The slot is used for
+	 * various operations that are applied to tuple after routing, such as
+	 * checking constraints.
+	 */
+	if (proute->parent_child_tupconv_maps[partidx] != NULL)
+	{
+		Relation	partrel = partRelInfo->ri_RelationDesc;
+
+		/*
+		 * Initialize the array in proute where these slots are stored, if not
+		 * already done.
+		 */
+		if (proute->partition_tuple_slots == NULL)
+			proute->partition_tuple_slots = (TupleTableSlot **)
+							palloc0(proute->num_partitions *
+									sizeof(TupleTableSlot *));
+
+		/*
+		 * Initialize the slot itself setting its descriptor to this
+		 * partition's TupleDesc; TupleDesc reference will be released
+		 * at the end of the command.
+		 */
+		proute->partition_tuple_slots[partidx] =
+							ExecInitExtraTupleSlot(estate,
+												   RelationGetDescr(partrel));
+	}
+
+	/*
 	 * If the partition is a foreign table, let the FDW init itself for
 	 * routing tuples to the partition.
 	 */
@@ -801,8 +822,7 @@ ConvertPartitionTupleSlot(TupleConversionMap *map,
 						  TupleTableSlot **p_my_slot,
 						  bool shouldFree)
 {
-	if (!map)
-		return tuple;
+	Assert(map != NULL && new_slot != NULL);
 
 	tuple = do_convert_tuple(tuple, map);
 
@@ -811,7 +831,6 @@ ConvertPartitionTupleSlot(TupleConversionMap *map,
 	 */
 	*p_my_slot = new_slot;
 	Assert(new_slot != NULL);
-	ExecSetSlotDescriptor(new_slot, map->outdesc);
 	ExecStoreTuple(tuple, new_slot, InvalidBuffer, shouldFree);
 
 	return tuple;
@@ -885,8 +904,6 @@ ExecCleanupTupleRouting(ModifyTableState *mtstate,
 	/* Release the standalone partition tuple descriptors, if any */
 	if (proute->root_tuple_slot)
 		ExecDropSingleTupleTableSlot(proute->root_tuple_slot);
-	if (proute->partition_tuple_slot)
-		ExecDropSingleTupleTableSlot(proute->partition_tuple_slot);
 }
 
 /*
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index d8d89c7983..e4b2d374b4 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1161,11 +1161,12 @@ lreplace:;
 			map_index = resultRelInfo - mtstate->resultRelInfo;
 			Assert(map_index >= 0 && map_index < mtstate->mt_nplans);
 			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
-			tuple = ConvertPartitionTupleSlot(tupconv_map,
-											  tuple,
-											  proute->root_tuple_slot,
-											  &slot,
-											  true);
+			if (tupconv_map != NULL)
+				tuple = ConvertPartitionTupleSlot(tupconv_map,
+												  tuple,
+												  proute->root_tuple_slot,
+												  &slot,
+												  true);
 
 			/*
 			 * Prepare for tuple routing, making it look like we're inserting
@@ -1703,6 +1704,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	int			partidx;
 	ResultRelInfo *partrel;
 	HeapTuple	tuple;
+	TupleConversionMap *map;
 
 	/*
 	 * Determine the target partition.  If ExecFindPartition does not find a
@@ -1790,11 +1792,16 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	/*
 	 * Convert the tuple, if necessary.
 	 */
-	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
-							  tuple,
-							  proute->partition_tuple_slot,
-							  &slot,
-							  true);
+	map = proute->parent_child_tupconv_maps[partidx];
+	if (map != NULL)
+	{
+		TupleTableSlot *new_slot;
+
+		Assert(proute->partition_tuple_slots != NULL &&
+			   proute->partition_tuple_slots[partidx] != NULL);
+		new_slot = proute->partition_tuple_slots[partidx];
+		ConvertPartitionTupleSlot(map, tuple, new_slot, &slot, true);
+	}
 
 	/* Initialize information needed to handle ON CONFLICT DO UPDATE. */
 	Assert(mtstate != NULL);
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index f6cd842cc9..c2a2dc563e 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -86,10 +86,13 @@ typedef struct PartitionDispatchData *PartitionDispatch;
  *								element of this array has the index into the
  *								corresponding partition in partitions array.
  * num_subplan_partition_offsets  Length of 'subplan_partition_offsets' array
- * partition_tuple_slot			TupleTableSlot to be used to manipulate any
- *								given leaf partition's rowtype after that
- *								partition is chosen for insertion by
- *								tuple-routing.
+ * partition_tuple_slots		Array of TupleTableSlot objects; if non-NULL,
+ *								contains one entry for every leaf partition,
+ *								of which only those of the leaf partitions
+ *								whose attribute numbers differ from the root
+ *								parent have a non-NULL value.  NULL if all of
+ *								the partitions encountered by a given command
+ *								happen to have same rowtype as the root parent
  * root_tuple_slot				TupleTableSlot to be used to transiently hold
  *								copy of a tuple that's being moved across
  *								partitions in the root partitioned table's
@@ -108,7 +111,7 @@ typedef struct PartitionTupleRouting
 	bool	   *child_parent_map_not_required;
 	int		   *subplan_partition_offsets;
 	int			num_subplan_partition_offsets;
-	TupleTableSlot *partition_tuple_slot;
+	TupleTableSlot **partition_tuple_slots;
 	TupleTableSlot *root_tuple_slot;
 } PartitionTupleRouting;
 
-- 
2.11.0

#14

Amit Khandekar

amitdkhan.pg@gmail.com

over 7 years ago

In reply to: Amit Langote (#13)

Re: partitioning - changing a slot's descriptor is expensive

On 20 August 2018 at 14:45, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

Thanks for the review.

On 2018/08/17 15:00, Amit Khandekar wrote:
Thanks for the patch. I did some review of the patch (needs a rebase).
Below are my comments.
@@ -1936,12 +1936,11 @@ ExecPartitionCheckEmitError(ResultRelInfo
*resultRelInfo,
+   /* Input slot might be of a partition, which has a fixed tupdesc. */
+   slot = MakeTupleTableSlot(tupdesc);
if (map != NULL)
-   {
tuple = do_convert_tuple(tuple, map);
-     ExecSetSlotDescriptor(slot, tupdesc);
-     ExecStoreTuple(tuple, slot, InvalidBuffer, false);
-   }
+   ExecStoreTuple(tuple, slot, InvalidBuffer, false);
Both MakeTupleTableSlot() and ExecStoreTuple() can be inside the (map
!= NULL) if condition.
This also applies for similar changes in ExecConstraints() and
ExecWithCheckOptions().
Ah, okay. I guess that means we'll allocate a new slot here only if we
had to switch to a partition-specific slot in the first place.
+ * Initialize an empty slot that will be used to manipulate tuples of any
+ * this partition's rowtype.
of any this => of this
+ * Initialized the array where these slots are stored, if not already
Initialized => Initialize
Fixed.
+ proute->partition_tuple_slots_alloced =
+ lappend(proute->partition_tuple_slots_alloced,
+ proute->partition_tuple_slots[partidx]);
Instead of doing the above, I think we can collect those slots in
estate->es_tupleTable using ExecInitExtraTupleSlot() so that they
don't have to be explicitly dropped in ExecCleanupTupleRouting(). And
also then we won't require the new field
proute->partition_tuple_slots_alloced.
Although I was slightly uncomfortable of the idea at first, thinking that
it's not good for tuple routing specific resources to be released by
generic executor code, doesn't seem too bad to do it the way you suggest.

Attached updated patch.

Thanks for the changes. The patch looks good to me.

By the way, I think it might be a good idea to
try to merge this patch with the effort in the following thread.

* Reduce partition tuple routing overheads *
https://commitfest.postgresql.org/19/1690/

Yes. I guess, whichever commits later needs to be rebased on the earlier one.

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

#15

Andres Freund

andres@anarazel.de

over 7 years ago

In reply to: Amit Khandekar (#14)

Re: partitioning - changing a slot's descriptor is expensive

Hi,

On 2018-08-20 16:40:17 +0530, Amit Khandekar wrote:

On 20 August 2018 at 14:45, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

By the way, I think it might be a good idea to
try to merge this patch with the effort in the following thread.

* Reduce partition tuple routing overheads *
https://commitfest.postgresql.org/19/1690/

Yes. I guess, whichever commits later needs to be rebased on the earlier one.

I think I'd rather keep this patch separate, it's pretty simple, so we
should be able to commit it fairly soon.

Greetings,

Andres Freund

#16

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: Andres Freund (#15)

Re: partitioning - changing a slot's descriptor is expensive

On 20 August 2018 at 23:21, Andres Freund <andres@anarazel.de> wrote:

On 2018-08-20 16:40:17 +0530, Amit Khandekar wrote:

On 20 August 2018 at 14:45, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:

By the way, I think it might be a good idea to
try to merge this patch with the effort in the following thread.

* Reduce partition tuple routing overheads *
https://commitfest.postgresql.org/19/1690/

Yes. I guess, whichever commits later needs to be rebased on the earlier one.

I think I'd rather keep this patch separate, it's pretty simple, so we
should be able to commit it fairly soon.

+1. I think that patch is already big enough.

I'm perfectly fine with rebasing that if this one gets committed first.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services