Making "COPY partitioned_table FROM" faster

Started by David Rowleyover 7 years ago24 messages

david.rowley@2ndquadrant.com

over 7 years ago

1 attachment(s)

I was looking at the COPY FROM performance gap between bulk loads with
partitioned tables vs non-partitioned tables. There's quite a gap!
Almost twice as slow in my test.

It seems to be mostly down to lack of usage of heap_multi_insert() for
the partitioned table case, which I guess is because we can only do
that into a single heap. I didn't really see any reason not do when
the partition for this tuple is the same as the one for the last
tuple. Such cases may well be quite common, especially so in time
series data stored in RANGE partitioned tables.

I've implemented this in the attached. Performance is much better
when the rows are located in the same partition. I've also tested the
worst case; when the partition changes on each row. That's now
slightly slower. Although, if we're worried about that we could
probably make the insert-method adaptive, and only enable
multi-inserts if the partition remains the same for X consecutive
tuples then have it revert back to single inserts when the partition
changed X times after X tuples, where X is some number above 1, say
10? I've not done that. I'm not sure it's worthwhile.

This patch seems fairly simple, only touching copy.c. I think it's a
good candidate for July's 'fest.

src/backend/commands/copy.c | 225 +++++++++++++++++++++++++++++++++++---------
1 file changed, 182 insertions(+), 43 deletions(-)

Benchmarks below:

Setup:

-- non-partitioned (control)
CREATE TABLE partbench_ (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2
INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL);

-- 10k parts
CREATE TABLE partbench (date TIMESTAMP NOT NULL, i1 INT NOT NULL, i2
INT NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL)
PARTITION BY RANGE (date);

\o /dev/null
select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
FOR VALUES FROM (''' || '2017-03-06'::date + (x::text || '
hours')::interval || ''') TO (''' || '2017-03-06'::date + ((x+1)::text
|| ' hours')::interval || ''');'
from generate_Series(0,9999) x;
\gexec
\o

Test:

-- Time loading of 1GB of data.
\timing on
copy partbench_ from program $$perl ~/partbench.pl$$ delimiter '|';
truncate table partbench_;
copy partbench from program $$perl ~/partbench.pl$$ delimiter '|';
truncate table partbench;
copy partbench from program $$perl ~/partbench_alternate.pl$$ delimiter '|';
truncate table partbench;

Unpatched:

postgres=# copy partbench_ from program $$perl ~/partbench.pl$$ delimiter '|';
COPY 17825782
Time: 22669.017 ms (00:22.669)
postgres=# truncate table partbench_;
postgres=# copy partbench from program $$perl ~/partbench.pl$$ delimiter '|';
COPY 17825782
Time: 44095.884 ms (00:44.096)
postgres=# truncate table partbench;
postgres=# copy partbench from program $$perl
~/partbench_alternate.pl$$ delimiter '|';
COPY 17825782
Time: 45129.004 ms (00:45.129)
postgres=# truncate table partbench;

Patched:

postgres=# copy partbench_ from program $$perl ~/partbench.pl$$ delimiter '|';
COPY 17825782
Time: 22701.290 ms (00:22.701)
postgres=# truncate table partbench_;
postgres=# copy partbench from program $$perl ~/partbench.pl$$ delimiter '|';
COPY 17825782
Time: 27721.054 ms (00:27.721)
postgres=# truncate table partbench;
postgres=# copy partbench from program $$perl
~/partbench_alternate.pl$$ delimiter '|';
COPY 17825782
Time: 46151.844 ms (00:46.152)
postgres=# truncate table partbench;

partbench.pl:
for (my $i=0; $i < 8912891; $i++) {
print "2018-04-26 15:00:00|1|2|3|4|5\n";
print "2018-04-26 15:00:00|1|2|3|4|5\n";
}

partbench_alternate.pl:
for (my $i=0; $i < 8912891; $i++) {
print "2018-04-25 15:00:00|1|2|3|4|5\n";
print "2018-04-26 15:00:00|1|2|3|4|5\n";
}

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v1-0001-Allow-multi-inserts-during-COPY-into-a-partitione.patchapplication/octet-stream; name=v1-0001-Allow-multi-inserts-during-COPY-into-a-partitione.patchDownload

From e3df01b93e1e1301244a06c34d5ba4483d0554b8 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 21 Jun 2018 20:59:35 +1200
Subject: [PATCH v1] Allow multi-inserts during COPY into a partitioned table

CopyFrom allows multi-inserts to be used for non-partitioned tables, but
this was disabled for partitioned tables.  The reason for this appeared to
be due to the fact that the tuple may not belong to the same partition as
the previous tuple did.  Not allowing multi-inserts here greatly slowed
down imports into partitioned tables.  These could take twice as long as a
copy to an equivalent non-partitioned table.  It seems wise to do something
about this, so this commit allows the multi-inserts by flushing the so-far
inserted tuples to the partition when the next tuple does not belong to the
same partition, or when the buffer fills.  This improves performance when
the next tuple in the stream commonly belongs to the same partition as the
previous tuple.

This required a bit of work around the per-tuple memory contexts as we must
flush the tuples when the next tuple does not belong the same partition.
In which case there is no good time to reset the per-tuple context, as
we've already built the new tuple by this time.  In order to work around
this we maintain two per-tuple contexts and just switch between them every
time the partition changes and reset the old one.  This does mean that the
first of each batch of tuples is not allocated in the same memory context
as the others, but that does not matter since we only reset the context
once the previous batch has been inserted.
---
 src/backend/commands/copy.c | 225 +++++++++++++++++++++++++++++++++++---------
 1 file changed, 182 insertions(+), 43 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3a66cb5025..6143a2ad69 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -47,6 +47,7 @@
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
+#include "utils/partcache.h"
 #include "utils/portal.h"
 #include "utils/rel.h"
 #include "utils/rls.h"
@@ -79,6 +80,16 @@ typedef enum EolType
 	EOL_CRNL
 } EolType;
 
+/*
+ * Represents the heap insert method to be used during COPY to.
+ */
+typedef enum CopyInsertMethod
+{
+	CIM_SINGLE,					/* use heap_insert or fdw routine */
+	CIM_MULTI,					/* always use heap_multi_insert */
+	CIM_MULTI_PARTITION			/* use heap_multi_insert only if valid */
+} CopyInsertMethod;
+
 /*
  * This struct contains all the state variables used throughout a COPY
  * operation. For simplicity, we use the same struct for all variants of COPY,
@@ -2312,14 +2323,19 @@ CopyFrom(CopyState cstate)
 	TupleTableSlot *myslot;
 	MemoryContext oldcontext = CurrentMemoryContext;
 
+	PartitionTupleRouting *proute = NULL;
+	ExprContext *secondaryExprContext = NULL;
 	ErrorContextCallback errcallback;
 	CommandId	mycid = GetCurrentCommandId(true);
 	int			hi_options = 0; /* start with default heap_insert options */
 	BulkInsertState bistate;
+	CopyInsertMethod insertMethod;
 	uint64		processed = 0;
-	bool		useHeapMultiInsert;
 	int			nBufferedTuples = 0;
 	int			prev_leaf_part_index = -1;
+	bool		has_before_insert_row_trig;
+	bool		has_instead_insert_row_trig;
+	bool		part_can_multi_insert = false;
 
 #define MAX_BUFFERED_TUPLES 1000
 	HeapTuple  *bufferedTuples = NULL;	/* initialize to silence warning */
@@ -2504,8 +2520,6 @@ CopyFrom(CopyState cstate)
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionTupleRouting *proute;
-
 		proute = cstate->partition_tuple_routing =
 			ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
@@ -2527,23 +2541,69 @@ CopyFrom(CopyState cstate)
 	 * expressions. Such triggers or expressions might query the table we're
 	 * inserting to, and act differently if the tuples that have already been
 	 * processed and prepared for insertion are not there.  We also can't do
-	 * it if the table is foreign or partitioned.
+	 * bulk inserts with foreign tables or for partitioned tables with a
+	 * statement level insert trigger.  It might be possible to allow
+	 * partitioned tables with such triggers in the future, but for now
+	 * CopyFromInsertBatch expects that any BR insert and statement level
+	 * insert triggers are on the same relation.
+	 *
+	 * Note: It does not matter if any partitions have any volatile default
+	 * expression as we use the defaults from the target of the COPY command.
 	 */
 	if ((resultRelInfo->ri_TrigDesc != NULL &&
 		 (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
 		  resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
+		(proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
+		 resultRelInfo->ri_TrigDesc->trig_insert_new_table) ||
 		resultRelInfo->ri_FdwRoutine != NULL ||
-		cstate->partition_tuple_routing != NULL ||
 		cstate->volatile_defexprs)
 	{
-		useHeapMultiInsert = false;
+		insertMethod = CIM_SINGLE;
 	}
 	else
 	{
-		useHeapMultiInsert = true;
+		/*
+		 * For partitioned tables, we may still be able to perform bulk
+		 * inserts for sets of consecutive tuples which belong to the same
+		 * partition.  However, the possibility of this depends on which
+		 * types of triggers exist on the partition.  We must disable bulk
+		 * inserts if the partition is a foreign table or it has any BR insert
+		 * or insert instead triggers (same as we checked above for the parent
+		 * table).  Since the partition's resultRelInfos are initialized only
+		 * when we actually need to insert the first tuple into them, we must
+		 * have the intermediate insert method of CIM_MULTI_PARTITION to flag
+		 * that we must later determine if we can use bulk-inserts for the
+		 * partition being inserted into.
+		 *
+		 * Normally, when performing bulk inserts we just flush the insert
+		 * buffer whenever it becomes full, but for the partitioned table
+		 * case, we flush it whenever the current tuple does not belong to
+		 * the same partition as the previous tuple, and since we flush the
+		 * previous partitions buffer once the new tuple has already been
+		 * built, we're unable to reset the estate since we'd free the memory
+		 * which the new tuple is stored.  To work around this we maintain a
+		 * secondary expression context and alternate between these when the
+		 * partition changes.  This does mean we do store the first new tuple
+		 * in a different context than subsequent tuples, but that does not
+		 * matter, providing we don't free anything while it's still needed.
+		 */
+		if (proute)
+		{
+			insertMethod = CIM_MULTI_PARTITION;
+			secondaryExprContext = CreateExprContext(estate);
+		}
+		else
+			insertMethod = CIM_MULTI;
+
 		bufferedTuples = palloc(MAX_BUFFERED_TUPLES * sizeof(HeapTuple));
 	}
 
+	has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
+		resultRelInfo->ri_TrigDesc->trig_insert_before_row);
+
+	has_instead_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
+		resultRelInfo->ri_TrigDesc->trig_insert_instead_row);
+
 	/*
 	 * Check BEFORE STATEMENT insertion triggers. It's debatable whether we
 	 * should do this for COPY, since it's not really an "INSERT" statement as
@@ -2608,11 +2668,9 @@ CopyFrom(CopyState cstate)
 		ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 
 		/* Determine the partition to heap_insert the tuple into */
-		if (cstate->partition_tuple_routing)
+		if (proute)
 		{
 			int			leaf_part_index;
-			PartitionTupleRouting *proute = cstate->partition_tuple_routing;
-
 			/*
 			 * Away we go ... If we end up not finding a partition after all,
 			 * ExecFindPartition() does not return and errors out instead.
@@ -2628,31 +2686,94 @@ CopyFrom(CopyState cstate)
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
-			/*
-			 * If this tuple is mapped to a partition that is not same as the
-			 * previous one, we'd better make the bulk insert mechanism gets a
-			 * new buffer.
-			 */
+			saved_resultRelInfo = resultRelInfo;
+
 			if (prev_leaf_part_index != leaf_part_index)
 			{
+				/*
+				 * When performing bulk-inserts into partitioned tables we
+				 * must insert the tuples seen so far to the heap whenever the
+				 * partition changes.  This might seem wasteful in cases where
+				 * the partition changes on each tuple, but in cases where
+				 * we're bulk loading into a single partition or the data
+				 * being loaded is ordered in partition order then performance
+				 * gains can easily be seen.
+				 */
+				if (nBufferedTuples > 0)
+				{
+					ExprContext *swapcontext;
+					ResultRelInfo *presultRelInfo;
+
+					presultRelInfo = proute->partitions[prev_leaf_part_index];
+
+					CopyFromInsertBatch(cstate, estate, mycid, hi_options,
+										presultRelInfo, myslot, bistate,
+										nBufferedTuples, bufferedTuples,
+										firstBufferedLineNo);
+					nBufferedTuples = 0;
+					bufferedTuplesSize = 0;
+
+					Assert(secondaryExprContext);
+
+					/*
+					 * Normally we reset the per-tuple context whenever the
+					 * bufferedTuples array is empty at the beginning of the
+					 * loop, however, it is possible since we flush the buffer
+					 * here that the buffer is never empty at the start of the
+					 * loop.  To prevent the per-tuple context from never
+					 * being reset we maintain a second context and alternate
+					 * between them when the partition changes.  We can now
+					 * reset secondaryExprContext as this is no longer needed,
+					 * since we just flushed any tuples stored in it.  We also
+					 * now switch over to the other context.  This does mean
+					 * that the first tuple in the buffer won't be in the same
+					 * context as the others, but that does not matter since
+					 * we only reset it after the flush.
+					 */
+					ReScanExprContext(secondaryExprContext);
+
+					swapcontext = secondaryExprContext;
+					secondaryExprContext = estate->es_per_tuple_exprcontext;
+					estate->es_per_tuple_exprcontext = swapcontext;
+				}
+
+				/*
+				 * Save the old ResultRelInfo and switch to the one
+				 * corresponding to the selected partition.
+				 */
+				resultRelInfo = proute->partitions[leaf_part_index];
+				if (resultRelInfo == NULL)
+				{
+					resultRelInfo = ExecInitPartitionInfo(mtstate,
+														  saved_resultRelInfo,
+														  proute, estate,
+														  leaf_part_index);
+					proute->partitions[leaf_part_index] = resultRelInfo;
+					Assert(resultRelInfo != NULL);
+				}
+
+				/* Determine which triggers exist on this partition */
+				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
+					resultRelInfo->ri_TrigDesc->trig_insert_before_row);
+
+				has_instead_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
+					resultRelInfo->ri_TrigDesc->trig_insert_instead_row);
+
+				/* Check if we can multi-insert into this partition */
+				part_can_multi_insert = insertMethod == CIM_MULTI_PARTITION &&
+					!has_before_insert_row_trig &&
+					!has_instead_insert_row_trig &&
+					!resultRelInfo->ri_FdwRoutine;
+
+				/*
+				 * We'd better make the bulk insert mechanism gets a new
+				 * buffer when the partition being inserted into changes.
+				 */
 				ReleaseBulkInsertStatePin(bistate);
 				prev_leaf_part_index = leaf_part_index;
 			}
-
-			/*
-			 * Save the old ResultRelInfo and switch to the one corresponding
-			 * to the selected partition.
-			 */
-			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = proute->partitions[leaf_part_index];
-			if (resultRelInfo == NULL)
-			{
-				resultRelInfo = ExecInitPartitionInfo(mtstate,
-													  saved_resultRelInfo,
-													  proute, estate,
-													  leaf_part_index);
-				Assert(resultRelInfo != NULL);
-			}
+			else
+				resultRelInfo = proute->partitions[leaf_part_index];
 
 			/*
 			 * For ExecInsertIndexTuples() to work on the partition's indexes
@@ -2665,8 +2786,7 @@ CopyFrom(CopyState cstate)
 			 */
 			if (cstate->transition_capture != NULL)
 			{
-				if (resultRelInfo->ri_TrigDesc &&
-					resultRelInfo->ri_TrigDesc->trig_insert_before_row)
+				if (has_before_insert_row_trig)
 				{
 					/*
 					 * If there are any BEFORE triggers on the partition,
@@ -2704,8 +2824,7 @@ CopyFrom(CopyState cstate)
 		skip_tuple = false;
 
 		/* BEFORE ROW INSERT Triggers */
-		if (resultRelInfo->ri_TrigDesc &&
-			resultRelInfo->ri_TrigDesc->trig_insert_before_row)
+		if (has_before_insert_row_trig)
 		{
 			slot = ExecBRInsertTriggers(estate, resultRelInfo, slot);
 
@@ -2717,8 +2836,7 @@ CopyFrom(CopyState cstate)
 
 		if (!skip_tuple)
 		{
-			if (resultRelInfo->ri_TrigDesc &&
-				resultRelInfo->ri_TrigDesc->trig_insert_instead_row)
+			if (has_instead_insert_row_trig)
 			{
 				/* Pass the data to the INSTEAD ROW INSERT trigger */
 				ExecIRInsertTriggers(estate, resultRelInfo, slot);
@@ -2745,7 +2863,12 @@ CopyFrom(CopyState cstate)
 					  resultRelInfo->ri_TrigDesc->trig_insert_before_row)))
 					ExecPartitionCheck(resultRelInfo, slot, estate, true);
 
-				if (useHeapMultiInsert)
+				/*
+				 * Perform multi-inserts when enabled, or when loading a
+				 * partitioned table that can support multi-inserts as
+				 * determined above.
+				 */
+				if (insertMethod == CIM_MULTI || part_can_multi_insert)
 				{
 					/* Add this tuple to the tuple buffer */
 					if (nBufferedTuples == 0)
@@ -2835,10 +2958,24 @@ next_tuple:
 
 	/* Flush any remaining buffered tuples */
 	if (nBufferedTuples > 0)
-		CopyFromInsertBatch(cstate, estate, mycid, hi_options,
-							resultRelInfo, myslot, bistate,
-							nBufferedTuples, bufferedTuples,
-							firstBufferedLineNo);
+	{
+		if (insertMethod == CIM_MULTI_PARTITION)
+		{
+			ResultRelInfo *presultRelInfo;
+
+			presultRelInfo = proute->partitions[prev_leaf_part_index];
+
+			CopyFromInsertBatch(cstate, estate, mycid, hi_options,
+								presultRelInfo, myslot, bistate,
+								nBufferedTuples, bufferedTuples,
+								firstBufferedLineNo);
+		}
+		else
+			CopyFromInsertBatch(cstate, estate, mycid, hi_options,
+								resultRelInfo, myslot, bistate,
+								nBufferedTuples, bufferedTuples,
+								firstBufferedLineNo);
+	}
 
 	/* Done, clean up */
 	error_context_stack = errcallback.previous;
@@ -2907,6 +3044,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
 	MemoryContext oldcontext;
 	int			i;
 	uint64		save_cur_lineno;
+	bool		line_buf_valid = cstate->line_buf_valid;
 
 	/*
 	 * Print error context information correctly, if one of the operations
@@ -2920,7 +3058,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
 	 * before calling it.
 	 */
 	oldcontext = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
-	heap_multi_insert(cstate->rel,
+	heap_multi_insert(resultRelInfo->ri_RelationDesc,
 					  bufferedTuples,
 					  nBufferedTuples,
 					  mycid,
@@ -2967,7 +3105,8 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
 		}
 	}
 
-	/* reset cur_lineno to where we were */
+	/* reset cur_lineno and line_buf_valid to what they were */
+	cstate->line_buf_valid = line_buf_valid;
 	cstate->cur_lineno = save_cur_lineno;
 }
 
-- 
2.16.2.windows.1

Karen Huddleston

khuddleston@pivotal.io

over 7 years ago

In reply to: David Rowley (#1)

Re: Making "COPY partitioned_table FROM" faster

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested

This patch applied cleanly and worked as expected.

Patch description:
This patch implements the use of heap_multi_insert() for partition tables when using the COPY FROM command, benefiting the performance of COPY FROM in cases in which multiple tuples in the file are destined for the same partition.

Tests:
All tests passed make check-world and TAP tests

Functionality:
Specifically, we tried the cases in the bottom section "Exercising the code" and found all behavior as described and expected.
Note that we did conduct performance testing for our test cases enumerated in this section using COPY FROM PROGRAM comparing the patched and unpatched code using one example with tuples all destined for the same partition and one example with tuples destined for alternating partitions. We saw a similar performance improvement to the one recorded by David in his email, including a decrease in performance for copying data with tuples alternating between destination partitions as compared to the unpatched code.
However, in our cases below, we focus on exercising the code added as opposed to on performance.

Feature feedback from a user perspective:
There will be two differences in performance that may be perceptible to the user after the inclusion of this patch:
1) the relatively small decrease in performance (as compared to master) for copying data from a file or program in which the destination partition changes frequently
2) the stark contrast between the performance of copying data destined for the same partition and data destined for alternating partitions
Based on the numbers in David's email the performance difference
27721.054 ms patched for copying data destined for the same partition vs 46151.844 ms patched for copying data destined for two different partitions alternating each row
Because both differences could impact users in data loading, it is worth considering making this transparent to the user in some way. Because this will be noticeable to the user, and, furthermore, because it is potentially within the power of the user to make changes to their data copying strategy given this information, it might be prudent to either create a GUC allowing the user to disable heap_multi_insert for COPY FROM (if the user knows the data he/she is copying over is very varied) or to add a comment to the docs that when using COPY FROM for a partition table, it is advisable to sort the data in some manner which would optimize the number of consecutive tuples destined for the same partition.

Code Review:
Naming of newly introduced enum:
The CopyInsertMethod enum value CIM_MULTI_PARTITION is potentially confusing, since, for a partition table with BEFORE or INSTEAD INSERT triggers on child partitions only heap_insert is valid, requiring the additional variable part_can_multi_insert to determine if heap_multi_insert can be used or not.
It might be more clear to name it something that indicates it is a candidate for multi-insertion, pending further review. This decouples the state of potentially applicable multi-insertion from the table being a partition table. In the future, perhaps, some other feature might make a table conditionally eligible for multi-insertion of tuples, so, it may be desirable to use this intermediate state for other kinds of tables as well.
Alternatively, it might make it more clear if the variable part_can_multi_insert was clearly for a leaf partition table, since this can change from leaf to leaf, maybe named leaf_part_can_multi_insert

Code comment potential typo corrections:
In the following comment, in the patched code line 2550
    /* ...
    * Note: It does not matter if any partitions have any volatile default
    * expression as we use the defaults from the target of the COPY command.
    */
It seems like "volatile default expression" should be "volatile default expressions"

In the following comment, in the patched code starting on line 2582

        /* ...
        * the same partition as the previous tuple, and since we flush the
        * previous partitions buffer once the new tuple has already been
        * built, we're unable to reset the estate since we'd free the memory
        * which the new tuple is stored. To work around this we maintain a
        * secondary expression context and alternate between these when the
        ... */
"previous partitions" should probably be "previous partition's" and
"since we'd free the memory which the new tuple is stored" should be "since we'd free the memory in which the new tuple is stored"

In the following comment, in the patched code starting on line 2769

                /*
                * We'd better make the bulk insert mechanism gets a new
                * buffer when the partition being inserted into changes.
                */
"We'd better make the bulk insert mechanism..." should be "We'd better make sure the bulk insert mechanism..."

Code comment clarification 1:
In the patched code, starting on line 2553, the addition of the new OR condition makes this if statement very difficult to parse as a reader. It would be helpful to the reader if the comment above it was partially inlined in the if statement--e.g.
    if (
        /* relation has either a row-level insert trigger or an insert instead of trigger */
        (resultRelInfo->ri_TrigDesc != NULL &&
        (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
        resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
        /* relation is a partition table with a statement-level insert trigger */
        (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
        resultRelInfo->ri_TrigDesc->trig_insert_new_table) ||
        /* relation is a foreign table or has volatile default expressions */
        resultRelInfo->ri_FdwRoutine != NULL ||
        cstate->volatile_defexprs
       )

Code comment clarification 2:
Starting with patched code line 2689 and ending on line 2766, there are several comments we found confusing:

In the following comment, in the patched code starting on line 2740,
                /*
                * Save the old ResultRelInfo and switch to the one
                * corresponding to the selected partition.
                */
upon initially reading this comment, we thought "saving the old ResultRelInfo" meant saving the relinfo of the partition from which we are switching, however, it seems like that is happening above on line 2691, and, that that partition from which we are switching is not the sibling partition but the parent partition and that we do not, in fact, need to "switch" the resultRelInfo from one sibling to another. Related to this, on walking through the code with the partition table described in the DDL below (table foo(a int) in the section "Exercising the code")
and copying the following data from a file 4 4 7 7 11 11 7 7 4 4, we found that the variable saved_resultRelInfo first set on line 2689
saved_resultRelInfo = resultRelInfo;
is always the parent partition rather than a sibling. Perhaps saved_resultRelInfo could be named something like parent_resultRelInfo, if this is a correct understanding of the code.
Or, if this is an incorrect understanding of the code, perhaps a comment could clarify this.

In fact, on line 2744, it appears we are "saving" the information for the leaf partition for which the current tuple is destined into the proute->partitions[]. It would be helpful to clarify this, perhaps with a comment to the effect of "it the destined partition has not already had its information initialized and saved into the proute->partitions list, do so now"

So, overall, we feel that the code from lines 2689 until 2691 and from 2740 to 2766 could use further clarification with regard to switching from parent to child partition and sibling to sibling partition as well as regarding saving relinfo for partitions that have not been seen before in the proute->partitions[]

Exercising the code:
-- DDL to create a partition table
CREATE TABLE foo(a int) partition by range (a);
CREATE TABLE foo_prt1 partition of foo for values from (1) to (5);
CREATE TABLE foo_prt2 partition of foo for values from (5) to (10);
CREATE TABLE foo_prt_default partition of foo DEFAULT;

-- Case 1
-- a simple partition table
-- copying a file containing the following values for foo(a) : 4 4 7 7 11 11 7 7 4 4 in file 'f.csv'
COPY foo FROM 'f.csv';
-- we see that the insertMethod value is correctly set to CIM_MULTI_PARTITION
-- the value of part_can_multi_insert is true for each of the leaf partitions
-- heap_multi_insert is called to insert data into these leaf partitions

-- Case 2
-- a partition table with a statement level insert trigger
-- DDL to create the statement level insert trigger
CREATE OR REPLACE FUNCTION parent_stmt_insert_function() RETURNS TRIGGER AS $$
BEGIN
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER parent_stmt_trig
    AFTER INSERT ON foo
    REFERENCING NEW TABLE AS newtable
    FOR EACH STATEMENT execute procedure parent_stmt_insert_function();

-- copying a file containing the following values for foo(a) : 4 4 7 7 11 11 7 7 4 4 in file 'f.csv'
COPY foo FROM 'f.csv';
-- we see that the value of insertMethod is correctly set to CIM_SINGLE due to the statement level insert trigger
-- heap_insert is called to insert data into these leaf partitions

-- Case 3
-- a partition table with a row-level before insert trigger on exactly one child partition out of 3
-- DDL to create the row level before insert trigger on one child partition which executes a function that inserts data into default partition
DROP FUNCTION IF EXISTS parent_stmt_insert_function CASCADE;
CREATE OR REPLACE FUNCTION child_row_insert_function() RETURNS TRIGGER AS $$
BEGIN
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER child_row_trig
BEFORE INSERT ON foo_prt1
FOR EACH ROW EXECUTE PROCEDURE child_row_insert_function();
-- copying a file containing the following values for foo(a) : 4 4 7 7 11 11 7 7 4 4 in file 'f.csv'
COPY foo FROM 'f.csv';
-- we see that the insertMethod value is set to CIM_MULTI_PARTITION
-- the value of part_can_multi_insert is false for foo_prt1, true for foo_prt2, and true for foo_prt_default
-- heap_multi_insert is called for foo_prt2 and foo_prt_default (for the rows inserted by COPY)
-- heap_insert is called for foo_prt1
-- heap_insert is called for the rows inserted by our row-level trigger into foo_prt_default

-- Case 4
-- a partition table with a row-level after insert trigger on the root partition
-- DDL to create the row-level after insert trigger on the root
DROP FUNCTION IF EXISTS parent_stmt_insert_function CASCADE;
DROP FUNCTION IF EXISTS child_row_insert_function CASCADE;
CREATE OR REPLACE FUNCTION parent_row_insert_function() RETURNS TRIGGER AS $$
BEGIN
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER parent_row_trig
AFTER INSERT ON foo
FOR EACH ROW execute procedure parent_row_insert_function();
-- copying a file containing the following values for foo(a) : 4 4 7 7 11 11 7 7 4 4 in file 'f.csv'
COPY foo FROM 'f.csv';
-- we see that the insertMethod is CIM_MULTI_PARTITION
-- foo_prt1, foo_prt2, and foo_prt_default have part_can_multi_insert as true

Thanks,
Melanie & Karen

The new status of this patch is: Waiting on Author

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: David Rowley (#1)

1 attachment(s)

Re: Making "COPY partitioned_table FROM" faster

(This email seemed to only come to me and somehow missed the mailing
list. Including the message here in its entirety)

On 20 July 2018 at 13:26, Karen Huddleston <khuddleston@pivotal.io> wrote:

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: not tested
Documentation: not tested

This patch applied cleanly and worked as expected.

Patch description:
This patch implements the use of heap_multi_insert() for partition tables when using the COPY FROM command, benefiting the performance of COPY FROM in cases in which multiple tuples in the file are destined for the same partition.

Tests:
All tests passed make check-world and TAP tests

Functionality:
Specifically, we tried the cases in the bottom section "Exercising the code" and found all behavior as described and expected.
Note that we did conduct performance testing for our test cases enumerated in this section using COPY FROM PROGRAM comparing the patched and unpatched code using one example with tuples all destined for the same partition and one example with tuples destined for alternating partitions. We saw a similar performance improvement to the one recorded by David in his email, including a decrease in performance for copying data with tuples alternating between destination partitions as compared to the unpatched code.
However, in our cases below, we focus on exercising the code added as opposed to on performance.

Feature feedback from a user perspective:
There will be two differences in performance that may be perceptible to the user after the inclusion of this patch:
1) the relatively small decrease in performance (as compared to master) for copying data from a file or program in which the destination partition changes frequently
2) the stark contrast between the performance of copying data destined for the same partition and data destined for alternating partitions
Based on the numbers in David's email the performance difference
27721.054 ms patched for copying data destined for the same partition vs 46151.844 ms patched for copying data destined for two different partitions alternating each row
Because both differences could impact users in data loading, it is worth considering making this transparent to the user in some way. Because this will be noticeable to the user, and, furthermore, because it is potentially within the power of the user to make changes to their data copying strategy given this information, it might be prudent to either create a GUC allowing the user to disable heap_multi_insert for COPY FROM (if the user knows the data he/she is copying over is very varied) or to add a comment to the docs that when using COPY FROM for a partition table, it is advisable to sort the data in some manner which would optimize the number of consecutive tuples destined for the same partition.

I'd rather steer clear of any new GUCs. Instead, in the patch I've
attached, I've included some adaptive code which chooses to use or not
use multi-inserts based on the average number of tuples that have been
in the batches. I also did some tests and it does appear that the
benefits of multi-inserts come pretty early. So I ended up coding it
to enabling multi-inserts when the average tuples per batch is at
least 1.3. I originally guessed 1.5, but one of my tests was faster
with multi-inserts and the average batch size there was 1.33.

Code Review:
Naming of newly introduced enum:
The CopyInsertMethod enum value CIM_MULTI_PARTITION is potentially confusing, since, for a partition table with BEFORE or INSTEAD INSERT triggers on child partitions only heap_insert is valid, requiring the additional variable part_can_multi_insert to determine if heap_multi_insert can be used or not.
It might be more clear to name it something that indicates it is a candidate for multi-insertion, pending further review. This decouples the state of potentially applicable multi-insertion from the table being a partition table. In the future, perhaps, some other feature might make a table conditionally eligible for multi-insertion of tuples, so, it may be desirable to use this intermediate state for other kinds of tables as well.

I've renamed this to CIM_MULTI_CONDITIONAL.

Alternatively, it might make it more clear if the variable part_can_multi_insert was clearly for a leaf partition table, since this can change from leaf to leaf, maybe named leaf_part_can_multi_insert

Renamed to leafpart_use_multi_insert. I changed from "can" to "use"
due to the adaptive code might just choose not to, even if it actually
"can" use multi-inserts.

Code comment potential typo corrections:
In the following comment, in the patched code line 2550
/* ...
* Note: It does not matter if any partitions have any volatile default
* expression as we use the defaults from the target of the COPY command.
*/
It seems like "volatile default expression" should be "volatile default expressions"

Changed.

In the following comment, in the patched code starting on line 2582

/* ...
* the same partition as the previous tuple, and since we flush the
* previous partitions buffer once the new tuple has already been
* built, we're unable to reset the estate since we'd free the memory
* which the new tuple is stored. To work around this we maintain a
* secondary expression context and alternate between these when the
... */
"previous partitions" should probably be "previous partition's" and
"since we'd free the memory which the new tuple is stored" should be "since we'd free the memory in which the new tuple is stored"

Changed

In the following comment, in the patched code starting on line 2769

/*
* We'd better make the bulk insert mechanism gets a new
* buffer when the partition being inserted into changes.
*/
"We'd better make the bulk insert mechanism..." should be "We'd better make sure the bulk insert mechanism..."

I didn't change this, as the code is not really making sure of that.
It's actually doing it. The comment existed like that before the
patch. I just moved it.

Code comment clarification 1:
In the patched code, starting on line 2553, the addition of the new OR condition makes this if statement very difficult to parse as a reader. It would be helpful to the reader if the comment above it was partially inlined in the if statement--e.g.
if (
/* relation has either a row-level insert trigger or an insert instead of trigger */
(resultRelInfo->ri_TrigDesc != NULL &&
(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
/* relation is a partition table with a statement-level insert trigger */
(proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
resultRelInfo->ri_TrigDesc->trig_insert_new_table) ||
/* relation is a foreign table or has volatile default expressions */
resultRelInfo->ri_FdwRoutine != NULL ||
cstate->volatile_defexprs
)

hmm, yeah. I'm not much of a fan of that either. I ended up breaking
it out into multiple if/else if/else, just we set the insertMethod to
CIM_SINGLE in 3 separate places. I think that's okay. I imagine the
compiler will optimise it correctly, but I didn't check.

Code comment clarification 2:
Starting with patched code line 2689 and ending on line 2766, there are several comments we found confusing:

In the following comment, in the patched code starting on line 2740,
/*
* Save the old ResultRelInfo and switch to the one
* corresponding to the selected partition.
*/
upon initially reading this comment, we thought "saving the old ResultRelInfo" meant saving the relinfo of the partition from which we are switching, however, it seems like that is happening above on line 2691, and, that that partition from which we are switching is not the sibling partition but the parent partition and that we do not, in fact, need to "switch" the resultRelInfo from one sibling to another. Related to this, on walking through the code with the partition table described in the DDL below (table foo(a int) in the section "Exercising the code")
and copying the following data from a file 4 4 7 7 11 11 7 7 4 4, we found that the variable saved_resultRelInfo first set on line 2689
saved_resultRelInfo = resultRelInfo;
is always the parent partition rather than a sibling. Perhaps saved_resultRelInfo could be named something like parent_resultRelInfo, if this is a correct understanding of the code.
Or, if this is an incorrect understanding of the code, perhaps a comment could clarify this.

In fact, on line 2744, it appears we are "saving" the information for the leaf partition for which the current tuple is destined into the proute->partitions[]. It would be helpful to clarify this, perhaps with a comment to the effect of "it the destined partition has not already had its information initialized and saved into the proute->partitions list, do so now"

Yeah, that's existing code and I agree it's a bit confusing and
actually a bit convoluted. I've simplified it a bit and renamed the
saved_resultRelInfo to target_resultRelInfo, where target means the
target of the COPY command. This stays fixed and the "resultRelInfo"
variable changes to point to the various partitions. This saves
having to perform the restore from the saved_resultRelInfo at the end
of the loop. This means I could also get rid of the goto nexttuple
code. The goto can just continue to the start of the loop, since
there's nothing to do at the end of the loop now.

This also meant that I had to change the if test above the
ExecPartitionCheck() call. Previously the saved_resultRelInfo there
was testing if we're doing tuple routing into a partition. Using
proute instead seems to be more clear, so I change it to that. I also
changed the test for the trigger there to use the bool that I'd set
above.

So, overall, we feel that the code from lines 2689 until 2691 and from 2740 to 2766 could use further clarification with regard to switching from parent to child partition and sibling to sibling partition as well as regarding saving relinfo for partitions that have not been seen before in the proute->partitions[]

I hope it's more clear now that I've got rid of saved_resultRelInfo.

Exercising the code:
-- DDL to create a partition table
CREATE TABLE foo(a int) partition by range (a);
CREATE TABLE foo_prt1 partition of foo for values from (1) to (5);
CREATE TABLE foo_prt2 partition of foo for values from (5) to (10);
CREATE TABLE foo_prt_default partition of foo DEFAULT;

-- Case 1
-- a simple partition table
-- copying a file containing the following values for foo(a) : 4 4 7 7 11 11 7 7 4 4 in file 'f.csv'
COPY foo FROM 'f.csv';
-- we see that the insertMethod value is correctly set to CIM_MULTI_PARTITION
-- the value of part_can_multi_insert is true for each of the leaf partitions
-- heap_multi_insert is called to insert data into these leaf partitions

-- Case 2
-- a partition table with a statement level insert trigger
-- DDL to create the statement level insert trigger
CREATE OR REPLACE FUNCTION parent_stmt_insert_function() RETURNS TRIGGER AS $$
BEGIN
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER parent_stmt_trig
AFTER INSERT ON foo
REFERENCING NEW TABLE AS newtable
FOR EACH STATEMENT execute procedure parent_stmt_insert_function();

-- copying a file containing the following values for foo(a) : 4 4 7 7 11 11 7 7 4 4 in file 'f.csv'
COPY foo FROM 'f.csv';
-- we see that the value of insertMethod is correctly set to CIM_SINGLE due to the statement level insert trigger
-- heap_insert is called to insert data into these leaf partitions

-- Case 3
-- a partition table with a row-level before insert trigger on exactly one child partition out of 3
-- DDL to create the row level before insert trigger on one child partition which executes a function that inserts data into default partition
DROP FUNCTION IF EXISTS parent_stmt_insert_function CASCADE;
CREATE OR REPLACE FUNCTION child_row_insert_function() RETURNS TRIGGER AS $$
BEGIN
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER child_row_trig
BEFORE INSERT ON foo_prt1
FOR EACH ROW EXECUTE PROCEDURE child_row_insert_function();
-- copying a file containing the following values for foo(a) : 4 4 7 7 11 11 7 7 4 4 in file 'f.csv'
COPY foo FROM 'f.csv';
-- we see that the insertMethod value is set to CIM_MULTI_PARTITION
-- the value of part_can_multi_insert is false for foo_prt1, true for foo_prt2, and true for foo_prt_default
-- heap_multi_insert is called for foo_prt2 and foo_prt_default (for the rows inserted by COPY)
-- heap_insert is called for foo_prt1
-- heap_insert is called for the rows inserted by our row-level trigger into foo_prt_default

-- Case 4
-- a partition table with a row-level after insert trigger on the root partition
-- DDL to create the row-level after insert trigger on the root
DROP FUNCTION IF EXISTS parent_stmt_insert_function CASCADE;
DROP FUNCTION IF EXISTS child_row_insert_function CASCADE;
CREATE OR REPLACE FUNCTION parent_row_insert_function() RETURNS TRIGGER AS $$
BEGIN
RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER parent_row_trig
AFTER INSERT ON foo
FOR EACH ROW execute procedure parent_row_insert_function();
-- copying a file containing the following values for foo(a) : 4 4 7 7 11 11 7 7 4 4 in file 'f.csv'
COPY foo FROM 'f.csv';
-- we see that the insertMethod is CIM_MULTI_PARTITION
-- foo_prt1, foo_prt2, and foo_prt_default have part_can_multi_insert as true

Thanks,
Melanie & Karen

The new status of this patch is: Waiting on Author

AWS m5d.large (times in milliseconds)

master v1 master vs v1
Control 26265 25201 96.0%
All 42667 30473 71.4%
Avg 4 42393 36658 86.5%
Avg 2.0 43775 39844 91.0%
Avg 2.0 43493 40305 92.7%
Avg 1.33 43219 43717 101.2%
Avg 1.0 44993 47616 105.8%

master v2 master vs v2
Control 26265 25421 96.8%
All 42667 30164 70.7%
Avg 4 42393 34288 80.9%
Avg 2.0 43775 40441 92.4%
Avg 2.0 43493 40991 94.2%
Avg 1.33 43219 43424 100.5%
Avg 1.0 44993 45204 100.5%

To test this I expanded the two Perl files into 6 files which are
formatted like:

for (my $i=0; $i < 2228222; $i++) {
print "2018-04-25 15:00:00|1|2|3|4|5\n";
print "2018-04-26 15:00:00|1|2|3|4|5\n";
print "2018-04-25 15:00:00|1|2|3|4|5\n";
print "2018-04-26 15:00:00|1|2|3|4|5\n";
print "2018-04-25 15:00:00|1|2|3|4|5\n";
print "2018-04-26 15:00:00|1|2|3|4|5\n";
print "2018-04-25 15:00:00|1|2|3|4|5\n";
print "2018-04-26 15:00:00|1|2|3|4|5\n";
}

The 2228222 is just a number I set so a 1GB table is produced.

To test I did:

\timing on
truncate table partbench_;
copy partbench_ from program $$perl ~/partbench_11111111.pl$$
delimiter '|'; -- control test
truncate table partbench;
copy partbench from program $$perl ~/partbench_11111111.pl$$ delimiter
'|'; -- all tuples in same part
truncate table partbench;
copy partbench from program $$perl ~/partbench_12222222.pl$$ delimiter
'|'; -- avg 4 tuples per part
truncate table partbench;
copy partbench from program $$perl ~/partbench_12221222.pl$$ delimiter
'|'; -- avg 2.0 tuples per part
truncate table partbench;
copy partbench from program $$perl ~/partbench_12211221.pl$$ delimiter
'|'; -- avg 2.0 tuples per part
truncate table partbench;
copy partbench from program $$perl ~/partbench_12212112.pl$$ delimiter
'|'; -- avg 1.33 tuples per part
truncate table partbench;
copy partbench from program $$perl ~/partbench_12121212.pl$$ delimiter
'|'; -- avg 1.0 tuples per part

The files are named according to what each of the 8 print statements
does. Anything with a "1" uses the 2018-04-25 timestamp and anything
with a "2" uses the 26th. So the one above is partbench_12121212.pl

Many thanks to both of you for the thorough review. I hope the
attached addresses all the concerns.

One final note: I'm not entirely convinced we need this adaptive
code, but it seems easy enough to rip it back out if it's more trouble
than it's worth. But if the other option is a GUC, then I'd rather
stick with the adaptive code, it's likely going to do much better than
a GUC since it can change itself during the copy which will be useful
when the stream contains a mix small and large sets of consecutive
tuples which belong to the same partition.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v2-0001-Allow-multi-inserts-during-COPY-into-a-partitione.patchapplication/octet-stream; name=v2-0001-Allow-multi-inserts-during-COPY-into-a-partitione.patchDownload

From 30a73fc6fc3f75748351a63d35f19202a47c2756 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Thu, 21 Jun 2018 20:59:35 +1200
Subject: [PATCH v2] Allow multi-inserts during COPY into a partitioned table

CopyFrom allows multi-inserts to be used for non-partitioned tables, but
this was disabled for partitioned tables.  The reason for this appeared to
be due to the fact that the tuple may not belong to the same partition as
the previous tuple did.  Not allowing multi-inserts here greatly slowed
down imports into partitioned tables.  These could take twice as long as a
copy to an equivalent non-partitioned table.  It seems wise to do something
about this, so this commit allows the multi-inserts by flushing the so-far
inserted tuples to the partition when the next tuple does not belong to the
same partition, or when the buffer fills.  This improves performance when
the next tuple in the stream commonly belongs to the same partition as the
previous tuple.

In cases where the target partition changes on every tuple using
multi-inserts slightly slows the performance.  To get around this we
track the average size of the batches that have been inserted and
adapively enable or disable multi-inserts based on the size of the
batch.  Some testing was done and the regression only seems to exist
when the average size of the insert batch is close to 1, so let's just
enable multi-inserts when the average size is at least 1.3.  More
performance testing might reveal a better number for, this, but since the
slowdown was only 1-2% it does not seem critical enough to spend too much
time calculating it.  In any case it may depend on other factors rather
than just the size of the batch.

Allowing multi-inserts for partitions required a bit of work around the
per-tuple memory contexts as we must flush the tuples when the next tuple
does not belong the same partition.  In which case there is no good time to
reset the per-tuple context, as we've already built the new tuple by this
time.  In order to work around this we maintain two per-tuple contexts and
just switch between them every time the partition changes and reset the old
one.  This does mean that the first of each batch of tuples is not
allocated in the same memory context as the others, but that does not
matter since we only reset the context once the previous batch has been
inserted.
---
 src/backend/commands/copy.c | 334 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 260 insertions(+), 74 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3a66cb5025..6dda559300 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -47,6 +47,7 @@
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
+#include "utils/partcache.h"
 #include "utils/portal.h"
 #include "utils/rel.h"
 #include "utils/rls.h"
@@ -79,6 +80,16 @@ typedef enum EolType
 	EOL_CRNL
 } EolType;
 
+/*
+ * Represents the heap insert method to be used during COPY to.
+ */
+typedef enum CopyInsertMethod
+{
+	CIM_SINGLE,					/* use heap_insert or fdw routine */
+	CIM_MULTI,					/* always use heap_multi_insert */
+	CIM_MULTI_CONDITIONAL		/* use heap_multi_insert only if valid */
+} CopyInsertMethod;
+
 /*
  * This struct contains all the state variables used throughout a COPY
  * operation. For simplicity, we use the same struct for all variants of COPY,
@@ -2305,26 +2316,35 @@ CopyFrom(CopyState cstate)
 	Datum	   *values;
 	bool	   *nulls;
 	ResultRelInfo *resultRelInfo;
-	ResultRelInfo *saved_resultRelInfo = NULL;
+	ResultRelInfo *target_resultRelInfo;
 	EState	   *estate = CreateExecutorState(); /* for ExecConstraints() */
 	ModifyTableState *mtstate;
 	ExprContext *econtext;
 	TupleTableSlot *myslot;
 	MemoryContext oldcontext = CurrentMemoryContext;
 
+	PartitionTupleRouting *proute = NULL;
+	ExprContext *secondaryExprContext = NULL;
 	ErrorContextCallback errcallback;
 	CommandId	mycid = GetCurrentCommandId(true);
 	int			hi_options = 0; /* start with default heap_insert options */
 	BulkInsertState bistate;
+	CopyInsertMethod insertMethod;
 	uint64		processed = 0;
-	bool		useHeapMultiInsert;
 	int			nBufferedTuples = 0;
 	int			prev_leaf_part_index = -1;
+	bool		has_before_insert_row_trig;
+	bool		has_instead_insert_row_trig;
+	bool		leafpart_use_multi_insert = false;
 
 #define MAX_BUFFERED_TUPLES 1000
+#define RECHECK_MULTI_INSERT_THRESHOLD 1000
 	HeapTuple  *bufferedTuples = NULL;	/* initialize to silence warning */
 	Size		bufferedTuplesSize = 0;
 	uint64		firstBufferedLineNo = 0;
+	uint64		lastPartitionSampleLineNo = 0;
+	uint64		nPartitionChanges = 0;
+	double		avgTuplesPerPartChange = 0;
 
 	Assert(cstate->rel);
 
@@ -2455,6 +2475,7 @@ CopyFrom(CopyState cstate)
 					  1,		/* dummy rangetable index */
 					  NULL,
 					  0);
+	target_resultRelInfo = resultRelInfo;
 
 	/* Verify the named relation is a valid target for INSERT */
 	CheckValidResultRel(resultRelInfo, CMD_INSERT);
@@ -2504,8 +2525,6 @@ CopyFrom(CopyState cstate)
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionTupleRouting *proute;
-
 		proute = cstate->partition_tuple_routing =
 			ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
@@ -2522,28 +2541,92 @@ CopyFrom(CopyState cstate)
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
-	 * separately for every tuple. However, we can't do that if there are
-	 * BEFORE/INSTEAD OF triggers, or we need to evaluate volatile default
-	 * expressions. Such triggers or expressions might query the table we're
-	 * inserting to, and act differently if the tuples that have already been
-	 * processed and prepared for insertion are not there.  We also can't do
-	 * it if the table is foreign or partitioned.
+	 * separately for every tuple. However, there are a number of reasons
+	 * why we might not be able to do this.  These are explained below.
 	 */
-	if ((resultRelInfo->ri_TrigDesc != NULL &&
+	if (resultRelInfo->ri_TrigDesc != NULL &&
 		 (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
-		  resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
-		resultRelInfo->ri_FdwRoutine != NULL ||
-		cstate->partition_tuple_routing != NULL ||
-		cstate->volatile_defexprs)
+		  resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
 	{
-		useHeapMultiInsert = false;
+		/*
+		 * Can't support multi-inserts when there are any BEFORE/INSTEAD OF
+		 * triggers on the table. Such triggers might query the table we're
+		 * inserting into and act differently if the tuples that have already
+		 * been processed any prepared for insertion are not there.
+		 */
+		insertMethod = CIM_SINGLE;
+	}
+	else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
+		 resultRelInfo->ri_TrigDesc->trig_insert_new_table)
+	{
+		/*
+		 * For partitioned tables we can't support multi-inserts when there
+		 * are any statement level insert triggers.  (It might be possible to
+		 * allow partitioned tables with such triggers in the future, but for
+		 * now CopyFromInsertBatch expects that any BR insert and statement
+		 * level insert triggers are on the same relation.
+		 */
+		insertMethod = CIM_SINGLE;
+	}
+	else if (resultRelInfo->ri_FdwRoutine != NULL ||
+			 cstate->volatile_defexprs)
+	{
+		/*
+		 * Can't support multi-inserts to foreign tables or if there are
+		 * any volatile default expressions in the table.  Similarly to
+		 * the trigger case above, such expressions may query the table
+		 * we're inserting into.
+		 *
+		 * Note: It does not matter if any partitions have any volatile
+		 * default expressions as we use the defaults from the target of the
+		 * COPY command.
+		 */
+		insertMethod = CIM_SINGLE;
 	}
 	else
 	{
-		useHeapMultiInsert = true;
+		/*
+		 * For partitioned tables, we may still be able to perform bulk
+		 * inserts for sets of consecutive tuples which belong to the same
+		 * partition.  However, the possibility of this depends on which
+		 * types of triggers exist on the partition.  We must disable bulk
+		 * inserts if the partition is a foreign table or it has any BR insert
+		 * or insert instead triggers (same as we checked above for the parent
+		 * table).  Since the partition's resultRelInfos are initialized only
+		 * when we actually need to insert the first tuple into them, we must
+		 * have the intermediate insert method of CIM_MULTI_CONDITIONAL to
+		 * flag that we must later determine if we can use bulk-inserts for
+		 * the partition being inserted into.
+		 *
+		 * Normally, when performing bulk inserts we just flush the insert
+		 * buffer whenever it becomes full, but for the partitioned table
+		 * case, we flush it whenever the current tuple does not belong to
+		 * the same partition as the previous tuple, and since we flush the
+		 * previous partition's buffer once the new tuple has already been
+		 * built, we're unable to reset the estate since we'd free the memory
+		 * in which the new tuple is stored.  To work around this we maintain
+		 * a secondary expression context and alternate between these when the
+		 * partition changes.  This does mean we do store the first new tuple
+		 * in a different context than subsequent tuples, but that does not
+		 * matter, providing we don't free anything while it's still needed.
+		 */
+		if (proute)
+		{
+			insertMethod = CIM_MULTI_CONDITIONAL;
+			secondaryExprContext = CreateExprContext(estate);
+		}
+		else
+			insertMethod = CIM_MULTI;
+
 		bufferedTuples = palloc(MAX_BUFFERED_TUPLES * sizeof(HeapTuple));
 	}
 
+	has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
+		resultRelInfo->ri_TrigDesc->trig_insert_before_row);
+
+	has_instead_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
+		resultRelInfo->ri_TrigDesc->trig_insert_instead_row);
+
 	/*
 	 * Check BEFORE STATEMENT insertion triggers. It's debatable whether we
 	 * should do this for COPY, since it's not really an "INSERT" statement as
@@ -2598,7 +2681,7 @@ CopyFrom(CopyState cstate)
 		 * Constraints might reference the tableoid column, so initialize
 		 * t_tableOid before evaluating them.
 		 */
-		tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+		tuple->t_tableOid = RelationGetRelid(target_resultRelInfo->ri_RelationDesc);
 
 		/* Triggers and stuff need to be invoked in query context. */
 		MemoryContextSwitchTo(oldcontext);
@@ -2608,11 +2691,9 @@ CopyFrom(CopyState cstate)
 		ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 
 		/* Determine the partition to heap_insert the tuple into */
-		if (cstate->partition_tuple_routing)
+		if (proute)
 		{
 			int			leaf_part_index;
-			PartitionTupleRouting *proute = cstate->partition_tuple_routing;
-
 			/*
 			 * Away we go ... If we end up not finding a partition after all,
 			 * ExecFindPartition() does not return and errors out instead.
@@ -2621,38 +2702,135 @@ CopyFrom(CopyState cstate)
 			 * will get us the ResultRelInfo and TupleConversionMap for the
 			 * partition, respectively.
 			 */
-			leaf_part_index = ExecFindPartition(resultRelInfo,
+			leaf_part_index = ExecFindPartition(target_resultRelInfo,
 												proute->partition_dispatch_info,
 												slot,
 												estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
-			/*
-			 * If this tuple is mapped to a partition that is not same as the
-			 * previous one, we'd better make the bulk insert mechanism gets a
-			 * new buffer.
-			 */
 			if (prev_leaf_part_index != leaf_part_index)
 			{
+				/*
+				 * When performing bulk-inserts into partitioned tables we
+				 * must insert the tuples seen so far to the heap whenever the
+				 * partition changes.  This might seem wasteful in cases where
+				 * the partition changes on each tuple, but in cases where
+				 * we're bulk loading into a single partition or the data
+				 * being loaded is ordered in partition order then performance
+				 * gains can easily be seen.
+				 */
+				if (nBufferedTuples > 0)
+				{
+					ExprContext *swapcontext;
+					ResultRelInfo *presultRelInfo;
+
+					presultRelInfo = proute->partitions[prev_leaf_part_index];
+
+					CopyFromInsertBatch(cstate, estate, mycid, hi_options,
+										presultRelInfo, myslot, bistate,
+										nBufferedTuples, bufferedTuples,
+										firstBufferedLineNo);
+					nBufferedTuples = 0;
+					bufferedTuplesSize = 0;
+
+					Assert(secondaryExprContext);
+
+					/*
+					 * Normally we reset the per-tuple context whenever the
+					 * bufferedTuples array is empty at the beginning of the
+					 * loop, however, it is possible since we flush the buffer
+					 * here that the buffer is never empty at the start of the
+					 * loop.  To prevent the per-tuple context from never
+					 * being reset we maintain a second context and alternate
+					 * between them when the partition changes.  We can now
+					 * reset secondaryExprContext as this is no longer needed,
+					 * since we just flushed any tuples stored in it.  We also
+					 * now switch over to the other context.  This does mean
+					 * that the first tuple in the buffer won't be in the same
+					 * context as the others, but that does not matter since
+					 * we only reset it after the flush.
+					 */
+					ReScanExprContext(secondaryExprContext);
+
+					swapcontext = secondaryExprContext;
+					secondaryExprContext = estate->es_per_tuple_exprcontext;
+					estate->es_per_tuple_exprcontext = swapcontext;
+				}
+
+				/*
+				 * Overwrite resultRelInfo with the corresponding partition's
+				 * one.
+				 */
+				resultRelInfo = proute->partitions[leaf_part_index];
+				if (resultRelInfo == NULL)
+				{
+					resultRelInfo = ExecInitPartitionInfo(mtstate,
+														  target_resultRelInfo,
+														  proute, estate,
+														  leaf_part_index);
+					proute->partitions[leaf_part_index] = resultRelInfo;
+					Assert(resultRelInfo != NULL);
+				}
+
+				/* Determine which triggers exist on this partition */
+				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
+					resultRelInfo->ri_TrigDesc->trig_insert_before_row);
+
+				has_instead_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
+					resultRelInfo->ri_TrigDesc->trig_insert_instead_row);
+
+				/* Check if we can multi-insert into this partition */
+				if (insertMethod == CIM_MULTI_CONDITIONAL)
+				{
+					nPartitionChanges++;
+
+					/*
+					 * Here we adaptively enable multi-inserts based on the
+					 * average number of tuples per recent multi-insert batch.
+					 * We recalculate the average every
+					 * RECHECK_MULTI_INSERT_THRESHOLD instead of taking the
+					 * average over the whole copy.  This allows us to enable
+					 * multi-inserts when we get periods in the copy stream
+					 * that have tuples commonly belonging to the same
+					 * partition, and disable when the partition is changing
+					 * too often.
+					 */
+					if (lastPartitionSampleLineNo <= (cstate->cur_lineno -
+											RECHECK_MULTI_INSERT_THRESHOLD))
+					{
+						avgTuplesPerPartChange =
+							(cstate->cur_lineno - lastPartitionSampleLineNo) /
+							(double) nPartitionChanges;
+
+						lastPartitionSampleLineNo = cstate->cur_lineno;
+						nPartitionChanges = 0;
+					}
+
+					/*
+					 * Tests have shown that using multi-inserts when the
+					 * partition changes on every tuple slightly decreases
+					 * the performance, however, there are benefits even when
+					 * only some batches have just 2 tuples, so let's enable
+					 * multi-inserts even when the average is quite low.
+					 */
+					leafpart_use_multi_insert = !has_before_insert_row_trig &&
+											!has_instead_insert_row_trig &&
+										resultRelInfo->ri_FdwRoutine == NULL &&
+										avgTuplesPerPartChange >= 1.3;
+				}
+				else
+					leafpart_use_multi_insert = false;
+
+				/*
+				 * We'd better make the bulk insert mechanism gets a new
+				 * buffer when the partition being inserted into changes.
+				 */
 				ReleaseBulkInsertStatePin(bistate);
 				prev_leaf_part_index = leaf_part_index;
 			}
-
-			/*
-			 * Save the old ResultRelInfo and switch to the one corresponding
-			 * to the selected partition.
-			 */
-			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = proute->partitions[leaf_part_index];
-			if (resultRelInfo == NULL)
-			{
-				resultRelInfo = ExecInitPartitionInfo(mtstate,
-													  saved_resultRelInfo,
-													  proute, estate,
-													  leaf_part_index);
-				Assert(resultRelInfo != NULL);
-			}
+			else
+				resultRelInfo = proute->partitions[leaf_part_index];
 
 			/*
 			 * For ExecInsertIndexTuples() to work on the partition's indexes
@@ -2665,8 +2843,7 @@ CopyFrom(CopyState cstate)
 			 */
 			if (cstate->transition_capture != NULL)
 			{
-				if (resultRelInfo->ri_TrigDesc &&
-					resultRelInfo->ri_TrigDesc->trig_insert_before_row)
+				if (has_before_insert_row_trig)
 				{
 					/*
 					 * If there are any BEFORE triggers on the partition,
@@ -2675,7 +2852,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, saved_resultRelInfo,
+						TupConvMapForLeaf(proute, target_resultRelInfo,
 										  leaf_part_index);
 				}
 				else
@@ -2704,8 +2881,7 @@ CopyFrom(CopyState cstate)
 		skip_tuple = false;
 
 		/* BEFORE ROW INSERT Triggers */
-		if (resultRelInfo->ri_TrigDesc &&
-			resultRelInfo->ri_TrigDesc->trig_insert_before_row)
+		if (has_before_insert_row_trig)
 		{
 			slot = ExecBRInsertTriggers(estate, resultRelInfo, slot);
 
@@ -2717,8 +2893,7 @@ CopyFrom(CopyState cstate)
 
 		if (!skip_tuple)
 		{
-			if (resultRelInfo->ri_TrigDesc &&
-				resultRelInfo->ri_TrigDesc->trig_insert_instead_row)
+			if (has_instead_insert_row_trig)
 			{
 				/* Pass the data to the INSTEAD ROW INSERT trigger */
 				ExecIRInsertTriggers(estate, resultRelInfo, slot);
@@ -2740,12 +2915,15 @@ CopyFrom(CopyState cstate)
 				 * partition.
 				 */
 				if (resultRelInfo->ri_PartitionCheck &&
-					(saved_resultRelInfo == NULL ||
-					 (resultRelInfo->ri_TrigDesc &&
-					  resultRelInfo->ri_TrigDesc->trig_insert_before_row)))
+					(proute == NULL || has_before_insert_row_trig))
 					ExecPartitionCheck(resultRelInfo, slot, estate, true);
 
-				if (useHeapMultiInsert)
+				/*
+				 * Perform multi-inserts when enabled, or when loading a
+				 * partitioned table that can support multi-inserts as
+				 * determined above.
+				 */
+				if (insertMethod == CIM_MULTI || leafpart_use_multi_insert)
 				{
 					/* Add this tuple to the tuple buffer */
 					if (nBufferedTuples == 0)
@@ -2783,7 +2961,7 @@ CopyFrom(CopyState cstate)
 																			   NULL);
 
 						if (slot == NULL)	/* "do nothing" */
-							goto next_tuple;
+							continue;		/* next tuple please */
 
 						/* FDW might have changed tuple */
 						tuple = ExecMaterializeSlot(slot);
@@ -2823,22 +3001,28 @@ CopyFrom(CopyState cstate)
 			 */
 			processed++;
 		}
-
-next_tuple:
-		/* Restore the saved ResultRelInfo */
-		if (saved_resultRelInfo)
-		{
-			resultRelInfo = saved_resultRelInfo;
-			estate->es_result_relation_info = resultRelInfo;
-		}
 	}
 
 	/* Flush any remaining buffered tuples */
 	if (nBufferedTuples > 0)
-		CopyFromInsertBatch(cstate, estate, mycid, hi_options,
-							resultRelInfo, myslot, bistate,
-							nBufferedTuples, bufferedTuples,
-							firstBufferedLineNo);
+	{
+		if (insertMethod == CIM_MULTI_CONDITIONAL)
+		{
+			ResultRelInfo *presultRelInfo;
+
+			presultRelInfo = proute->partitions[prev_leaf_part_index];
+
+			CopyFromInsertBatch(cstate, estate, mycid, hi_options,
+								presultRelInfo, myslot, bistate,
+								nBufferedTuples, bufferedTuples,
+								firstBufferedLineNo);
+		}
+		else
+			CopyFromInsertBatch(cstate, estate, mycid, hi_options,
+								resultRelInfo, myslot, bistate,
+								nBufferedTuples, bufferedTuples,
+								firstBufferedLineNo);
+	}
 
 	/* Done, clean up */
 	error_context_stack = errcallback.previous;
@@ -2855,7 +3039,7 @@ next_tuple:
 		pq_endmsgread();
 
 	/* Execute AFTER STATEMENT insertion triggers */
-	ExecASInsertTriggers(estate, resultRelInfo, cstate->transition_capture);
+	ExecASInsertTriggers(estate, target_resultRelInfo, cstate->transition_capture);
 
 	/* Handle queued AFTER triggers */
 	AfterTriggerEndQuery(estate);
@@ -2866,12 +3050,12 @@ next_tuple:
 	ExecResetTupleTable(estate->es_tupleTable, false);
 
 	/* Allow the FDW to shut down */
-	if (resultRelInfo->ri_FdwRoutine != NULL &&
-		resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
-		resultRelInfo->ri_FdwRoutine->EndForeignInsert(estate,
-													   resultRelInfo);
+	if (target_resultRelInfo->ri_FdwRoutine != NULL &&
+		target_resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
+		target_resultRelInfo->ri_FdwRoutine->EndForeignInsert(estate,
+													   target_resultRelInfo);
 
-	ExecCloseIndices(resultRelInfo);
+	ExecCloseIndices(target_resultRelInfo);
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
 	if (cstate->partition_tuple_routing)
@@ -2907,6 +3091,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
 	MemoryContext oldcontext;
 	int			i;
 	uint64		save_cur_lineno;
+	bool		line_buf_valid = cstate->line_buf_valid;
 
 	/*
 	 * Print error context information correctly, if one of the operations
@@ -2920,7 +3105,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
 	 * before calling it.
 	 */
 	oldcontext = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
-	heap_multi_insert(cstate->rel,
+	heap_multi_insert(resultRelInfo->ri_RelationDesc,
 					  bufferedTuples,
 					  nBufferedTuples,
 					  mycid,
@@ -2967,7 +3152,8 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
 		}
 	}
 
-	/* reset cur_lineno to where we were */
+	/* reset cur_lineno and line_buf_valid to what they were */
+	cstate->line_buf_valid = line_buf_valid;
 	cstate->cur_lineno = save_cur_lineno;
 }
 
-- 
2.17.1

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

over 7 years ago

In reply to: David Rowley (#3)

Re: Making "COPY partitioned_table FROM" faster

On 20.07.18 16:57, David Rowley wrote:

One final note: I'm not entirely convinced we need this adaptive
code, but it seems easy enough to rip it back out if it's more trouble
than it's worth. But if the other option is a GUC, then I'd rather
stick with the adaptive code, it's likely going to do much better than
a GUC since it can change itself during the copy which will be useful
when the stream contains a mix small and large sets of consecutive
tuples which belong to the same partition.

I think some kind of way to switch between the two code paths would be
desirable. For example, with hash partitioning, it's likely that in
many cases you won't find any adjacent candidates in batches of
significant size. So then you've just made everything 5% slower.
Unless we can make the multi-insert path itself faster.

The particular heuristic you have chosen seems sensible to me, but we'll
have to see how it holds up in practice.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: Peter Eisentraut (#4)

1 attachment(s)

Re: Making "COPY partitioned_table FROM" faster

On 24 July 2018 at 06:38, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 20.07.18 16:57, David Rowley wrote:

One final note: I'm not entirely convinced we need this adaptive
code, but it seems easy enough to rip it back out if it's more trouble
than it's worth. But if the other option is a GUC, then I'd rather
stick with the adaptive code, it's likely going to do much better than
a GUC since it can change itself during the copy which will be useful
when the stream contains a mix small and large sets of consecutive
tuples which belong to the same partition.

I think some kind of way to switch between the two code paths would be
desirable. For example, with hash partitioning, it's likely that in
many cases you won't find any adjacent candidates in batches of
significant size. So then you've just made everything 5% slower.
Unless we can make the multi-insert path itself faster.

The particular heuristic you have chosen seems sensible to me, but we'll
have to see how it holds up in practice.

Thanks for having a look at this. You're probably right here,
although, the slowdown I measured was 2.2%, not 5%. I had it in my
head that it was about 1%, but for 2.2% it seems worth the extra code.

I've attached a small delta against the v2 patch that fixes up a few
comments and gets rid of a needless assignment.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v2_copy_speedup_fixup_delta.patchapplication/octet-stream; name=v2_copy_speedup_fixup_delta.patchDownload

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 6dda559300..71a4fa8ebe 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2714,11 +2714,7 @@ CopyFrom(CopyState cstate)
 				/*
 				 * When performing bulk-inserts into partitioned tables we
 				 * must insert the tuples seen so far to the heap whenever the
-				 * partition changes.  This might seem wasteful in cases where
-				 * the partition changes on each tuple, but in cases where
-				 * we're bulk loading into a single partition or the data
-				 * being loaded is ordered in partition order then performance
-				 * gains can easily be seen.
+				 * partition changes.
 				 */
 				if (nBufferedTuples > 0)
 				{
@@ -2787,12 +2783,12 @@ CopyFrom(CopyState cstate)
 
 					/*
 					 * Here we adaptively enable multi-inserts based on the
-					 * average number of tuples per recent multi-insert batch.
-					 * We recalculate the average every
-					 * RECHECK_MULTI_INSERT_THRESHOLD instead of taking the
-					 * average over the whole copy.  This allows us to enable
-					 * multi-inserts when we get periods in the copy stream
-					 * that have tuples commonly belonging to the same
+					 * average number of tuples from recent multi-insert
+					 * batches.  We recalculate the average every
+					 * RECHECK_MULTI_INSERT_THRESHOLD tuples instead of taking
+					 * the average over the whole copy.  This allows us to
+					 * enable multi-inserts when we get periods in the copy
+					 * stream that have tuples commonly belonging to the same
 					 * partition, and disable when the partition is changing
 					 * too often.
 					 */
@@ -2829,8 +2825,6 @@ CopyFrom(CopyState cstate)
 				ReleaseBulkInsertStatePin(bistate);
 				prev_leaf_part_index = leaf_part_index;
 			}
-			else
-				resultRelInfo = proute->partitions[leaf_part_index];
 
 			/*
 			 * For ExecInsertIndexTuples() to work on the partition's indexes

Melanie Plageman

melanieplageman@gmail.com

over 7 years ago

In reply to: David Rowley (#3)

Re: Making "COPY partitioned_table FROM" faster

On Fri, Jul 20, 2018 at 7:57 AM, David Rowley <david.rowley@2ndquadrant.com>
wrote:

So, overall, we feel that the code from lines 2689 until 2691 and from

2740 to 2766 could use further clarification with regard to switching from
parent to child partition and sibling to sibling partition as well as
regarding saving relinfo for partitions that have not been seen before in
the proute->partitions[]

I hope it's more clear now that I've got rid of saved_resultRelInfo.

I think all of the refactoring and clarifications look good to me and are

much more clear.

One small additional typo I noticed:

In the patched code on line 2555, there appears to be a typo:
/* ...
* inserting into and act differently if the tuples that have already
* been processed any prepared for insertion are not there.
*/
The word "any" here is probably wrong, though I am actually not sure what
was meant.

Though it is from the existing comments, it would be nice to spell out once
what is meant by "BR trigger". I'm sure it is mentioned elsewhere in the
code, but, since it is not easily googled, it might be helpful to specify.

Many thanks to both of you for the thorough review. I hope the

attached addresses all the concerns.

I feel that the code itself is clear and feel our concerns are addressed.

One final note: I'm not entirely convinced we need this adaptive
code, but it seems easy enough to rip it back out if it's more trouble
than it's worth. But if the other option is a GUC, then I'd rather
stick with the adaptive code, it's likely going to do much better than
a GUC since it can change itself during the copy which will be useful
when the stream contains a mix small and large sets of consecutive
tuples which belong to the same partition.

Though the v2 numbers do look better, it doesn't complete alleviate the

slow-down in the worst case. Perhaps the GUC and the adaptive code are not
alternatives and could instead be used together. You could even make the
actual RECHECK_MULTI_INSERT_THRESHOLD the GUC.
However, I think that the decision as to whether or not it makes sense to
do the adaptive code without a GUC is really based on the average use case,
to which I can't really speak.
The counter-argument I see, however, is that it is not acceptable to have
completely un-optimized insertion of data into partition tables and that an
overwhelming flood of GUCs is undesirable.

Simon Riggs

simon@2ndquadrant.com

over 7 years ago

In reply to: Melanie Plageman (#6)

Re: Making "COPY partitioned_table FROM" faster

On 24 July 2018 at 16:32, Melanie Plageman <melanieplageman@gmail.com> wrote:

On Fri, Jul 20, 2018 at 7:57 AM, David Rowley <david.rowley@2ndquadrant.com>
wrote:

One final note: I'm not entirely convinced we need this adaptive
code, but it seems easy enough to rip it back out if it's more trouble
than it's worth. But if the other option is a GUC, then I'd rather
stick with the adaptive code, it's likely going to do much better than
a GUC since it can change itself during the copy which will be useful
when the stream contains a mix small and large sets of consecutive
tuples which belong to the same partition.

Though the v2 numbers do look better, it doesn't complete alleviate the
slow-down in the worst case. Perhaps the GUC and the adaptive code are not
alternatives and could instead be used together. You could even make the
actual RECHECK_MULTI_INSERT_THRESHOLD the GUC.
However, I think that the decision as to whether or not it makes sense to do
the adaptive code without a GUC is really based on the average use case, to
which I can't really speak.
The counter-argument I see, however, is that it is not acceptable to have
completely un-optimized insertion of data into partition tables and that an
overwhelming flood of GUCs is undesirable.

I don't see any need here for another GUC, nor even a command option.
The user has already indicated their use case to us:

We know that the common case for RANGE partitioned tables is to load
into the one current partition. We also know that the partition might
change as we load data, when the data loaded crosses the partition
boundary, so we need this optimization to be adaptive. Not all data
loads follow that rule, so we also need the adpative algorithm to shut
off for those cases.

We also know that the common case for HASH partitions is to load into
all partitions at once, since hash is specifically designed to spread
data out across partitions. It is almost never true that we would want
to load one partition at a time, so it seems easy to turn the
optimization off if we use this type of partitioning. Or better, we
need work done to improve that case also, but that is outside the
current scope of this patch.

If we have multi-level partitions, if any of the levels includes a
Hash, then turn it off.

LIST partitions are less likely to have a clear pattern, so I would
treat them like HASH and assume the data is not sorted by partition.

So for this patch, just add an "if (RANGE)" test.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: Melanie Plageman (#6)

2 attachment(s)

Re: Making "COPY partitioned_table FROM" faster

Hi Melanie,

Many thanks for looking over this again.

On 25 July 2018 at 03:32, Melanie Plageman <melanieplageman@gmail.com> wrote:

One small additional typo I noticed:

In the patched code on line 2555, there appears to be a typo:
/* ...
* inserting into and act differently if the tuples that have already
* been processed any prepared for insertion are not there.
*/
The word "any" here is probably wrong, though I am actually not sure what
was meant.

It was meant to be "and"

Though it is from the existing comments, it would be nice to spell out once
what is meant by "BR trigger". I'm sure it is mentioned elsewhere in the
code, but, since it is not easily googled, it might be helpful to specify.

Many thanks to both of you for the thorough review. I hope the
attached addresses all the concerns.

I feel that the code itself is clear and feel our concerns are addressed.

Great!

One final note: I'm not entirely convinced we need this adaptive
code, but it seems easy enough to rip it back out if it's more trouble
than it's worth. But if the other option is a GUC, then I'd rather
stick with the adaptive code, it's likely going to do much better than
a GUC since it can change itself during the copy which will be useful
when the stream contains a mix small and large sets of consecutive
tuples which belong to the same partition.

Though the v2 numbers do look better, it doesn't complete alleviate the
slow-down in the worst case. Perhaps the GUC and the adaptive code are not
alternatives and could instead be used together. You could even make the
actual RECHECK_MULTI_INSERT_THRESHOLD the GUC.
However, I think that the decision as to whether or not it makes sense to do
the adaptive code without a GUC is really based on the average use case, to
which I can't really speak.
The counter-argument I see, however, is that it is not acceptable to have
completely un-optimized insertion of data into partition tables and that an
overwhelming flood of GUCs is undesirable.

I'm pretty much against a GUC, because:

1) Most people won't use it or know about it, and;
2) Copy streams might contain mixes of long runs of tuples belonging
to the same partition and periods where the tuple changes frequently.
GUC cannot be set to account for that, and;
3) Because of the following:

I looked at this patch again and I saw that there are a few more
things we can do to improve the performance further.

I've made the following changes:

a) Made the avgTuplesPerPartChange >= 1.3 the first condition in the
test that sets leafpart_use_multi_insert.
b) Added an unlikely() when testing of the partition's resultRelInfo
has set to be initialised. This is only true the first time a
partition is inserted into during the COPY.
c) Added unlikely() around lastPartitionSampleLineNo test. This will
only be true 1 in 1000, so pretty unlikely.
d) Moved the nBufferedTuples > 0 test under the insertMethod ==
CIM_MULTI_CONDITIONAL test. It's never going to be > 0 when we're
doing CIM_SINGLE.

I spun up another AWS m5d.large instance to test this. This time I
only tested the case where the partition changes on every tuple. The
new patch takes about 96.5% of the time that master takes. I performed
10 runs of each and tested both with fsync=on and fsync=off. I've
attached the results in spreadsheet form.

I think given the new performance results there's no need for any GUC
or conditionally enabling this optimisation based on partitioning
strategy.

v3 patch is attached.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

copy_speedup_v3_results.odsapplication/vnd.oasis.opendocument.spreadsheet; name=copy_speedup_v3_results.odsDownload

v3-0001-Allow-multi-inserts-during-COPY-into-a-partitione.patchapplication/octet-stream; name=v3-0001-Allow-multi-inserts-during-COPY-into-a-partitione.patchDownload

From ced8b4ee585dacbfc7c996b9c6d5b789a94a2cbc Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Tue, 24 Jul 2018 12:30:56 +1200
Subject: [PATCH v3] Allow multi-inserts during COPY into a partitioned table

CopyFrom allows multi-inserts to be used for non-partitioned tables, but
this was disabled for partitioned tables.  The reason for this appeared to
be due to the fact that the tuple may not belong to the same partition as
the previous tuple did.  Not allowing multi-inserts here greatly slowed
down imports into partitioned tables.  These could take twice as long as a
copy to an equivalent non-partitioned table.  It seems wise to do something
about this, so this commit allows the multi-inserts by flushing the so-far
inserted tuples to the partition when the next tuple does not belong to the
same partition, or when the buffer fills.  This improves performance when
the next tuple in the stream commonly belongs to the same partition as the
previous tuple.

In cases where the target partition changes on every tuple using
multi-inserts slightly slows the performance.  To get around this we
track the average size of the batches that have been inserted and
adapively enable or disable multi-inserts based on the size of the
batch.  Some testing was done and the regression only seems to exist
when the average size of the insert batch is close to 1, so let's just
enable multi-inserts when the average size is at least 1.3.  More
performance testing might reveal a better number for, this, but since the
slowdown was only 1-2% it does not seem critical enough to spend too much
time calculating it.  In any case it may depend on other factors rather
than just the size of the batch.

Allowing multi-inserts for partitions required a bit of work around the
per-tuple memory contexts as we must flush the tuples when the next tuple
does not belong the same partition.  In which case there is no good time to
reset the per-tuple context, as we've already built the new tuple by this
time.  In order to work around this we maintain two per-tuple contexts and
just switch between them every time the partition changes and reset the old
one.  This does mean that the first of each batch of tuples is not
allocated in the same memory context as the others, but that does not
matter since we only reset the context once the previous batch has been
inserted.
---
 src/backend/commands/copy.c | 330 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 256 insertions(+), 74 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3a66cb5025..ec4c651b11 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -47,6 +47,7 @@
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
+#include "utils/partcache.h"
 #include "utils/portal.h"
 #include "utils/rel.h"
 #include "utils/rls.h"
@@ -79,6 +80,16 @@ typedef enum EolType
 	EOL_CRNL
 } EolType;
 
+/*
+ * Represents the heap insert method to be used during COPY to.
+ */
+typedef enum CopyInsertMethod
+{
+	CIM_SINGLE,					/* use heap_insert or fdw routine */
+	CIM_MULTI,					/* always use heap_multi_insert */
+	CIM_MULTI_CONDITIONAL		/* use heap_multi_insert only if valid */
+} CopyInsertMethod;
+
 /*
  * This struct contains all the state variables used throughout a COPY
  * operation. For simplicity, we use the same struct for all variants of COPY,
@@ -2305,26 +2316,35 @@ CopyFrom(CopyState cstate)
 	Datum	   *values;
 	bool	   *nulls;
 	ResultRelInfo *resultRelInfo;
-	ResultRelInfo *saved_resultRelInfo = NULL;
+	ResultRelInfo *target_resultRelInfo;
 	EState	   *estate = CreateExecutorState(); /* for ExecConstraints() */
 	ModifyTableState *mtstate;
 	ExprContext *econtext;
 	TupleTableSlot *myslot;
 	MemoryContext oldcontext = CurrentMemoryContext;
 
+	PartitionTupleRouting *proute = NULL;
+	ExprContext *secondaryExprContext = NULL;
 	ErrorContextCallback errcallback;
 	CommandId	mycid = GetCurrentCommandId(true);
 	int			hi_options = 0; /* start with default heap_insert options */
 	BulkInsertState bistate;
+	CopyInsertMethod insertMethod;
 	uint64		processed = 0;
-	bool		useHeapMultiInsert;
 	int			nBufferedTuples = 0;
 	int			prev_leaf_part_index = -1;
+	bool		has_before_insert_row_trig;
+	bool		has_instead_insert_row_trig;
+	bool		leafpart_use_multi_insert = false;
 
 #define MAX_BUFFERED_TUPLES 1000
+#define RECHECK_MULTI_INSERT_THRESHOLD 1000
 	HeapTuple  *bufferedTuples = NULL;	/* initialize to silence warning */
 	Size		bufferedTuplesSize = 0;
 	uint64		firstBufferedLineNo = 0;
+	uint64		lastPartitionSampleLineNo = 0;
+	uint64		nPartitionChanges = 0;
+	double		avgTuplesPerPartChange = 0;
 
 	Assert(cstate->rel);
 
@@ -2455,6 +2475,7 @@ CopyFrom(CopyState cstate)
 					  1,		/* dummy rangetable index */
 					  NULL,
 					  0);
+	target_resultRelInfo = resultRelInfo;
 
 	/* Verify the named relation is a valid target for INSERT */
 	CheckValidResultRel(resultRelInfo, CMD_INSERT);
@@ -2504,8 +2525,6 @@ CopyFrom(CopyState cstate)
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionTupleRouting *proute;
-
 		proute = cstate->partition_tuple_routing =
 			ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
@@ -2522,28 +2541,92 @@ CopyFrom(CopyState cstate)
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
-	 * separately for every tuple. However, we can't do that if there are
-	 * BEFORE/INSTEAD OF triggers, or we need to evaluate volatile default
-	 * expressions. Such triggers or expressions might query the table we're
-	 * inserting to, and act differently if the tuples that have already been
-	 * processed and prepared for insertion are not there.  We also can't do
-	 * it if the table is foreign or partitioned.
+	 * separately for every tuple. However, there are a number of reasons why
+	 * we might not be able to do this.  These are explained below.
 	 */
-	if ((resultRelInfo->ri_TrigDesc != NULL &&
-		 (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
-		  resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
-		resultRelInfo->ri_FdwRoutine != NULL ||
-		cstate->partition_tuple_routing != NULL ||
-		cstate->volatile_defexprs)
+	if (resultRelInfo->ri_TrigDesc != NULL &&
+		(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
+		 resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
+	{
+		/*
+		 * Can't support multi-inserts when there are any BEFORE/INSTEAD OF
+		 * triggers on the table. Such triggers might query the table we're
+		 * inserting into and act differently if the tuples that have already
+		 * been processed and prepared for insertion are not there.
+		 */
+		insertMethod = CIM_SINGLE;
+	}
+	else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
+			 resultRelInfo->ri_TrigDesc->trig_insert_new_table)
+	{
+		/*
+		 * For partitioned tables we can't support multi-inserts when there
+		 * are any statement level insert triggers.  (It might be possible to
+		 * allow partitioned tables with such triggers in the future, but for
+		 * now CopyFromInsertBatch expects that any before row insert and
+		 * statement level insert triggers are on the same relation.
+		 */
+		insertMethod = CIM_SINGLE;
+	}
+	else if (resultRelInfo->ri_FdwRoutine != NULL ||
+			 cstate->volatile_defexprs)
 	{
-		useHeapMultiInsert = false;
+		/*
+		 * Can't support multi-inserts to foreign tables or if there are any
+		 * volatile default expressions in the table.  Similarly to the
+		 * trigger case above, such expressions may query the table we're
+		 * inserting into.
+		 *
+		 * Note: It does not matter if any partitions have any volatile
+		 * default expressions as we use the defaults from the target of the
+		 * COPY command.
+		 */
+		insertMethod = CIM_SINGLE;
 	}
 	else
 	{
-		useHeapMultiInsert = true;
+		/*
+		 * For partitioned tables, we may still be able to perform bulk
+		 * inserts for sets of consecutive tuples which belong to the same
+		 * partition.  However, the possibility of this depends on which types
+		 * of triggers exist on the partition.  We must disable bulk inserts
+		 * if the partition is a foreign table or it has any before row insert
+		 * or insert instead triggers (same as we checked above for the parent
+		 * table).  Since the partition's resultRelInfos are initialized only
+		 * when we actually need to insert the first tuple into them, we must
+		 * have the intermediate insert method of CIM_MULTI_CONDITIONAL to
+		 * flag that we must later determine if we can use bulk-inserts for
+		 * the partition being inserted into.
+		 *
+		 * Normally, when performing bulk inserts we just flush the insert
+		 * buffer whenever it becomes full, but for the partitioned table
+		 * case, we flush it whenever the current tuple does not belong to the
+		 * same partition as the previous tuple, and since we flush the
+		 * previous partition's buffer once the new tuple has already been
+		 * built, we're unable to reset the estate since we'd free the memory
+		 * in which the new tuple is stored.  To work around this we maintain
+		 * a secondary expression context and alternate between these when the
+		 * partition changes.  This does mean we do store the first new tuple
+		 * in a different context than subsequent tuples, but that does not
+		 * matter, providing we don't free anything while it's still needed.
+		 */
+		if (proute)
+		{
+			insertMethod = CIM_MULTI_CONDITIONAL;
+			secondaryExprContext = CreateExprContext(estate);
+		}
+		else
+			insertMethod = CIM_MULTI;
+
 		bufferedTuples = palloc(MAX_BUFFERED_TUPLES * sizeof(HeapTuple));
 	}
 
+	has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
+								  resultRelInfo->ri_TrigDesc->trig_insert_before_row);
+
+	has_instead_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
+								   resultRelInfo->ri_TrigDesc->trig_insert_instead_row);
+
 	/*
 	 * Check BEFORE STATEMENT insertion triggers. It's debatable whether we
 	 * should do this for COPY, since it's not really an "INSERT" statement as
@@ -2598,7 +2681,7 @@ CopyFrom(CopyState cstate)
 		 * Constraints might reference the tableoid column, so initialize
 		 * t_tableOid before evaluating them.
 		 */
-		tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+		tuple->t_tableOid = RelationGetRelid(target_resultRelInfo->ri_RelationDesc);
 
 		/* Triggers and stuff need to be invoked in query context. */
 		MemoryContextSwitchTo(oldcontext);
@@ -2608,10 +2691,9 @@ CopyFrom(CopyState cstate)
 		ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 
 		/* Determine the partition to heap_insert the tuple into */
-		if (cstate->partition_tuple_routing)
+		if (proute)
 		{
 			int			leaf_part_index;
-			PartitionTupleRouting *proute = cstate->partition_tuple_routing;
 
 			/*
 			 * Away we go ... If we end up not finding a partition after all,
@@ -2621,39 +2703,131 @@ CopyFrom(CopyState cstate)
 			 * will get us the ResultRelInfo and TupleConversionMap for the
 			 * partition, respectively.
 			 */
-			leaf_part_index = ExecFindPartition(resultRelInfo,
+			leaf_part_index = ExecFindPartition(target_resultRelInfo,
 												proute->partition_dispatch_info,
 												slot,
 												estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
-			/*
-			 * If this tuple is mapped to a partition that is not same as the
-			 * previous one, we'd better make the bulk insert mechanism gets a
-			 * new buffer.
-			 */
 			if (prev_leaf_part_index != leaf_part_index)
 			{
+				/* Check if we can multi-insert into this partition */
+				if (insertMethod == CIM_MULTI_CONDITIONAL)
+				{
+					/*
+					 * When performing bulk-inserts into partitioned tables we
+					 * must insert the tuples seen so far to the heap whenever
+					 * the partition changes.
+					 */
+					if (nBufferedTuples > 0)
+					{
+						ExprContext *swapcontext;
+						ResultRelInfo *presultRelInfo;
+
+						presultRelInfo = proute->partitions[prev_leaf_part_index];
+
+						CopyFromInsertBatch(cstate, estate, mycid, hi_options,
+											presultRelInfo, myslot, bistate,
+											nBufferedTuples, bufferedTuples,
+											firstBufferedLineNo);
+						nBufferedTuples = 0;
+						bufferedTuplesSize = 0;
+
+						Assert(secondaryExprContext);
+
+						/*
+						 * Normally we reset the per-tuple context whenever
+						 * the bufferedTuples array is empty at the beginning
+						 * of the loop, however, it is possible since we flush
+						 * the buffer here that the buffer is never empty at
+						 * the start of the loop.  To prevent the per-tuple
+						 * context from never being reset we maintain a second
+						 * context and alternate between them when the
+						 * partition changes.  We can now reset
+						 * secondaryExprContext as this is no longer needed,
+						 * since we just flushed any tuples stored in it.  We
+						 * also now switch over to the other context.  This
+						 * does mean that the first tuple in the buffer won't
+						 * be in the same context as the others, but that does
+						 * not matter since we only reset it after the flush.
+						 */
+						ReScanExprContext(secondaryExprContext);
+
+						swapcontext = secondaryExprContext;
+						secondaryExprContext = estate->es_per_tuple_exprcontext;
+						estate->es_per_tuple_exprcontext = swapcontext;
+					}
+
+					nPartitionChanges++;
+
+					/*
+					 * Here we adaptively enable multi-inserts based on the
+					 * average number of tuples from recent multi-insert
+					 * batches.  We recalculate the average every
+					 * RECHECK_MULTI_INSERT_THRESHOLD tuples instead of taking
+					 * the average over the whole copy.  This allows us to
+					 * enable multi-inserts when we get periods in the copy
+					 * stream that have tuples commonly belonging to the same
+					 * partition, and disable when the partition is changing
+					 * too often.
+					 */
+					if (unlikely(lastPartitionSampleLineNo <= (cstate->cur_lineno -
+															   RECHECK_MULTI_INSERT_THRESHOLD)))
+					{
+						avgTuplesPerPartChange =
+							(cstate->cur_lineno - lastPartitionSampleLineNo) /
+							(double) nPartitionChanges;
+
+						lastPartitionSampleLineNo = cstate->cur_lineno;
+						nPartitionChanges = 0;
+					}
+
+					/*
+					 * Tests have shown that using multi-inserts when the
+					 * partition changes on every tuple slightly decreases the
+					 * performance, however, there are benefits even when only
+					 * some batches have just 2 tuples, so let's enable
+					 * multi-inserts even when the average is quite low.
+					 */
+					leafpart_use_multi_insert = avgTuplesPerPartChange >= 1.3 &&
+						!has_before_insert_row_trig &&
+						!has_instead_insert_row_trig &&
+						resultRelInfo->ri_FdwRoutine == NULL;
+				}
+				else
+					leafpart_use_multi_insert = false;
+
+				/*
+				 * Overwrite resultRelInfo with the corresponding partition's
+				 * one.
+				 */
+				resultRelInfo = proute->partitions[leaf_part_index];
+				if (unlikely(resultRelInfo == NULL))
+				{
+					resultRelInfo = ExecInitPartitionInfo(mtstate,
+														  target_resultRelInfo,
+														  proute, estate,
+														  leaf_part_index);
+					proute->partitions[leaf_part_index] = resultRelInfo;
+					Assert(resultRelInfo != NULL);
+				}
+
+				/* Determine which triggers exist on this partition */
+				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
+											  resultRelInfo->ri_TrigDesc->trig_insert_before_row);
+
+				has_instead_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
+											   resultRelInfo->ri_TrigDesc->trig_insert_instead_row);
+
+				/*
+				 * We'd better make the bulk insert mechanism gets a new
+				 * buffer when the partition being inserted into changes.
+				 */
 				ReleaseBulkInsertStatePin(bistate);
 				prev_leaf_part_index = leaf_part_index;
 			}
 
-			/*
-			 * Save the old ResultRelInfo and switch to the one corresponding
-			 * to the selected partition.
-			 */
-			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = proute->partitions[leaf_part_index];
-			if (resultRelInfo == NULL)
-			{
-				resultRelInfo = ExecInitPartitionInfo(mtstate,
-													  saved_resultRelInfo,
-													  proute, estate,
-													  leaf_part_index);
-				Assert(resultRelInfo != NULL);
-			}
-
 			/*
 			 * For ExecInsertIndexTuples() to work on the partition's indexes
 			 */
@@ -2665,8 +2839,7 @@ CopyFrom(CopyState cstate)
 			 */
 			if (cstate->transition_capture != NULL)
 			{
-				if (resultRelInfo->ri_TrigDesc &&
-					resultRelInfo->ri_TrigDesc->trig_insert_before_row)
+				if (has_before_insert_row_trig)
 				{
 					/*
 					 * If there are any BEFORE triggers on the partition,
@@ -2675,7 +2848,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, saved_resultRelInfo,
+						TupConvMapForLeaf(proute, target_resultRelInfo,
 										  leaf_part_index);
 				}
 				else
@@ -2704,8 +2877,7 @@ CopyFrom(CopyState cstate)
 		skip_tuple = false;
 
 		/* BEFORE ROW INSERT Triggers */
-		if (resultRelInfo->ri_TrigDesc &&
-			resultRelInfo->ri_TrigDesc->trig_insert_before_row)
+		if (has_before_insert_row_trig)
 		{
 			slot = ExecBRInsertTriggers(estate, resultRelInfo, slot);
 
@@ -2717,8 +2889,7 @@ CopyFrom(CopyState cstate)
 
 		if (!skip_tuple)
 		{
-			if (resultRelInfo->ri_TrigDesc &&
-				resultRelInfo->ri_TrigDesc->trig_insert_instead_row)
+			if (has_instead_insert_row_trig)
 			{
 				/* Pass the data to the INSTEAD ROW INSERT trigger */
 				ExecIRInsertTriggers(estate, resultRelInfo, slot);
@@ -2740,12 +2911,15 @@ CopyFrom(CopyState cstate)
 				 * partition.
 				 */
 				if (resultRelInfo->ri_PartitionCheck &&
-					(saved_resultRelInfo == NULL ||
-					 (resultRelInfo->ri_TrigDesc &&
-					  resultRelInfo->ri_TrigDesc->trig_insert_before_row)))
+					(proute == NULL || has_before_insert_row_trig))
 					ExecPartitionCheck(resultRelInfo, slot, estate, true);
 
-				if (useHeapMultiInsert)
+				/*
+				 * Perform multi-inserts when enabled, or when loading a
+				 * partitioned table that can support multi-inserts as
+				 * determined above.
+				 */
+				if (insertMethod == CIM_MULTI || leafpart_use_multi_insert)
 				{
 					/* Add this tuple to the tuple buffer */
 					if (nBufferedTuples == 0)
@@ -2783,7 +2957,7 @@ CopyFrom(CopyState cstate)
 																			   NULL);
 
 						if (slot == NULL)	/* "do nothing" */
-							goto next_tuple;
+							continue;	/* next tuple please */
 
 						/* FDW might have changed tuple */
 						tuple = ExecMaterializeSlot(slot);
@@ -2823,22 +2997,28 @@ CopyFrom(CopyState cstate)
 			 */
 			processed++;
 		}
-
-next_tuple:
-		/* Restore the saved ResultRelInfo */
-		if (saved_resultRelInfo)
-		{
-			resultRelInfo = saved_resultRelInfo;
-			estate->es_result_relation_info = resultRelInfo;
-		}
 	}
 
 	/* Flush any remaining buffered tuples */
 	if (nBufferedTuples > 0)
-		CopyFromInsertBatch(cstate, estate, mycid, hi_options,
-							resultRelInfo, myslot, bistate,
-							nBufferedTuples, bufferedTuples,
-							firstBufferedLineNo);
+	{
+		if (insertMethod == CIM_MULTI_CONDITIONAL)
+		{
+			ResultRelInfo *presultRelInfo;
+
+			presultRelInfo = proute->partitions[prev_leaf_part_index];
+
+			CopyFromInsertBatch(cstate, estate, mycid, hi_options,
+								presultRelInfo, myslot, bistate,
+								nBufferedTuples, bufferedTuples,
+								firstBufferedLineNo);
+		}
+		else
+			CopyFromInsertBatch(cstate, estate, mycid, hi_options,
+								resultRelInfo, myslot, bistate,
+								nBufferedTuples, bufferedTuples,
+								firstBufferedLineNo);
+	}
 
 	/* Done, clean up */
 	error_context_stack = errcallback.previous;
@@ -2855,7 +3035,7 @@ next_tuple:
 		pq_endmsgread();
 
 	/* Execute AFTER STATEMENT insertion triggers */
-	ExecASInsertTriggers(estate, resultRelInfo, cstate->transition_capture);
+	ExecASInsertTriggers(estate, target_resultRelInfo, cstate->transition_capture);
 
 	/* Handle queued AFTER triggers */
 	AfterTriggerEndQuery(estate);
@@ -2866,12 +3046,12 @@ next_tuple:
 	ExecResetTupleTable(estate->es_tupleTable, false);
 
 	/* Allow the FDW to shut down */
-	if (resultRelInfo->ri_FdwRoutine != NULL &&
-		resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
-		resultRelInfo->ri_FdwRoutine->EndForeignInsert(estate,
-													   resultRelInfo);
+	if (target_resultRelInfo->ri_FdwRoutine != NULL &&
+		target_resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
+		target_resultRelInfo->ri_FdwRoutine->EndForeignInsert(estate,
+															  target_resultRelInfo);
 
-	ExecCloseIndices(resultRelInfo);
+	ExecCloseIndices(target_resultRelInfo);
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
 	if (cstate->partition_tuple_routing)
@@ -2907,6 +3087,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
 	MemoryContext oldcontext;
 	int			i;
 	uint64		save_cur_lineno;
+	bool		line_buf_valid = cstate->line_buf_valid;
 
 	/*
 	 * Print error context information correctly, if one of the operations
@@ -2920,7 +3101,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
 	 * before calling it.
 	 */
 	oldcontext = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
-	heap_multi_insert(cstate->rel,
+	heap_multi_insert(resultRelInfo->ri_RelationDesc,
 					  bufferedTuples,
 					  nBufferedTuples,
 					  mycid,
@@ -2967,7 +3148,8 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
 		}
 	}
 
-	/* reset cur_lineno to where we were */
+	/* reset cur_lineno and line_buf_valid to what they were */
+	cstate->line_buf_valid = line_buf_valid;
 	cstate->cur_lineno = save_cur_lineno;
 }
 
-- 
2.17.1

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: Simon Riggs (#7)

Re: Making "COPY partitioned_table FROM" faster

On 25 July 2018 at 04:37, Simon Riggs <simon@2ndquadrant.com> wrote:

I don't see any need here for another GUC, nor even a command option.
The user has already indicated their use case to us:

I agree.

We know that the common case for RANGE partitioned tables is to load
into the one current partition. We also know that the partition might
change as we load data, when the data loaded crosses the partition
boundary, so we need this optimization to be adaptive. Not all data
loads follow that rule, so we also need the adpative algorithm to shut
off for those cases.

We also know that the common case for HASH partitions is to load into
all partitions at once, since hash is specifically designed to spread
data out across partitions. It is almost never true that we would want
to load one partition at a time, so it seems easy to turn the
optimization off if we use this type of partitioning. Or better, we
need work done to improve that case also, but that is outside the
current scope of this patch.

If we have multi-level partitions, if any of the levels includes a
Hash, then turn it off.

LIST partitions are less likely to have a clear pattern, so I would
treat them like HASH and assume the data is not sorted by partition.

So for this patch, just add an "if (RANGE)" test.

I agree RANGE partition is probably the most likely case to benefit
from this optimisation, but I just don't think that HASH could never
benefit and LIST probably sits somewhere in the middle.

HASH partitioning might be useful in cases like partitioning by
"sensor_id". It does not seem that unreasonable that someone might
want to load all the data for an entire sensor at once.

The v3 version of the patch also fixes the very small performance
regression for the (probably quite likely) worst-case situation. New
performance is about 3.5% faster instead of 0.5-1% slower. So likely
there's no longer any need to consider this.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#10

Simon Riggs

simon@2ndquadrant.com

over 7 years ago

In reply to: David Rowley (#9)

Re: Making "COPY partitioned_table FROM" faster

On 26 July 2018 at 03:30, David Rowley <david.rowley@2ndquadrant.com> wrote:

The v3 version of the patch also fixes the very small performance
regression for the (probably quite likely) worst-case situation. New
performance is about 3.5% faster instead of 0.5-1% slower. So likely
there's no longer any need to consider this.

Works for me!

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#11

Melanie Plageman

melanieplageman@gmail.com

over 7 years ago

In reply to: David Rowley (#8)

Re: Making "COPY partitioned_table FROM" faster

On Wed, Jul 25, 2018 at 7:24 PM, David Rowley <david.rowley@2ndquadrant.com>
wrote:

On patched code line 2564, there is a missing parenthesis. It might be
better to remove the parentheses entirely and, while you're at it, there is
a missing comma.

/*
* For partitioned tables we can't support multi-inserts when there
* are any statement level insert triggers. (It might be possible to
* allow partitioned tables with such triggers in the future, but for
* now CopyFromInsertBatch expects that any before row insert and
* statement level insert triggers are on the same relation.
*/

Should be

/*
* For partitioned tables we can't support multi-inserts when there
* are any statement level insert triggers. It might be possible to
* allow partitioned tables with such triggers in the future, but for
* now, CopyFromInsertBatch expects that any before row insert and
* statement level insert triggers are on the same relation.
*/

Regarding the performance improvement and code diff from v2 to v3:
The rearranging of the code in v3 makes sense and improves the flow of the
code, however, I stared at it for awhile and couldn't quite work out in my
head which part caused the significant improvement from v2.
I would have thought that a speedup from patch v2 to v3 would have come
from doing multi-inserts less often when the target partition is switching
a lot, however, looking at the v2 to v3 diff, I can't see how any of the
changes would have caused a decrease in the number of multi-inserts given
the same data and target table ddl.
So, I thinking I'm missing something. Which of the changes would cause the
performance improvement from patch v2 to v3? My understanding is:

a) Made the avgTuplesPerPartChange >= 1.3 the first condition in the

test that sets leafpart_use_multi_insert.

This would short-circuit checking the other conditions when deciding how to
set leafpart_use_multi_insert

b) Added an unlikely() when testing of the partition's resultRelInfo
has set to be initialised. This is only true the first time a
partition is inserted into during the COPY.

The addition of "unlikely" here seems like a good idea because there is a
function call (ExecInitPartitionInfo) inside the if statement. However,
would that cause such a difference in performance from v2 to v3? What would
be the reason? Cache misses? Some kind of pre-evaluation of the expression?

c) Added unlikely() around lastPartitionSampleLineNo test. This will

only be true 1 in 1000, so pretty unlikely.

Even though this makes sense based on the frequency with which this
condition will evaluate to true, I don't understand what the performance
benefit would be for adding a compiler hint here around such a trivial
calculation

d) Moved the nBufferedTuples > 0 test under the insertMethod ==

CIM_MULTI_CONDITIONAL test. It's never going to be > 0 when we're
doing CIM_SINGLE.

This is a good catch. It makes this part of the code more clear, as well.
However, it doesn't seem like it would affect performance at all.

Let me know what I'm missing.

#12

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: Melanie Plageman (#11)

Re: Making "COPY partitioned_table FROM" faster

On 27 July 2018 at 05:30, Melanie Plageman <melanieplageman@gmail.com> wrote:

On patched code line 2564, there is a missing parenthesis. It might be
better to remove the parentheses entirely and, while you're at it, there is
a missing comma.

Thanks for noticing that.

Fixed in the attached v4. That's the only change.

Regarding the performance improvement and code diff from v2 to v3:
The rearranging of the code in v3 makes sense and improves the flow of the
code, however, I stared at it for awhile and couldn't quite work out in my
head which part caused the significant improvement from v2.
I would have thought that a speedup from patch v2 to v3 would have come from
doing multi-inserts less often when the target partition is switching a lot,
however, looking at the v2 to v3 diff, I can't see how any of the changes
would have caused a decrease in the number of multi-inserts given the same
data and target table ddl.
So, I thinking I'm missing something. Which of the changes would cause the
performance improvement from patch v2 to v3? My understanding is:

My guess is that the compile rearranged the code move some of the
unlikely code out of line, perhaps to the end of the function making
the common working set of code fit the instruction cache, where maybe
before there was some cache missed. I've no evidence of that as I
didn't look at the assembly code or do any profiling.

You could probably narrow it down by reverting those changed
one-by-one and seeing if it was just one in particular that caused the
performance improvement or if it was some combination of the changed.
Sounds like that would take quite a bit of time and it would only be
for a learning experience, but an interesting one!

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#13

Robert Haas

robertmhaas@gmail.com

over 7 years ago

In reply to: David Rowley (#9)

Re: Making "COPY partitioned_table FROM" faster

On Wed, Jul 25, 2018 at 10:30 PM, David Rowley
<david.rowley@2ndquadrant.com> wrote:

I agree RANGE partition is probably the most likely case to benefit
from this optimisation, but I just don't think that HASH could never
benefit and LIST probably sits somewhere in the middle.

HASH partitioning might be useful in cases like partitioning by
"sensor_id". It does not seem that unreasonable that someone might
want to load all the data for an entire sensor at once.

Another case might be restoring a dump with --load-via-partition-root.
Granted, you probably shouldn't specify that option unless you expect
rows to end up in different partition than they were in the original
cluster, but it's possible somebody might do it out of an abundance of
caution if, say, they are doing an automated dump restore on a machine
that may or may not have different endian-ness than the original.

I think making it adaptive, as you've done, is definitely better than
a heuristic based on the partitioning type.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#14

Melanie Plageman

melanieplageman@gmail.com

over 7 years ago

In reply to: David Rowley (#12)

Re: Making "COPY partitioned_table FROM" faster

On Thu, Jul 26, 2018 at 7:26 PM, David Rowley <david.rowley@2ndquadrant.com>
wrote:

Fixed in the attached v4. That's the only change.

I don't see an attachment.

So, I thinking I'm missing something. Which of the changes would cause the

performance improvement from patch v2 to v3? My understanding is:

You could probably narrow it down by reverting those changed
one-by-one and seeing if it was just one in particular that caused the
performance improvement or if it was some combination of the changed.
Sounds like that would take quite a bit of time and it would only be
for a learning experience, but an interesting one!

We are probably okay without doing that in this case. Assuming v4 changed
that typo, I feel ready to change the status of the patch to "ready for
committer".

#15

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: Melanie Plageman (#14)

1 attachment(s)

Re: Making "COPY partitioned_table FROM" faster

On 29 July 2018 at 05:24, Melanie Plageman <melanieplageman@gmail.com> wrote:

On Thu, Jul 26, 2018 at 7:26 PM, David Rowley <david.rowley@2ndquadrant.com>
wrote:

Fixed in the attached v4. That's the only change.

I don't see an attachment.

Oops. Must've fallen off in transit :) Hopefully, it's more firmly
attached this time.

We are probably okay without doing that in this case. Assuming v4 changed
that typo, I feel ready to change the status of the patch to "ready for
committer".

Great. Many thanks to you and Karen for reviewing this and for pushing
me into the adaptive code. I think the patch is far better for it.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v4-0001-Allow-multi-inserts-during-COPY-into-a-partitione.patchapplication/octet-stream; name=v4-0001-Allow-multi-inserts-during-COPY-into-a-partitione.patchDownload

From 653a79963d3e7625c746bd6b95ff7e0009d422c9 Mon Sep 17 00:00:00 2001
From: "dgrowley@gmail.com" <dgrowley@gmail.com>
Date: Tue, 24 Jul 2018 12:30:56 +1200
Subject: [PATCH v4] Allow multi-inserts during COPY into a partitioned table

CopyFrom allows multi-inserts to be used for non-partitioned tables, but
this was disabled for partitioned tables.  The reason for this appeared to
be due to the fact that the tuple may not belong to the same partition as
the previous tuple did.  Not allowing multi-inserts here greatly slowed
down imports into partitioned tables.  These could take twice as long as a
copy to an equivalent non-partitioned table.  It seems wise to do something
about this, so this commit allows the multi-inserts by flushing the so-far
inserted tuples to the partition when the next tuple does not belong to the
same partition, or when the buffer fills.  This improves performance when
the next tuple in the stream commonly belongs to the same partition as the
previous tuple.

In cases where the target partition changes on every tuple using
multi-inserts slightly slows the performance.  To get around this we
track the average size of the batches that have been inserted and
adapively enable or disable multi-inserts based on the size of the
batch.  Some testing was done and the regression only seems to exist
when the average size of the insert batch is close to 1, so let's just
enable multi-inserts when the average size is at least 1.3.  More
performance testing might reveal a better number for, this, but since the
slowdown was only 1-2% it does not seem critical enough to spend too much
time calculating it.  In any case it may depend on other factors rather
than just the size of the batch.

Allowing multi-inserts for partitions required a bit of work around the
per-tuple memory contexts as we must flush the tuples when the next tuple
does not belong the same partition.  In which case there is no good time to
reset the per-tuple context, as we've already built the new tuple by this
time.  In order to work around this we maintain two per-tuple contexts and
just switch between them every time the partition changes and reset the old
one.  This does mean that the first of each batch of tuples is not
allocated in the same memory context as the others, but that does not
matter since we only reset the context once the previous batch has been
inserted.
---
 src/backend/commands/copy.c | 330 ++++++++++++++++++++++++++++++++++----------
 1 file changed, 256 insertions(+), 74 deletions(-)

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3a66cb5025..dacbe751c6 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -47,6 +47,7 @@
 #include "utils/builtins.h"
 #include "utils/lsyscache.h"
 #include "utils/memutils.h"
+#include "utils/partcache.h"
 #include "utils/portal.h"
 #include "utils/rel.h"
 #include "utils/rls.h"
@@ -79,6 +80,16 @@ typedef enum EolType
 	EOL_CRNL
 } EolType;
 
+/*
+ * Represents the heap insert method to be used during COPY to.
+ */
+typedef enum CopyInsertMethod
+{
+	CIM_SINGLE,					/* use heap_insert or fdw routine */
+	CIM_MULTI,					/* always use heap_multi_insert */
+	CIM_MULTI_CONDITIONAL		/* use heap_multi_insert only if valid */
+} CopyInsertMethod;
+
 /*
  * This struct contains all the state variables used throughout a COPY
  * operation. For simplicity, we use the same struct for all variants of COPY,
@@ -2305,26 +2316,35 @@ CopyFrom(CopyState cstate)
 	Datum	   *values;
 	bool	   *nulls;
 	ResultRelInfo *resultRelInfo;
-	ResultRelInfo *saved_resultRelInfo = NULL;
+	ResultRelInfo *target_resultRelInfo;
 	EState	   *estate = CreateExecutorState(); /* for ExecConstraints() */
 	ModifyTableState *mtstate;
 	ExprContext *econtext;
 	TupleTableSlot *myslot;
 	MemoryContext oldcontext = CurrentMemoryContext;
 
+	PartitionTupleRouting *proute = NULL;
+	ExprContext *secondaryExprContext = NULL;
 	ErrorContextCallback errcallback;
 	CommandId	mycid = GetCurrentCommandId(true);
 	int			hi_options = 0; /* start with default heap_insert options */
 	BulkInsertState bistate;
+	CopyInsertMethod insertMethod;
 	uint64		processed = 0;
-	bool		useHeapMultiInsert;
 	int			nBufferedTuples = 0;
 	int			prev_leaf_part_index = -1;
+	bool		has_before_insert_row_trig;
+	bool		has_instead_insert_row_trig;
+	bool		leafpart_use_multi_insert = false;
 
 #define MAX_BUFFERED_TUPLES 1000
+#define RECHECK_MULTI_INSERT_THRESHOLD 1000
 	HeapTuple  *bufferedTuples = NULL;	/* initialize to silence warning */
 	Size		bufferedTuplesSize = 0;
 	uint64		firstBufferedLineNo = 0;
+	uint64		lastPartitionSampleLineNo = 0;
+	uint64		nPartitionChanges = 0;
+	double		avgTuplesPerPartChange = 0;
 
 	Assert(cstate->rel);
 
@@ -2455,6 +2475,7 @@ CopyFrom(CopyState cstate)
 					  1,		/* dummy rangetable index */
 					  NULL,
 					  0);
+	target_resultRelInfo = resultRelInfo;
 
 	/* Verify the named relation is a valid target for INSERT */
 	CheckValidResultRel(resultRelInfo, CMD_INSERT);
@@ -2504,8 +2525,6 @@ CopyFrom(CopyState cstate)
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		PartitionTupleRouting *proute;
-
 		proute = cstate->partition_tuple_routing =
 			ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
@@ -2522,28 +2541,92 @@ CopyFrom(CopyState cstate)
 	/*
 	 * It's more efficient to prepare a bunch of tuples for insertion, and
 	 * insert them in one heap_multi_insert() call, than call heap_insert()
-	 * separately for every tuple. However, we can't do that if there are
-	 * BEFORE/INSTEAD OF triggers, or we need to evaluate volatile default
-	 * expressions. Such triggers or expressions might query the table we're
-	 * inserting to, and act differently if the tuples that have already been
-	 * processed and prepared for insertion are not there.  We also can't do
-	 * it if the table is foreign or partitioned.
+	 * separately for every tuple. However, there are a number of reasons why
+	 * we might not be able to do this.  These are explained below.
 	 */
-	if ((resultRelInfo->ri_TrigDesc != NULL &&
-		 (resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
-		  resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
-		resultRelInfo->ri_FdwRoutine != NULL ||
-		cstate->partition_tuple_routing != NULL ||
-		cstate->volatile_defexprs)
+	if (resultRelInfo->ri_TrigDesc != NULL &&
+		(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
+		 resultRelInfo->ri_TrigDesc->trig_insert_instead_row))
+	{
+		/*
+		 * Can't support multi-inserts when there are any BEFORE/INSTEAD OF
+		 * triggers on the table. Such triggers might query the table we're
+		 * inserting into and act differently if the tuples that have already
+		 * been processed and prepared for insertion are not there.
+		 */
+		insertMethod = CIM_SINGLE;
+	}
+	else if (proute != NULL && resultRelInfo->ri_TrigDesc != NULL &&
+			 resultRelInfo->ri_TrigDesc->trig_insert_new_table)
+	{
+		/*
+		 * For partitioned tables we can't support multi-inserts when there
+		 * are any statement level insert triggers. It might be possible to
+		 * allow partitioned tables with such triggers in the future, but for
+		 * now, CopyFromInsertBatch expects that any before row insert and
+		 * statement level insert triggers are on the same relation.
+		 */
+		insertMethod = CIM_SINGLE;
+	}
+	else if (resultRelInfo->ri_FdwRoutine != NULL ||
+			 cstate->volatile_defexprs)
 	{
-		useHeapMultiInsert = false;
+		/*
+		 * Can't support multi-inserts to foreign tables or if there are any
+		 * volatile default expressions in the table.  Similarly to the
+		 * trigger case above, such expressions may query the table we're
+		 * inserting into.
+		 *
+		 * Note: It does not matter if any partitions have any volatile
+		 * default expressions as we use the defaults from the target of the
+		 * COPY command.
+		 */
+		insertMethod = CIM_SINGLE;
 	}
 	else
 	{
-		useHeapMultiInsert = true;
+		/*
+		 * For partitioned tables, we may still be able to perform bulk
+		 * inserts for sets of consecutive tuples which belong to the same
+		 * partition.  However, the possibility of this depends on which types
+		 * of triggers exist on the partition.  We must disable bulk inserts
+		 * if the partition is a foreign table or it has any before row insert
+		 * or insert instead triggers (same as we checked above for the parent
+		 * table).  Since the partition's resultRelInfos are initialized only
+		 * when we actually need to insert the first tuple into them, we must
+		 * have the intermediate insert method of CIM_MULTI_CONDITIONAL to
+		 * flag that we must later determine if we can use bulk-inserts for
+		 * the partition being inserted into.
+		 *
+		 * Normally, when performing bulk inserts we just flush the insert
+		 * buffer whenever it becomes full, but for the partitioned table
+		 * case, we flush it whenever the current tuple does not belong to the
+		 * same partition as the previous tuple, and since we flush the
+		 * previous partition's buffer once the new tuple has already been
+		 * built, we're unable to reset the estate since we'd free the memory
+		 * in which the new tuple is stored.  To work around this we maintain
+		 * a secondary expression context and alternate between these when the
+		 * partition changes.  This does mean we do store the first new tuple
+		 * in a different context than subsequent tuples, but that does not
+		 * matter, providing we don't free anything while it's still needed.
+		 */
+		if (proute)
+		{
+			insertMethod = CIM_MULTI_CONDITIONAL;
+			secondaryExprContext = CreateExprContext(estate);
+		}
+		else
+			insertMethod = CIM_MULTI;
+
 		bufferedTuples = palloc(MAX_BUFFERED_TUPLES * sizeof(HeapTuple));
 	}
 
+	has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
+								  resultRelInfo->ri_TrigDesc->trig_insert_before_row);
+
+	has_instead_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
+								   resultRelInfo->ri_TrigDesc->trig_insert_instead_row);
+
 	/*
 	 * Check BEFORE STATEMENT insertion triggers. It's debatable whether we
 	 * should do this for COPY, since it's not really an "INSERT" statement as
@@ -2598,7 +2681,7 @@ CopyFrom(CopyState cstate)
 		 * Constraints might reference the tableoid column, so initialize
 		 * t_tableOid before evaluating them.
 		 */
-		tuple->t_tableOid = RelationGetRelid(resultRelInfo->ri_RelationDesc);
+		tuple->t_tableOid = RelationGetRelid(target_resultRelInfo->ri_RelationDesc);
 
 		/* Triggers and stuff need to be invoked in query context. */
 		MemoryContextSwitchTo(oldcontext);
@@ -2608,10 +2691,9 @@ CopyFrom(CopyState cstate)
 		ExecStoreTuple(tuple, slot, InvalidBuffer, false);
 
 		/* Determine the partition to heap_insert the tuple into */
-		if (cstate->partition_tuple_routing)
+		if (proute)
 		{
 			int			leaf_part_index;
-			PartitionTupleRouting *proute = cstate->partition_tuple_routing;
 
 			/*
 			 * Away we go ... If we end up not finding a partition after all,
@@ -2621,39 +2703,131 @@ CopyFrom(CopyState cstate)
 			 * will get us the ResultRelInfo and TupleConversionMap for the
 			 * partition, respectively.
 			 */
-			leaf_part_index = ExecFindPartition(resultRelInfo,
+			leaf_part_index = ExecFindPartition(target_resultRelInfo,
 												proute->partition_dispatch_info,
 												slot,
 												estate);
 			Assert(leaf_part_index >= 0 &&
 				   leaf_part_index < proute->num_partitions);
 
-			/*
-			 * If this tuple is mapped to a partition that is not same as the
-			 * previous one, we'd better make the bulk insert mechanism gets a
-			 * new buffer.
-			 */
 			if (prev_leaf_part_index != leaf_part_index)
 			{
+				/* Check if we can multi-insert into this partition */
+				if (insertMethod == CIM_MULTI_CONDITIONAL)
+				{
+					/*
+					 * When performing bulk-inserts into partitioned tables we
+					 * must insert the tuples seen so far to the heap whenever
+					 * the partition changes.
+					 */
+					if (nBufferedTuples > 0)
+					{
+						ExprContext *swapcontext;
+						ResultRelInfo *presultRelInfo;
+
+						presultRelInfo = proute->partitions[prev_leaf_part_index];
+
+						CopyFromInsertBatch(cstate, estate, mycid, hi_options,
+											presultRelInfo, myslot, bistate,
+											nBufferedTuples, bufferedTuples,
+											firstBufferedLineNo);
+						nBufferedTuples = 0;
+						bufferedTuplesSize = 0;
+
+						Assert(secondaryExprContext);
+
+						/*
+						 * Normally we reset the per-tuple context whenever
+						 * the bufferedTuples array is empty at the beginning
+						 * of the loop, however, it is possible since we flush
+						 * the buffer here that the buffer is never empty at
+						 * the start of the loop.  To prevent the per-tuple
+						 * context from never being reset we maintain a second
+						 * context and alternate between them when the
+						 * partition changes.  We can now reset
+						 * secondaryExprContext as this is no longer needed,
+						 * since we just flushed any tuples stored in it.  We
+						 * also now switch over to the other context.  This
+						 * does mean that the first tuple in the buffer won't
+						 * be in the same context as the others, but that does
+						 * not matter since we only reset it after the flush.
+						 */
+						ReScanExprContext(secondaryExprContext);
+
+						swapcontext = secondaryExprContext;
+						secondaryExprContext = estate->es_per_tuple_exprcontext;
+						estate->es_per_tuple_exprcontext = swapcontext;
+					}
+
+					nPartitionChanges++;
+
+					/*
+					 * Here we adaptively enable multi-inserts based on the
+					 * average number of tuples from recent multi-insert
+					 * batches.  We recalculate the average every
+					 * RECHECK_MULTI_INSERT_THRESHOLD tuples instead of taking
+					 * the average over the whole copy.  This allows us to
+					 * enable multi-inserts when we get periods in the copy
+					 * stream that have tuples commonly belonging to the same
+					 * partition, and disable when the partition is changing
+					 * too often.
+					 */
+					if (unlikely(lastPartitionSampleLineNo <= (cstate->cur_lineno -
+															   RECHECK_MULTI_INSERT_THRESHOLD)))
+					{
+						avgTuplesPerPartChange =
+							(cstate->cur_lineno - lastPartitionSampleLineNo) /
+							(double) nPartitionChanges;
+
+						lastPartitionSampleLineNo = cstate->cur_lineno;
+						nPartitionChanges = 0;
+					}
+
+					/*
+					 * Tests have shown that using multi-inserts when the
+					 * partition changes on every tuple slightly decreases the
+					 * performance, however, there are benefits even when only
+					 * some batches have just 2 tuples, so let's enable
+					 * multi-inserts even when the average is quite low.
+					 */
+					leafpart_use_multi_insert = avgTuplesPerPartChange >= 1.3 &&
+						!has_before_insert_row_trig &&
+						!has_instead_insert_row_trig &&
+						resultRelInfo->ri_FdwRoutine == NULL;
+				}
+				else
+					leafpart_use_multi_insert = false;
+
+				/*
+				 * Overwrite resultRelInfo with the corresponding partition's
+				 * one.
+				 */
+				resultRelInfo = proute->partitions[leaf_part_index];
+				if (unlikely(resultRelInfo == NULL))
+				{
+					resultRelInfo = ExecInitPartitionInfo(mtstate,
+														  target_resultRelInfo,
+														  proute, estate,
+														  leaf_part_index);
+					proute->partitions[leaf_part_index] = resultRelInfo;
+					Assert(resultRelInfo != NULL);
+				}
+
+				/* Determine which triggers exist on this partition */
+				has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
+											  resultRelInfo->ri_TrigDesc->trig_insert_before_row);
+
+				has_instead_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
+											   resultRelInfo->ri_TrigDesc->trig_insert_instead_row);
+
+				/*
+				 * We'd better make the bulk insert mechanism gets a new
+				 * buffer when the partition being inserted into changes.
+				 */
 				ReleaseBulkInsertStatePin(bistate);
 				prev_leaf_part_index = leaf_part_index;
 			}
 
-			/*
-			 * Save the old ResultRelInfo and switch to the one corresponding
-			 * to the selected partition.
-			 */
-			saved_resultRelInfo = resultRelInfo;
-			resultRelInfo = proute->partitions[leaf_part_index];
-			if (resultRelInfo == NULL)
-			{
-				resultRelInfo = ExecInitPartitionInfo(mtstate,
-													  saved_resultRelInfo,
-													  proute, estate,
-													  leaf_part_index);
-				Assert(resultRelInfo != NULL);
-			}
-
 			/*
 			 * For ExecInsertIndexTuples() to work on the partition's indexes
 			 */
@@ -2665,8 +2839,7 @@ CopyFrom(CopyState cstate)
 			 */
 			if (cstate->transition_capture != NULL)
 			{
-				if (resultRelInfo->ri_TrigDesc &&
-					resultRelInfo->ri_TrigDesc->trig_insert_before_row)
+				if (has_before_insert_row_trig)
 				{
 					/*
 					 * If there are any BEFORE triggers on the partition,
@@ -2675,7 +2848,7 @@ CopyFrom(CopyState cstate)
 					 */
 					cstate->transition_capture->tcs_original_insert_tuple = NULL;
 					cstate->transition_capture->tcs_map =
-						TupConvMapForLeaf(proute, saved_resultRelInfo,
+						TupConvMapForLeaf(proute, target_resultRelInfo,
 										  leaf_part_index);
 				}
 				else
@@ -2704,8 +2877,7 @@ CopyFrom(CopyState cstate)
 		skip_tuple = false;
 
 		/* BEFORE ROW INSERT Triggers */
-		if (resultRelInfo->ri_TrigDesc &&
-			resultRelInfo->ri_TrigDesc->trig_insert_before_row)
+		if (has_before_insert_row_trig)
 		{
 			slot = ExecBRInsertTriggers(estate, resultRelInfo, slot);
 
@@ -2717,8 +2889,7 @@ CopyFrom(CopyState cstate)
 
 		if (!skip_tuple)
 		{
-			if (resultRelInfo->ri_TrigDesc &&
-				resultRelInfo->ri_TrigDesc->trig_insert_instead_row)
+			if (has_instead_insert_row_trig)
 			{
 				/* Pass the data to the INSTEAD ROW INSERT trigger */
 				ExecIRInsertTriggers(estate, resultRelInfo, slot);
@@ -2740,12 +2911,15 @@ CopyFrom(CopyState cstate)
 				 * partition.
 				 */
 				if (resultRelInfo->ri_PartitionCheck &&
-					(saved_resultRelInfo == NULL ||
-					 (resultRelInfo->ri_TrigDesc &&
-					  resultRelInfo->ri_TrigDesc->trig_insert_before_row)))
+					(proute == NULL || has_before_insert_row_trig))
 					ExecPartitionCheck(resultRelInfo, slot, estate, true);
 
-				if (useHeapMultiInsert)
+				/*
+				 * Perform multi-inserts when enabled, or when loading a
+				 * partitioned table that can support multi-inserts as
+				 * determined above.
+				 */
+				if (insertMethod == CIM_MULTI || leafpart_use_multi_insert)
 				{
 					/* Add this tuple to the tuple buffer */
 					if (nBufferedTuples == 0)
@@ -2783,7 +2957,7 @@ CopyFrom(CopyState cstate)
 																			   NULL);
 
 						if (slot == NULL)	/* "do nothing" */
-							goto next_tuple;
+							continue;	/* next tuple please */
 
 						/* FDW might have changed tuple */
 						tuple = ExecMaterializeSlot(slot);
@@ -2823,22 +2997,28 @@ CopyFrom(CopyState cstate)
 			 */
 			processed++;
 		}
-
-next_tuple:
-		/* Restore the saved ResultRelInfo */
-		if (saved_resultRelInfo)
-		{
-			resultRelInfo = saved_resultRelInfo;
-			estate->es_result_relation_info = resultRelInfo;
-		}
 	}
 
 	/* Flush any remaining buffered tuples */
 	if (nBufferedTuples > 0)
-		CopyFromInsertBatch(cstate, estate, mycid, hi_options,
-							resultRelInfo, myslot, bistate,
-							nBufferedTuples, bufferedTuples,
-							firstBufferedLineNo);
+	{
+		if (insertMethod == CIM_MULTI_CONDITIONAL)
+		{
+			ResultRelInfo *presultRelInfo;
+
+			presultRelInfo = proute->partitions[prev_leaf_part_index];
+
+			CopyFromInsertBatch(cstate, estate, mycid, hi_options,
+								presultRelInfo, myslot, bistate,
+								nBufferedTuples, bufferedTuples,
+								firstBufferedLineNo);
+		}
+		else
+			CopyFromInsertBatch(cstate, estate, mycid, hi_options,
+								resultRelInfo, myslot, bistate,
+								nBufferedTuples, bufferedTuples,
+								firstBufferedLineNo);
+	}
 
 	/* Done, clean up */
 	error_context_stack = errcallback.previous;
@@ -2855,7 +3035,7 @@ next_tuple:
 		pq_endmsgread();
 
 	/* Execute AFTER STATEMENT insertion triggers */
-	ExecASInsertTriggers(estate, resultRelInfo, cstate->transition_capture);
+	ExecASInsertTriggers(estate, target_resultRelInfo, cstate->transition_capture);
 
 	/* Handle queued AFTER triggers */
 	AfterTriggerEndQuery(estate);
@@ -2866,12 +3046,12 @@ next_tuple:
 	ExecResetTupleTable(estate->es_tupleTable, false);
 
 	/* Allow the FDW to shut down */
-	if (resultRelInfo->ri_FdwRoutine != NULL &&
-		resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
-		resultRelInfo->ri_FdwRoutine->EndForeignInsert(estate,
-													   resultRelInfo);
+	if (target_resultRelInfo->ri_FdwRoutine != NULL &&
+		target_resultRelInfo->ri_FdwRoutine->EndForeignInsert != NULL)
+		target_resultRelInfo->ri_FdwRoutine->EndForeignInsert(estate,
+															  target_resultRelInfo);
 
-	ExecCloseIndices(resultRelInfo);
+	ExecCloseIndices(target_resultRelInfo);
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
 	if (cstate->partition_tuple_routing)
@@ -2907,6 +3087,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
 	MemoryContext oldcontext;
 	int			i;
 	uint64		save_cur_lineno;
+	bool		line_buf_valid = cstate->line_buf_valid;
 
 	/*
 	 * Print error context information correctly, if one of the operations
@@ -2920,7 +3101,7 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
 	 * before calling it.
 	 */
 	oldcontext = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
-	heap_multi_insert(cstate->rel,
+	heap_multi_insert(resultRelInfo->ri_RelationDesc,
 					  bufferedTuples,
 					  nBufferedTuples,
 					  mycid,
@@ -2967,7 +3148,8 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
 		}
 	}
 
-	/* reset cur_lineno to where we were */
+	/* reset cur_lineno and line_buf_valid to what they were */
+	cstate->line_buf_valid = line_buf_valid;
 	cstate->cur_lineno = save_cur_lineno;
 }
 
-- 
2.17.1

#16

Melanie Plageman

melanieplageman@gmail.com

over 7 years ago

In reply to: David Rowley (#15)

Re: Making "COPY partitioned_table FROM" faster

On Sat, Jul 28, 2018 at 6:00 PM, David Rowley <david.rowley@2ndquadrant.com>
wrote:

Oops. Must've fallen off in transit :) Hopefully, it's more firmly
attached this time.

LGTM. Status changed to "ready for committer"

#17

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

over 7 years ago

In reply to: David Rowley (#15)

Re: Making "COPY partitioned_table FROM" faster

Two more thoughts:

- Add some tests. The if (nBufferedTuples > 0) that flushes the tuples
when the partition changes is not currently exercised.

- With proute becoming a function-level variable,
cstate->partition_tuple_routing is obsolete and could be removed. (No
point in saving this in cstate if it's only used in one function anyway.)

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#18

Amit Langote

Langote_Amit_f8@lab.ntt.co.jp

over 7 years ago

In reply to: Peter Eisentraut (#17)

Re: Making "COPY partitioned_table FROM" faster

On 2018/07/30 17:33, Peter Eisentraut wrote:

- With proute becoming a function-level variable,
cstate->partition_tuple_routing is obsolete and could be removed. (No
point in saving this in cstate if it's only used in one function anyway.)

+1. Also seems to apply to transition_capture, which afaict, was added in
cstate only because partition_tuple_routing is there.

Thanks,
Amit

#19

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: Peter Eisentraut (#17)

1 attachment(s)

Re: Making "COPY partitioned_table FROM" faster

On 30 July 2018 at 20:33, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

Two more thoughts:

- Add some tests. The if (nBufferedTuples > 0) that flushes the tuples
when the partition changes is not currently exercised.

That seems like a good idea. In fact, it uncovered a bug around
ConvertPartitionTupleSlot() freeing the previously stored tuple out
the slot which resulted in a crash. I didn't notice before because my
test had previously not required any tuple conversions.

While ensuring my test was working correctly I noticed that I had a
thinko in the logic that decided if another avgTuplesPerPartChange
calculation was due. The problem was that the (cstate->cur_lineno -
RECHECK_MULTI_INSERT_THRESHOLD) is unsigned and when cur_lineno is
below RECHECK_MULTI_INSERT_THRESHOLD it results in an underflow which
makes the if test always true until cur_lineno gets up to 1000. I
considered just making all those counters signed, but chickened out
when it came to changing "processed". I thought it was quite strange
to have "processed" unsigned and the other counters that do similar
things signed. Of course, signed would have enough range, but it
would mean changing the return type of CopyFrom() which I wasn't
willing to do.

In the end, I just protected the if test so that it only calculates
the average again if cur_lineno is at least
RECHECK_MULTI_INSERT_THRESHOLD. This slightly changes when
avgTuplesPerPartChange is first set, but it'll still be at least 1000
tuples before we start doing multi-inserts. Another option would be to
cast the expression as int64... I'm a bit undecided what's best here.

- With proute becoming a function-level variable,
cstate->partition_tuple_routing is obsolete and could be removed. (No
point in saving this in cstate if it's only used in one function anyway.)

Good idea. I've removed that.

I've attached a delta patch based on the v4 patch.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v5_multiinsert_copy_delta.patchapplication/octet-stream; name=v5_multiinsert_copy_delta.patchDownload

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index dacbe751c6..abe70097c5 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -81,7 +81,7 @@ typedef enum EolType
 } EolType;
 
 /*
- * Represents the heap insert method to be used during COPY to.
+ * Represents the heap insert method to be used during COPY FROM.
  */
 typedef enum CopyInsertMethod
 {
@@ -179,8 +179,6 @@ typedef struct CopyStateData
 	bool		volatile_defexprs;	/* is any of defexprs volatile? */
 	List	   *range_table;
 
-	/* Tuple-routing support info */
-	PartitionTupleRouting *partition_tuple_routing;
 
 	TransitionCaptureState *transition_capture;
 
@@ -2525,8 +2523,7 @@ CopyFrom(CopyState cstate)
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		proute = cstate->partition_tuple_routing =
-			ExecSetupPartitionTupleRouting(NULL, cstate->rel);
+		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
 		/*
 		 * If we are capturing transition tuples, they may need to be
@@ -2773,7 +2770,8 @@ CopyFrom(CopyState cstate)
 					 * too often.
 					 */
 					if (unlikely(lastPartitionSampleLineNo <= (cstate->cur_lineno -
-															   RECHECK_MULTI_INSERT_THRESHOLD)))
+															   RECHECK_MULTI_INSERT_THRESHOLD)
+								 && cstate->cur_lineno >= RECHECK_MULTI_INSERT_THRESHOLD))
 					{
 						avgTuplesPerPartChange =
 							(cstate->cur_lineno - lastPartitionSampleLineNo) /
@@ -2864,10 +2862,12 @@ CopyFrom(CopyState cstate)
 
 			/*
 			 * We might need to convert from the parent rowtype to the
-			 * partition rowtype.
+			 * partition rowtype.  Don't free the already stored tuple as it
+			 * may still be required for a multi-insert batch.
 			 */
 			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
 											  tuple,
+											  false,
 											  proute->partition_tuple_slot,
 											  &slot);
 
@@ -3054,8 +3054,8 @@ CopyFrom(CopyState cstate)
 	ExecCloseIndices(target_resultRelInfo);
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
-	if (cstate->partition_tuple_routing)
-		ExecCleanupTupleRouting(mtstate, cstate->partition_tuple_routing);
+	if (proute)
+		ExecCleanupTupleRouting(mtstate, proute);
 
 	/* Close any trigger target relations */
 	ExecCleanUpTriggerState(estate);
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index cd0ec08461..8d1146216d 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -773,6 +773,7 @@ TupConvMapForLeaf(PartitionTupleRouting *proute,
 HeapTuple
 ConvertPartitionTupleSlot(TupleConversionMap *map,
 						  HeapTuple tuple,
+						  bool shouldFree,
 						  TupleTableSlot *new_slot,
 						  TupleTableSlot **p_my_slot)
 {
@@ -787,7 +788,7 @@ ConvertPartitionTupleSlot(TupleConversionMap *map,
 	*p_my_slot = new_slot;
 	Assert(new_slot != NULL);
 	ExecSetSlotDescriptor(new_slot, map->outdesc);
-	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, shouldFree);
 
 	return tuple;
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index f535762e2d..c5591bacc9 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1163,6 +1163,7 @@ lreplace:;
 			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
 			tuple = ConvertPartitionTupleSlot(tupconv_map,
 											  tuple,
+											  true,
 											  proute->root_tuple_slot,
 											  &slot);
 
@@ -1791,6 +1792,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	 */
 	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
 							  tuple,
+							  true,
 							  proute->partition_tuple_slot,
 							  &slot);
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 862bf65060..0a249985bb 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -204,6 +204,7 @@ extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
 				  ResultRelInfo *rootRelInfo, int leaf_index);
 extern HeapTuple ConvertPartitionTupleSlot(TupleConversionMap *map,
 						  HeapTuple tuple,
+						  bool shouldFree,
 						  TupleTableSlot *new_slot,
 						  TupleTableSlot **p_my_slot);
 extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
diff --git a/src/test/regress/input/copy.source b/src/test/regress/input/copy.source
index cb13606d14..6d05f95e0d 100644
--- a/src/test/regress/input/copy.source
+++ b/src/test/regress/input/copy.source
@@ -133,3 +133,28 @@ this is just a line full of junk that would error out if parsed
 \.
 
 copy copytest3 to stdout csv header;
+
+-- test copy from with a partitioned table
+create table parted_copytest (
+	a int,
+	b int,
+	c text
+) partition by list (b);
+
+create table parted_copytest_a1 (c text, b int, a int);
+create table parted_copytest_a2 (a int, c text, b int);
+
+alter table parted_copytest attach partition parted_copytest_a1 for values in(1);
+alter table parted_copytest attach partition parted_copytest_a2 for values in(2);
+
+insert into parted_copytest select x,1,'One' from generate_series(1,1000) x;
+insert into parted_copytest select x,2,'Two' from generate_series(1001,1010) x;
+insert into parted_copytest select x,1,'One' from generate_series(1011,1020) x;
+
+copy (select * from parted_copytest order by a) to '@abs_builddir@/results/parted_copytest.csv';
+
+truncate parted_copytest;
+
+copy parted_copytest from '@abs_builddir@/results/parted_copytest.csv';
+
+drop table parted_copytest;

#20

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

over 7 years ago

In reply to: David Rowley (#19)

Re: Making "COPY partitioned_table FROM" faster

On 30/07/2018 15:26, David Rowley wrote:

- Add some tests. The if (nBufferedTuples > 0) that flushes the tuples
when the partition changes is not currently exercised.

That seems like a good idea. In fact, it uncovered a bug around
ConvertPartitionTupleSlot() freeing the previously stored tuple out
the slot which resulted in a crash. I didn't notice before because my
test had previously not required any tuple conversions.

I think we need to think of a better place to put that temporary file,
and clean it up properly afterwards. I'm not sure whether we have
existing uses like that.

Also, maybe the test should check afterwards that the right count of
rows ended up in each partition?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#21

Melanie Plageman

melanieplageman@gmail.com

over 7 years ago

In reply to: Peter Eisentraut (#20)

Re: Making "COPY partitioned_table FROM" faster

On Mon, Jul 30, 2018 at 11:21 AM, Peter Eisentraut <
peter.eisentraut@2ndquadrant.com> wrote:

On 30/07/2018 15:26, David Rowley wrote:

- Add some tests. The if (nBufferedTuples > 0) that flushes the tuples
when the partition changes is not currently exercised.

That seems like a good idea. In fact, it uncovered a bug around
ConvertPartitionTupleSlot() freeing the previously stored tuple out
the slot which resulted in a crash. I didn't notice before because my
test had previously not required any tuple conversions.

I think we need to think of a better place to put that temporary file,
and clean it up properly afterwards. I'm not sure whether we have
existing uses like that.

Also, maybe the test should check afterwards that the right count of
rows ended up in each partition?

Yea, I actually would suggest changing the data inserted in the third

insert statement to have 'Three' in the third column:
insert into parted_copytest select x,1,'One' from
generate_series(1011,1020) x;
And then this check:
select count(*) from parted_copytest group by a, b, c;

#22

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: Peter Eisentraut (#20)

1 attachment(s)

Re: Making "COPY partitioned_table FROM" faster

On 31 July 2018 at 06:21, Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:

On 30/07/2018 15:26, David Rowley wrote:

- Add some tests. The if (nBufferedTuples > 0) that flushes the tuples
when the partition changes is not currently exercised.

That seems like a good idea. In fact, it uncovered a bug around
ConvertPartitionTupleSlot() freeing the previously stored tuple out
the slot which resulted in a crash. I didn't notice before because my
test had previously not required any tuple conversions.

I think we need to think of a better place to put that temporary file,
and clean it up properly afterwards. I'm not sure whether we have
existing uses like that.

Ideally, I could have written the test in copy2.sql, but since I had
to insert over 1000 rows to trigger the multi-insert code, copy from
stdin was not practical in the .sql file. Instead, I just followed the
lead in copy.source and used the same temp file style as the previous
test. It also leaves that file laying around, it just happens to be
smaller. I've updated the test to truncate the table again and
perform another COPY TO to empty the file. I didn't see any way to
remove the file due to lack of standard way of removing files in the
OS. \! rm ... is not going to work on windows, for example.

I also failed to realise how the .source files work previously. I see
there are input and output directories. I'd missed updating the output
one in the v5 delta patch.

Also, maybe the test should check afterwards that the right count of
rows ended up in each partition?

Agreed. I've added that.

The attached v6 delta replaces the v5 delta and should be applied on
top of the full v4 patch. I'm using deltas in the hope they're easier
to review the few changes than reviewing the entire patch again.

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

v6_multiinsert_copy_delta.patchapplication/octet-stream; name=v6_multiinsert_copy_delta.patchDownload

diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index dacbe751c6..abe70097c5 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -81,7 +81,7 @@ typedef enum EolType
 } EolType;
 
 /*
- * Represents the heap insert method to be used during COPY to.
+ * Represents the heap insert method to be used during COPY FROM.
  */
 typedef enum CopyInsertMethod
 {
@@ -179,8 +179,6 @@ typedef struct CopyStateData
 	bool		volatile_defexprs;	/* is any of defexprs volatile? */
 	List	   *range_table;
 
-	/* Tuple-routing support info */
-	PartitionTupleRouting *partition_tuple_routing;
 
 	TransitionCaptureState *transition_capture;
 
@@ -2525,8 +2523,7 @@ CopyFrom(CopyState cstate)
 	 */
 	if (cstate->rel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
 	{
-		proute = cstate->partition_tuple_routing =
-			ExecSetupPartitionTupleRouting(NULL, cstate->rel);
+		proute = ExecSetupPartitionTupleRouting(NULL, cstate->rel);
 
 		/*
 		 * If we are capturing transition tuples, they may need to be
@@ -2773,7 +2770,8 @@ CopyFrom(CopyState cstate)
 					 * too often.
 					 */
 					if (unlikely(lastPartitionSampleLineNo <= (cstate->cur_lineno -
-															   RECHECK_MULTI_INSERT_THRESHOLD)))
+															   RECHECK_MULTI_INSERT_THRESHOLD)
+								 && cstate->cur_lineno >= RECHECK_MULTI_INSERT_THRESHOLD))
 					{
 						avgTuplesPerPartChange =
 							(cstate->cur_lineno - lastPartitionSampleLineNo) /
@@ -2864,10 +2862,12 @@ CopyFrom(CopyState cstate)
 
 			/*
 			 * We might need to convert from the parent rowtype to the
-			 * partition rowtype.
+			 * partition rowtype.  Don't free the already stored tuple as it
+			 * may still be required for a multi-insert batch.
 			 */
 			tuple = ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[leaf_part_index],
 											  tuple,
+											  false,
 											  proute->partition_tuple_slot,
 											  &slot);
 
@@ -3054,8 +3054,8 @@ CopyFrom(CopyState cstate)
 	ExecCloseIndices(target_resultRelInfo);
 
 	/* Close all the partitioned tables, leaf partitions, and their indices */
-	if (cstate->partition_tuple_routing)
-		ExecCleanupTupleRouting(mtstate, cstate->partition_tuple_routing);
+	if (proute)
+		ExecCleanupTupleRouting(mtstate, proute);
 
 	/* Close any trigger target relations */
 	ExecCleanUpTriggerState(estate);
diff --git a/src/backend/executor/execPartition.c b/src/backend/executor/execPartition.c
index cd0ec08461..8d1146216d 100644
--- a/src/backend/executor/execPartition.c
+++ b/src/backend/executor/execPartition.c
@@ -773,6 +773,7 @@ TupConvMapForLeaf(PartitionTupleRouting *proute,
 HeapTuple
 ConvertPartitionTupleSlot(TupleConversionMap *map,
 						  HeapTuple tuple,
+						  bool shouldFree,
 						  TupleTableSlot *new_slot,
 						  TupleTableSlot **p_my_slot)
 {
@@ -787,7 +788,7 @@ ConvertPartitionTupleSlot(TupleConversionMap *map,
 	*p_my_slot = new_slot;
 	Assert(new_slot != NULL);
 	ExecSetSlotDescriptor(new_slot, map->outdesc);
-	ExecStoreTuple(tuple, new_slot, InvalidBuffer, true);
+	ExecStoreTuple(tuple, new_slot, InvalidBuffer, shouldFree);
 
 	return tuple;
 }
diff --git a/src/backend/executor/nodeModifyTable.c b/src/backend/executor/nodeModifyTable.c
index f535762e2d..c5591bacc9 100644
--- a/src/backend/executor/nodeModifyTable.c
+++ b/src/backend/executor/nodeModifyTable.c
@@ -1163,6 +1163,7 @@ lreplace:;
 			tupconv_map = tupconv_map_for_subplan(mtstate, map_index);
 			tuple = ConvertPartitionTupleSlot(tupconv_map,
 											  tuple,
+											  true,
 											  proute->root_tuple_slot,
 											  &slot);
 
@@ -1791,6 +1792,7 @@ ExecPrepareTupleRouting(ModifyTableState *mtstate,
 	 */
 	ConvertPartitionTupleSlot(proute->parent_child_tupconv_maps[partidx],
 							  tuple,
+							  true,
 							  proute->partition_tuple_slot,
 							  &slot);
 
diff --git a/src/include/executor/execPartition.h b/src/include/executor/execPartition.h
index 862bf65060..0a249985bb 100644
--- a/src/include/executor/execPartition.h
+++ b/src/include/executor/execPartition.h
@@ -204,6 +204,7 @@ extern TupleConversionMap *TupConvMapForLeaf(PartitionTupleRouting *proute,
 				  ResultRelInfo *rootRelInfo, int leaf_index);
 extern HeapTuple ConvertPartitionTupleSlot(TupleConversionMap *map,
 						  HeapTuple tuple,
+						  bool shouldFree,
 						  TupleTableSlot *new_slot,
 						  TupleTableSlot **p_my_slot);
 extern void ExecCleanupTupleRouting(ModifyTableState *mtstate,
diff --git a/src/test/regress/input/copy.source b/src/test/regress/input/copy.source
index cb13606d14..cd98e83dcd 100644
--- a/src/test/regress/input/copy.source
+++ b/src/test/regress/input/copy.source
@@ -133,3 +133,37 @@ this is just a line full of junk that would error out if parsed
 \.
 
 copy copytest3 to stdout csv header;
+
+-- test copy from with a partitioned table
+create table parted_copytest (
+	a int,
+	b int,
+	c text
+) partition by list (b);
+
+create table parted_copytest_a1 (c text, b int, a int);
+create table parted_copytest_a2 (a int, c text, b int);
+
+alter table parted_copytest attach partition parted_copytest_a1 for values in(1);
+alter table parted_copytest attach partition parted_copytest_a2 for values in(2);
+
+-- We must insert enough rows to trigger multi-inserts.  These are only
+-- enabled adaptively when there are few enough partition changes.
+insert into parted_copytest select x,1,'One' from generate_series(1,1000) x;
+insert into parted_copytest select x,2,'Two' from generate_series(1001,1010) x;
+insert into parted_copytest select x,1,'One' from generate_series(1011,1020) x;
+
+copy (select * from parted_copytest order by a) to '@abs_builddir@/results/parted_copytest.csv';
+
+truncate parted_copytest;
+
+copy parted_copytest from '@abs_builddir@/results/parted_copytest.csv';
+
+select tableoid::regclass,count(*),sum(a) from parted_copytest
+group by tableoid order by tableoid::regclass::name;
+
+-- clear out the parted_copytest.csv file to avoid taking up space
+truncate parted_copytest;
+copy (select * from parted_copytest order by a) to '@abs_builddir@/results/parted_copytest.csv';
+
+drop table parted_copytest;
diff --git a/src/test/regress/output/copy.source b/src/test/regress/output/copy.source
index b7e372d61b..edeceafce3 100644
--- a/src/test/regress/output/copy.source
+++ b/src/test/regress/output/copy.source
@@ -95,3 +95,33 @@ copy copytest3 to stdout csv header;
 c1,"col with , comma","col with "" quote"
 1,a,1
 2,b,2
+-- test copy from with a partitioned table
+create table parted_copytest (
+	a int,
+	b int,
+	c text
+) partition by list (b);
+create table parted_copytest_a1 (c text, b int, a int);
+create table parted_copytest_a2 (a int, c text, b int);
+alter table parted_copytest attach partition parted_copytest_a1 for values in(1);
+alter table parted_copytest attach partition parted_copytest_a2 for values in(2);
+-- We must insert enough rows to trigger multi-inserts.  These are only
+-- enabled adaptively when there are few enough partition changes.
+insert into parted_copytest select x,1,'One' from generate_series(1,1000) x;
+insert into parted_copytest select x,2,'Two' from generate_series(1001,1010) x;
+insert into parted_copytest select x,1,'One' from generate_series(1011,1020) x;
+copy (select * from parted_copytest order by a) to '@abs_builddir@/results/parted_copytest.csv';
+truncate parted_copytest;
+copy parted_copytest from '@abs_builddir@/results/parted_copytest.csv';
+select tableoid::regclass,count(*),sum(a) from parted_copytest
+group by tableoid order by tableoid::regclass::name;
+      tableoid      | count |  sum   
+--------------------+-------+--------
+ parted_copytest_a1 |  1010 | 510655
+ parted_copytest_a2 |    10 |  10055
+(2 rows)
+
+-- clear out the parted_copytest.csv file to avoid taking up space
+truncate parted_copytest;
+copy (select * from parted_copytest order by a) to '@abs_builddir@/results/parted_copytest.csv';
+drop table parted_copytest;

#23

David Rowley

david.rowley@2ndquadrant.com

over 7 years ago

In reply to: David Rowley (#22)

Re: Making "COPY partitioned_table FROM" faster

On 31 July 2018 at 11:51, David Rowley <david.rowley@2ndquadrant.com> wrote:

The attached v6 delta replaces the v5 delta and should be applied on
top of the full v4 patch.

(now committed)

Many thanks for committing this Peter and many thanks to Melanie and
Karen for reviewing it!

--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#24

Peter Eisentraut

peter.eisentraut@2ndquadrant.com

over 7 years ago

In reply to: David Rowley (#23)

Re: Making "COPY partitioned_table FROM" faster

On 02/08/2018 01:36, David Rowley wrote:

On 31 July 2018 at 11:51, David Rowley <david.rowley@2ndquadrant.com> wrote:

The attached v6 delta replaces the v5 delta and should be applied on
top of the full v4 patch.

(now committed)

Many thanks for committing this Peter and many thanks to Melanie and
Karen for reviewing it!

I problems with my email as I was committing this, which is why I didn't
let you know. ;-)

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services