Batch insert in CTAS/MatView code

Started by Paul Guoalmost 7 years ago27 messages
#1Paul Guo
pguo@pivotal.io
1 attachment(s)

Hello, Postgres hackers,

The copy code has used batch insert with function heap_multi_insert() to
speed up. It seems that Create Table As or Materialized View could leverage
that code also to boost the performance also. Attached is a patch to
implement that. That was done by Taylor (cc-ed) and me.

The patch also modifies heap_multi_insert() a bit to do a bit further
code-level optimization by using static memory, instead of using memory
context and dynamic allocation. For Modifytable->insert, it seems that
there are more limitations for batch insert (trigger, etc?) but it seems
that it is possible that we could do batch insert for the case that we
could do?

By the way, while looking at the code, I noticed that there are 9 local
arrays with large length in toast_insert_or_update() which seems to be a
risk of stack overflow. Maybe we should put it as static or global.

Here is a quick simple performance testing on a mirrorless Postgres
instance with the SQLs below. The tests cover tables with small column
length, large column length and toast.

-- tuples with small size.
drop table if exists t1;
create table t1 (a int);

insert into t1 select * from generate_series(1, 10000000);
drop table if exists t2;
\timing
create table t2 as select * from t1;
\timing

-- tuples that are untoasted and data that is 1664 bytes wide
drop table if exists t1;
create table t1 (a name, b name, c name, d name, e name, f name, g name, h
name, i name, j name, k name, l name, m name, n name, o name, p name, q
name, r name, s name, t name, u name, v name, w name, x name, y name, z
name);

insert into t1 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',
'z' from generate_series(1, 500000);
drop table if exists t2;
\timing
create table t2 as select * from t1;
\timing

-- tuples that are toastable.
drop table if exists t1;
create table t1 (a text, b text, c text, d text, e text, f text, g text, h
text, i text, j text, k text, l text, m text, n text, o text, p text, q
text, r text, s text, t text, u text, v text, w text, x text, y text, z
text);

insert into t1 select i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i,
i, i, i, i, i, i, i, i from (select repeat('123456789', 10000) from
generate_series(1,2000)) i;
drop table if exists t2;
\timing
create table t2 as select * from t1;
\timing

Here are the timing results:

With the patch,

Time: 4728.142 ms (00:04.728)
Time: 14203.983 ms (00:14.204)
Time: 1008.669 ms (00:01.009)

Baseline,
Time: 11096.146 ms (00:11.096)
Time: 13106.741 ms (00:13.107)
Time: 1100.174 ms (00:01.100)

While for toast and large column size there is < 10% decrease but for small
column size the improvement is super good. Actually if I hardcode the batch
count as 4 all test cases are better but the improvement for small column
size is smaller than that with current patch. Pretty much the number 4 is
quite case specific so I can not hardcode that in the patch. Of course we
could further tune that but the current value seems to be a good trade-off?

Thanks.

Attachments:

0001-Heap-batch-insert-for-CTAS.patchapplication/octet-stream; name=0001-Heap-batch-insert-for-CTAS.patchDownload
From b213809f03053efdba9370d89f70f90be827733d Mon Sep 17 00:00:00 2001
From: Paul Guo <pguo@pivotal.io>
Date: Thu, 28 Feb 2019 15:43:34 +0800
Subject: [PATCH] Heap batch insert for CTAS.

---
 src/backend/access/heap/heapam.c |  9 ++++-----
 src/backend/commands/copy.c      | 12 ++----------
 src/backend/commands/createas.c  | 30 +++++++++++++++++++++++++-----
 src/include/access/heapam.h      |  2 ++
 4 files changed, 33 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dc3499349b..2e56e85463 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2399,16 +2399,12 @@ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
  * That's faster than calling heap_insert() in a loop, because when multiple
  * tuples can be inserted on a single page, we can write just a single WAL
  * record covering all of them, and only need to lock/unlock the page once.
- *
- * Note: this leaks memory into the current memory context. You can create a
- * temporary context before calling this, if that's a problem.
  */
 void
 heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate)
 {
 	TransactionId xid = GetCurrentTransactionId();
-	HeapTuple  *heaptuples;
 	int			i;
 	int			ndone;
 	PGAlignedBlock scratch;
@@ -2417,6 +2413,10 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 	Size		saveFreeSpace;
 	bool		need_tuple_data = RelationIsLogicallyLogged(relation);
 	bool		need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+	/* Declare it as static to let this memory not be on stack. */
+	static HeapTuple	heaptuples[MAX_MULTI_INSERT_TUPLES];
+
+	Assert(ntuples <= MAX_MULTI_INSERT_TUPLES);
 
 	/* currently not needed (thus unsupported) for heap_multi_insert() */
 	AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
@@ -2426,7 +2426,6 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 												   HEAP_DEFAULT_FILLFACTOR);
 
 	/* Toast and set header data in all the tuples */
-	heaptuples = palloc(ntuples * sizeof(HeapTuple));
 	for (i = 0; i < ntuples; i++)
 		heaptuples[i] = heap_prepare_insert(relation, tuples[i],
 											xid, cid, options);
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index dbb06397e6..a80a246a48 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -2335,7 +2335,6 @@ CopyFrom(CopyState cstate)
 	bool		has_instead_insert_row_trig;
 	bool		leafpart_use_multi_insert = false;
 
-#define MAX_BUFFERED_TUPLES 1000
 #define RECHECK_MULTI_INSERT_THRESHOLD 1000
 	HeapTuple  *bufferedTuples = NULL;	/* initialize to silence warning */
 	Size		bufferedTuplesSize = 0;
@@ -2644,7 +2643,7 @@ CopyFrom(CopyState cstate)
 		else
 			insertMethod = CIM_MULTI;
 
-		bufferedTuples = palloc(MAX_BUFFERED_TUPLES * sizeof(HeapTuple));
+		bufferedTuples = palloc(MAX_MULTI_INSERT_TUPLES * sizeof(HeapTuple));
 	}
 
 	has_before_insert_row_trig = (resultRelInfo->ri_TrigDesc &&
@@ -2961,7 +2960,7 @@ CopyFrom(CopyState cstate)
 					 * large, to avoid using large amounts of memory for the
 					 * buffer when the tuples are exceptionally wide.
 					 */
-					if (nBufferedTuples == MAX_BUFFERED_TUPLES ||
+					if (nBufferedTuples == MAX_MULTI_INSERT_TUPLES ||
 						bufferedTuplesSize > 65535)
 					{
 						CopyFromInsertBatch(cstate, estate, mycid, hi_options,
@@ -3113,7 +3112,6 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
 					int nBufferedTuples, HeapTuple *bufferedTuples,
 					uint64 firstBufferedLineNo)
 {
-	MemoryContext oldcontext;
 	int			i;
 	uint64		save_cur_lineno;
 	bool		line_buf_valid = cstate->line_buf_valid;
@@ -3125,18 +3123,12 @@ CopyFromInsertBatch(CopyState cstate, EState *estate, CommandId mycid,
 	cstate->line_buf_valid = false;
 	save_cur_lineno = cstate->cur_lineno;
 
-	/*
-	 * heap_multi_insert leaks memory, so switch to short-lived memory context
-	 * before calling it.
-	 */
-	oldcontext = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
 	heap_multi_insert(resultRelInfo->ri_RelationDesc,
 					  bufferedTuples,
 					  nBufferedTuples,
 					  mycid,
 					  hi_options,
 					  bistate);
-	MemoryContextSwitchTo(oldcontext);
 
 	/*
 	 * If there are any indexes, update them for all the inserted tuples, and
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 2bc8f928ea..837ec65920 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -61,6 +61,9 @@ typedef struct
 	CommandId	output_cid;		/* cmin to insert in output tuples */
 	int			hi_options;		/* heap_insert performance options */
 	BulkInsertState bistate;	/* bulk insert state */
+	HeapTuple   bufferedTuples[MAX_MULTI_INSERT_TUPLES];
+	int         nBufferedTuples;
+	int bufferedTuplesSize;
 } DR_intorel;
 
 /* utility functions for CTAS definition creation */
@@ -559,6 +562,7 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 	myState->hi_options = HEAP_INSERT_SKIP_FSM |
 		(XLogIsNeeded() ? 0 : HEAP_INSERT_SKIP_WAL);
 	myState->bistate = GetBulkInsertState();
+	myState->nBufferedTuples = 0;
 
 	/* Not using WAL requires smgr_targblock be initially invalid */
 	Assert(RelationGetTargetBlock(intoRelationDesc) == InvalidBlockNumber);
@@ -579,11 +583,18 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 	 */
 	tuple = ExecCopySlotHeapTuple(slot);
 
-	heap_insert(myState->rel,
-				tuple,
-				myState->output_cid,
-				myState->hi_options,
-				myState->bistate);
+	myState->bufferedTuples[myState->nBufferedTuples++] = tuple;
+	myState->bufferedTuplesSize += tuple->t_len;
+
+	if (myState->nBufferedTuples == MAX_MULTI_INSERT_TUPLES ||
+		myState->bufferedTuplesSize >= 65535)
+	{
+		heap_multi_insert(myState->rel, myState->bufferedTuples,
+						  myState->nBufferedTuples, myState->output_cid,
+						  myState->hi_options, myState->bistate);
+		myState->nBufferedTuples = 0;
+		myState->bufferedTuplesSize = 0;
+	}
 
 	/* We know this is a newly created relation, so there are no indexes */
 
@@ -598,6 +609,15 @@ intorel_shutdown(DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
 
+	if (myState->nBufferedTuples != 0)
+	{
+		heap_multi_insert(myState->rel, myState->bufferedTuples,
+						  myState->nBufferedTuples, myState->output_cid,
+						  myState->hi_options, myState->bistate);
+		myState->nBufferedTuples = 0;
+		myState->bufferedTuplesSize = 0;
+	}
+
 	FreeBulkInsertState(myState->bistate);
 
 	/* If we skipped using WAL, must heap_sync before commit */
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index ab0879138f..927babd4e6 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -37,6 +37,8 @@ typedef struct BulkInsertStateData *BulkInsertState;
 
 #define MaxLockTupleMode	LockTupleExclusive
 
+#define MAX_MULTI_INSERT_TUPLES	1000
+
 /*
  * When heap_update, heap_delete, or heap_lock_tuple fail because the target
  * tuple is already outdated, they fill in this struct to provide information
-- 
2.17.2

#2Michael Paquier
michael@paquier.xyz
In reply to: Paul Guo (#1)
Re: Batch insert in CTAS/MatView code

Hi,

On Wed, Mar 06, 2019 at 10:06:27PM +0800, Paul Guo wrote:

The copy code has used batch insert with function heap_multi_insert() to
speed up. It seems that Create Table As or Materialized View could leverage
that code also to boost the performance also. Attached is a patch to
implement that. That was done by Taylor (cc-ed) and me.

Please note that we are currently in the last commit fest of Postgres
12, and it is too late to propose new features. Please feel free to
add an entry to the commit fest happening afterwards.
--
Michael

#3Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Paul Guo (#1)
Re: Batch insert in CTAS/MatView code

On 06/03/2019 22:06, Paul Guo wrote:

The patch also modifies heap_multi_insert() a bit to do a bit further
code-level optimization by using static memory, instead of using memory
context and dynamic allocation.

If toasting is required, heap_prepare_insert() creates a palloc'd tuple.
That is still leaked to the current memory context.

Leaking into the current memory context is not a bad thing, because
resetting a memory context is faster than doing a lot of pfree() calls.
The callers just need to be prepared for that, and use a short-lived
memory context.

By the way, while looking at the code, I noticed that there are 9 local
arrays with large length in toast_insert_or_update() which seems to be a
risk of stack overflow. Maybe we should put it as static or global.

Hmm. We currently reserve 512 kB between the kernel's limit, and the
limit we check in check_stack_depth(). See STACK_DEPTH_SLOP. Those
arrays add up to 52800 bytes on a 64-bit maching, if I did my math
right. So there's still a lot of headroom. I agree that it nevertheless
seems a bit excessive, though.

With the patch,

Time: 4728.142 ms (00:04.728)
Time: 14203.983 ms (00:14.204)
Time: 1008.669 ms (00:01.009)

Baseline,
Time: 11096.146 ms (00:11.096)
Time: 13106.741 ms (00:13.107)
Time: 1100.174 ms (00:01.100)

Nice speedup!

While for toast and large column size there is < 10% decrease but for
small column size the improvement is super good. Actually if I hardcode
the batch count as 4 all test cases are better but the improvement for
small column size is smaller than that with current patch. Pretty much
the number 4 is quite case specific so I can not hardcode that in the
patch. Of course we could further tune that but the current value seems
to be a good trade-off?

Have you done any profiling, on why the multi-insert is slower with
large tuples? In principle, I don't see why it should be slower.

- Heikki

#4Paul Guo
pguo@pivotal.io
In reply to: Heikki Linnakangas (#3)
Re: Batch insert in CTAS/MatView code

Sorry for the late reply.

To Michael. Thank you. I know this commitfest is ongoing and I'm not
targeting for this.

On Thu, Mar 7, 2019 at 4:54 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/03/2019 22:06, Paul Guo wrote:

The patch also modifies heap_multi_insert() a bit to do a bit further
code-level optimization by using static memory, instead of using memory
context and dynamic allocation.

If toasting is required, heap_prepare_insert() creates a palloc'd tuple.
That is still leaked to the current memory context.

Thanks. I checked the code for that but apparently, I missed that one. I'll
see what proper context can be used for CTAS. For copy code maybe just
revert my change.

Leaking into the current memory context is not a bad thing, because
resetting a memory context is faster than doing a lot of pfree() calls.
The callers just need to be prepared for that, and use a short-lived
memory context.

By the way, while looking at the code, I noticed that there are 9 local
arrays with large length in toast_insert_or_update() which seems to be a
risk of stack overflow. Maybe we should put it as static or global.

Hmm. We currently reserve 512 kB between the kernel's limit, and the
limit we check in check_stack_depth(). See STACK_DEPTH_SLOP. Those
arrays add up to 52800 bytes on a 64-bit maching, if I did my math
right. So there's still a lot of headroom. I agree that it nevertheless
seems a bit excessive, though.

I was worried about some recursive calling of it, but probably there should
be no worry for toast_insert_or_update().

With the patch,

Time: 4728.142 ms (00:04.728)
Time: 14203.983 ms (00:14.204)
Time: 1008.669 ms (00:01.009)

Baseline,
Time: 11096.146 ms (00:11.096)
Time: 13106.741 ms (00:13.107)
Time: 1100.174 ms (00:01.100)

Nice speedup!

While for toast and large column size there is < 10% decrease but for
small column size the improvement is super good. Actually if I hardcode
the batch count as 4 all test cases are better but the improvement for
small column size is smaller than that with current patch. Pretty much
the number 4 is quite case specific so I can not hardcode that in the
patch. Of course we could further tune that but the current value seems
to be a good trade-off?

Have you done any profiling, on why the multi-insert is slower with
large tuples? In principle, I don't see why it should be slower.

Thanks for the suggestion. I'll explore a bit more on this.

Show quoted text

- Heikki

#5David Fetter
david@fetter.org
In reply to: Paul Guo (#1)
Re: Batch insert in CTAS/MatView code

On Wed, Mar 06, 2019 at 10:06:27PM +0800, Paul Guo wrote:

Hello, Postgres hackers,

The copy code has used batch insert with function heap_multi_insert() to
speed up. It seems that Create Table As or Materialized View could leverage
that code also to boost the performance also. Attached is a patch to
implement that.

This is great!

Is this optimization doable for multi-row INSERTs, either with tuples
spelled out in the body of the query or in constructs like INSERT ...
SELECT ...?

Best,
David.
--
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#6Paul Guo
pguo@pivotal.io
In reply to: David Fetter (#5)
Re: Batch insert in CTAS/MatView code

On Mon, Mar 11, 2019 at 2:58 AM David Fetter <david@fetter.org> wrote:

On Wed, Mar 06, 2019 at 10:06:27PM +0800, Paul Guo wrote:

Hello, Postgres hackers,

The copy code has used batch insert with function heap_multi_insert() to
speed up. It seems that Create Table As or Materialized View could

leverage

that code also to boost the performance also. Attached is a patch to
implement that.

This is great!

Is this optimization doable for multi-row INSERTs, either with tuples
spelled out in the body of the query or in constructs like INSERT ...
SELECT ...?

Yes. That's "batch insert" in the ModifyTable nodes which I mentioned in
the first email.
By the way, batch is a usual optimization mechanism for iteration kind
model (like postgres executor),
so batch should benefit many executor nodes in theory also.

Show quoted text

Best,
David.
--
David Fetter <david(at)fetter(dot)org>
https://urldefense.proofpoint.com/v2/url?u=http-3A__fetter.org_&amp;d=DwIBAg&amp;c=lnl9vOaLMzsy2niBC8-h_K-7QJuNJEsFrzdndhuJ3Sw&amp;r=Usi0ex6Ch92MsB5QQDgYFw&amp;m=wgGDTDFzZV7nnMm0NFt-yGKmm_KZk18RXKP9HL8h6UE&amp;s=tnaoLdajjR0Ew-93XUliHW1FUspVl09pIFd9aXxvqc8&amp;e=
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#7Paul Guo
pguo@pivotal.io
In reply to: Heikki Linnakangas (#3)
1 attachment(s)
Re: Batch insert in CTAS/MatView code

Hi all,

I've been working other things until recently I restarted the work,
profiling & refactoring the code.
It's been a long time since the last patch was proposed. The new patch has
now been firstly refactored due to
4da597edf1bae0cf0453b5ed6fc4347b6334dfe1 (Make TupleTableSlots extensible,
finish split of existing slot type).

Now that TupleTableSlot, instead of HeapTuple is one argument of
intorel_receive() so we can not get the
tuple length directly. This patch now gets the tuple length if we know all
columns are with fixed widths, else
we calculate an avg. tuple length using the first MAX_MULTI_INSERT_SAMPLES
(defined as 1000) tuples
and use for the total length of tuples in a batch.

I noticed that to do batch insert, we might need additional memory copy
sometimes comparing with "single insert"
(that should be the reason that we previously saw a bit regressions) so a
good solution seems to fall back
to "single insert" if the tuple length is larger than a threshold. I set
this as 2000 after quick testing.

To make test stable and strict, I run checkpoint before each ctas, the test
script looks like this:

checkpoint;
\timing
create table tt as select a,b,c from t11;
\timing
drop table tt;

Also previously I just tested the BufferHeapTupleTableSlot (i.e. create
table tt as select * from t11),
this time I test VirtualTupleTableSlot (i.e. create table tt as select
a,b,c from t11) additionally.
It seems that VirtualTupleTableSlot is very common in real cases.

I tested four kinds of tables, see below SQLs.

-- tuples with small size.
create table t11 (a int, b int, c int, d int);
insert into t11 select s,s,s,s from generate_series(1, 10000000) s;
analyze t11;

-- tuples that are untoasted and tuple size is 1984 bytes.
create table t12 (a name, b name, c name, d name, e name, f name, g name, h
name, i name, j name, k name, l name, m name, n name, o name, p name, q
name, r name, s name, t name, u name, v name, w name, x name, y name, z
name, a1 name, a2 name, a3 name, a4 name, a5 name);
insert into t12 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',
'z', 'a', 'b', 'c', 'd', 'e' from generate_series(1, 500000);
analyze t12;

-- tuples that are untoasted and tuple size is 2112 bytes.
create table t13 (a name, b name, c name, d name, e name, f name, g name, h
name, i name, j name, k name, l name, m name, n name, o name, p name, q
name, r name, s name, t name, u name, v name, w name, x name, y name, z
name, a1 name, a2 name, a3 name, a4 name, a5 name, a6 name, a7 name);
insert into t13 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',
'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g' from generate_series(1, 500000);
analyze t13;

-- tuples that are toastable and tuple compressed size is 1084.
create table t14 (a text, b text, c text, d text, e text, f text, g text, h
text, i text, j text, k text, l text, m text, n text, o text, p text, q
text, r text, s text, t text, u text, v text, w text, x text, y text, z
text);
insert into t14 select i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i,
i, i, i, i, i, i, i, i, i from (select repeat('123456789', 10000) from
generate_series(1,5000)) i;
analyze t14;

I also tested two scenarios for each testing.

One is to clean up all kernel caches (page & inode & dentry on Linux) using
the command below and then run the test,
sync; echo 3 > /proc/sys/vm/drop_caches
After running all tests all relation files will be in kernel cache (my test
system memory is large enough to accommodate all relation files),
then I run the tests again. I run like this because in real scenario the
result of the test should be among the two results. Also I rerun
each test and finally I calculate the average results as the experiment
results. Below are some results:

scenario1: All related kernel caches are cleaned up (note the first two
columns are time with second).

baseline patch diff% SQL

10.1 5.57 44.85% create table tt as select * from t11;

10.7 5.52 48.41% create table tt as select a,b,c from t11;

9.57 10.2 -6.58% create table tt as select * from t12;

9.64 8.63 10.48% create table tt as select
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12;

14.2 14.46 -1.83% create table tt as select * from t13;

11.88 12.05 -1.43% create table tt as select
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from
t13;

3.17 3.25 -2.52% create table tt as select * from t14;

2.93 3.12 -6.48% create table tt as select
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14;

scenario2: all related kernel caches are populated after previous testing.

baseline patch diff% SQL

9.6 4.97 48.23% create table tt as select * from t11;

10.41 5.32 48.90% create table tt as select a,b,c from t11;

9.12 9.52 -4.38% create table tt as select * from t12;

9.66 8.6 10.97% create table tt as select
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12;

13.56 13.6 -0.30% create table tt as select * from t13;

11.36 11.7 -2.99% create table tt as select
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from
t13;

3.08 3.13 -1.62% create table tt as select * from t14;

2.95 3.03 -2.71% create table tt as select
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14;

From above we can get some tentative conclusions:

1. t11: For short-size tables, batch insert improves much (40%+).

2. t12: For BufferHeapTupleTableSlot, the patch slows down 4.x%-6.x%, but
for VirtualTupleTableSlot it improves 10.x%.
If we look at execTuples.c, it looks like this is quite relevant to
additional memory copy. It seems that VirtualTupleTableSlot is
more common than the BufferHeapTupleTableSlot so probably the current code
should be fine for most real cases. Or it's possible
to determine multi-insert also according to the input slot tuple but this
seems to be ugly in code. Or continue to lower the threshold a bit
so that "create table tt as select * from t12;" also improves although this
hurts the VirtualTupleTableSlot case.

3. for t13, new code still uses single insert so the difference should be
small. I just want to see the regression when even we use "single insert".

4. For toast case t14, the degradation is small, not a big deal.

By the way, did we try or think about allow better prefetch (on Linux) for
seqscan. i.e. POSIX_FADV_SEQUENTIAL in posix_fadvise() to enlarge the
kernel readahead window. Suppose this should help if seq tuple handling is
faster than default kernel readahead setting.

v2 patch is attached.

On Thu, Mar 7, 2019 at 4:54 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Show quoted text

On 06/03/2019 22:06, Paul Guo wrote:

The patch also modifies heap_multi_insert() a bit to do a bit further
code-level optimization by using static memory, instead of using memory
context and dynamic allocation.

If toasting is required, heap_prepare_insert() creates a palloc'd tuple.
That is still leaked to the current memory context.

Leaking into the current memory context is not a bad thing, because
resetting a memory context is faster than doing a lot of pfree() calls.
The callers just need to be prepared for that, and use a short-lived
memory context.

By the way, while looking at the code, I noticed that there are 9 local
arrays with large length in toast_insert_or_update() which seems to be a
risk of stack overflow. Maybe we should put it as static or global.

Hmm. We currently reserve 512 kB between the kernel's limit, and the
limit we check in check_stack_depth(). See STACK_DEPTH_SLOP. Those
arrays add up to 52800 bytes on a 64-bit maching, if I did my math
right. So there's still a lot of headroom. I agree that it nevertheless
seems a bit excessive, though.

With the patch,

Time: 4728.142 ms (00:04.728)
Time: 14203.983 ms (00:14.204)
Time: 1008.669 ms (00:01.009)

Baseline,
Time: 11096.146 ms (00:11.096)
Time: 13106.741 ms (00:13.107)
Time: 1100.174 ms (00:01.100)

Nice speedup!

While for toast and large column size there is < 10% decrease but for
small column size the improvement is super good. Actually if I hardcode
the batch count as 4 all test cases are better but the improvement for
small column size is smaller than that with current patch. Pretty much
the number 4 is quite case specific so I can not hardcode that in the
patch. Of course we could further tune that but the current value seems
to be a good trade-off?

Have you done any profiling, on why the multi-insert is slower with
large tuples? In principle, I don't see why it should be slower.

- Heikki

Attachments:

v2-0001-Heap-batch-insert-for-CTAS-MatView.patchapplication/octet-stream; name=v2-0001-Heap-batch-insert-for-CTAS-MatView.patchDownload
From e4ae67f650a46c6127f55b2efbf54b2314a61945 Mon Sep 17 00:00:00 2001
From: Paul Guo <paulguo@gmail.com>
Date: Thu, 28 Feb 2019 15:43:34 +0800
Subject: [PATCH v2] Heap batch insert for CTAS/MatView.

---
 src/backend/access/heap/heapam.c |   6 +-
 src/backend/commands/copy.c      |  24 ++---
 src/backend/commands/createas.c  | 153 ++++++++++++++++++++++++++++++-
 src/include/access/heapam.h      |  11 +++
 4 files changed, 169 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8ac0f8a513..5f5ed06e2d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2111,7 +2111,6 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate)
 {
 	TransactionId xid = GetCurrentTransactionId();
-	HeapTuple  *heaptuples;
 	int			i;
 	int			ndone;
 	PGAlignedBlock scratch;
@@ -2120,6 +2119,10 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	Size		saveFreeSpace;
 	bool		need_tuple_data = RelationIsLogicallyLogged(relation);
 	bool		need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+	/* Declare it as static to let this memory be not on stack. */
+	static HeapTuple	heaptuples[MAX_MULTI_INSERT_TUPLES];
+
+	Assert(ntuples <= MAX_MULTI_INSERT_TUPLES);
 
 	/* currently not needed (thus unsupported) for heap_multi_insert() */
 	AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
@@ -2129,7 +2132,6 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 												   HEAP_DEFAULT_FILLFACTOR);
 
 	/* Toast and set header data in all the slots */
-	heaptuples = palloc(ntuples * sizeof(HeapTuple));
 	for (i = 0; i < ntuples; i++)
 	{
 		HeapTuple	tuple;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 84c54fbc70..5e0e929034 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -234,18 +234,6 @@ typedef struct
 	uint64		processed;		/* # of tuples processed */
 } DR_copy;
 
-
-/*
- * No more than this many tuples per CopyMultiInsertBuffer
- *
- * Caution: Don't make this too big, as we could end up with this many
- * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's
- * multiInsertBuffers list.  Increasing this can cause quadratic growth in
- * memory requirements during copies into partitioned tables with a large
- * number of partitions.
- */
-#define MAX_BUFFERED_TUPLES		1000
-
 /*
  * Flush buffers if there are >= this many bytes, as counted by the input
  * size, of tuples stored.
@@ -258,11 +246,11 @@ typedef struct
 /* Stores multi-insert data related to a single relation in CopyFrom. */
 typedef struct CopyMultiInsertBuffer
 {
-	TupleTableSlot *slots[MAX_BUFFERED_TUPLES]; /* Array to store tuples */
+	TupleTableSlot *slots[MAX_MULTI_INSERT_TUPLES]; /* Array to store tuples */
 	ResultRelInfo *resultRelInfo;	/* ResultRelInfo for 'relid' */
 	BulkInsertState bistate;	/* BulkInsertState for this rel */
 	int			nused;			/* number of 'slots' containing tuples */
-	uint64		linenos[MAX_BUFFERED_TUPLES];	/* Line # of tuple in copy
+	uint64		linenos[MAX_MULTI_INSERT_TUPLES];	/* Line # of tuple in copy
 												 * stream */
 } CopyMultiInsertBuffer;
 
@@ -2352,7 +2340,7 @@ CopyMultiInsertBufferInit(ResultRelInfo *rri)
 	CopyMultiInsertBuffer *buffer;
 
 	buffer = (CopyMultiInsertBuffer *) palloc(sizeof(CopyMultiInsertBuffer));
-	memset(buffer->slots, 0, sizeof(TupleTableSlot *) * MAX_BUFFERED_TUPLES);
+	memset(buffer->slots, 0, sizeof(TupleTableSlot *) * MAX_MULTI_INSERT_TUPLES);
 	buffer->resultRelInfo = rri;
 	buffer->bistate = GetBulkInsertState();
 	buffer->nused = 0;
@@ -2411,7 +2399,7 @@ CopyMultiInsertInfoInit(CopyMultiInsertInfo *miinfo, ResultRelInfo *rri,
 static inline bool
 CopyMultiInsertInfoIsFull(CopyMultiInsertInfo *miinfo)
 {
-	if (miinfo->bufferedTuples >= MAX_BUFFERED_TUPLES ||
+	if (miinfo->bufferedTuples >= MAX_MULTI_INSERT_TUPLES ||
 		miinfo->bufferedBytes >= MAX_BUFFERED_BYTES)
 		return true;
 	return false;
@@ -2531,7 +2519,7 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertBuffer *buffer)
 	FreeBulkInsertState(buffer->bistate);
 
 	/* Since we only create slots on demand, just drop the non-null ones. */
-	for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
+	for (i = 0; i < MAX_MULTI_INSERT_TUPLES && buffer->slots[i] != NULL; i++)
 		ExecDropSingleTupleTableSlot(buffer->slots[i]);
 
 	pfree(buffer);
@@ -2617,7 +2605,7 @@ CopyMultiInsertInfoNextFreeSlot(CopyMultiInsertInfo *miinfo,
 	int			nused = buffer->nused;
 
 	Assert(buffer != NULL);
-	Assert(nused < MAX_BUFFERED_TUPLES);
+	Assert(nused < MAX_MULTI_INSERT_TUPLES);
 
 	if (buffer->slots[nused] == NULL)
 		buffer->slots[nused] = table_slot_create(rri->ri_RelationDesc, NULL);
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 4c1d909d38..1d5ccf74bd 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -62,6 +62,15 @@ typedef struct
 	CommandId	output_cid;		/* cmin to insert in output tuples */
 	int			ti_options;		/* table_tuple_insert performance options */
 	BulkInsertState bistate;	/* bulk insert state */
+	MemoryContext	mi_context;	/* Memory context for multi insert */
+	int				tup_len;	/* accurate or average tuple length. */
+	/* Below are buffered slots and related information. */
+	TupleTableSlot	*buffered_slots[MAX_MULTI_INSERT_TUPLES];
+	int				buffered_slots_num;	/* How many buffered slots for multi insert */
+	int				buffered_slots_size;	/* Total tuple size for multi insert */
+	/* Below are variables for sampling (to calculte avg.tup_len if needed). */
+	int				sampled_tuples_num; /* -1 means no sampling is needed. */
+	uint64			sampled_tuples_size; /* Total tuple size of samples. */
 } DR_intorel;
 
 /* utility functions for CTAS definition creation */
@@ -441,6 +450,8 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 	RangeTblEntry *rte;
 	ListCell   *lc;
 	int			attnum;
+	int			tup_len;
+	bool		use_sampling;
 
 	Assert(into != NULL);		/* else somebody forgot to set it */
 
@@ -456,12 +467,22 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 	 */
 	attrList = NIL;
 	lc = list_head(into->colNames);
+	tup_len = 0;
+	use_sampling = false;
 	for (attnum = 0; attnum < typeinfo->natts; attnum++)
 	{
 		Form_pg_attribute attribute = TupleDescAttr(typeinfo, attnum);
 		ColumnDef  *col;
 		char	   *colname;
 
+		if (attribute->attlen > 0)
+		{
+			if (!use_sampling)
+				tup_len += attribute->attlen;
+		}
+		else
+			use_sampling = true; /* Update tup_len via sampling. */
+
 		if (lc)
 		{
 			colname = strVal(lfirst(lc));
@@ -561,11 +582,59 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 	myState->ti_options = TABLE_INSERT_SKIP_FSM |
 		(XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
 	myState->bistate = GetBulkInsertState();
+	memset(myState->buffered_slots, 0, sizeof(TupleTableSlot *) * MAX_MULTI_INSERT_TUPLES);
+	myState->buffered_slots_num = 0;
+	myState->buffered_slots_size = 0;
+	myState->tup_len = use_sampling ? 0 : tup_len;
+	myState->sampled_tuples_num = use_sampling ? 0 : -1;
+	myState->sampled_tuples_size = 0;
+
+	/*
+	 * Create a temporary memory context so that we can reset once per
+	 * multi insert.
+	 */
+	myState->mi_context = AllocSetContextCreate(CurrentMemoryContext,
+												"intorel_multi_insert",
+												ALLOCSET_DEFAULT_SIZES);
 
 	/* Not using WAL requires smgr_targblock be initially invalid */
 	Assert(RelationGetTargetBlock(intoRelationDesc) == InvalidBlockNumber);
 }
 
+/*
+ * If the tuple length, which is obtained either through sampling on tuples with
+ * variable length attribute(s), or through calculating for tuples with
+ * accurate length attributes, is larger than or equal to this value, we do
+ * not use multi insert since memory copy overhead could decrease the
+ * benefit of multi insert.
+ */
+#define MAX_TUP_LEN_FOR_MULTI_INSERT	2000
+
+/* How many first tuples are sampled to calculte average tuple length? */
+#define MAX_MULTI_INSERT_SAMPLES		1000
+
+static void
+intorel_flush_multi_insert(DR_intorel *myState)
+{
+	MemoryContext oldcontext;
+	int i;
+
+	oldcontext = MemoryContextSwitchTo(myState->mi_context);
+
+	table_multi_insert(myState->rel, myState->buffered_slots,
+					   myState->buffered_slots_num, myState->output_cid,
+					   myState->ti_options, myState->bistate);
+
+	MemoryContextReset(myState->mi_context);
+	MemoryContextSwitchTo(oldcontext);
+
+	for (i = 0; i < myState->buffered_slots_num; i++)
+		ExecClearTuple(myState->buffered_slots[i]);
+
+	myState->buffered_slots_num = 0;
+	myState->buffered_slots_size = 0;
+}
+
 /*
  * intorel_receive --- receive one tuple
  */
@@ -573,6 +642,8 @@ static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
+	TupleTableSlot *batchslot;
+	HeapTuple tuple;
 
 	/*
 	 * Note that the input slot might not be of the type of the target
@@ -583,11 +654,72 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 	 * tuple's xmin), but since we don't do that here...
 	 */
 
-	table_tuple_insert(myState->rel,
-					   slot,
-					   myState->output_cid,
-					   myState->ti_options,
-					   myState->bistate);
+	/*
+	 * If the accurate/average tuple length is large, do single insert.
+	 * We do not call ExecFetchSlotHeapTuple() for the input slot to get
+	 * accurate tuple length here since sometimes it is wasteful to call
+	 * it again in table_tuple_insert(), e.g. VirtualTupleTableSlot
+	 */
+	if (myState->tup_len >= MAX_TUP_LEN_FOR_MULTI_INSERT)
+	{
+		table_tuple_insert(myState->rel,
+						   slot,
+						   myState->output_cid,
+						   myState->ti_options,
+						   myState->bistate);
+		return true;
+	}
+
+	/* Copy the slot to batchslot lists and materialize them. */
+	if (myState->buffered_slots[myState->buffered_slots_num] == NULL)
+	{
+		batchslot = table_slot_create(myState->rel, NULL);
+		myState->buffered_slots[myState->buffered_slots_num] = batchslot;
+	}
+	else
+		batchslot = myState->buffered_slots[myState->buffered_slots_num];
+
+	ExecCopySlot(batchslot, slot);
+	/*
+	 * In theory we do not need materalize here but if both input slot and
+	 * dst slot are BufferHeapTupleTableSlot, there might be hot code in
+	 * ResourceOwnerForgetBuffer() and ResourceOwnerRememberBuffer()
+	 * since we do them in batch. We could easily work around this by doing
+	 * materialize in advance. This is harmless since later when calling
+	 * table_multi_insert(), we need materialize also.
+	 */
+	ExecMaterializeSlot(batchslot);
+	myState->buffered_slots_num++;
+
+	if (myState->sampled_tuples_num < 0 ||
+		myState->sampled_tuples_num == MAX_MULTI_INSERT_SAMPLES)
+		myState->buffered_slots_size += myState->tup_len;
+	else
+	{
+		/*
+		 * Sampling to get the rough average tuple length for later use.
+		 * We do not use plan width since that is inaccurate sometimes.
+		 */
+		tuple = ExecFetchSlotHeapTuple(batchslot, true, NULL);
+
+		myState->buffered_slots_size += tuple->t_len;
+		myState->sampled_tuples_size += tuple->t_len;
+		myState->sampled_tuples_num++;
+
+		/*
+		 * Just finished sampling. Let's update myState->tup_len and
+		 * flush the tuples since in next call we possibly do single insert.
+		 */
+		if(myState->sampled_tuples_num == MAX_MULTI_INSERT_SAMPLES)
+		{
+			myState->tup_len = myState->sampled_tuples_size / myState->sampled_tuples_num;
+			intorel_flush_multi_insert(myState);
+		}
+	}
+
+	if (myState->buffered_slots_num == MAX_MULTI_INSERT_TUPLES ||
+		myState->buffered_slots_size >= 65535)
+		intorel_flush_multi_insert(myState);
 
 	/* We know this is a newly created relation, so there are no indexes */
 
@@ -601,11 +733,22 @@ static void
 intorel_shutdown(DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
+	int i;
+
+	if (myState->buffered_slots_num != 0)
+		intorel_flush_multi_insert(myState);
+
+	for (i = 0; i < MAX_MULTI_INSERT_TUPLES && myState->buffered_slots[i] != NULL; i++)
+		ExecDropSingleTupleTableSlot(myState->buffered_slots[i]);
 
 	FreeBulkInsertState(myState->bistate);
 
 	table_finish_bulk_insert(myState->rel, myState->ti_options);
 
+	if (myState->mi_context)
+		MemoryContextDelete(myState->mi_context);
+	myState->mi_context = NULL;
+
 	/* close rel, but keep lock until commit */
 	table_close(myState->rel, NoLock);
 	myState->rel = NULL;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index dffb57bf11..e90c6a3fc6 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -40,6 +40,17 @@ struct TupleTableSlot;
 
 #define MaxLockTupleMode	LockTupleExclusive
 
+/*
+ * No more than this many tuples per MultiInsertBuffer
+ *
+ * Caution: Don't make this too big. For COPY, we could end up with this many
+ * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's
+ * multiInsertBuffers list.  Increasing this can cause quadratic growth in
+ * memory requirements during copies into partitioned tables with a large
+ * number of partitions. For CTAS/MatView, the impact is similar.
+ */
+#define MAX_MULTI_INSERT_TUPLES	1000
+
 /*
  * Descriptor for heap table scans.
  */
-- 
2.17.2

#8Paul Guo
pguo@pivotal.io
In reply to: Paul Guo (#7)
1 attachment(s)
Re: Batch insert in CTAS/MatView code

On Mon, Jun 17, 2019 at 8:53 PM Paul Guo <pguo@pivotal.io> wrote:

Hi all,

I've been working other things until recently I restarted the work,
profiling & refactoring the code.
It's been a long time since the last patch was proposed. The new patch has
now been firstly refactored due to
4da597edf1bae0cf0453b5ed6fc4347b6334dfe1 (Make TupleTableSlots extensible,
finish split of existing slot type).

Now that TupleTableSlot, instead of HeapTuple is one argument of
intorel_receive() so we can not get the
tuple length directly. This patch now gets the tuple length if we know all
columns are with fixed widths, else
we calculate an avg. tuple length using the first MAX_MULTI_INSERT_SAMPLES
(defined as 1000) tuples
and use for the total length of tuples in a batch.

I noticed that to do batch insert, we might need additional memory copy
sometimes comparing with "single insert"
(that should be the reason that we previously saw a bit regressions) so a
good solution seems to fall back
to "single insert" if the tuple length is larger than a threshold. I set
this as 2000 after quick testing.

To make test stable and strict, I run checkpoint before each ctas, the
test script looks like this:

checkpoint;
\timing
create table tt as select a,b,c from t11;
\timing
drop table tt;

Also previously I just tested the BufferHeapTupleTableSlot (i.e. create
table tt as select * from t11),
this time I test VirtualTupleTableSlot (i.e. create table tt as select
a,b,c from t11) additionally.
It seems that VirtualTupleTableSlot is very common in real cases.

I tested four kinds of tables, see below SQLs.

-- tuples with small size.
create table t11 (a int, b int, c int, d int);
insert into t11 select s,s,s,s from generate_series(1, 10000000) s;
analyze t11;

-- tuples that are untoasted and tuple size is 1984 bytes.
create table t12 (a name, b name, c name, d name, e name, f name, g name,
h name, i name, j name, k name, l name, m name, n name, o name, p name, q
name, r name, s name, t name, u name, v name, w name, x name, y name, z
name, a1 name, a2 name, a3 name, a4 name, a5 name);
insert into t12 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',
'z', 'a', 'b', 'c', 'd', 'e' from generate_series(1, 500000);
analyze t12;

-- tuples that are untoasted and tuple size is 2112 bytes.
create table t13 (a name, b name, c name, d name, e name, f name, g name,
h name, i name, j name, k name, l name, m name, n name, o name, p name, q
name, r name, s name, t name, u name, v name, w name, x name, y name, z
name, a1 name, a2 name, a3 name, a4 name, a5 name, a6 name, a7 name);
insert into t13 select 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j',
'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y',
'z', 'a', 'b', 'c', 'd', 'e', 'f', 'g' from generate_series(1, 500000);
analyze t13;

-- tuples that are toastable and tuple compressed size is 1084.
create table t14 (a text, b text, c text, d text, e text, f text, g text,
h text, i text, j text, k text, l text, m text, n text, o text, p text, q
text, r text, s text, t text, u text, v text, w text, x text, y text, z
text);
insert into t14 select i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i, i,
i, i, i, i, i, i, i, i, i from (select repeat('123456789', 10000) from
generate_series(1,5000)) i;
analyze t14;

I also tested two scenarios for each testing.

One is to clean up all kernel caches (page & inode & dentry on Linux)
using the command below and then run the test,
sync; echo 3 > /proc/sys/vm/drop_caches
After running all tests all relation files will be in kernel cache (my
test system memory is large enough to accommodate all relation files),
then I run the tests again. I run like this because in real scenario the
result of the test should be among the two results. Also I rerun
each test and finally I calculate the average results as the experiment
results. Below are some results:

scenario1: All related kernel caches are cleaned up (note the first two
columns are time with second).

baseline patch diff% SQL

10.1 5.57 44.85% create table tt as select * from t11;

10.7 5.52 48.41% create table tt as select a,b,c from t11;

9.57 10.2 -6.58% create table tt as select * from t12;

9.64 8.63 10.48% create table tt as select
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12;

14.2 14.46 -1.83% create table tt as select * from t13;

11.88 12.05 -1.43% create table tt as select
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from
t13;

3.17 3.25 -2.52% create table tt as select * from t14;

2.93 3.12 -6.48% create table tt as select
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14;

scenario2: all related kernel caches are populated after previous testing.

baseline patch diff% SQL

9.6 4.97 48.23% create table tt as select * from t11;

10.41 5.32 48.90% create table tt as select a,b,c from t11;

9.12 9.52 -4.38% create table tt as select * from t12;

9.66 8.6 10.97% create table tt as select
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4 from t12;

13.56 13.6 -0.30% create table tt as select * from t13;

11.36 11.7 -2.99% create table tt as select
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,a1,a2,a3,a4,a5,a6 from
t13;

3.08 3.13 -1.62% create table tt as select * from t14;

2.95 3.03 -2.71% create table tt as select
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y from t14;

From above we can get some tentative conclusions:

1. t11: For short-size tables, batch insert improves much (40%+).

2. t12: For BufferHeapTupleTableSlot, the patch slows down 4.x%-6.x%, but
for VirtualTupleTableSlot it improves 10.x%.
If we look at execTuples.c, it looks like this is quite relevant to
additional memory copy. It seems that VirtualTupleTableSlot is
more common than the BufferHeapTupleTableSlot so probably the current code
should be fine for most real cases. Or it's possible
to determine multi-insert also according to the input slot tuple but this
seems to be ugly in code. Or continue to lower the threshold a bit
so that "create table tt as select * from t12;" also improves although
this hurts the VirtualTupleTableSlot case.

To alleviate this. I tuned MAX_TUP_LEN_FOR_MULTI_INSERT a bit and set it
from 2000 to 1600. With a table with 24 name-typed columns (total size
1536), I tried both
case1: create table tt as select * from t12;
case2: create table tt as
select a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w from t12;

This patch increases the performance for both. Note, of course, this change
(MAX_TUP_LEN_FOR_MULTI_INSERT) does not affect the test results of previous
t11, t13, t14 in theory since the code path is not affected.

kernel caches cleaned up:

baseline(s) patch(s) diff%
case1: 7.65 7.30 4.6%
case2: 7.75 6.80 12.2%

relation files are in cache:

case1: 7.09 6.66 6.1%
case2: 7.49 6.83 8.8%

We do not need to find a larger threshold that just makes the case1
improvement near to zero since on other test environments the threshold
might be a bit different so it should be set as a rough value, and it seems
that 1600 should benefit most cases.

I attached the v3 patch which just has the MAX_TUP_LEN_FOR_MULTI_INSERT
change.

Thanks.

Show quoted text

3. for t13, new code still uses single insert so the difference should be
small. I just want to see the regression when even we use "single insert".

4. For toast case t14, the degradation is small, not a big deal.

By the way, did we try or think about allow better prefetch (on Linux) for
seqscan. i.e. POSIX_FADV_SEQUENTIAL in posix_fadvise() to enlarge the
kernel readahead window. Suppose this should help if seq tuple handling is
faster than default kernel readahead setting.

v2 patch is attached.

On Thu, Mar 7, 2019 at 4:54 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 06/03/2019 22:06, Paul Guo wrote:

The patch also modifies heap_multi_insert() a bit to do a bit further
code-level optimization by using static memory, instead of using memory
context and dynamic allocation.

If toasting is required, heap_prepare_insert() creates a palloc'd tuple.
That is still leaked to the current memory context.

Leaking into the current memory context is not a bad thing, because
resetting a memory context is faster than doing a lot of pfree() calls.
The callers just need to be prepared for that, and use a short-lived
memory context.

By the way, while looking at the code, I noticed that there are 9 local
arrays with large length in toast_insert_or_update() which seems to be

a

risk of stack overflow. Maybe we should put it as static or global.

Hmm. We currently reserve 512 kB between the kernel's limit, and the
limit we check in check_stack_depth(). See STACK_DEPTH_SLOP. Those
arrays add up to 52800 bytes on a 64-bit maching, if I did my math
right. So there's still a lot of headroom. I agree that it nevertheless
seems a bit excessive, though.

With the patch,

Time: 4728.142 ms (00:04.728)
Time: 14203.983 ms (00:14.204)
Time: 1008.669 ms (00:01.009)

Baseline,
Time: 11096.146 ms (00:11.096)
Time: 13106.741 ms (00:13.107)
Time: 1100.174 ms (00:01.100)

Nice speedup!

While for toast and large column size there is < 10% decrease but for
small column size the improvement is super good. Actually if I hardcode
the batch count as 4 all test cases are better but the improvement for
small column size is smaller than that with current patch. Pretty much
the number 4 is quite case specific so I can not hardcode that in the
patch. Of course we could further tune that but the current value seems
to be a good trade-off?

Have you done any profiling, on why the multi-insert is slower with
large tuples? In principle, I don't see why it should be slower.

- Heikki

Attachments:

v3-0001-Heap-batch-insert-for-CTAS-MatView.patchapplication/octet-stream; name=v3-0001-Heap-batch-insert-for-CTAS-MatView.patchDownload
From 21b223f0aaa93e6fdf5644e3e6c1a7d7d0269fa9 Mon Sep 17 00:00:00 2001
From: Paul Guo <paulguo@gmail.com>
Date: Thu, 28 Feb 2019 15:43:34 +0800
Subject: [PATCH v3] Heap batch insert for CTAS/MatView.

---
 src/backend/access/heap/heapam.c |   6 +-
 src/backend/commands/copy.c      |  24 ++---
 src/backend/commands/createas.c  | 153 ++++++++++++++++++++++++++++++-
 src/include/access/heapam.h      |  11 +++
 4 files changed, 169 insertions(+), 25 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 8ac0f8a513..5f5ed06e2d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2111,7 +2111,6 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate)
 {
 	TransactionId xid = GetCurrentTransactionId();
-	HeapTuple  *heaptuples;
 	int			i;
 	int			ndone;
 	PGAlignedBlock scratch;
@@ -2120,6 +2119,10 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	Size		saveFreeSpace;
 	bool		need_tuple_data = RelationIsLogicallyLogged(relation);
 	bool		need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+	/* Declare it as static to let this memory be not on stack. */
+	static HeapTuple	heaptuples[MAX_MULTI_INSERT_TUPLES];
+
+	Assert(ntuples <= MAX_MULTI_INSERT_TUPLES);
 
 	/* currently not needed (thus unsupported) for heap_multi_insert() */
 	AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
@@ -2129,7 +2132,6 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 												   HEAP_DEFAULT_FILLFACTOR);
 
 	/* Toast and set header data in all the slots */
-	heaptuples = palloc(ntuples * sizeof(HeapTuple));
 	for (i = 0; i < ntuples; i++)
 	{
 		HeapTuple	tuple;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 84c54fbc70..5e0e929034 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -234,18 +234,6 @@ typedef struct
 	uint64		processed;		/* # of tuples processed */
 } DR_copy;
 
-
-/*
- * No more than this many tuples per CopyMultiInsertBuffer
- *
- * Caution: Don't make this too big, as we could end up with this many
- * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's
- * multiInsertBuffers list.  Increasing this can cause quadratic growth in
- * memory requirements during copies into partitioned tables with a large
- * number of partitions.
- */
-#define MAX_BUFFERED_TUPLES		1000
-
 /*
  * Flush buffers if there are >= this many bytes, as counted by the input
  * size, of tuples stored.
@@ -258,11 +246,11 @@ typedef struct
 /* Stores multi-insert data related to a single relation in CopyFrom. */
 typedef struct CopyMultiInsertBuffer
 {
-	TupleTableSlot *slots[MAX_BUFFERED_TUPLES]; /* Array to store tuples */
+	TupleTableSlot *slots[MAX_MULTI_INSERT_TUPLES]; /* Array to store tuples */
 	ResultRelInfo *resultRelInfo;	/* ResultRelInfo for 'relid' */
 	BulkInsertState bistate;	/* BulkInsertState for this rel */
 	int			nused;			/* number of 'slots' containing tuples */
-	uint64		linenos[MAX_BUFFERED_TUPLES];	/* Line # of tuple in copy
+	uint64		linenos[MAX_MULTI_INSERT_TUPLES];	/* Line # of tuple in copy
 												 * stream */
 } CopyMultiInsertBuffer;
 
@@ -2352,7 +2340,7 @@ CopyMultiInsertBufferInit(ResultRelInfo *rri)
 	CopyMultiInsertBuffer *buffer;
 
 	buffer = (CopyMultiInsertBuffer *) palloc(sizeof(CopyMultiInsertBuffer));
-	memset(buffer->slots, 0, sizeof(TupleTableSlot *) * MAX_BUFFERED_TUPLES);
+	memset(buffer->slots, 0, sizeof(TupleTableSlot *) * MAX_MULTI_INSERT_TUPLES);
 	buffer->resultRelInfo = rri;
 	buffer->bistate = GetBulkInsertState();
 	buffer->nused = 0;
@@ -2411,7 +2399,7 @@ CopyMultiInsertInfoInit(CopyMultiInsertInfo *miinfo, ResultRelInfo *rri,
 static inline bool
 CopyMultiInsertInfoIsFull(CopyMultiInsertInfo *miinfo)
 {
-	if (miinfo->bufferedTuples >= MAX_BUFFERED_TUPLES ||
+	if (miinfo->bufferedTuples >= MAX_MULTI_INSERT_TUPLES ||
 		miinfo->bufferedBytes >= MAX_BUFFERED_BYTES)
 		return true;
 	return false;
@@ -2531,7 +2519,7 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertBuffer *buffer)
 	FreeBulkInsertState(buffer->bistate);
 
 	/* Since we only create slots on demand, just drop the non-null ones. */
-	for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
+	for (i = 0; i < MAX_MULTI_INSERT_TUPLES && buffer->slots[i] != NULL; i++)
 		ExecDropSingleTupleTableSlot(buffer->slots[i]);
 
 	pfree(buffer);
@@ -2617,7 +2605,7 @@ CopyMultiInsertInfoNextFreeSlot(CopyMultiInsertInfo *miinfo,
 	int			nused = buffer->nused;
 
 	Assert(buffer != NULL);
-	Assert(nused < MAX_BUFFERED_TUPLES);
+	Assert(nused < MAX_MULTI_INSERT_TUPLES);
 
 	if (buffer->slots[nused] == NULL)
 		buffer->slots[nused] = table_slot_create(rri->ri_RelationDesc, NULL);
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index 4c1d909d38..66aa051c3a 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -62,6 +62,15 @@ typedef struct
 	CommandId	output_cid;		/* cmin to insert in output tuples */
 	int			ti_options;		/* table_tuple_insert performance options */
 	BulkInsertState bistate;	/* bulk insert state */
+	MemoryContext	mi_context;	/* Memory context for multi insert */
+	int				tup_len;	/* accurate or average tuple length. */
+	/* Below are buffered slots and related information. */
+	TupleTableSlot	*buffered_slots[MAX_MULTI_INSERT_TUPLES];
+	int				buffered_slots_num;	/* How many buffered slots for multi insert */
+	int				buffered_slots_size;	/* Total tuple size for multi insert */
+	/* Below are variables for sampling (to calculte avg.tup_len if needed). */
+	int				sampled_tuples_num; /* -1 means no sampling is needed. */
+	uint64			sampled_tuples_size; /* Total tuple size of samples. */
 } DR_intorel;
 
 /* utility functions for CTAS definition creation */
@@ -441,6 +450,8 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 	RangeTblEntry *rte;
 	ListCell   *lc;
 	int			attnum;
+	int			tup_len;
+	bool		use_sampling;
 
 	Assert(into != NULL);		/* else somebody forgot to set it */
 
@@ -456,12 +467,22 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 	 */
 	attrList = NIL;
 	lc = list_head(into->colNames);
+	tup_len = 0;
+	use_sampling = false;
 	for (attnum = 0; attnum < typeinfo->natts; attnum++)
 	{
 		Form_pg_attribute attribute = TupleDescAttr(typeinfo, attnum);
 		ColumnDef  *col;
 		char	   *colname;
 
+		if (attribute->attlen > 0)
+		{
+			if (!use_sampling)
+				tup_len += attribute->attlen;
+		}
+		else
+			use_sampling = true; /* Update tup_len via sampling. */
+
 		if (lc)
 		{
 			colname = strVal(lfirst(lc));
@@ -561,11 +582,59 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 	myState->ti_options = TABLE_INSERT_SKIP_FSM |
 		(XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
 	myState->bistate = GetBulkInsertState();
+	memset(myState->buffered_slots, 0, sizeof(TupleTableSlot *) * MAX_MULTI_INSERT_TUPLES);
+	myState->buffered_slots_num = 0;
+	myState->buffered_slots_size = 0;
+	myState->tup_len = use_sampling ? 0 : tup_len;
+	myState->sampled_tuples_num = use_sampling ? 0 : -1;
+	myState->sampled_tuples_size = 0;
+
+	/*
+	 * Create a temporary memory context so that we can reset once per
+	 * multi insert.
+	 */
+	myState->mi_context = AllocSetContextCreate(CurrentMemoryContext,
+												"intorel_multi_insert",
+												ALLOCSET_DEFAULT_SIZES);
 
 	/* Not using WAL requires smgr_targblock be initially invalid */
 	Assert(RelationGetTargetBlock(intoRelationDesc) == InvalidBlockNumber);
 }
 
+/*
+ * If the tuple length, which is obtained either through sampling on tuples with
+ * variable length attribute(s), or through calculating for tuples with
+ * accurate length attributes, is larger than or equal to this value, we do
+ * not use multi insert since memory copy overhead could decrease the
+ * benefit of multi insert.
+ */
+#define MAX_TUP_LEN_FOR_MULTI_INSERT	1600
+
+/* How many first tuples are sampled to calculte average tuple length? */
+#define MAX_MULTI_INSERT_SAMPLES		1000
+
+static void
+intorel_flush_multi_insert(DR_intorel *myState)
+{
+	MemoryContext oldcontext;
+	int i;
+
+	oldcontext = MemoryContextSwitchTo(myState->mi_context);
+
+	table_multi_insert(myState->rel, myState->buffered_slots,
+					   myState->buffered_slots_num, myState->output_cid,
+					   myState->ti_options, myState->bistate);
+
+	MemoryContextReset(myState->mi_context);
+	MemoryContextSwitchTo(oldcontext);
+
+	for (i = 0; i < myState->buffered_slots_num; i++)
+		ExecClearTuple(myState->buffered_slots[i]);
+
+	myState->buffered_slots_num = 0;
+	myState->buffered_slots_size = 0;
+}
+
 /*
  * intorel_receive --- receive one tuple
  */
@@ -573,6 +642,8 @@ static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
+	TupleTableSlot *batchslot;
+	HeapTuple tuple;
 
 	/*
 	 * Note that the input slot might not be of the type of the target
@@ -583,11 +654,72 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 	 * tuple's xmin), but since we don't do that here...
 	 */
 
-	table_tuple_insert(myState->rel,
-					   slot,
-					   myState->output_cid,
-					   myState->ti_options,
-					   myState->bistate);
+	/*
+	 * If the accurate/average tuple length is large, do single insert.
+	 * We do not call ExecFetchSlotHeapTuple() for the input slot to get
+	 * accurate tuple length here since sometimes it is wasteful to call
+	 * it again in table_tuple_insert(), e.g. VirtualTupleTableSlot
+	 */
+	if (myState->tup_len >= MAX_TUP_LEN_FOR_MULTI_INSERT)
+	{
+		table_tuple_insert(myState->rel,
+						   slot,
+						   myState->output_cid,
+						   myState->ti_options,
+						   myState->bistate);
+		return true;
+	}
+
+	/* Copy the slot to batchslot lists and materialize them. */
+	if (myState->buffered_slots[myState->buffered_slots_num] == NULL)
+	{
+		batchslot = table_slot_create(myState->rel, NULL);
+		myState->buffered_slots[myState->buffered_slots_num] = batchslot;
+	}
+	else
+		batchslot = myState->buffered_slots[myState->buffered_slots_num];
+
+	ExecCopySlot(batchslot, slot);
+	/*
+	 * In theory we do not need materalize here but if both input slot and
+	 * dst slot are BufferHeapTupleTableSlot, there might be hot code in
+	 * ResourceOwnerForgetBuffer() and ResourceOwnerRememberBuffer()
+	 * since we do them in batch. We could easily work around this by doing
+	 * materialize in advance. This is harmless since later when calling
+	 * table_multi_insert(), we need materialize also.
+	 */
+	ExecMaterializeSlot(batchslot);
+	myState->buffered_slots_num++;
+
+	if (myState->sampled_tuples_num < 0 ||
+		myState->sampled_tuples_num == MAX_MULTI_INSERT_SAMPLES)
+		myState->buffered_slots_size += myState->tup_len;
+	else
+	{
+		/*
+		 * Sampling to get the rough average tuple length for later use.
+		 * We do not use plan width since that is inaccurate sometimes.
+		 */
+		tuple = ExecFetchSlotHeapTuple(batchslot, true, NULL);
+
+		myState->buffered_slots_size += tuple->t_len;
+		myState->sampled_tuples_size += tuple->t_len;
+		myState->sampled_tuples_num++;
+
+		/*
+		 * Just finished sampling. Let's update myState->tup_len and
+		 * flush the tuples since in next call we possibly do single insert.
+		 */
+		if(myState->sampled_tuples_num == MAX_MULTI_INSERT_SAMPLES)
+		{
+			myState->tup_len = myState->sampled_tuples_size / myState->sampled_tuples_num;
+			intorel_flush_multi_insert(myState);
+		}
+	}
+
+	if (myState->buffered_slots_num == MAX_MULTI_INSERT_TUPLES ||
+		myState->buffered_slots_size >= 65535)
+		intorel_flush_multi_insert(myState);
 
 	/* We know this is a newly created relation, so there are no indexes */
 
@@ -601,11 +733,22 @@ static void
 intorel_shutdown(DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
+	int i;
+
+	if (myState->buffered_slots_num != 0)
+		intorel_flush_multi_insert(myState);
+
+	for (i = 0; i < MAX_MULTI_INSERT_TUPLES && myState->buffered_slots[i] != NULL; i++)
+		ExecDropSingleTupleTableSlot(myState->buffered_slots[i]);
 
 	FreeBulkInsertState(myState->bistate);
 
 	table_finish_bulk_insert(myState->rel, myState->ti_options);
 
+	if (myState->mi_context)
+		MemoryContextDelete(myState->mi_context);
+	myState->mi_context = NULL;
+
 	/* close rel, but keep lock until commit */
 	table_close(myState->rel, NoLock);
 	myState->rel = NULL;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index dffb57bf11..e90c6a3fc6 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -40,6 +40,17 @@ struct TupleTableSlot;
 
 #define MaxLockTupleMode	LockTupleExclusive
 
+/*
+ * No more than this many tuples per MultiInsertBuffer
+ *
+ * Caution: Don't make this too big. For COPY, we could end up with this many
+ * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's
+ * multiInsertBuffers list.  Increasing this can cause quadratic growth in
+ * memory requirements during copies into partitioned tables with a large
+ * number of partitions. For CTAS/MatView, the impact is similar.
+ */
+#define MAX_MULTI_INSERT_TUPLES	1000
+
 /*
  * Descriptor for heap table scans.
  */
-- 
2.17.2

#9Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Paul Guo (#7)
Re: Batch insert in CTAS/MatView code

On 17/06/2019 15:53, Paul Guo wrote:

I noticed that to do batch insert, we might need additional memory copy
sometimes comparing with "single insert"
(that should be the reason that we previously saw a bit regressions) so a
good solution seems to fall back
to "single insert" if the tuple length is larger than a threshold. I set
this as 2000 after quick testing.

Where does the additional memory copy come from? Can we avoid doing it
in the multi-insert case?

- Heikki

#10Paul Guo
pguo@pivotal.io
In reply to: Heikki Linnakangas (#9)
1 attachment(s)
Re: Batch insert in CTAS/MatView code

On Fri, Aug 2, 2019 at 2:55 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 17/06/2019 15:53, Paul Guo wrote:

I noticed that to do batch insert, we might need additional memory copy
sometimes comparing with "single insert"
(that should be the reason that we previously saw a bit regressions) so a
good solution seems to fall back
to "single insert" if the tuple length is larger than a threshold. I set
this as 2000 after quick testing.

Where does the additional memory copy come from? Can we avoid doing it
in the multi-insert case?

Hi Heikki,

Sorry for the late reply. I took some time on looking at & debugging the
code of TupleTableSlotOps
of various TupleTableSlot types carefully, especially the
BufferHeapTupleTableSlot case on which
we seemed to see regression if no threshold is set, also debugging &
testing more of the CTAS case.
I found my previous word "additional memory copy" (mainly tuple content
copy against single insert)
is wrong based on the latest code (probably is wrong also with previous
code). So in theory
we should not worry about additional tuple copy overhead now, and then I
tried the patch without setting
multi-insert threshold as attached.

To make test results more stable, this time I run a simple ' select
count(*) from tbl' before each CTAS to
warm up the shared buffer, run checkpoint before each CTAS, disable
autovacuum by setting
'autovacuum = off', set larger shared buffers (but < 25% of total memory
which is recommended
by PG doc) so that CTAS all hits shared buffer read if there exists warm
buffers (double-checked via
explain(analyze, buffers)). These seem to be reasonable for performance
testing. Each kind of CTAS
testing is run three times (Note before each run we do warm up and
checkpoint as mentioned).

I mainly tested the t12 (normal table with tuple size ~ 2k) case since for
others our patch either
performs better or similarly.

Patch: 1st_run 2nd_run 3rd_run

t12_BufferHeapTuple 7883.400 7549.966 8090.080
t12_Virtual 8041.637 8191.317 8182.404

Baseline: 1st_run 2nd_run 3rd_run

t12_BufferHeapTuple: 8264.290 7508.410 7681.702
t12_Virtual 8167.792 7970.537 8106.874

I actually roughly tested other tables we mentioned also (t11 and t14) -
the test results and conclusions are same.
t12_BufferHeapTuple means: create table tt as select * from t12;
t12_Virtual means: create table tt as select *partial columns* from t12;

So it looks like for t12 the results between our code and baseline are
similar so not setting
threshoud seem to be good though it looks like t12_BufferHeapTuple test
results varies a
lot (at most 0.5 seconds) for both our patch and baseline vs the virtual
case which is quite stable.

This actually confused me a bit given we've cached the source table in
shared buffers. I suspected checkpoint affects,
so I disabled checkpoint by setting max_wal_size = 3000 during CTAS, the
BufferHeapTuple case (see below)
still varies some. I'm not sure what's the reason but this does not seem to
a be blocker for the patch.
Patch: 1st_run 2nd_run 3rd_run
t12_BufferHeapTuple 7717.304 7413.259 7452.773
t12_Virtual 7445.742 7483.148 7593.583

Baseline: 1st_run 2nd_run 3rd_run
t12_BufferHeapTuple 8186.302 7736.541 7759.056
t12_Virtual 8004.880 8096.712 7961.483

Attachments:

v4-0001-Multi-insert-in-Create-Table-As.patchapplication/octet-stream; name=v4-0001-Multi-insert-in-Create-Table-As.patchDownload
From 714b2a5f70abb85cd45131f0b930c91f0c67ccf7 Mon Sep 17 00:00:00 2001
From: Paul Guo <paulguo@gmail.com>
Date: Thu, 28 Feb 2019 15:43:34 +0800
Subject: [PATCH v4] Multi insert in Create Table As.

This could improve the performance and also benefits 'create materialized view'
since that uses the code of create table as.
---
 src/backend/access/heap/heapam.c |  6 ++-
 src/backend/commands/copy.c      | 24 +++-------
 src/backend/commands/createas.c  | 81 ++++++++++++++++++++++++++------
 src/include/access/heapam.h      | 11 +++++
 4 files changed, 88 insertions(+), 34 deletions(-)

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e9544822bf..8a844b3b5f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2106,7 +2106,6 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 				  CommandId cid, int options, BulkInsertState bistate)
 {
 	TransactionId xid = GetCurrentTransactionId();
-	HeapTuple  *heaptuples;
 	int			i;
 	int			ndone;
 	PGAlignedBlock scratch;
@@ -2115,6 +2114,10 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 	Size		saveFreeSpace;
 	bool		need_tuple_data = RelationIsLogicallyLogged(relation);
 	bool		need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+	/* Declare it as static to let this memory be not on stack. */
+	static HeapTuple	heaptuples[MAX_MULTI_INSERT_TUPLES];
+
+	Assert(ntuples <= MAX_MULTI_INSERT_TUPLES);
 
 	/* currently not needed (thus unsupported) for heap_multi_insert() */
 	AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
@@ -2124,7 +2127,6 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
 												   HEAP_DEFAULT_FILLFACTOR);
 
 	/* Toast and set header data in all the slots */
-	heaptuples = palloc(ntuples * sizeof(HeapTuple));
 	for (i = 0; i < ntuples; i++)
 	{
 		HeapTuple	tuple;
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 3aeef30b28..893c2031c2 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -233,18 +233,6 @@ typedef struct
 	uint64		processed;		/* # of tuples processed */
 } DR_copy;
 
-
-/*
- * No more than this many tuples per CopyMultiInsertBuffer
- *
- * Caution: Don't make this too big, as we could end up with this many
- * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's
- * multiInsertBuffers list.  Increasing this can cause quadratic growth in
- * memory requirements during copies into partitioned tables with a large
- * number of partitions.
- */
-#define MAX_BUFFERED_TUPLES		1000
-
 /*
  * Flush buffers if there are >= this many bytes, as counted by the input
  * size, of tuples stored.
@@ -257,11 +245,11 @@ typedef struct
 /* Stores multi-insert data related to a single relation in CopyFrom. */
 typedef struct CopyMultiInsertBuffer
 {
-	TupleTableSlot *slots[MAX_BUFFERED_TUPLES]; /* Array to store tuples */
+	TupleTableSlot *slots[MAX_MULTI_INSERT_TUPLES]; /* Array to store tuples */
 	ResultRelInfo *resultRelInfo;	/* ResultRelInfo for 'relid' */
 	BulkInsertState bistate;	/* BulkInsertState for this rel */
 	int			nused;			/* number of 'slots' containing tuples */
-	uint64		linenos[MAX_BUFFERED_TUPLES];	/* Line # of tuple in copy
+	uint64		linenos[MAX_MULTI_INSERT_TUPLES];	/* Line # of tuple in copy
 												 * stream */
 } CopyMultiInsertBuffer;
 
@@ -2351,7 +2339,7 @@ CopyMultiInsertBufferInit(ResultRelInfo *rri)
 	CopyMultiInsertBuffer *buffer;
 
 	buffer = (CopyMultiInsertBuffer *) palloc(sizeof(CopyMultiInsertBuffer));
-	memset(buffer->slots, 0, sizeof(TupleTableSlot *) * MAX_BUFFERED_TUPLES);
+	memset(buffer->slots, 0, sizeof(TupleTableSlot *) * MAX_MULTI_INSERT_TUPLES);
 	buffer->resultRelInfo = rri;
 	buffer->bistate = GetBulkInsertState();
 	buffer->nused = 0;
@@ -2410,7 +2398,7 @@ CopyMultiInsertInfoInit(CopyMultiInsertInfo *miinfo, ResultRelInfo *rri,
 static inline bool
 CopyMultiInsertInfoIsFull(CopyMultiInsertInfo *miinfo)
 {
-	if (miinfo->bufferedTuples >= MAX_BUFFERED_TUPLES ||
+	if (miinfo->bufferedTuples >= MAX_MULTI_INSERT_TUPLES ||
 		miinfo->bufferedBytes >= MAX_BUFFERED_BYTES)
 		return true;
 	return false;
@@ -2531,7 +2519,7 @@ CopyMultiInsertBufferCleanup(CopyMultiInsertInfo *miinfo,
 	FreeBulkInsertState(buffer->bistate);
 
 	/* Since we only create slots on demand, just drop the non-null ones. */
-	for (i = 0; i < MAX_BUFFERED_TUPLES && buffer->slots[i] != NULL; i++)
+	for (i = 0; i < MAX_MULTI_INSERT_TUPLES && buffer->slots[i] != NULL; i++)
 		ExecDropSingleTupleTableSlot(buffer->slots[i]);
 
 	table_finish_bulk_insert(buffer->resultRelInfo->ri_RelationDesc,
@@ -2620,7 +2608,7 @@ CopyMultiInsertInfoNextFreeSlot(CopyMultiInsertInfo *miinfo,
 	int			nused = buffer->nused;
 
 	Assert(buffer != NULL);
-	Assert(nused < MAX_BUFFERED_TUPLES);
+	Assert(nused < MAX_MULTI_INSERT_TUPLES);
 
 	if (buffer->slots[nused] == NULL)
 		buffer->slots[nused] = table_slot_create(rri->ri_RelationDesc, NULL);
diff --git a/src/backend/commands/createas.c b/src/backend/commands/createas.c
index b7d220699f..ae3aa383f5 100644
--- a/src/backend/commands/createas.c
+++ b/src/backend/commands/createas.c
@@ -62,6 +62,10 @@ typedef struct
 	CommandId	output_cid;		/* cmin to insert in output tuples */
 	int			ti_options;		/* table_tuple_insert performance options */
 	BulkInsertState bistate;	/* bulk insert state */
+	MemoryContext	mi_context;	/* A temporary memory context for multi insert */
+	TupleTableSlot *mi_slots[MAX_MULTI_INSERT_TUPLES]; /* buffered slots for a multi insert batch. */
+	int				mi_slots_num;	/* How many buffered slots for a multi insert batch. */
+	int				mi_slots_size;	/* Total tuple size for a multi insert batch. */
 } DR_intorel;
 
 /* utility functions for CTAS definition creation */
@@ -561,33 +565,71 @@ intorel_startup(DestReceiver *self, int operation, TupleDesc typeinfo)
 	myState->ti_options = TABLE_INSERT_SKIP_FSM |
 		(XLogIsNeeded() ? 0 : TABLE_INSERT_SKIP_WAL);
 	myState->bistate = GetBulkInsertState();
+	memset(myState->mi_slots, 0, sizeof(TupleTableSlot *) * MAX_MULTI_INSERT_TUPLES);
+	myState->mi_slots_num = 0;
+	myState->mi_slots_size = 0;
+
+	/*
+	 * Create a temporary memory context so that we can reset once per
+	 * multi insert batch.
+	 */
+	myState->mi_context = AllocSetContextCreate(CurrentMemoryContext,
+												"intorel_multi_insert",
+												ALLOCSET_DEFAULT_SIZES);
 
 	/* Not using WAL requires smgr_targblock be initially invalid */
 	Assert(RelationGetTargetBlock(intoRelationDesc) == InvalidBlockNumber);
 }
 
+static void
+intorel_flush_multi_insert(DR_intorel *myState)
+{
+	MemoryContext oldcontext;
+	int			  i;
+
+	oldcontext = MemoryContextSwitchTo(myState->mi_context);
+
+	table_multi_insert(myState->rel, myState->mi_slots,
+					   myState->mi_slots_num, myState->output_cid,
+					   myState->ti_options, myState->bistate);
+
+	MemoryContextReset(myState->mi_context);
+	MemoryContextSwitchTo(oldcontext);
+
+	for (i = 0; i < myState->mi_slots_num; i++)
+		ExecClearTuple(myState->mi_slots[i]);
+
+	myState->mi_slots_num = 0;
+	myState->mi_slots_size = 0;
+}
+
 /*
  * intorel_receive --- receive one tuple
  */
 static bool
 intorel_receive(TupleTableSlot *slot, DestReceiver *self)
 {
-	DR_intorel *myState = (DR_intorel *) self;
+	DR_intorel		*myState = (DR_intorel *) self;
+	TupleTableSlot  *batchslot;
+	HeapTuple		 tuple;
 
-	/*
-	 * Note that the input slot might not be of the type of the target
-	 * relation. That's supported by table_tuple_insert(), but slightly less
-	 * efficient than inserting with the right slot - but the alternative
-	 * would be to copy into a slot of the right type, which would not be
-	 * cheap either. This also doesn't allow accessing per-AM data (say a
-	 * tuple's xmin), but since we don't do that here...
-	 */
+	if (myState->mi_slots[myState->mi_slots_num] == NULL)
+	{
+		batchslot = table_slot_create(myState->rel, NULL);
+		myState->mi_slots[myState->mi_slots_num] = batchslot;
+	}
+	else
+		batchslot = myState->mi_slots[myState->mi_slots_num];
 
-	table_tuple_insert(myState->rel,
-					   slot,
-					   myState->output_cid,
-					   myState->ti_options,
-					   myState->bistate);
+	ExecCopySlot(batchslot, slot);
+	tuple = ExecFetchSlotHeapTuple(batchslot, true, NULL);
+
+	myState->mi_slots_num++;
+	myState->mi_slots_size += tuple->t_len;
+
+	if (myState->mi_slots_num >= MAX_MULTI_INSERT_TUPLES ||
+		myState->mi_slots_size >= 65535)
+		intorel_flush_multi_insert(myState);
 
 	/* We know this is a newly created relation, so there are no indexes */
 
@@ -601,11 +643,22 @@ static void
 intorel_shutdown(DestReceiver *self)
 {
 	DR_intorel *myState = (DR_intorel *) self;
+	int			i;
+
+	if (myState->mi_slots_num != 0)
+		intorel_flush_multi_insert(myState);
+
+	for (i = 0; i < MAX_MULTI_INSERT_TUPLES && myState->mi_slots[i] != NULL; i++)
+		ExecDropSingleTupleTableSlot(myState->mi_slots[i]);
 
 	FreeBulkInsertState(myState->bistate);
 
 	table_finish_bulk_insert(myState->rel, myState->ti_options);
 
+	if (myState->mi_context)
+		MemoryContextDelete(myState->mi_context);
+	myState->mi_context = NULL;
+
 	/* close rel, but keep lock until commit */
 	table_close(myState->rel, NoLock);
 	myState->rel = NULL;
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 858bcb6bc9..b7c8bd42e7 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -40,6 +40,17 @@ struct TupleTableSlot;
 
 #define MaxLockTupleMode	LockTupleExclusive
 
+/*
+ * No more than this many tuples per MultiInsertBuffer
+ *
+ * Caution: Don't make this too big. For COPY, we could end up with this many
+ * CopyMultiInsertBuffer items stored in CopyMultiInsertInfo's
+ * multiInsertBuffers list.  Increasing this can cause quadratic growth in
+ * memory requirements during copies into partitioned tables with a large
+ * number of partitions. For CTAS/MatView, the impact is similar.
+ */
+#define MAX_MULTI_INSERT_TUPLES	1000
+
 /*
  * Descriptor for heap table scans.
  */
-- 
2.17.2

#11Asim R P
apraveen@pivotal.io
In reply to: Paul Guo (#10)
Re: Batch insert in CTAS/MatView code

On Mon, Sep 9, 2019 at 4:02 PM Paul Guo <pguo@pivotal.io> wrote:

So in theory
we should not worry about additional tuple copy overhead now, and then I

tried the patch without setting

multi-insert threshold as attached.

I reviewed your patch today. It looks good overall. My concern is that
the ExecFetchSlotHeapTuple call does not seem appropriate. In a generic
place such as createas.c, we should be using generic tableam API only.
However, I can also see that there is no better alternative. We need to
compute the size of accumulated tuples so far, in order to decide whether
to stop accumulating tuples. There is no convenient way to obtain the
length of the tuple, given a slot. How about making that decision solely
based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple call
altogether?

The multi insert copy code deals with index tuples also, which I don't see
in the patch. Don't we need to consider populating indexes?

Asim

#12Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Asim R P (#11)
Re: Batch insert in CTAS/MatView code

On 2019-Sep-25, Asim R P wrote:

I reviewed your patch today. It looks good overall. My concern is that
the ExecFetchSlotHeapTuple call does not seem appropriate. In a generic
place such as createas.c, we should be using generic tableam API only.
However, I can also see that there is no better alternative. We need to
compute the size of accumulated tuples so far, in order to decide whether
to stop accumulating tuples. There is no convenient way to obtain the
length of the tuple, given a slot. How about making that decision solely
based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple call
altogether?

... maybe we should add a new operation to slots, that returns the
(approximate?) size of a tuple? That would make this easy. (I'm not
sure however what to do about TOAST considerations -- is it size in
memory that we're worried about?)

Also:

+ myState->mi_slots_size >= 65535)

This magic number should be placed in a define next to the other one,
but I'm not sure that heapam.h is a good location, since surely this
applies to matviews in other table AMs too.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13Paul Guo
pguo@pivotal.io
In reply to: Asim R P (#11)
Re: Batch insert in CTAS/MatView code

Asim Thanks for the review.

On Wed, Sep 25, 2019 at 6:39 PM Asim R P <apraveen@pivotal.io> wrote:

On Mon, Sep 9, 2019 at 4:02 PM Paul Guo <pguo@pivotal.io> wrote:

So in theory
we should not worry about additional tuple copy overhead now, and then I

tried the patch without setting

multi-insert threshold as attached.

I reviewed your patch today. It looks good overall. My concern is that
the ExecFetchSlotHeapTuple call does not seem appropriate. In a generic
place such as createas.c, we should be using generic tableam API only.
However, I can also see that there is no better alternative. We need to
compute the size of accumulated tuples so far, in order to decide whether
to stop accumulating tuples. There is no convenient way to obtain the
length of the tuple, given a slot. How about making that decision solely
based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple call
altogether?

For heapam, ExecFetchSlotHeapTuple() will be called again in
heap_multi_insert() to prepare the final multi-insert. if we check
ExecFetchSlotHeapTuple(), we could find that calling it multiple time just
involves very very few overhead for the BufferHeapTuple case. Note for
virtual tuple case the 2nd ExecFetchSlotHeapTuple() call still copies slot
contents, but we've called ExecCopySlot(batchslot, slot); to copy to a
BufferHeap case so no worries for the virtual tuple case (as a source).

Previously (long ago) I probably understood the code incorrectly so had the
concern also. I used sampling to do that (for variable-length tuple), but
now apparently we do not need that.

The multi insert copy code deals with index tuples also, which I don't see
in the patch. Don't we need to consider populating indexes?

create table as/create mat view DDL does not involve index creation for the
table/matview. The code seems to be able to used in RefreshMatView also,
for that we need to consider if we use multi-insert in that code.

#14Paul Guo
pguo@pivotal.io
In reply to: Alvaro Herrera (#12)
Re: Batch insert in CTAS/MatView code

On Thu, Sep 26, 2019 at 9:43 PM Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

On 2019-Sep-25, Asim R P wrote:

I reviewed your patch today. It looks good overall. My concern is that
the ExecFetchSlotHeapTuple call does not seem appropriate. In a generic
place such as createas.c, we should be using generic tableam API only.
However, I can also see that there is no better alternative. We need to
compute the size of accumulated tuples so far, in order to decide whether
to stop accumulating tuples. There is no convenient way to obtain the
length of the tuple, given a slot. How about making that decision solely
based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple

call

altogether?

... maybe we should add a new operation to slots, that returns the
(approximate?) size of a tuple? That would make this easy. (I'm not
sure however what to do about TOAST considerations -- is it size in
memory that we're worried about?)

Also:

+ myState->mi_slots_size >= 65535)

This magic number should be placed in a define next to the other one,
but I'm not sure that heapam.h is a good location, since surely this
applies to matviews in other table AMs too.

yes defining 65535 seems better. Let's fix this one later when having

more feedback. Thanks.

#15Asim R P
apraveen@pivotal.io
In reply to: Alvaro Herrera (#12)
Re: Batch insert in CTAS/MatView code

On Thu, Sep 26, 2019 at 7:13 PM Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

On 2019-Sep-25, Asim R P wrote:

I reviewed your patch today. It looks good overall. My concern is that
the ExecFetchSlotHeapTuple call does not seem appropriate. In a generic
place such as createas.c, we should be using generic tableam API only.
However, I can also see that there is no better alternative. We need to
compute the size of accumulated tuples so far, in order to decide

whether

to stop accumulating tuples. There is no convenient way to obtain the
length of the tuple, given a slot. How about making that decision

solely

based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple

call

altogether?

... maybe we should add a new operation to slots, that returns the
(approximate?) size of a tuple? That would make this easy. (I'm not
sure however what to do about TOAST considerations -- is it size in
memory that we're worried about?)

That will help. For slots containing heap tuples, heap_compute_data_size()
is what we need. Approximate size is better than nothing.
In case of CTAS, we are dealing with slots returned by a scan node.
Wouldn't TOAST datums be already expanded in those slots?

Asim

#16Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#12)
Re: Batch insert in CTAS/MatView code

Hi,

On 2019-09-26 10:43:27 -0300, Alvaro Herrera wrote:

On 2019-Sep-25, Asim R P wrote:

I reviewed your patch today. It looks good overall. My concern is that
the ExecFetchSlotHeapTuple call does not seem appropriate. In a generic
place such as createas.c, we should be using generic tableam API
only.

Indeed.

However, I can also see that there is no better alternative. We need to
compute the size of accumulated tuples so far, in order to decide whether
to stop accumulating tuples. There is no convenient way to obtain the
length of the tuple, given a slot. How about making that decision solely
based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple call
altogether?

... maybe we should add a new operation to slots, that returns the
(approximate?) size of a tuple?

Hm, I'm not convinced that it's worth adding that as a dedicated
operation. It's not that clear what it'd exactly mean anyway - what
would it measure? As referenced in the slot? As if it were stored on
disk? etc?

I wonder if the right answer wouldn't be to just measure the size of a
memory context containing the batch slots, or something like that.

That would make this easy. (I'm not sure however what to do about
TOAST considerations -- is it size in memory that we're worried
about?)

The in-memory size is probably fine, because in all likelihood the
toasted cols are just going to point to on-disk datums, no?

Also:

+ myState->mi_slots_size >= 65535)

This magic number should be placed in a define next to the other one,
but I'm not sure that heapam.h is a good location, since surely this
applies to matviews in other table AMs too.

Right. I think it'd be better to move this into an AM independent place.

Greetings,

Andres Freund

#17Andres Freund
andres@anarazel.de
In reply to: Paul Guo (#10)
Re: Batch insert in CTAS/MatView code

Hi,

On 2019-09-09 18:31:54 +0800, Paul Guo wrote:

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e9544822bf..8a844b3b5f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2106,7 +2106,6 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
CommandId cid, int options, BulkInsertState bistate)
{
TransactionId xid = GetCurrentTransactionId();
-	HeapTuple  *heaptuples;
int			i;
int			ndone;
PGAlignedBlock scratch;
@@ -2115,6 +2114,10 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
Size		saveFreeSpace;
bool		need_tuple_data = RelationIsLogicallyLogged(relation);
bool		need_cids = RelationIsAccessibleInLogicalDecoding(relation);
+	/* Declare it as static to let this memory be not on stack. */
+	static HeapTuple	heaptuples[MAX_MULTI_INSERT_TUPLES];
+
+	Assert(ntuples <= MAX_MULTI_INSERT_TUPLES);

/* currently not needed (thus unsupported) for heap_multi_insert() */
AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
@@ -2124,7 +2127,6 @@ heap_multi_insert(Relation relation, TupleTableSlot **slots, int ntuples,
HEAP_DEFAULT_FILLFACTOR);

/* Toast and set header data in all the slots */
- heaptuples = palloc(ntuples * sizeof(HeapTuple));
for (i = 0; i < ntuples; i++)
{
HeapTuple tuple;

I don't think this is a good idea. We shouldn't unnecessarily allocate
8KB on the stack. Is there any actual evidence this is a performance
benefit? To me this just seems like it'll reduce the flexibility of the
API, without any benefit. I'll also note that you've apparently not
updated tableam.h to document this new restriction.

Greetings,

Andres Freund

#18Paul Guo
pguo@pivotal.io
In reply to: Andres Freund (#17)
Re: Batch insert in CTAS/MatView code

On Sat, Sep 28, 2019 at 5:49 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-09-09 18:31:54 +0800, Paul Guo wrote:

diff --git a/src/backend/access/heap/heapam.c

b/src/backend/access/heap/heapam.c

index e9544822bf..8a844b3b5f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2106,7 +2106,6 @@ heap_multi_insert(Relation relation,

TupleTableSlot **slots, int ntuples,

CommandId cid, int options,

BulkInsertState bistate)

{
TransactionId xid = GetCurrentTransactionId();
- HeapTuple *heaptuples;
int i;
int ndone;
PGAlignedBlock scratch;
@@ -2115,6 +2114,10 @@ heap_multi_insert(Relation relation,

TupleTableSlot **slots, int ntuples,

Size saveFreeSpace;
bool need_tuple_data =

RelationIsLogicallyLogged(relation);

bool need_cids =

RelationIsAccessibleInLogicalDecoding(relation);

+     /* Declare it as static to let this memory be not on stack. */
+     static HeapTuple        heaptuples[MAX_MULTI_INSERT_TUPLES];
+
+     Assert(ntuples <= MAX_MULTI_INSERT_TUPLES);

/* currently not needed (thus unsupported) for heap_multi_insert()

*/

AssertArg(!(options & HEAP_INSERT_NO_LOGICAL));
@@ -2124,7 +2127,6 @@ heap_multi_insert(Relation relation,

TupleTableSlot **slots, int ntuples,

HEAP_DEFAULT_FILLFACTOR);

/* Toast and set header data in all the slots */
- heaptuples = palloc(ntuples * sizeof(HeapTuple));
for (i = 0; i < ntuples; i++)
{
HeapTuple tuple;

I don't think this is a good idea. We shouldn't unnecessarily allocate
8KB on the stack. Is there any actual evidence this is a performance
benefit? To me this just seems like it'll reduce the flexibility of the

Previous heaptuples is palloc-ed in each batch, which should be slower than
pre-allocated & reusing memory in theory.

API, without any benefit. I'll also note that you've apparently not

updated tableam.h to document this new restriction.

Yes it should be moved from heapam.h to that file along with the 65535
definition.

#19Paul Guo
pguo@pivotal.io
In reply to: Andres Freund (#16)
Re: Batch insert in CTAS/MatView code

However, I can also see that there is no better alternative. We need

to

compute the size of accumulated tuples so far, in order to decide

whether

to stop accumulating tuples. There is no convenient way to obtain the
length of the tuple, given a slot. How about making that decision

solely

based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple

call

altogether?

... maybe we should add a new operation to slots, that returns the
(approximate?) size of a tuple?

Hm, I'm not convinced that it's worth adding that as a dedicated
operation. It's not that clear what it'd exactly mean anyway - what
would it measure? As referenced in the slot? As if it were stored on
disk? etc?

I wonder if the right answer wouldn't be to just measure the size of a
memory context containing the batch slots, or something like that.

Probably a better way is to move those logic (append slot to slots, judge
when to flush, flush, clean up slots) into table_multi_insert()? Generally
the final implementation of table_multi_insert() should be able to know
the sizes easily. One concern is that currently just COPY in the repo uses
multi insert, so not sure if other callers in the future want their own
logic (or set up a flag to allow customization but seems a bit
over-designed?).

#20Andres Freund
andres@anarazel.de
In reply to: Paul Guo (#19)
Re: Batch insert in CTAS/MatView code

Hi,

On 2019-09-30 12:12:31 +0800, Paul Guo wrote:

However, I can also see that there is no better alternative. We need

to

compute the size of accumulated tuples so far, in order to decide

whether

to stop accumulating tuples. There is no convenient way to obtain the
length of the tuple, given a slot. How about making that decision

solely

based on number of tuples, so that we can avoid ExecFetchSlotHeapTuple

call

altogether?

... maybe we should add a new operation to slots, that returns the
(approximate?) size of a tuple?

Hm, I'm not convinced that it's worth adding that as a dedicated
operation. It's not that clear what it'd exactly mean anyway - what
would it measure? As referenced in the slot? As if it were stored on
disk? etc?

I wonder if the right answer wouldn't be to just measure the size of a
memory context containing the batch slots, or something like that.

Probably a better way is to move those logic (append slot to slots, judge
when to flush, flush, clean up slots) into table_multi_insert()?

That does not strike me as a good idea. The upper layer is going to need
to manage some resources (e.g. it's the only bit that knows about how to
manage lifetime of the incoming data), and by exposing it to each AM
we're going to duplicate the necessary code too.

Generally the final implementation of table_multi_insert() should be
able to know the sizes easily. One concern is that currently just COPY
in the repo uses multi insert, so not sure if other callers in the
future want their own logic (or set up a flag to allow customization
but seems a bit over-designed?).

And that is also a concern, it seems unlikely that we'll get the
interface good.

Greetings,

Andres Freund

#21Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#20)
Re: Batch insert in CTAS/MatView code

On Mon, Sep 30, 2019 at 12:38:02AM -0700, Andres Freund wrote:

That does not strike me as a good idea. The upper layer is going to need
to manage some resources (e.g. it's the only bit that knows about how to
manage lifetime of the incoming data), and by exposing it to each AM
we're going to duplicate the necessary code too.

Latest status of the thread maps with a patch which still applies,
still the discussion could go on as more review is needed. So I have
moved it to next CF.
--
Michael

#22Paul Guo
pguo@pivotal.io
In reply to: Michael Paquier (#21)
Re: Batch insert in CTAS/MatView code

I took some time on digging the issue yesterday so the main concern of the
patch is to get the tuple length from ExecFetchSlotHeapTuple().

+   ExecCopySlot(batchslot, slot);
+   tuple = ExecFetchSlotHeapTuple(batchslot, true, NULL);
+
+   myState->mi_slots_num++;
+   myState->mi_slots_size += tuple->t_len;

We definitely should remove ExecFetchSlotHeapTuple() here but we need to
know the tuple length (at least rough length). One solution might be using
memory context stat info as mentioned, the code looks like this.

tup_len = MemoryContextUsedSize(batchslot->tts_mcxt);
ExecCopySlot(batchslot, slot);
ExecMaterializeSlot(batchslot);
tup_len = MemoryContextUsedSize(batchslot->tts_mcxt) - tup_len;

MemoryContextUsedSize() is added to calculate the total used size (simply
hack to use the stats interface).

+int
+MemoryContextUsedSize(MemoryContext context)
+{
+   MemoryContextCounters total;
+
+   memset(&total, 0, sizeof(total));
+   context->methods->stats(context, NULL, NULL, &total);
+
+   return total.totalspace - total.freespace;
+}

This basically works but there are concerns:

The length is not accurate (though we do not need to be that accurate)
since there are
some additional memory allocations, but we are not sure if the size is
not much
larger than the real length for some slot types in the future and I'm
not sure whether we
definitely allocate at least the tuple length in the memory context
after materialization
for all slot types in the future. Last is that the code seems to be a
bit ugly also.

As a reference, For "create table t1 as select * from t2", the above
code returns
"tuple length" is 88 (real tuple length is 4).

Another solution is that maybe return the real size
in ExecMaterializeSlot()? e.g.
ExecMaterializeSlot(slot, &tup_len); For this we probably want to store the
length
in the slot struct for performance.

For the COPY case the tuple length is known in advance but I can image more
cases
which do not know the size but need that for the multi_insert interface, at
least I'm
wondering if we should use that in 'refresh matview' and fast path for
Insert node
(I heard some complaints about the performance of "insert into tbl from
select..."
from some of our users)? So the concern is not just for the case in this
patch.

Besides, My colleagues Ashwin Agrawal and Adam Lee found maybe could
try raw_heap_insert() similar code for ctas and compare since it is do
insert in
a new created table. That would involve more discussions, much more code
change and need to test more (stability and performance). So multi insert
seems
to be a stable solution in a short time given that has been used in COPY
for a
long time?

Whatever the solution for CTAS we need to address the concern of tuple size
for multi insert cases.

#23Daniel Gustafsson
daniel@yesql.se
In reply to: Paul Guo (#22)
Re: Batch insert in CTAS/MatView code

In an off-list discussion with Paul, we decided to withdraw this patch for now
and instead create a new entry when there is a re-worked patch. This has now
been done in the CF app.

cheers ./daniel

#24tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Daniel Gustafsson (#23)
RE: Batch insert in CTAS/MatView code

Hello Paul-san,

From: Daniel Gustafsson <daniel@yesql.se>

In an off-list discussion with Paul, we decided to withdraw this patch for now
and instead create a new entry when there is a re-worked patch. This has
now
been done in the CF app.

Would you mind if I take over this patch for PG 15? I find this promising, as Bharath-san demonstrated good performance by combining your patch and the parallel CTAS that Bharath-san has been working on. We'd like to do things that enhance parallelism.

Please allow me to start posting the revised patches next week if I can't get your reply.

Regards
Takayuki Tsunakawa

#25Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: tsunakawa.takay@fujitsu.com (#24)
Re: Batch insert in CTAS/MatView code

On Wed, May 26, 2021 at 12:18 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:

Hello Paul-san,

From: Daniel Gustafsson <daniel@yesql.se>

In an off-list discussion with Paul, we decided to withdraw this patch for now
and instead create a new entry when there is a re-worked patch. This has
now
been done in the CF app.

Would you mind if I take over this patch for PG 15? I find this promising, as Bharath-san demonstrated good performance by combining your patch and the parallel CTAS that Bharath-san has been working on. We'd like to do things that enhance parallelism.

Please allow me to start posting the revised patches next week if I can't get your reply.

Hi,

I think the "New Table Access Methods for Multi and Single Inserts"
patches at [1]/messages/by-id/CALj2ACXdrOmB6Na9amHWZHKvRT3Z0nwTRsCwoMT-npOBtmXLXg@mail.gmail.com make multi insert usage easy for COPY, CTAS/Create Mat
View, Refresh Mat View and so on. It also has a patch for multi
inserts in CTAS and Refresh Mat View
(v6-0002-CTAS-and-REFRESH-Mat-View-With-New-Multi-Insert-T.patch).
Please see that thread and feel free to review it.

[1]: /messages/by-id/CALj2ACXdrOmB6Na9amHWZHKvRT3Z0nwTRsCwoMT-npOBtmXLXg@mail.gmail.com

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

#26tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Bharath Rupireddy (#25)
RE: Batch insert in CTAS/MatView code

From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>

I think the "New Table Access Methods for Multi and Single Inserts"
patches at [1] make multi insert usage easy for COPY, CTAS/Create Mat
View, Refresh Mat View and so on. It also has a patch for multi
inserts in CTAS and Refresh Mat View
(v6-0002-CTAS-and-REFRESH-Mat-View-With-New-Multi-Insert-T.patch).
Please see that thread and feel free to review it.

Ouch, I didn't notice that the patch was reborn in that thread. OK.

Could you add it to the CF if you haven't yet?

Regards
Takayuki Tsunakawa

#27Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: tsunakawa.takay@fujitsu.com (#26)
Re: Batch insert in CTAS/MatView code

On Wed, May 26, 2021 at 12:49 PM tsunakawa.takay@fujitsu.com
<tsunakawa.takay@fujitsu.com> wrote:

From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>

I think the "New Table Access Methods for Multi and Single Inserts"
patches at [1] make multi insert usage easy for COPY, CTAS/Create Mat
View, Refresh Mat View and so on. It also has a patch for multi
inserts in CTAS and Refresh Mat View
(v6-0002-CTAS-and-REFRESH-Mat-View-With-New-Multi-Insert-T.patch).
Please see that thread and feel free to review it.

Ouch, I didn't notice that the patch was reborn in that thread. OK.

Could you add it to the CF if you haven't yet?

It already has one - https://commitfest.postgresql.org/33/2871/

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com