Inserting heap tuples in bulk in COPY

Started by Heikki Linnakangasover 14 years ago32 messages
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
4 attachment(s)

COPY is slow. Let's make it faster. One obvious optimization is to
insert heap tuples in bigger chunks, instead of calling heap_insert()
separately for every tuple. That saves the overhead of pinning and
locking the buffer for every tuple, and you only need to write one WAL
record for all the tuples written to the same page, instead of one for
each tuple.

Attached is a WIP patch to do that. It adds a new function,
heap_multi_insert, which does the same thing as heap_insert, but works
in bulk. It takes an array of tuples as argument, and tries to cram as
many of them into the chosen targe page as it can, and only writes a
single WAL record of the operation.

This gives a significant speedup to COPY, particularly for narrow
tables, with small tuples. Grouping multiple tuples into one WAL record
reduces the WAL volume significantly, and the time spent in writing that
WAL. The reduced overhead of repeatedly locking the buffer is also most
noticeable on narrow tables. On wider tables, the effects are smaller.
See copytest-results.txt, containing test results with three tables of
different widths. The scripts used to get those numbers are also attached.

Triggers complicate this. I believe it is only safe to group tuples
together like this if the table has no triggers. A BEFORE ROW trigger
might run a SELECT on the table being copied to, and check if some of
the tuples we're about to insert exist. If we run BEFORE ROW triggers
for a bunch of tuples first, and only then insert them, none of the
trigger invocations will see the other rows as inserted yet. Similarly,
if we run AFTER ROW triggers after inserting a bunch of tuples, the
trigger for each of the insertions would see all the inserted rows. So
at least for now, the patch simply falls back to inserting one row at a
time if there are any triggers on the table.

The patch is WIP, mainly because I didn't write the WAL replay routines
yet, but please let me know if you see any issues.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

copy-heap-multi-insert-1.patchtext/x-diff; name=copy-heap-multi-insert-1.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 9f1bcf1..0594332 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -24,6 +24,7 @@
  *		heap_getnext	- retrieve next tuple in scan
  *		heap_fetch		- retrieve tuple with given tid
  *		heap_insert		- insert tuple into a relation
+ *		heap_multi_insert - insert multiple tuples into a relation
  *		heap_delete		- delete a tuple from a relation
  *		heap_update		- replace a tuple in a relation with another tuple
  *		heap_markpos	- mark scan position
@@ -1860,11 +1861,39 @@ Oid
 heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 			int options, BulkInsertState bistate)
 {
+	HeapTuple heaptup;
+
+	heaptup = heap_prepare_insert(relation, tup, cid, options, bistate);
+
+	heap_multi_insert(relation, &heaptup, 1, options, bistate);
+
+	/*
+	 * If heaptup is a private copy, release it.  Don't forget to copy t_self
+	 * back to the caller's image, too.
+	 */
+	if (heaptup != tup)
+	{
+		tup->t_self = heaptup->t_self;
+		heap_freetuple(heaptup);
+	}
+
+	return HeapTupleGetOid(tup);
+}
+
+/*
+ * Prepare a tuple for insertion with heap_multi_insert. This sets the
+ * tuple header fields, assigns an OID, and toasts the tuple if necessary.
+ * Returns a toasted version of the tuple if it was toasted, or the original
+ * tuple if not.
+ *
+ * This needs to be called for each tuple before calling heap_multi_insert().
+ */
+HeapTuple
+heap_prepare_insert(Relation relation, HeapTuple tup, CommandId cid,
+					int options, BulkInsertState bistate)
+{
 	TransactionId xid = GetCurrentTransactionId();
-	HeapTuple	heaptup;
-	Buffer		buffer;
-	Buffer		vmbuffer = InvalidBuffer;
-	bool		all_visible_cleared = false;
+	HeapTuple heaptup;
 
 	if (relation->rd_rel->relhasoids)
 	{
@@ -1916,6 +1945,39 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	else
 		heaptup = tup;
 
+	return heaptup;
+}
+
+/*
+ * Inserts tuples to relation. This always inserts at least one tuple, and
+ * opportunistically more if the chosen target page happens to have room, up
+ * to ntuples. Returns the number of tuples inserted.
+ */
+int
+heap_multi_insert(Relation relation, HeapTuple *heaptuples, int ntuples,
+				  int options, BulkInsertState bistate)
+{
+	HeapTuple	heaptup = heaptuples[0];
+	Buffer		buffer;
+	Buffer		vmbuffer = InvalidBuffer;
+	bool		all_visible_cleared = false;
+	int			i;
+	int			ndone;
+	char	   *scratch = NULL;
+	int			scratchused = 0;
+	Page		page;
+
+	/*
+	 * Allocate some memory to use for constructing the WAL record. Using
+	 * palloc() within a critical section is not safe, so we allocate this
+	 * beforehand.
+	 */
+	if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
+		scratch = palloc(BLCKSZ);
+
+	if (IsSystemRelation(relation))
+		Assert(ntuples == 1);
+
 	/*
 	 * Find buffer to insert this tuple into.  If the page is all visible,
 	 * this will also pin the requisite visibility map page.
@@ -1923,6 +1985,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	buffer = RelationGetBufferForTuple(relation, heaptup->t_len,
 									   InvalidBuffer, options, bistate,
 									   &vmbuffer, NULL);
+	page = BufferGetPage(buffer);
 
 	/*
 	 * We're about to do the actual insert -- check for conflict at the
@@ -1931,20 +1994,25 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CheckForSerializableConflictIn(relation, NULL, buffer);
 
-	/* NO EREPORT(ERROR) from here till changes are logged */
-	START_CRIT_SECTION();
-
-	RelationPutHeapTuple(relation, buffer, heaptup);
-
-	if (PageIsAllVisible(BufferGetPage(buffer)))
+	if (PageIsAllVisible(page))
 	{
 		all_visible_cleared = true;
-		PageClearAllVisible(BufferGetPage(buffer));
+		PageClearAllVisible(page);
 		visibilitymap_clear(relation,
 							ItemPointerGetBlockNumber(&(heaptup->t_self)),
 							vmbuffer);
 	}
 
+	/* NO EREPORT(ERROR) from here till changes are logged */
+	START_CRIT_SECTION();
+
+	ndone = 0;
+	do
+	{
+		heaptup = heaptuples[ndone++];
+		RelationPutHeapTuple(relation, buffer, heaptup);
+	} while (ndone < ntuples && PageGetHeapFreeSpace(page) > MAXALIGN(heaptup->t_len));
+
 	/*
 	 * XXX Should we set PageSetPrunable on this page ?
 	 *
@@ -1961,11 +2029,13 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	/* XLOG stuff */
 	if (!(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation))
 	{
+		XLogRecPtr	recptr;
+
+		if (ntuples == 1)
+		{
 		xl_heap_insert xlrec;
 		xl_heap_header xlhdr;
-		XLogRecPtr	recptr;
 		XLogRecData rdata[3];
-		Page		page = BufferGetPage(buffer);
 		uint8		info = XLOG_HEAP_INSERT;
 
 		xlrec.all_visible_cleared = all_visible_cleared;
@@ -2007,10 +2077,79 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 			PageGetMaxOffsetNumber(page) == FirstOffsetNumber)
 		{
 			info |= XLOG_HEAP_INIT_PAGE;
-			rdata[1].buffer = rdata[2].buffer = InvalidBuffer;
+			rdata[1].buffer = InvalidBuffer;
+		}
+
+		recptr = XLogInsert(RM_HEAP2_ID, info, rdata);
 		}
+		else
+		{
+			xl_heap_multi_insert *xlrec;
+			XLogRecData rdata[2];
+			uint8		info = XLOG_HEAP2_MULTI_INSERT;
+			char	   *tupledata;
+			int			totaldatalen;
+
+			xlrec = (xl_heap_multi_insert *) scratch;
+			scratchused += SizeOfHeapMultiInsert(ndone);
+
+			xlrec->all_visible_cleared = all_visible_cleared;
+			xlrec->node = relation->rd_node;
+			xlrec->blkno = BufferGetBlockNumber(buffer);
+			xlrec->ntuples = ndone;
+
+			tupledata = &scratch[scratchused];
+			totaldatalen = 0;
+
+			for (i = 0; i < ndone; i++)
+			{
+				int tuplen;
+
+				heaptup = heaptuples[i];
+				xlrec->tuphdrs[i].offset = ItemPointerGetOffsetNumber(&heaptup->t_self);
+				xlrec->tuphdrs[i].xlhdr.t_infomask2 = heaptup->t_data->t_infomask2;
+				xlrec->tuphdrs[i].xlhdr.t_infomask = heaptup->t_data->t_infomask;
+				xlrec->tuphdrs[i].xlhdr.t_hoff = heaptup->t_data->t_hoff;
+
+				/* write bitmap [+ padding] [+ oid] + data */
+				tuplen = heaptup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+				memcpy(&tupledata[totaldatalen],
+					   (char *) heaptup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   tuplen);
+				totaldatalen += tuplen;
+			}
+			scratchused += totaldatalen;
+
+			rdata[0].data = (char *) &xlrec;
+			rdata[0].len = SizeOfHeapMultiInsert(ndone);
+			rdata[0].buffer = InvalidBuffer;
+			rdata[0].next = &(rdata[1]);
 
-		recptr = XLogInsert(RM_HEAP_ID, info, rdata);
+			/*
+			 * note we mark rdata[1] as belonging to buffer; if XLogInsert decides
+			 * to write the whole page to the xlog, we don't need to store
+			 * xl_heap_header in the xlog. XXX: we do anyway
+			 */
+			rdata[1].data = tupledata;
+			rdata[1].len = totaldatalen;
+			rdata[1].buffer = buffer;
+			rdata[1].buffer_std = true;
+			rdata[1].next = NULL;
+
+			/*
+			 * If this is the single and first tuple on page, we can reinit the
+			 * page instead of restoring the whole thing.  Set flag, and hide
+			 * buffer references from XLogInsert.
+			 */
+			if (ItemPointerGetOffsetNumber(&(heaptuples[0]->t_self)) == FirstOffsetNumber &&
+				PageGetMaxOffsetNumber(page) == FirstOffsetNumber + ndone - 1)
+			{
+				info |= XLOG_HEAP_INIT_PAGE;
+				rdata[1].buffer = InvalidBuffer;
+			}
+
+			recptr = XLogInsert(RM_HEAP2_ID, info, rdata);
+		}
 
 		PageSetLSN(page, recptr);
 		PageSetTLI(page, ThisTimeLineID);
@@ -2030,19 +2169,13 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CacheInvalidateHeapTuple(relation, heaptup);
 
-	pgstat_count_heap_insert(relation);
+	for (i = 0; i < ndone; i++)
+		pgstat_count_heap_insert(relation);
 
-	/*
-	 * If heaptup is a private copy, release it.  Don't forget to copy t_self
-	 * back to the caller's image, too.
-	 */
-	if (heaptup != tup)
-	{
-		tup->t_self = heaptup->t_self;
-		heap_freetuple(heaptup);
-	}
+	if (scratch)
+		pfree(scratch);
 
-	return HeapTupleGetOid(tup);
+	return ndone;
 }
 
 /*
@@ -4729,6 +4862,12 @@ heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
 		XLogRecordPageWithFreeSpace(xlrec->target.node, blkno, freespace);
 }
 
+static void
+heap_xlog_multi_insert(XLogRecPtr lsn, XLogRecord *record)
+{
+	/* TODO */
+}
+
 /*
  * Handles UPDATE and HOT_UPDATE
  */
@@ -5118,6 +5257,9 @@ heap2_redo(XLogRecPtr lsn, XLogRecord *record)
 		case XLOG_HEAP2_VISIBLE:
 			heap_xlog_visible(lsn, record);
 			break;
+		case XLOG_HEAP2_MULTI_INSERT:
+			heap_xlog_multi_insert(lsn, record);
+			break;
 		default:
 			elog(PANIC, "heap2_redo: unknown op code %u", info);
 	}
@@ -5255,6 +5397,10 @@ heap2_desc(StringInfo buf, uint8 xl_info, char *rec)
 						 xlrec->node.spcNode, xlrec->node.dbNode,
 						 xlrec->node.relNode, xlrec->block);
 	}
+	else if (info == XLOG_HEAP2_MULTI_INSERT)
+	{
+		/* TODO */
+	}
 	else
 		appendStringInfo(buf, "UNKNOWN");
 }
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index 528a3a1..bc0a2fe 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -1842,11 +1842,16 @@ CopyFrom(CopyState cstate)
 	ExprContext *econtext;
 	TupleTableSlot *myslot;
 	MemoryContext oldcontext = CurrentMemoryContext;
+
 	ErrorContextCallback errcontext;
 	CommandId	mycid = GetCurrentCommandId(true);
 	int			hi_options = 0; /* start with default heap_insert options */
 	BulkInsertState bistate;
 	uint64		processed = 0;
+	bool		useHeapMultiInsert = false;
+	int			nBufferedTuples = 0;
+#define TUPLE_BUFFER_SIZE 100
+	HeapTuple	bufferedTuples[TUPLE_BUFFER_SIZE];
 
 	Assert(cstate->rel);
 
@@ -1941,6 +1946,22 @@ CopyFrom(CopyState cstate)
 	/* Triggers might need a slot as well */
 	estate->es_trig_tuple_slot = ExecInitExtraTupleSlot(estate);
 
+	/*
+	 * If there isn't any triggers on the table, we can buffer the constructed
+	 * tuples and insert them in bigger chunks using heap_multi_insert(). It's
+	 * not clear if this would be safe with triggers. A trigger could look at
+	 * the rows already inserted and act differently based on them, for
+	 * example, and if we insert them in chunks, so an AFTER ROW trigger would
+	 * see the whole chunk as inserted (or as not inserted, for a BEFORE ROW
+	 * trigger), even though the triggers for the other tuples had not been
+	 * run yet.
+	 */
+	if (resultRelInfo->ri_TrigDesc == NULL)
+	{
+		useHeapMultiInsert = true;
+		nBufferedTuples = 0;
+	}
+
 	/* Prepare to catch AFTER triggers. */
 	AfterTriggerBeginQuery();
 
@@ -1972,8 +1993,11 @@ CopyFrom(CopyState cstate)
 
 		CHECK_FOR_INTERRUPTS();
 
-		/* Reset the per-tuple exprcontext */
-		ResetPerTupleExprContext(estate);
+		if (nBufferedTuples == 0)
+		{
+			/* Reset the per-tuple exprcontext */
+			ResetPerTupleExprContext(estate);
+		}
 
 		/* Switch into its memory context */
 		MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
@@ -2016,18 +2040,73 @@ CopyFrom(CopyState cstate)
 			if (cstate->rel->rd_att->constr)
 				ExecConstraints(resultRelInfo, slot, estate);
 
-			/* OK, store the tuple and create index entries for it */
-			heap_insert(cstate->rel, tuple, mycid, hi_options, bistate);
+			if (useHeapMultiInsert)
+			{
+				/* Insert this tuple to the tuple buffer */
+				bufferedTuples[nBufferedTuples++] =
+					heap_prepare_insert(cstate->rel, tuple, mycid,
+										hi_options, bistate);
+
+				/* If the buffer filled up, flush it */
+				if (nBufferedTuples == TUPLE_BUFFER_SIZE)
+				{
+					int		ninserted;
+					int		nremaining;
+					HeapTuple *remainingTuples;
 
-			if (resultRelInfo->ri_NumIndices > 0)
-				recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
-													   estate);
+					/*
+					 * Call heap_multi_insert() until all the tuples have been
+					 * inserted. We must flush it fully, so that we can reset
+					 * the per-tuple memory context (which is now a bit of a
+					 * misnomer).
+					 */
+					remainingTuples = bufferedTuples;
+					nremaining = nBufferedTuples;
+					while (nremaining > 0)
+					{
+						ninserted = heap_multi_insert(cstate->rel,
+													  remainingTuples,
+													  nremaining,
+													  hi_options,
+													  bistate);
+						nremaining -= ninserted;
+						remainingTuples = remainingTuples + ninserted;
+					}
+
+					/*
+					 * If there are any indexes, update the indexes for all the
+					 * inserted tuples.
+					 */
+					if (resultRelInfo->ri_NumIndices > 0)
+					{
+						int i;
+						for (i = 0; i < nBufferedTuples; i++)
+						{
+							ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
+							recheckIndexes = ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+																   estate);
+							list_free(recheckIndexes);
+						}
+					}
 
-			/* AFTER ROW INSERT Triggers */
-			ExecARInsertTriggers(estate, resultRelInfo, tuple,
-								 recheckIndexes);
+					nBufferedTuples = 0;
+				}
+			}
+			else
+			{
+				/* OK, store the tuple and create index entries for it */
+				heap_insert(cstate->rel, tuple, mycid, hi_options, bistate);
+
+				if (resultRelInfo->ri_NumIndices > 0)
+					recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+														   estate);
+
+				/* AFTER ROW INSERT Triggers */
+				ExecARInsertTriggers(estate, resultRelInfo, tuple,
+									 recheckIndexes);
 
-			list_free(recheckIndexes);
+				list_free(recheckIndexes);
+			}
 
 			/*
 			 * We count only tuples not suppressed by a BEFORE INSERT trigger;
@@ -2038,6 +2117,48 @@ CopyFrom(CopyState cstate)
 		}
 	}
 
+	/* Flush any remaining buffered tuples */
+	if (nBufferedTuples > 0)
+	{
+		int		ninserted;
+		int		nremaining;
+		HeapTuple *remainingTuples;
+
+		/*
+		 * Call heap_multi_insert() until all the tuples have been
+		 * inserted.
+		 */
+		remainingTuples = bufferedTuples;
+		nremaining = nBufferedTuples;
+		while(nremaining > 0)
+		{
+			ninserted = heap_multi_insert(cstate->rel,
+										  remainingTuples,
+										  nremaining,
+										  hi_options,
+										  bistate);
+			nremaining -= ninserted;
+			remainingTuples = remainingTuples + ninserted;
+		}
+
+		/*
+		 * If there are any indexes, update the indexes for all the
+		 * inserted tuples.
+		 */
+		if (resultRelInfo->ri_NumIndices > 0)
+		{
+			int i;
+			for (i = 0; i < nBufferedTuples; i++)
+			{
+				List *recheckIndexes;
+				ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
+				recheckIndexes = ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+													   estate);
+				list_free(recheckIndexes);
+			}
+		}
+	}
+
 	/* Done, clean up */
 	error_context_stack = errcontext.previous;
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 56036a8..e2a36f7 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -97,8 +97,12 @@ extern void setLastTid(const ItemPointer tid);
 extern BulkInsertState GetBulkInsertState(void);
 extern void FreeBulkInsertState(BulkInsertState);
 
+extern HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup, CommandId cid,
+					int options, BulkInsertState bistate);
 extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 			int options, BulkInsertState bistate);
+extern int heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
+			int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			ItemPointer ctid, TransactionId *update_xmax,
 			CommandId cid, Snapshot crosscheck, bool wait);
diff --git a/src/include/access/htup.h b/src/include/access/htup.h
index ba5d9b2..f978dd0 100644
--- a/src/include/access/htup.h
+++ b/src/include/access/htup.h
@@ -607,6 +607,7 @@ typedef HeapTupleData *HeapTuple;
 /* 0x20 is free, was XLOG_HEAP2_CLEAN_MOVE */
 #define XLOG_HEAP2_CLEANUP_INFO 0x30
 #define XLOG_HEAP2_VISIBLE		0x40
+#define XLOG_HEAP2_MULTI_INSERT	0x50
 
 /*
  * All what we need to find changed tuple
@@ -660,6 +661,25 @@ typedef struct xl_heap_insert
 
 #define SizeOfHeapInsert	(offsetof(xl_heap_insert, all_visible_cleared) + sizeof(bool))
 
+typedef struct xl_multi_header
+{
+	OffsetNumber offset;
+	xl_heap_header xlhdr;
+} xl_multi_header;
+
+/* This is what we need to know about insert */
+typedef struct xl_heap_multi_insert
+{
+	RelFileNode node;
+	BlockNumber	blkno;
+	bool		all_visible_cleared;	/* PD_ALL_VISIBLE was cleared */
+	uint16		ntuples;
+	xl_multi_header tuphdrs[1]; /* var length */
+	/* TUPLE DATA FOLLOW AT END OF STRUCT */
+} xl_heap_multi_insert;
+
+#define SizeOfHeapMultiInsert(n)	(offsetof(xl_heap_multi_insert, tuphdrs) + sizeof(xl_multi_header) * (n))
+
 /* This is what we need to know about update|hot_update */
 typedef struct xl_heap_update
 {
copytest-results.txttext/plain; name=copytest-results.txtDownload
copytests-setup.sqltext/x-sql; name=copytests-setup.sqlDownload
copytests-run.shapplication/x-sh; name=copytests-run.shDownload
#2Gurjeet Singh
singh.gurjeet@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: Inserting heap tuples in bulk in COPY

On Fri, Aug 12, 2011 at 3:16 PM, Heikki Linnakangas <
heikki.linnakangas@enterprisedb.com> wrote:

COPY is slow.

No kidding!

So at least for now, the patch simply falls back to inserting one row at a
time if there are any triggers on the table.

Maybe we want to change that to "fall back to old ways if there are any FOR
EACH ROW triggers", since FOR EACH STATEMENT triggers won't be bothered by
this optimization.

--
Gurjeet Singh
EnterpriseDB Corporation
The Enterprise PostgreSQL Company

#3Florian Pflug
fgp@phlo.org
In reply to: Heikki Linnakangas (#1)
Re: Inserting heap tuples in bulk in COPY

On Aug12, 2011, at 21:16 , Heikki Linnakangas wrote:

Triggers complicate this. I believe it is only safe to group tuples together like this if the table has no triggers. A BEFORE ROW trigger might run a SELECT on the table being copied to, and check if some of the tuples we're about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples first, and only then insert them, none of the trigger invocations will see the other rows as inserted yet. Similarly, if we run AFTER ROW triggers after inserting a bunch of tuples, the trigger for each of the insertions would see all the inserted rows.

Don't we run AFTER ROW triggers after inserting *all* the tuples anyway? At least this is what we do in the case of INSERT/UPDATE/DELETE if I'm not mistaken.

best regards,
Florian Pflug

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Florian Pflug (#3)
Re: Inserting heap tuples in bulk in COPY

On 12.08.2011 22:57, Florian Pflug wrote:

On Aug12, 2011, at 21:16 , Heikki Linnakangas wrote:

Triggers complicate this. I believe it is only safe to group tuples together like this if the table has no triggers. A BEFORE ROW trigger might run a SELECT on the table being copied to, and check if some of the tuples we're about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples first, and only then insert them, none of the trigger invocations will see the other rows as inserted yet. Similarly, if we run AFTER ROW triggers after inserting a bunch of tuples, the trigger for each of the insertions would see all the inserted rows.

Don't we run AFTER ROW triggers after inserting *all* the tuples anyway? At least this is what we do in the case of INSERT/UPDATE/DELETE if I'm not mistaken.

Um, yes, you're right. Now I feel silly. The above still applies to
BEFORE ROW triggers, though.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#5Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: Inserting heap tuples in bulk in COPY

On Fri, Aug 12, 2011 at 3:16 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

COPY is slow. Let's make it faster. One obvious optimization is to insert
heap tuples in bigger chunks, instead of calling heap_insert() separately
for every tuple. That saves the overhead of pinning and locking the buffer
for every tuple, and you only need to write one WAL record for all the
tuples written to the same page, instead of one for each tuple.

Attached is a WIP patch to do that. It adds a new function,
heap_multi_insert, which does the same thing as heap_insert, but works in
bulk. It takes an array of tuples as argument, and tries to cram as many of
them into the chosen targe page as it can, and only writes a single WAL
record of the operation.

This gives a significant speedup to COPY, particularly for narrow tables,
with small tuples. Grouping multiple tuples into one WAL record reduces the
WAL volume significantly, and the time spent in writing that WAL. The
reduced overhead of repeatedly locking the buffer is also most noticeable on
narrow tables. On wider tables, the effects are smaller. See
copytest-results.txt, containing test results with three tables of different
widths. The scripts used to get those numbers are also attached.

Triggers complicate this. I believe it is only safe to group tuples together
like this if the table has no triggers. A BEFORE ROW trigger might run a
SELECT on the table being copied to, and check if some of the tuples we're
about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples
first, and only then insert them, none of the trigger invocations will see
the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
after inserting a bunch of tuples, the trigger for each of the insertions
would see all the inserted rows. So at least for now, the patch simply falls
back to inserting one row at a time if there are any triggers on the table.

The patch is WIP, mainly because I didn't write the WAL replay routines yet,
but please let me know if you see any issues.

I thought about trying to do this at one point in the past, but I
couldn't figure out exactly how to make it work. I think the approach
you've taken here is good.

Aside from the point already raised about needing to worry only about
BEFORE ROW triggers, I don't see any showstoppers.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Andrew Dunstan
andrew@dunslane.net
In reply to: Robert Haas (#5)
Re: Inserting heap tuples in bulk in COPY

On 08/12/2011 04:57 PM, Robert Haas wrote:

I thought about trying to do this at one point in the past, but I
couldn't figure out exactly how to make it work. I think the approach
you've taken here is good.

Aside from the point already raised about needing to worry only about
BEFORE ROW triggers, I don't see any showstoppers.

Yeah, this looks very promising indeed. Well done. In fact, I'm asking
myself how hard it will be to backport for one particular customer of
ours, for whom the WAL load is a major hotspot.

cheers

andrew

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#1)
Re: Inserting heap tuples in bulk in COPY

On Fri, Aug 12, 2011 at 8:16 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

COPY is slow. Let's make it faster. One obvious optimization is to insert
heap tuples in bigger chunks, instead of calling heap_insert() separately
for every tuple. That saves the overhead of pinning and locking the buffer
for every tuple, and you only need to write one WAL record for all the
tuples written to the same page, instead of one for each tuple.

We don't pin the buffer for every tuple, that optimisation is already done...

When we discussed this before you said that it wasn't worth trying to
do this additional work - it was certainly a smaller gain than the one
we achieved by removing the pinning overhead.

Also, we discussed that you would work on buffering the index inserts,
which is where the main problem lies. The main heap is only a small
part of the overhead if we have multiple indexes already built on a
table - which is the use case that causes the most problem.

So I'm a little surprised to see you working on this and I'm guessing
that the COPY improvement with indexes is barely noticeable. This
would be a nice improvement, but not until the bulk index inserts are
done.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

#8Merlin Moncure
mmoncure@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: Inserting heap tuples in bulk in COPY

On Fri, Aug 12, 2011 at 2:16 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

COPY is slow. Let's make it faster. One obvious optimization is to insert
heap tuples in bigger chunks, instead of calling heap_insert() separately
for every tuple. That saves the overhead of pinning and locking the buffer
for every tuple, and you only need to write one WAL record for all the
tuples written to the same page, instead of one for each tuple.

Attached is a WIP patch to do that. It adds a new function,
heap_multi_insert, which does the same thing as heap_insert, but works in
bulk. It takes an array of tuples as argument, and tries to cram as many of
them into the chosen targe page as it can, and only writes a single WAL
record of the operation.

This gives a significant speedup to COPY, particularly for narrow tables,
with small tuples. Grouping multiple tuples into one WAL record reduces the
WAL volume significantly, and the time spent in writing that WAL. The
reduced overhead of repeatedly locking the buffer is also most noticeable on
narrow tables. On wider tables, the effects are smaller. See
copytest-results.txt, containing test results with three tables of different
widths. The scripts used to get those numbers are also attached.

Triggers complicate this. I believe it is only safe to group tuples together
like this if the table has no triggers. A BEFORE ROW trigger might run a
SELECT on the table being copied to, and check if some of the tuples we're
about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples
first, and only then insert them, none of the trigger invocations will see
the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
after inserting a bunch of tuples, the trigger for each of the insertions
would see all the inserted rows. So at least for now, the patch simply falls
back to inserting one row at a time if there are any triggers on the table.

But generic RI triggers would be ok, right?

merlin

#9Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Simon Riggs (#7)
Re: Inserting heap tuples in bulk in COPY

On 13.08.2011 00:17, Simon Riggs wrote:

Also, we discussed that you would work on buffering the index inserts,
which is where the main problem lies. The main heap is only a small
part of the overhead if we have multiple indexes already built on a
table - which is the use case that causes the most problem.

Sure, if you have indexes on the table already, then the overhead of
updating them is more significant. I am actually working on that too, I
will make a separate post about that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#10Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Merlin Moncure (#8)
Re: Inserting heap tuples in bulk in COPY

On 13.08.2011 00:26, Merlin Moncure wrote:

On Fri, Aug 12, 2011 at 2:16 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Triggers complicate this. I believe it is only safe to group tuples together
like this if the table has no triggers. A BEFORE ROW trigger might run a
SELECT on the table being copied to, and check if some of the tuples we're
about to insert exist. If we run BEFORE ROW triggers for a bunch of tuples
first, and only then insert them, none of the trigger invocations will see
the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
after inserting a bunch of tuples, the trigger for each of the insertions
would see all the inserted rows. So at least for now, the patch simply falls
back to inserting one row at a time if there are any triggers on the table.

But generic RI triggers would be ok, right?

RI triggers are AFTER ROW triggers, which we concluded to be OK after
all, so they would be ok.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#11Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Heikki Linnakangas (#10)
Re: Inserting heap tuples in bulk in COPY

On 12 August 2011 23:19, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Triggers complicate this. I believe it is only safe to group tuples
together
like this if the table has no triggers. A BEFORE ROW trigger might run a
SELECT on the table being copied to, and check if some of the tuples
we're
about to insert exist. If we run BEFORE ROW triggers for a bunch of
tuples
first, and only then insert them, none of the trigger invocations will
see
the other rows as inserted yet. Similarly, if we run AFTER ROW triggers
after inserting a bunch of tuples, the trigger for each of the insertions
would see all the inserted rows. So at least for now, the patch simply
falls
back to inserting one row at a time if there are any triggers on the
table.

I guess DEFAULT values could also suffer from a similar problem to
BEFORE ROW triggers though (in theory a DEFAULT could be based on a
function that SELECTs from the table being copied to).

Regards,
Dean

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: Inserting heap tuples in bulk in COPY

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

The patch is WIP, mainly because I didn't write the WAL replay routines
yet, but please let me know if you see any issues.

Why do you need new WAL replay routines? Can't you just use the existing
XLOG_HEAP_NEWPAGE support?

By any large, I think we should be avoiding special-purpose WAL entries
as much as possible.

Also, I strongly object to turning regular heap_insert into a wrapper
around some other more complicated operation. That will likely have
bad consequences for the performance of every other operation.

regards, tom lane

#13Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#12)
1 attachment(s)
Re: Inserting heap tuples in bulk in COPY

On 13.08.2011 17:33, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

The patch is WIP, mainly because I didn't write the WAL replay routines
yet, but please let me know if you see any issues.

Why do you need new WAL replay routines? Can't you just use the existing
XLOG_HEAP_NEWPAGE support?

By any large, I think we should be avoiding special-purpose WAL entries
as much as possible.

I tried that, but most of the reduction in WAL-size melts away with
that. And if the page you're copying to is not empty, logging the whole
page is even more expensive. You'd need to fall back to retail inserts
in that case which complicates the logic.

Also, I strongly object to turning regular heap_insert into a wrapper
around some other more complicated operation. That will likely have
bad consequences for the performance of every other operation.

Ok. I doubt it makes any noticeable difference for performance, but
having done that, it's more readable, too, to duplicate the code.

Attached is a new version of the patch. It is now complete, including
WAL replay code.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

copy-heap-multi-insert-2.patchtext/x-diff; name=copy-heap-multi-insert-2.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 06db65d..b479613 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -24,6 +24,7 @@
  *		heap_getnext	- retrieve next tuple in scan
  *		heap_fetch		- retrieve tuple with given tid
  *		heap_insert		- insert tuple into a relation
+ *		heap_multi_insert - insert multiple tuples into a relation
  *		heap_delete		- delete a tuple from a relation
  *		heap_update		- replace a tuple in a relation with another tuple
  *		heap_markpos	- mark scan position
@@ -2030,7 +2031,7 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 	 */
 	CacheInvalidateHeapTuple(relation, heaptup, NULL);
 
-	pgstat_count_heap_insert(relation);
+	pgstat_count_heap_insert(relation, 1);
 
 	/*
 	 * If heaptup is a private copy, release it.  Don't forget to copy t_self
@@ -2046,6 +2047,276 @@ heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 }
 
 /*
+ * Subroutine for heap_multi_insert(). Prepares a tuple for insertion. This
+ * sets the tuple header fields, assigns an OID, and toasts the tuple if
+ * necessary. Returns a toasted version of the tuple if it was toasted, or
+ * the original tuple if not.
+ */
+static HeapTuple
+heap_prepare_insert(Relation relation, HeapTuple tup, CommandId cid,
+					int options)
+{
+	TransactionId xid = GetCurrentTransactionId();
+	HeapTuple heaptup;
+
+	if (relation->rd_rel->relhasoids)
+	{
+#ifdef NOT_USED
+		/* this is redundant with an Assert in HeapTupleSetOid */
+		Assert(tup->t_data->t_infomask & HEAP_HASOID);
+#endif
+
+		/*
+		 * If the object id of this tuple has already been assigned, trust the
+		 * caller.	There are a couple of ways this can happen.  At initial db
+		 * creation, the backend program sets oids for tuples. When we define
+		 * an index, we set the oid.  Finally, in the future, we may allow
+		 * users to set their own object ids in order to support a persistent
+		 * object store (objects need to contain pointers to one another).
+		 */
+		if (!OidIsValid(HeapTupleGetOid(tup)))
+			HeapTupleSetOid(tup, GetNewOid(relation));
+	}
+	else
+	{
+		/* check there is not space for an OID */
+		Assert(!(tup->t_data->t_infomask & HEAP_HASOID));
+	}
+
+	tup->t_data->t_infomask &= ~(HEAP_XACT_MASK);
+	tup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
+	tup->t_data->t_infomask |= HEAP_XMAX_INVALID;
+	HeapTupleHeaderSetXmin(tup->t_data, xid);
+	HeapTupleHeaderSetCmin(tup->t_data, cid);
+	HeapTupleHeaderSetXmax(tup->t_data, 0);		/* for cleanliness */
+	tup->t_tableOid = RelationGetRelid(relation);
+
+	/*
+	 * If the new tuple is too big for storage or contains already toasted
+	 * out-of-line attributes from some other relation, invoke the toaster.
+	 *
+	 * Note: below this point, heaptup is the data we actually intend to store
+	 * into the relation; tup is the caller's original untoasted data.
+	 */
+	if (relation->rd_rel->relkind != RELKIND_RELATION)
+	{
+		/* toast table entries should never be recursively toasted */
+		Assert(!HeapTupleHasExternal(tup));
+		heaptup = tup;
+	}
+	else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
+		heaptup = toast_insert_or_update(relation, tup, NULL, options);
+	else
+		heaptup = tup;
+
+	return heaptup;
+}
+
+/*
+ *	heap_multi_insert	- insert multiple tuple into a heap
+ *
+ * This is like heap_insert(), but inserts multiple tuples in one operation.
+ * That's faster than calling heap_insert() in a loop, because when multiple
+ * tuples can be inserted on a single page, we can write just a single WAL
+ * record covering all of them, and only need to lock/unlock the page once.
+ *
+ * Note: this leaks memory into the current memory context. You can create a
+ * temporary context before calling this, if that's a problem.
+ */
+void
+heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
+				  CommandId cid, int options, BulkInsertState bistate)
+{
+	HeapTuple  *heaptuples;
+	Buffer		buffer;
+	Buffer		vmbuffer = InvalidBuffer;
+	bool		all_visible_cleared = false;
+	int			i;
+	int			ndone;
+	char	   *scratch = NULL;
+	Page		page;
+	bool		needwal;
+
+	needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+
+	/* Toast and set header data in all the tuples */
+	heaptuples = palloc(ntuples * sizeof(HeapTuple));
+	for (i = 0; i < ntuples; i++)
+		heaptuples[i] = heap_prepare_insert(relation, tuples[i], cid, options);
+
+	/*
+	 * Allocate some memory to use for constructing the WAL record. Using
+	 * palloc() within a critical section is not safe, so we allocate this
+	 * beforehand.
+	 */
+	if (needwal)
+		scratch = palloc(BLCKSZ);
+
+	ndone = 0;
+	while (ndone < ntuples)
+	{
+		int nthispage;
+
+		/*
+		 * Find buffer to insert this tuple into.  If the page is all visible,
+		 * this will also pin the requisite visibility map page.
+		 */
+		buffer = RelationGetBufferForTuple(relation, heaptuples[ndone]->t_len,
+										   InvalidBuffer, options, bistate,
+										   &vmbuffer, NULL);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * We're about to do the actual insert -- check for conflict at the
+		 * relation or buffer level first, to avoid possibly having to roll back
+		 * work we've just done.
+		 */
+		CheckForSerializableConflictIn(relation, NULL, buffer);
+
+		if (PageIsAllVisible(page))
+		{
+			all_visible_cleared = true;
+			PageClearAllVisible(page);
+			visibilitymap_clear(relation,
+								BufferGetBlockNumber(buffer),
+								vmbuffer);
+		}
+
+		/* NO EREPORT(ERROR) from here till changes are logged */
+		START_CRIT_SECTION();
+
+		/* Put as many tuples as fit on this page */
+		for (nthispage = 0; ndone + nthispage < ntuples; nthispage++)
+		{
+			HeapTuple	heaptup = heaptuples[ndone + nthispage];
+
+			if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len))
+				break;
+
+			RelationPutHeapTuple(relation, buffer, heaptup);
+		}
+
+		/*
+		 * XXX Should we set PageSetPrunable on this page ? See heap_insert()
+		 */
+
+		MarkBufferDirty(buffer);
+
+		/* XLOG stuff */
+		if (needwal)
+		{
+			XLogRecPtr	recptr;
+			xl_heap_multi_insert *xlrec;
+			XLogRecData rdata[2];
+			uint8		info = XLOG_HEAP2_MULTI_INSERT;
+			char	   *tupledata;
+			int			totaldatalen;
+			char	   *scratchptr;
+			bool		init;
+
+			/*
+			 * If the page was previously empty, we can reinit the page
+			 * instead of restoring the whole thing.
+			 */
+			init = (ItemPointerGetOffsetNumber(&(heaptuples[ndone]->t_self)) == FirstOffsetNumber &&
+					PageGetMaxOffsetNumber(page) == FirstOffsetNumber + nthispage - 1);
+
+			/* allocate xl_heap_multi_insert struct from the scratch area */
+			scratchptr = scratch;
+			xlrec = (xl_heap_multi_insert *) scratchptr;
+			scratchptr += SizeOfHeapMultiInsert;
+
+			/* allocate offsets array, unless we're reinitializing the page */
+			if (!init)
+				scratchptr += nthispage * sizeof(OffsetNumber);
+
+			/* the rest of the scratch space is used for tuple data */
+			tupledata = scratchptr;
+
+			xlrec->all_visible_cleared = all_visible_cleared;
+			xlrec->node = relation->rd_node;
+			xlrec->blkno = BufferGetBlockNumber(buffer);
+			xlrec->ntuples = nthispage;
+
+			for (i = 0; i < nthispage; i++)
+			{
+				HeapTuple	heaptup = heaptuples[ndone + i];
+				xl_multi_insert_tuple *tuphdr;
+				int			datalen;
+
+				tuphdr = (xl_multi_insert_tuple *) SHORTALIGN(scratchptr);
+
+				scratchptr = ((char *) tuphdr) + SizeOfMultiInsertTuple;
+
+				if (!init)
+					xlrec->offsets[i] = ItemPointerGetOffsetNumber(&heaptup->t_self);
+				tuphdr->t_infomask2 = heaptup->t_data->t_infomask2;
+				tuphdr->t_infomask = heaptup->t_data->t_infomask;
+				tuphdr->t_hoff = heaptup->t_data->t_hoff;
+
+				/* write bitmap [+ padding] [+ oid] + data */
+				datalen = heaptup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+				memcpy(scratchptr,
+					   (char *) heaptup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+					   datalen);
+				tuphdr->datalen = datalen;
+				scratchptr += datalen;
+			}
+			totaldatalen = scratchptr - tupledata;
+			Assert((scratchptr - scratch) < BLCKSZ);
+
+			rdata[0].data = (char *) xlrec;
+			rdata[0].len = tupledata - scratch;
+			rdata[0].buffer = InvalidBuffer;
+			rdata[0].next = &rdata[1];
+
+			rdata[1].data = tupledata;
+			rdata[1].len = totaldatalen;
+			rdata[1].buffer = buffer;
+			rdata[1].buffer_std = true;
+			rdata[1].next = NULL;
+
+			/*
+			 * If we're going to reinitialize the whole page using the WAL
+			 * record, hide buffer reference from XLogInsert.
+			 */
+			if (init)
+			{
+				rdata[1].buffer = InvalidBuffer;
+				info |= XLOG_HEAP_INIT_PAGE;
+			}
+
+			recptr = XLogInsert(RM_HEAP2_ID, info, rdata);
+
+			PageSetLSN(page, recptr);
+			PageSetTLI(page, ThisTimeLineID);
+		}
+
+		END_CRIT_SECTION();
+
+		UnlockReleaseBuffer(buffer);
+		if (vmbuffer != InvalidBuffer)
+			ReleaseBuffer(vmbuffer);
+
+		ndone += nthispage;
+	}
+
+	/*
+	 * If tuples are cachable, mark them for invalidation from the caches in
+	 * case we abort.  Note it is OK to do this after releasing the buffer,
+	 * because the heaptuples data structure is all in local memory, not in
+	 * the shared buffer.
+	 */
+	if (IsSystemRelation(relation))
+	{
+		for (i = 0; i < ntuples; i++)
+			CacheInvalidateHeapTuple(relation, heaptuples[i], NULL);
+	}
+
+	pgstat_count_heap_insert(relation, ntuples);
+}
+
+/*
  *	simple_heap_insert - insert a tuple
  *
  * Currently, this routine differs from heap_insert only in supplying
@@ -4730,6 +5001,144 @@ heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
 }
 
 /*
+ * Handles MULTI_INSERT record type.
+ */
+static void
+heap_xlog_multi_insert(XLogRecPtr lsn, XLogRecord *record)
+{
+	char	   *recdata = XLogRecGetData(record);
+	xl_heap_multi_insert *xlrec;
+	Buffer		buffer;
+	Page		page;
+	struct
+	{
+		HeapTupleHeaderData hdr;
+		char		data[MaxHeapTupleSize];
+	}			tbuf;
+	HeapTupleHeader htup;
+	uint32		newlen;
+	Size		freespace;
+	BlockNumber blkno;
+	int			i;
+	bool		isinit = (record->xl_info & XLOG_HEAP_INIT_PAGE) != 0;
+
+	xlrec = (xl_heap_multi_insert *) recdata;
+	recdata += SizeOfHeapMultiInsert;
+
+	/*
+	 * If we're reinitializing the page, the tuples are stored in order from
+	 * FirstOffsetNumber. Otherwise there's an array of offsets in the WAL
+	 * record.
+	 */
+	if (!isinit)
+		recdata += sizeof(OffsetNumber) * xlrec->ntuples;
+
+	blkno = xlrec->blkno;
+
+	/*
+	 * The visibility map may need to be fixed even if the heap page is
+	 * already up-to-date.
+	 */
+	if (xlrec->all_visible_cleared)
+	{
+		Relation	reln = CreateFakeRelcacheEntry(xlrec->node);
+		Buffer		vmbuffer = InvalidBuffer;
+
+		visibilitymap_pin(reln, blkno, &vmbuffer);
+		visibilitymap_clear(reln, blkno, vmbuffer);
+		ReleaseBuffer(vmbuffer);
+		FreeFakeRelcacheEntry(reln);
+	}
+
+	if (record->xl_info & XLR_BKP_BLOCK_1)
+		return;
+
+	if (isinit)
+	{
+		buffer = XLogReadBuffer(xlrec->node, blkno, true);
+		Assert(BufferIsValid(buffer));
+		page = (Page) BufferGetPage(buffer);
+
+		PageInit(page, BufferGetPageSize(buffer), 0);
+	}
+	else
+	{
+		buffer = XLogReadBuffer(xlrec->node, blkno, false);
+		if (!BufferIsValid(buffer))
+			return;
+		page = (Page) BufferGetPage(buffer);
+
+		if (XLByteLE(lsn, PageGetLSN(page)))	/* changes are applied */
+		{
+			UnlockReleaseBuffer(buffer);
+			return;
+		}
+	}
+
+	for (i = 0; i < xlrec->ntuples; i++)
+	{
+		OffsetNumber offnum;
+		xl_multi_insert_tuple *xlhdr;
+
+		if (isinit)
+			offnum = FirstOffsetNumber + i;
+		else
+			offnum = xlrec->offsets[i];
+		if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+			elog(PANIC, "heap_multi_insert_redo: invalid max offset number");
+
+		xlhdr = (xl_multi_insert_tuple *) SHORTALIGN(recdata);
+		recdata += SizeOfMultiInsertTuple;
+
+		newlen = xlhdr->datalen;
+		Assert(newlen <= MaxHeapTupleSize);
+		htup = &tbuf.hdr;
+		MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
+		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+			   (char *) recdata,
+			   newlen);
+		recdata += newlen;
+
+		newlen += offsetof(HeapTupleHeaderData, t_bits);
+		htup->t_infomask2 = xlhdr->t_infomask2;
+		htup->t_infomask = xlhdr->t_infomask;
+		htup->t_hoff = xlhdr->t_hoff;
+		HeapTupleHeaderSetXmin(htup, record->xl_xid);
+		HeapTupleHeaderSetCmin(htup, FirstCommandId);
+		ItemPointerSetBlockNumber(&htup->t_ctid, blkno);
+		ItemPointerSetOffsetNumber(&htup->t_ctid, offnum);
+
+		offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
+		if (offnum == InvalidOffsetNumber)
+			elog(PANIC, "heap_multi_insert_redo: failed to add tuple");
+	}
+
+	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
+
+	PageSetLSN(page, lsn);
+	PageSetTLI(page, ThisTimeLineID);
+
+	if (xlrec->all_visible_cleared)
+		PageClearAllVisible(page);
+
+	MarkBufferDirty(buffer);
+	UnlockReleaseBuffer(buffer);
+
+	/*
+	 * If the page is running low on free space, update the FSM as well.
+	 * Arbitrarily, our definition of "low" is less than 20%. We can't do much
+	 * better than that without knowing the fill-factor for the table.
+	 *
+	 * XXX: We don't get here if the page was restored from full page image.
+	 * We don't bother to update the FSM in that case, it doesn't need to be
+	 * totally accurate anyway.
+	 */
+	if (freespace < BLCKSZ / 5)
+		XLogRecordPageWithFreeSpace(xlrec->node, blkno, freespace);
+}
+
+/*
  * Handles UPDATE and HOT_UPDATE
  */
 static void
@@ -5118,6 +5527,9 @@ heap2_redo(XLogRecPtr lsn, XLogRecord *record)
 		case XLOG_HEAP2_VISIBLE:
 			heap_xlog_visible(lsn, record);
 			break;
+		case XLOG_HEAP2_MULTI_INSERT:
+			heap_xlog_multi_insert(lsn, record);
+			break;
 		default:
 			elog(PANIC, "heap2_redo: unknown op code %u", info);
 	}
@@ -5255,6 +5667,18 @@ heap2_desc(StringInfo buf, uint8 xl_info, char *rec)
 						 xlrec->node.spcNode, xlrec->node.dbNode,
 						 xlrec->node.relNode, xlrec->block);
 	}
+	else if (info == XLOG_HEAP2_MULTI_INSERT)
+	{
+		xl_heap_multi_insert *xlrec = (xl_heap_multi_insert *) rec;
+
+		if (xl_info & XLOG_HEAP_INIT_PAGE)
+			appendStringInfo(buf, "multi-insert (init): ");
+		else
+			appendStringInfo(buf, "multi-insert: ");
+		appendStringInfo(buf, "rel %u/%u/%u; blk %u; %d tuples",
+						 xlrec->node.spcNode, xlrec->node.dbNode, xlrec->node.relNode,
+						 xlrec->blkno, xlrec->ntuples);
+	}
 	else
 		appendStringInfo(buf, "UNKNOWN");
 }
diff --git a/src/backend/commands/copy.c b/src/backend/commands/copy.c
index a9c1980..0d75104 100644
--- a/src/backend/commands/copy.c
+++ b/src/backend/commands/copy.c
@@ -33,6 +33,7 @@
 #include "libpq/pqformat.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "optimizer/clauses.h"
 #include "optimizer/planner.h"
 #include "parser/parse_relation.h"
 #include "rewrite/rewriteHandler.h"
@@ -149,6 +150,7 @@ typedef struct CopyStateData
 	Oid		   *typioparams;	/* array of element types for in_functions */
 	int		   *defmap;			/* array of default att numbers */
 	ExprState **defexprs;		/* array of default att expressions */
+	bool		volatile_defexprs; /* is any of defexprs volatile? */
 
 	/*
 	 * These variables are used to reduce overhead in textual COPY FROM.
@@ -1842,11 +1844,17 @@ CopyFrom(CopyState cstate)
 	ExprContext *econtext;
 	TupleTableSlot *myslot;
 	MemoryContext oldcontext = CurrentMemoryContext;
+
 	ErrorContextCallback errcontext;
 	CommandId	mycid = GetCurrentCommandId(true);
 	int			hi_options = 0; /* start with default heap_insert options */
 	BulkInsertState bistate;
 	uint64		processed = 0;
+	bool		useHeapMultiInsert = true;
+	int			nBufferedTuples = 0;
+#define MAX_BUFFERED_TUPLES 1000
+	HeapTuple	bufferedTuples[MAX_BUFFERED_TUPLES];
+	Size		bufferedTuplesSize = 0;
 
 	Assert(cstate->rel);
 
@@ -1941,6 +1949,23 @@ CopyFrom(CopyState cstate)
 	/* Triggers might need a slot as well */
 	estate->es_trig_tuple_slot = ExecInitExtraTupleSlot(estate);
 
+	/*
+	 * It's more efficient to prepare a bunch of tuples for insertion, and
+	 * insert them in one heap_multi_insert() call, than call heap_insert()
+	 * separately for every tuple. However, we can't do that if there are
+	 * BEFORE/INSTEAD OF triggers, or we need to evaluate volatile default
+	 * expressions. Such triggers or expressions might query the table we're
+	 * inserting to, and act differently if the tuples that have already been
+	 * processed and prepared for insertion are not there.
+	 */
+	if ((resultRelInfo->ri_TrigDesc != NULL &&
+		(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
+		 resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
+		cstate->volatile_defexprs)
+	{
+		useHeapMultiInsert = false;
+	}
+
 	/* Prepare to catch AFTER triggers. */
 	AfterTriggerBeginQuery();
 
@@ -1972,8 +1997,15 @@ CopyFrom(CopyState cstate)
 
 		CHECK_FOR_INTERRUPTS();
 
-		/* Reset the per-tuple exprcontext */
-		ResetPerTupleExprContext(estate);
+		if (nBufferedTuples == 0)
+		{
+			/*
+			 * Reset the per-tuple exprcontext. We can only do this if the
+			 * tuple buffer is empty (calling the context the per-tuple memory
+			 * context is a bit of a misnomer now
+			 */
+			ResetPerTupleExprContext(estate);
+		}
 
 		/* Switch into its memory context */
 		MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
@@ -2016,18 +2048,69 @@ CopyFrom(CopyState cstate)
 			if (cstate->rel->rd_att->constr)
 				ExecConstraints(resultRelInfo, slot, estate);
 
-			/* OK, store the tuple and create index entries for it */
-			heap_insert(cstate->rel, tuple, mycid, hi_options, bistate);
+			if (useHeapMultiInsert)
+			{
+				/* Insert this tuple to the tuple buffer */
+				bufferedTuples[nBufferedTuples++] = tuple;
+				bufferedTuplesSize += tuple->t_len;
 
-			if (resultRelInfo->ri_NumIndices > 0)
-				recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
-													   estate);
+				/*
+				 * If the buffer filled up, flush it. Also flush if the total
+				 * size of all the tuples in the buffer becomes large, to
+				 * avoid using large amounts of memory for the buffers when
+				 * the tuples are exceptionally wide.
+				 */
+				if (nBufferedTuples == MAX_BUFFERED_TUPLES ||
+					bufferedTuplesSize > 65535)
+				{
+					/*
+					 * heap_multi_insert leaks memory, so switch into
+					 * short-lived memory context before calling it.
+					 */
+					MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+					heap_multi_insert(cstate->rel,
+									  bufferedTuples,
+									  nBufferedTuples,
+									  mycid,
+									  hi_options,
+									  bistate);
+					MemoryContextSwitchTo(oldcontext);
 
-			/* AFTER ROW INSERT Triggers */
-			ExecARInsertTriggers(estate, resultRelInfo, tuple,
-								 recheckIndexes);
+					/*
+					 * If there are any indexes, update the indexes for all the
+					 * inserted tuples.
+					 */
+					if (resultRelInfo->ri_NumIndices > 0)
+					{
+						int i;
+						for (i = 0; i < nBufferedTuples; i++)
+						{
+							ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
+							recheckIndexes = ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+																   estate);
+							list_free(recheckIndexes);
+						}
+					}
+
+					nBufferedTuples = 0;
+					bufferedTuplesSize = 0;
+				}
+			}
+			else
+			{
+				/* OK, store the tuple and create index entries for it */
+				heap_insert(cstate->rel, tuple, mycid, hi_options, bistate);
 
-			list_free(recheckIndexes);
+				if (resultRelInfo->ri_NumIndices > 0)
+					recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
+														   estate);
+
+				/* AFTER ROW INSERT Triggers */
+				ExecARInsertTriggers(estate, resultRelInfo, tuple,
+									 recheckIndexes);
+
+				list_free(recheckIndexes);
+			}
 
 			/*
 			 * We count only tuples not suppressed by a BEFORE INSERT trigger;
@@ -2038,6 +2121,34 @@ CopyFrom(CopyState cstate)
 		}
 	}
 
+	/* Flush any remaining buffered tuples */
+	if (nBufferedTuples > 0)
+	{
+		heap_multi_insert(cstate->rel,
+						  bufferedTuples,
+						  nBufferedTuples,
+						  mycid,
+						  hi_options,
+						  bistate);
+
+		/*
+		 * If there are any indexes, update the indexes for all the
+		 * inserted tuples.
+		 */
+		if (resultRelInfo->ri_NumIndices > 0)
+		{
+			int i;
+			for (i = 0; i < nBufferedTuples; i++)
+			{
+				List *recheckIndexes;
+				ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
+				recheckIndexes = ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+													   estate);
+				list_free(recheckIndexes);
+			}
+		}
+	}
+
 	/* Done, clean up */
 	error_context_stack = errcontext.previous;
 
@@ -2099,6 +2210,7 @@ BeginCopyFrom(Relation rel,
 	int		   *defmap;
 	ExprState **defexprs;
 	MemoryContext oldcontext;
+	bool		volatile_defexprs;
 
 	cstate = BeginCopy(true, rel, NULL, NULL, attnamelist, options);
 	oldcontext = MemoryContextSwitchTo(cstate->copycontext);
@@ -2122,6 +2234,7 @@ BeginCopyFrom(Relation rel,
 	attr = tupDesc->attrs;
 	num_phys_attrs = tupDesc->natts;
 	num_defaults = 0;
+	volatile_defexprs = false;
 
 	/*
 	 * Pick up the required catalog information for each attribute in the
@@ -2163,6 +2276,9 @@ BeginCopyFrom(Relation rel,
 								 expression_planner((Expr *) defexpr), NULL);
 				defmap[num_defaults] = attnum - 1;
 				num_defaults++;
+
+				if (!volatile_defexprs)
+					volatile_defexprs = contain_volatile_functions(defexpr);
 			}
 		}
 	}
@@ -2172,6 +2288,7 @@ BeginCopyFrom(Relation rel,
 	cstate->typioparams = typioparams;
 	cstate->defmap = defmap;
 	cstate->defexprs = defexprs;
+	cstate->volatile_defexprs = volatile_defexprs;
 	cstate->num_defaults = num_defaults;
 
 	if (pipe)
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index eb9adc8..4f1a6ab 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -1672,10 +1672,10 @@ add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level)
 }
 
 /*
- * pgstat_count_heap_insert - count a tuple insertion
+ * pgstat_count_heap_insert - count a tuple insertion of n tuples
  */
 void
-pgstat_count_heap_insert(Relation rel)
+pgstat_count_heap_insert(Relation rel, int n)
 {
 	PgStat_TableStatus *pgstat_info = rel->pgstat_info;
 
@@ -1688,7 +1688,7 @@ pgstat_count_heap_insert(Relation rel)
 			pgstat_info->trans->nest_level != nest_level)
 			add_tabstat_xact_level(pgstat_info, nest_level);
 
-		pgstat_info->trans->tuples_inserted++;
+		pgstat_info->trans->tuples_inserted += n;
 	}
 }
 
diff --git a/src/include/access/heapam.h b/src/include/access/heapam.h
index 776ea5c..dfbd276 100644
--- a/src/include/access/heapam.h
+++ b/src/include/access/heapam.h
@@ -97,6 +97,8 @@ extern void FreeBulkInsertState(BulkInsertState);
 
 extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
 			int options, BulkInsertState bistate);
+extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
+				  CommandId cid, int options, BulkInsertState bistate);
 extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
 			ItemPointer ctid, TransactionId *update_xmax,
 			CommandId cid, Snapshot crosscheck, bool wait);
diff --git a/src/include/access/htup.h b/src/include/access/htup.h
index c025835..2e87f22 100644
--- a/src/include/access/htup.h
+++ b/src/include/access/htup.h
@@ -608,6 +608,7 @@ typedef HeapTupleData *HeapTuple;
 /* 0x20 is free, was XLOG_HEAP2_CLEAN_MOVE */
 #define XLOG_HEAP2_CLEANUP_INFO 0x30
 #define XLOG_HEAP2_VISIBLE		0x40
+#define XLOG_HEAP2_MULTI_INSERT	0x50
 
 /*
  * All what we need to find changed tuple
@@ -661,6 +662,36 @@ typedef struct xl_heap_insert
 
 #define SizeOfHeapInsert	(offsetof(xl_heap_insert, all_visible_cleared) + sizeof(bool))
 
+/*
+ * This is what we need to know about a multi-insert. The record consists of
+ * xl_heap_multi_insert header, followed by a xl_multi_insert_tuple and tuple
+ * data for each tuple. 'offsets' array is omitted if the whole page is
+ * reinitialized (XLOG_HEAP_INIT_PAGE)
+ */
+typedef struct xl_heap_multi_insert
+{
+	RelFileNode node;
+	BlockNumber	blkno;
+	bool		all_visible_cleared;
+	uint16		ntuples;
+	OffsetNumber offsets[1];
+
+	/* TUPLE DATA (xl_multi_insert_tuples) FOLLOW AT END OF STRUCT */
+} xl_heap_multi_insert;
+
+#define SizeOfHeapMultiInsert	offsetof(xl_heap_multi_insert, offsets)
+
+typedef struct xl_multi_insert_tuple
+{
+	uint16		datalen;				/* size of tuple data that follows */
+	uint16		t_infomask2;
+	uint16		t_infomask;
+	uint8		t_hoff;
+	/* TUPLE DATA FOLLOWS AT END OF STRUCT */
+} xl_multi_insert_tuple;
+
+#define SizeOfMultiInsertTuple	(offsetof(xl_multi_insert_tuple, t_hoff) + sizeof(uint8))
+
 /* This is what we need to know about update|hot_update */
 typedef struct xl_heap_update
 {
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 20c4d43..689b95f 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -764,7 +764,7 @@ extern void pgstat_initstats(Relation rel);
 			(rel)->pgstat_info->t_counts.t_blocks_hit++;			\
 	} while (0)
 
-extern void pgstat_count_heap_insert(Relation rel);
+extern void pgstat_count_heap_insert(Relation rel, int n);
 extern void pgstat_count_heap_update(Relation rel, bool hot);
 extern void pgstat_count_heap_delete(Relation rel);
 extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
#14Kohei KaiGai
kaigai@kaigai.gr.jp
In reply to: Heikki Linnakangas (#13)
Re: Inserting heap tuples in bulk in COPY

Hi Heikki,

I checked your patch, then I have a comment and two questions here.

The heap_prepare_insert() seems a duplication of code with earlier
half of existing heap_insert(). I think it is a good question to
consolidate these portion of the code.

I'm not clear the reason why the argument of
CheckForSerializableConflictIn() was
changed from the one in heap_insert(). Its source code comment describes as:
:
* For a heap insert, we only need to check for table-level SSI locks.
* Our new tuple can't possibly conflict with existing tuple locks, and
* heap page locks are only consolidated versions of tuple locks; they do
* not lock "gaps" as index page locks do. So we don't need to identify
* a buffer before making the call.
*/
Is it feasible that newly inserted tuples conflict with existing tuple
locks when
we expand insert to support multiple tuples at once?
It is a bit confusing for me.

This patch disallow multiple-insertion not only when before-row
trigger is configured,
but volatile functions are used to compute a default value also.
Is it a reasonable condition to avoid multiple-insertion?
All the datums to be delivered to heap_form_tuple() is calculated in
NextCopyFrom,
and default values are also constructed here; sequentially.
So, it seems to me we have no worry about volatile functions are not
invoked toward
each tuples. Or, am I missing something?

Thanks,

2011/9/14 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:

On 13.08.2011 17:33, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:

The patch is WIP, mainly because I didn't write the WAL replay routines
yet, but please let me know if you see any issues.

Why do you need new WAL replay routines?  Can't you just use the existing
XLOG_HEAP_NEWPAGE support?

By any large, I think we should be avoiding special-purpose WAL entries
as much as possible.

I tried that, but most of the reduction in WAL-size melts away with that.
And if the page you're copying to is not empty, logging the whole page is
even more expensive. You'd need to fall back to retail inserts in that case
which complicates the logic.

Also, I strongly object to turning regular heap_insert into a wrapper
around some other more complicated operation.  That will likely have
bad consequences for the performance of every other operation.

Ok. I doubt it makes any noticeable difference for performance, but having
done that, it's more readable, too, to duplicate the code.

Attached is a new version of the patch. It is now complete, including WAL
replay code.

--
 Heikki Linnakangas
 EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
KaiGai Kohei <kaigai@kaigai.gr.jp>

#15Dean Rasheed
dean.a.rasheed@gmail.com
In reply to: Kohei KaiGai (#14)
Re: Inserting heap tuples in bulk in COPY

On 25 September 2011 09:43, Kohei KaiGai <kaigai@kaigai.gr.jp> wrote:

Hi Heikki,

I checked your patch, then I have a comment and two questions here.

2011/9/14 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:

Attached is a new version of the patch. It is now complete, including WAL
replay code.

Hi,

I had a quick look at the patch as well and spotted an oversight: the
multi-insert code branch fails to invoke AFTER ROW triggers.

Regards,
Dean

#16Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#13)
Re: Inserting heap tuples in bulk in COPY

On Wed, Sep 14, 2011 at 6:52 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Why do you need new WAL replay routines?  Can't you just use the existing
XLOG_HEAP_NEWPAGE support?

By any large, I think we should be avoiding special-purpose WAL entries
as much as possible.

I tried that, but most of the reduction in WAL-size melts away with that.
And if the page you're copying to is not empty, logging the whole page is
even more expensive. You'd need to fall back to retail inserts in that case
which complicates the logic.

Where does it go? I understand why it'd be a problem for partially
filled pages, but it seems like it ought to be efficient for pages
that are initially empty.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17Kevin Grittner
Kevin.Grittner@wicourts.gov
In reply to: Robert Haas (#16)
Re: Inserting heap tuples in bulk in COPY

Kohei KaiGai wrote:

I'm not clear the reason why the argument of
CheckForSerializableConflictIn() was changed from the one in
heap_insert().

The code was probably just based on heap_insert() before this recent
commit:

http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9d306c66e63eb7f45eab9475b3f96c3134bacac6

Is it feasible that newly inserted tuples conflict with existing
tuple locks when we expand insert to support multiple tuples at
once?

No.

-Kevin

#18Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#16)
Re: Inserting heap tuples in bulk in COPY

On 25.09.2011 19:01, Robert Haas wrote:

On Wed, Sep 14, 2011 at 6:52 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Why do you need new WAL replay routines? Can't you just use the existing
XLOG_HEAP_NEWPAGE support?

By any large, I think we should be avoiding special-purpose WAL entries
as much as possible.

I tried that, but most of the reduction in WAL-size melts away with that.
And if the page you're copying to is not empty, logging the whole page is
even more expensive. You'd need to fall back to retail inserts in that case
which complicates the logic.

Where does it go? I understand why it'd be a problem for partially
filled pages, but it seems like it ought to be efficient for pages
that are initially empty.

A regular heap_insert record leaves out a lot of information that can be
deduced at replay time. It can leave out all the headers, including just
the null bitmap + data. In addition to that, there's just the location
of the tuple (RelFileNode+ItemPointer). At replay, xmin is taken from
the WAL record header.

For a multi-insert record, you don't even need to store the RelFileNode
and the block number for every tuple, just the offsets.

In comparison, a full-page image will include the full tuple header, and
also the line pointers. If I'm doing my math right, a full-page image
takes 25 bytes more data per tuple, than the special-purpose
multi-insert record.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#18)
Re: Inserting heap tuples in bulk in COPY

On Thu, Oct 6, 2011 at 7:33 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

A regular heap_insert record leaves out a lot of information that can be
deduced at replay time. It can leave out all the headers, including just the
null bitmap + data. In addition to that, there's just the location of the
tuple (RelFileNode+ItemPointer). At replay, xmin is taken from the WAL
record header.

For a multi-insert record, you don't even need to store the RelFileNode and
the block number for every tuple, just the offsets.

In comparison, a full-page image will include the full tuple header, and
also the line pointers. If I'm doing my math right, a full-page image takes
25 bytes more data per tuple, than the special-purpose multi-insert record.

Interesting. It's always seemed to me fairly inefficient in general
to store the whole RelFileNode. For many people, the database and
tablespace OID will be constants, and even if they aren't, there
certainly aren't going to be 96 bits of entropy in the relfilenode. I
thought about whether we could create some sort of mapping layer,
where say once per checkpoint we'd allocate a 4-byte integer to denote
a relfilenode, and WAL-log that mapping. Then after that everyone
could just refer to the 4-byte integer instead of the whole
relfilenode. But it seems like a lot of work for 8 bytes per record.
Then again, if you're getting that much benefit from shaving off 25
bytes per tuple, maybe it is, although I feel like FPW is the elephant
in the room.

But I digress...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#20Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#19)
Re: Inserting heap tuples in bulk in COPY

On 06.10.2011 15:11, Robert Haas wrote:

On Thu, Oct 6, 2011 at 7:33 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

A regular heap_insert record leaves out a lot of information that can be
deduced at replay time. It can leave out all the headers, including just the
null bitmap + data. In addition to that, there's just the location of the
tuple (RelFileNode+ItemPointer). At replay, xmin is taken from the WAL
record header.

For a multi-insert record, you don't even need to store the RelFileNode and
the block number for every tuple, just the offsets.

In comparison, a full-page image will include the full tuple header, and
also the line pointers. If I'm doing my math right, a full-page image takes
25 bytes more data per tuple, than the special-purpose multi-insert record.

Interesting. It's always seemed to me fairly inefficient in general
to store the whole RelFileNode. For many people, the database and
tablespace OID will be constants, and even if they aren't, there
certainly aren't going to be 96 bits of entropy in the relfilenode. I
thought about whether we could create some sort of mapping layer,
where say once per checkpoint we'd allocate a 4-byte integer to denote
a relfilenode, and WAL-log that mapping. Then after that everyone
could just refer to the 4-byte integer instead of the whole
relfilenode. But it seems like a lot of work for 8 bytes per record.
Then again, if you're getting that much benefit from shaving off 25
bytes per tuple, maybe it is, although I feel like FPW is the elephant
in the room.

A very simple optimization would be to leave out tablespace OID
altogether if it's DEFAULTTABLESPACE_OID, and just set a flag somewhere.
Then again, we could also just compress the WAL wholesale.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Dean Rasheed (#15)
1 attachment(s)
Re: Inserting heap tuples in bulk in COPY

On 25.09.2011 16:03, Dean Rasheed wrote:

On 25 September 2011 09:43, Kohei KaiGai<kaigai@kaigai.gr.jp> wrote:

Hi Heikki,

I checked your patch, then I have a comment and two questions here.

2011/9/14 Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>:

Attached is a new version of the patch. It is now complete, including WAL
replay code.

Hi,

I had a quick look at the patch as well and spotted an oversight: the
multi-insert code branch fails to invoke AFTER ROW triggers.

Thanks! Here's an updated version of the patch, fixing that, and all the
other issues pointed out this far.

I extracted the code that sets oid and tuple headers, and invokes the
toaster, into a new function that's shared by heap_insert() and
heap_multi_insert(). Tom objected to merging heap_insert() and
heap_multi_insert() into one complicated function, and I think he was
right on that, but sharing this code to prepare a tuple still makes
sense. IMHO it makes heap_insert() slightly more readable too.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

copy-heap-multi-insert-3.patchtext/x-diff; name=copy-heap-multi-insert-3.patchDownload
*** a/src/backend/access/heap/heapam.c
--- b/src/backend/access/heap/heapam.c
***************
*** 24,29 ****
--- 24,30 ----
   *		heap_getnext	- retrieve next tuple in scan
   *		heap_fetch		- retrieve tuple with given tid
   *		heap_insert		- insert tuple into a relation
+  *		heap_multi_insert - insert multiple tuples into a relation
   *		heap_delete		- delete a tuple from a relation
   *		heap_update		- replace a tuple in a relation with another tuple
   *		heap_markpos	- mark scan position
***************
*** 79,84 **** static HeapScanDesc heap_beginscan_internal(Relation relation,
--- 80,87 ----
  						int nkeys, ScanKey key,
  						bool allow_strat, bool allow_sync,
  						bool is_bitmapscan);
+ static HeapTuple heap_prepare_insert(Relation relation, HeapTuple tup,
+ 					TransactionId xid, CommandId cid, int options);
  static XLogRecPtr log_heap_update(Relation reln, Buffer oldbuf,
  				ItemPointerData from, Buffer newbuf, HeapTuple newtup,
  				bool all_visible_cleared, bool new_all_visible_cleared);
***************
*** 1866,1920 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid,
  	Buffer		vmbuffer = InvalidBuffer;
  	bool		all_visible_cleared = false;
  
- 	if (relation->rd_rel->relhasoids)
- 	{
- #ifdef NOT_USED
- 		/* this is redundant with an Assert in HeapTupleSetOid */
- 		Assert(tup->t_data->t_infomask & HEAP_HASOID);
- #endif
- 
- 		/*
- 		 * If the object id of this tuple has already been assigned, trust the
- 		 * caller.	There are a couple of ways this can happen.  At initial db
- 		 * creation, the backend program sets oids for tuples. When we define
- 		 * an index, we set the oid.  Finally, in the future, we may allow
- 		 * users to set their own object ids in order to support a persistent
- 		 * object store (objects need to contain pointers to one another).
- 		 */
- 		if (!OidIsValid(HeapTupleGetOid(tup)))
- 			HeapTupleSetOid(tup, GetNewOid(relation));
- 	}
- 	else
- 	{
- 		/* check there is not space for an OID */
- 		Assert(!(tup->t_data->t_infomask & HEAP_HASOID));
- 	}
- 
- 	tup->t_data->t_infomask &= ~(HEAP_XACT_MASK);
- 	tup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
- 	tup->t_data->t_infomask |= HEAP_XMAX_INVALID;
- 	HeapTupleHeaderSetXmin(tup->t_data, xid);
- 	HeapTupleHeaderSetCmin(tup->t_data, cid);
- 	HeapTupleHeaderSetXmax(tup->t_data, 0);		/* for cleanliness */
- 	tup->t_tableOid = RelationGetRelid(relation);
- 
  	/*
! 	 * If the new tuple is too big for storage or contains already toasted
! 	 * out-of-line attributes from some other relation, invoke the toaster.
  	 *
  	 * Note: below this point, heaptup is the data we actually intend to store
  	 * into the relation; tup is the caller's original untoasted data.
  	 */
! 	if (relation->rd_rel->relkind != RELKIND_RELATION)
! 	{
! 		/* toast table entries should never be recursively toasted */
! 		Assert(!HeapTupleHasExternal(tup));
! 		heaptup = tup;
! 	}
! 	else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
! 		heaptup = toast_insert_or_update(relation, tup, NULL, options);
! 	else
! 		heaptup = tup;
  
  	/*
  	 * We're about to do the actual insert -- but check for conflict first,
--- 1869,1882 ----
  	Buffer		vmbuffer = InvalidBuffer;
  	bool		all_visible_cleared = false;
  
  	/*
! 	 * Fill in tuple header fields, assign an OID, and toast the tuple if
! 	 * necessary.
  	 *
  	 * Note: below this point, heaptup is the data we actually intend to store
  	 * into the relation; tup is the caller's original untoasted data.
  	 */
! 	heaptup = heap_prepare_insert(relation, tup, xid, cid, options);
  
  	/*
  	 * We're about to do the actual insert -- but check for conflict first,
***************
*** 2035,2041 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid,
  	 */
  	CacheInvalidateHeapTuple(relation, heaptup, NULL);
  
! 	pgstat_count_heap_insert(relation);
  
  	/*
  	 * If heaptup is a private copy, release it.  Don't forget to copy t_self
--- 1997,2003 ----
  	 */
  	CacheInvalidateHeapTuple(relation, heaptup, NULL);
  
! 	pgstat_count_heap_insert(relation, 1);
  
  	/*
  	 * If heaptup is a private copy, release it.  Don't forget to copy t_self
***************
*** 2051,2056 **** heap_insert(Relation relation, HeapTuple tup, CommandId cid,
--- 2013,2297 ----
  }
  
  /*
+  * Subroutine for heap_insert(). Prepares a tuple for insertion. This sets the
+  * tuple header fields, assigns an OID, and toasts the tuple if necessary.
+  * Returns a toasted version of the tuple if it was toasted, or the original
+  * tuple if not. Note that in any case, the header fields are also set in
+  * the original tuple.
+  */
+ static HeapTuple
+ heap_prepare_insert(Relation relation, HeapTuple tup, TransactionId xid,
+ 					CommandId cid, int options)
+ {
+ 	if (relation->rd_rel->relhasoids)
+ 	{
+ #ifdef NOT_USED
+ 		/* this is redundant with an Assert in HeapTupleSetOid */
+ 		Assert(tup->t_data->t_infomask & HEAP_HASOID);
+ #endif
+ 
+ 		/*
+ 		 * If the object id of this tuple has already been assigned, trust the
+ 		 * caller.	There are a couple of ways this can happen.  At initial db
+ 		 * creation, the backend program sets oids for tuples. When we define
+ 		 * an index, we set the oid.  Finally, in the future, we may allow
+ 		 * users to set their own object ids in order to support a persistent
+ 		 * object store (objects need to contain pointers to one another).
+ 		 */
+ 		if (!OidIsValid(HeapTupleGetOid(tup)))
+ 			HeapTupleSetOid(tup, GetNewOid(relation));
+ 	}
+ 	else
+ 	{
+ 		/* check there is not space for an OID */
+ 		Assert(!(tup->t_data->t_infomask & HEAP_HASOID));
+ 	}
+ 
+ 	tup->t_data->t_infomask &= ~(HEAP_XACT_MASK);
+ 	tup->t_data->t_infomask2 &= ~(HEAP2_XACT_MASK);
+ 	tup->t_data->t_infomask |= HEAP_XMAX_INVALID;
+ 	HeapTupleHeaderSetXmin(tup->t_data, xid);
+ 	HeapTupleHeaderSetCmin(tup->t_data, cid);
+ 	HeapTupleHeaderSetXmax(tup->t_data, 0);		/* for cleanliness */
+ 	tup->t_tableOid = RelationGetRelid(relation);
+ 
+ 	/*
+ 	 * If the new tuple is too big for storage or contains already toasted
+ 	 * out-of-line attributes from some other relation, invoke the toaster.
+ 	 */
+ 	if (relation->rd_rel->relkind != RELKIND_RELATION)
+ 	{
+ 		/* toast table entries should never be recursively toasted */
+ 		Assert(!HeapTupleHasExternal(tup));
+ 		return tup;
+ 	}
+ 	else if (HeapTupleHasExternal(tup) || tup->t_len > TOAST_TUPLE_THRESHOLD)
+ 		return toast_insert_or_update(relation, tup, NULL, options);
+ 	else
+ 		return tup;
+ }
+ 
+ /*
+  *	heap_multi_insert	- insert multiple tuple into a heap
+  *
+  * This is like heap_insert(), but inserts multiple tuples in one operation.
+  * That's faster than calling heap_insert() in a loop, because when multiple
+  * tuples can be inserted on a single page, we can write just a single WAL
+  * record covering all of them, and only need to lock/unlock the page once.
+  *
+  * Note: this leaks memory into the current memory context. You can create a
+  * temporary context before calling this, if that's a problem.
+  */
+ void
+ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
+ 				  CommandId cid, int options, BulkInsertState bistate)
+ {
+ 	TransactionId xid = GetCurrentTransactionId();
+ 	HeapTuple  *heaptuples;
+ 	Buffer		buffer;
+ 	Buffer		vmbuffer = InvalidBuffer;
+ 	bool		all_visible_cleared = false;
+ 	int			i;
+ 	int			ndone;
+ 	char	   *scratch = NULL;
+ 	Page		page;
+ 	bool		needwal;
+ 
+ 	needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
+ 
+ 	/* Toast and set header data in all the tuples */
+ 	heaptuples = palloc(ntuples * sizeof(HeapTuple));
+ 	for (i = 0; i < ntuples; i++)
+ 		heaptuples[i] = heap_prepare_insert(relation, tuples[i],
+ 											xid, cid, options);
+ 
+ 	/*
+ 	 * Allocate some memory to use for constructing the WAL record. Using
+ 	 * palloc() within a critical section is not safe, so we allocate this
+ 	 * beforehand.
+ 	 */
+ 	if (needwal)
+ 		scratch = palloc(BLCKSZ);
+ 
+ 	/*
+ 	 * We're about to do the actual inserts -- but check for conflict first,
+ 	 * to avoid possibly having to roll back work we've just done.
+ 	 *
+ 	 * For a heap insert, we only need to check for table-level SSI locks.
+ 	 * Our new tuple can't possibly conflict with existing tuple locks, and
+ 	 * heap page locks are only consolidated versions of tuple locks; they do
+ 	 * not lock "gaps" as index page locks do.  So we don't need to identify
+ 	 * a buffer before making the call.
+ 	 */
+ 	CheckForSerializableConflictIn(relation, NULL, InvalidBuffer);
+ 
+ 	ndone = 0;
+ 	while (ndone < ntuples)
+ 	{
+ 		int nthispage;
+ 
+ 		/*
+ 		 * Find buffer where at least the next tuple will fit.  If the page
+ 		 * is all-visible, this will also pin the requisite visibility map
+ 		 * page.
+ 		 */
+ 		buffer = RelationGetBufferForTuple(relation, heaptuples[ndone]->t_len,
+ 										   InvalidBuffer, options, bistate,
+ 										   &vmbuffer, NULL);
+ 		page = BufferGetPage(buffer);
+ 
+ 		if (PageIsAllVisible(page))
+ 		{
+ 			all_visible_cleared = true;
+ 			PageClearAllVisible(page);
+ 			visibilitymap_clear(relation,
+ 								BufferGetBlockNumber(buffer),
+ 								vmbuffer);
+ 		}
+ 
+ 		/* NO EREPORT(ERROR) from here till changes are logged */
+ 		START_CRIT_SECTION();
+ 
+ 		/* Put as many tuples as fit on this page */
+ 		for (nthispage = 0; ndone + nthispage < ntuples; nthispage++)
+ 		{
+ 			HeapTuple	heaptup = heaptuples[ndone + nthispage];
+ 
+ 			if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len))
+ 				break;
+ 
+ 			RelationPutHeapTuple(relation, buffer, heaptup);
+ 		}
+ 
+ 		/*
+ 		 * XXX Should we set PageSetPrunable on this page ? See heap_insert()
+ 		 */
+ 
+ 		MarkBufferDirty(buffer);
+ 
+ 		/* XLOG stuff */
+ 		if (needwal)
+ 		{
+ 			XLogRecPtr	recptr;
+ 			xl_heap_multi_insert *xlrec;
+ 			XLogRecData rdata[2];
+ 			uint8		info = XLOG_HEAP2_MULTI_INSERT;
+ 			char	   *tupledata;
+ 			int			totaldatalen;
+ 			char	   *scratchptr = scratch;
+ 			bool		init;
+ 
+ 			/*
+ 			 * If the page was previously empty, we can reinit the page
+ 			 * instead of restoring the whole thing.
+ 			 */
+ 			init = (ItemPointerGetOffsetNumber(&(heaptuples[ndone]->t_self)) == FirstOffsetNumber &&
+ 					PageGetMaxOffsetNumber(page) == FirstOffsetNumber + nthispage - 1);
+ 
+ 			/* allocate xl_heap_multi_insert struct from the scratch area */
+ 			xlrec = (xl_heap_multi_insert *) scratchptr;
+ 			scratchptr += SizeOfHeapMultiInsert;
+ 
+ 			/*
+ 			 * Allocate offsets array. Unless we're reinitializing the page,
+ 			 * in that case the tuples are stored in order starting at
+ 			 * FirstOffsetNumber and we don't need to store the offsets
+ 			 * explicitly.
+ 			 */
+ 			if (!init)
+ 				scratchptr += nthispage * sizeof(OffsetNumber);
+ 
+ 			/* the rest of the scratch space is used for tuple data */
+ 			tupledata = scratchptr;
+ 
+ 			xlrec->all_visible_cleared = all_visible_cleared;
+ 			xlrec->node = relation->rd_node;
+ 			xlrec->blkno = BufferGetBlockNumber(buffer);
+ 			xlrec->ntuples = nthispage;
+ 
+ 			/*
+ 			 * Write out an xl_multi_insert_tuple and the tuple data itself
+ 			 * for each tuple.
+ 			 */
+ 			for (i = 0; i < nthispage; i++)
+ 			{
+ 				HeapTuple	heaptup = heaptuples[ndone + i];
+ 				xl_multi_insert_tuple *tuphdr;
+ 				int			datalen;
+ 
+ 				if (!init)
+ 					xlrec->offsets[i] = ItemPointerGetOffsetNumber(&heaptup->t_self);
+ 				/* xl_multi_insert_tuple needs two-byte alignment. */
+ 				tuphdr = (xl_multi_insert_tuple *) SHORTALIGN(scratchptr);
+ 				scratchptr = ((char *) tuphdr) + SizeOfMultiInsertTuple;
+ 
+ 				tuphdr->t_infomask2 = heaptup->t_data->t_infomask2;
+ 				tuphdr->t_infomask = heaptup->t_data->t_infomask;
+ 				tuphdr->t_hoff = heaptup->t_data->t_hoff;
+ 
+ 				/* write bitmap [+ padding] [+ oid] + data */
+ 				datalen = heaptup->t_len - offsetof(HeapTupleHeaderData, t_bits);
+ 				memcpy(scratchptr,
+ 					   (char *) heaptup->t_data + offsetof(HeapTupleHeaderData, t_bits),
+ 					   datalen);
+ 				tuphdr->datalen = datalen;
+ 				scratchptr += datalen;
+ 			}
+ 			totaldatalen = scratchptr - tupledata;
+ 			Assert((scratchptr - scratch) < BLCKSZ);
+ 
+ 			rdata[0].data = (char *) xlrec;
+ 			rdata[0].len = tupledata - scratch;
+ 			rdata[0].buffer = InvalidBuffer;
+ 			rdata[0].next = &rdata[1];
+ 
+ 			rdata[1].data = tupledata;
+ 			rdata[1].len = totaldatalen;
+ 			rdata[1].buffer = buffer;
+ 			rdata[1].buffer_std = true;
+ 			rdata[1].next = NULL;
+ 
+ 			/*
+ 			 * If we're going to reinitialize the whole page using the WAL
+ 			 * record, hide buffer reference from XLogInsert.
+ 			 */
+ 			if (init)
+ 			{
+ 				rdata[1].buffer = InvalidBuffer;
+ 				info |= XLOG_HEAP_INIT_PAGE;
+ 			}
+ 
+ 			recptr = XLogInsert(RM_HEAP2_ID, info, rdata);
+ 
+ 			PageSetLSN(page, recptr);
+ 			PageSetTLI(page, ThisTimeLineID);
+ 		}
+ 
+ 		END_CRIT_SECTION();
+ 
+ 		UnlockReleaseBuffer(buffer);
+ 		if (vmbuffer != InvalidBuffer)
+ 			ReleaseBuffer(vmbuffer);
+ 
+ 		ndone += nthispage;
+ 	}
+ 
+ 	/*
+ 	 * If tuples are cachable, mark them for invalidation from the caches in
+ 	 * case we abort.  Note it is OK to do this after releasing the buffer,
+ 	 * because the heaptuples data structure is all in local memory, not in
+ 	 * the shared buffer.
+ 	 */
+ 	if (IsSystemRelation(relation))
+ 	{
+ 		for (i = 0; i < ntuples; i++)
+ 			CacheInvalidateHeapTuple(relation, heaptuples[i], NULL);
+ 	}
+ 
+ 	pgstat_count_heap_insert(relation, ntuples);
+ }
+ 
+ /*
   *	simple_heap_insert - insert a tuple
   *
   * Currently, this routine differs from heap_insert only in supplying
***************
*** 4738,4743 **** heap_xlog_insert(XLogRecPtr lsn, XLogRecord *record)
--- 4979,5122 ----
  }
  
  /*
+  * Handles MULTI_INSERT record type.
+  */
+ static void
+ heap_xlog_multi_insert(XLogRecPtr lsn, XLogRecord *record)
+ {
+ 	char	   *recdata = XLogRecGetData(record);
+ 	xl_heap_multi_insert *xlrec;
+ 	Buffer		buffer;
+ 	Page		page;
+ 	struct
+ 	{
+ 		HeapTupleHeaderData hdr;
+ 		char		data[MaxHeapTupleSize];
+ 	}			tbuf;
+ 	HeapTupleHeader htup;
+ 	uint32		newlen;
+ 	Size		freespace;
+ 	BlockNumber blkno;
+ 	int			i;
+ 	bool		isinit = (record->xl_info & XLOG_HEAP_INIT_PAGE) != 0;
+ 
+ 	xlrec = (xl_heap_multi_insert *) recdata;
+ 	recdata += SizeOfHeapMultiInsert;
+ 
+ 	/*
+ 	 * If we're reinitializing the page, the tuples are stored in order from
+ 	 * FirstOffsetNumber. Otherwise there's an array of offsets in the WAL
+ 	 * record.
+ 	 */
+ 	if (!isinit)
+ 		recdata += sizeof(OffsetNumber) * xlrec->ntuples;
+ 
+ 	blkno = xlrec->blkno;
+ 
+ 	/*
+ 	 * The visibility map may need to be fixed even if the heap page is
+ 	 * already up-to-date.
+ 	 */
+ 	if (xlrec->all_visible_cleared)
+ 	{
+ 		Relation	reln = CreateFakeRelcacheEntry(xlrec->node);
+ 		Buffer		vmbuffer = InvalidBuffer;
+ 
+ 		visibilitymap_pin(reln, blkno, &vmbuffer);
+ 		visibilitymap_clear(reln, blkno, vmbuffer);
+ 		ReleaseBuffer(vmbuffer);
+ 		FreeFakeRelcacheEntry(reln);
+ 	}
+ 
+ 	if (record->xl_info & XLR_BKP_BLOCK_1)
+ 		return;
+ 
+ 	if (isinit)
+ 	{
+ 		buffer = XLogReadBuffer(xlrec->node, blkno, true);
+ 		Assert(BufferIsValid(buffer));
+ 		page = (Page) BufferGetPage(buffer);
+ 
+ 		PageInit(page, BufferGetPageSize(buffer), 0);
+ 	}
+ 	else
+ 	{
+ 		buffer = XLogReadBuffer(xlrec->node, blkno, false);
+ 		if (!BufferIsValid(buffer))
+ 			return;
+ 		page = (Page) BufferGetPage(buffer);
+ 
+ 		if (XLByteLE(lsn, PageGetLSN(page)))	/* changes are applied */
+ 		{
+ 			UnlockReleaseBuffer(buffer);
+ 			return;
+ 		}
+ 	}
+ 
+ 	for (i = 0; i < xlrec->ntuples; i++)
+ 	{
+ 		OffsetNumber offnum;
+ 		xl_multi_insert_tuple *xlhdr;
+ 
+ 		if (isinit)
+ 			offnum = FirstOffsetNumber + i;
+ 		else
+ 			offnum = xlrec->offsets[i];
+ 		if (PageGetMaxOffsetNumber(page) + 1 < offnum)
+ 			elog(PANIC, "heap_multi_insert_redo: invalid max offset number");
+ 
+ 		xlhdr = (xl_multi_insert_tuple *) SHORTALIGN(recdata);
+ 		recdata += SizeOfMultiInsertTuple;
+ 
+ 		newlen = xlhdr->datalen;
+ 		Assert(newlen <= MaxHeapTupleSize);
+ 		htup = &tbuf.hdr;
+ 		MemSet((char *) htup, 0, sizeof(HeapTupleHeaderData));
+ 		/* PG73FORMAT: get bitmap [+ padding] [+ oid] + data */
+ 		memcpy((char *) htup + offsetof(HeapTupleHeaderData, t_bits),
+ 			   (char *) recdata,
+ 			   newlen);
+ 		recdata += newlen;
+ 
+ 		newlen += offsetof(HeapTupleHeaderData, t_bits);
+ 		htup->t_infomask2 = xlhdr->t_infomask2;
+ 		htup->t_infomask = xlhdr->t_infomask;
+ 		htup->t_hoff = xlhdr->t_hoff;
+ 		HeapTupleHeaderSetXmin(htup, record->xl_xid);
+ 		HeapTupleHeaderSetCmin(htup, FirstCommandId);
+ 		ItemPointerSetBlockNumber(&htup->t_ctid, blkno);
+ 		ItemPointerSetOffsetNumber(&htup->t_ctid, offnum);
+ 
+ 		offnum = PageAddItem(page, (Item) htup, newlen, offnum, true, true);
+ 		if (offnum == InvalidOffsetNumber)
+ 			elog(PANIC, "heap_multi_insert_redo: failed to add tuple");
+ 	}
+ 
+ 	freespace = PageGetHeapFreeSpace(page);		/* needed to update FSM below */
+ 
+ 	PageSetLSN(page, lsn);
+ 	PageSetTLI(page, ThisTimeLineID);
+ 
+ 	if (xlrec->all_visible_cleared)
+ 		PageClearAllVisible(page);
+ 
+ 	MarkBufferDirty(buffer);
+ 	UnlockReleaseBuffer(buffer);
+ 
+ 	/*
+ 	 * If the page is running low on free space, update the FSM as well.
+ 	 * Arbitrarily, our definition of "low" is less than 20%. We can't do much
+ 	 * better than that without knowing the fill-factor for the table.
+ 	 *
+ 	 * XXX: We don't get here if the page was restored from full page image.
+ 	 * We don't bother to update the FSM in that case, it doesn't need to be
+ 	 * totally accurate anyway.
+ 	 */
+ 	if (freespace < BLCKSZ / 5)
+ 		XLogRecordPageWithFreeSpace(xlrec->node, blkno, freespace);
+ }
+ 
+ /*
   * Handles UPDATE and HOT_UPDATE
   */
  static void
***************
*** 5126,5131 **** heap2_redo(XLogRecPtr lsn, XLogRecord *record)
--- 5505,5513 ----
  		case XLOG_HEAP2_VISIBLE:
  			heap_xlog_visible(lsn, record);
  			break;
+ 		case XLOG_HEAP2_MULTI_INSERT:
+ 			heap_xlog_multi_insert(lsn, record);
+ 			break;
  		default:
  			elog(PANIC, "heap2_redo: unknown op code %u", info);
  	}
***************
*** 5263,5268 **** heap2_desc(StringInfo buf, uint8 xl_info, char *rec)
--- 5645,5662 ----
  						 xlrec->node.spcNode, xlrec->node.dbNode,
  						 xlrec->node.relNode, xlrec->block);
  	}
+ 	else if (info == XLOG_HEAP2_MULTI_INSERT)
+ 	{
+ 		xl_heap_multi_insert *xlrec = (xl_heap_multi_insert *) rec;
+ 
+ 		if (xl_info & XLOG_HEAP_INIT_PAGE)
+ 			appendStringInfo(buf, "multi-insert (init): ");
+ 		else
+ 			appendStringInfo(buf, "multi-insert: ");
+ 		appendStringInfo(buf, "rel %u/%u/%u; blk %u; %d tuples",
+ 						 xlrec->node.spcNode, xlrec->node.dbNode, xlrec->node.relNode,
+ 						 xlrec->blkno, xlrec->ntuples);
+ 	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
  }
*** a/src/backend/commands/copy.c
--- b/src/backend/commands/copy.c
***************
*** 33,38 ****
--- 33,39 ----
  #include "libpq/pqformat.h"
  #include "mb/pg_wchar.h"
  #include "miscadmin.h"
+ #include "optimizer/clauses.h"
  #include "optimizer/planner.h"
  #include "parser/parse_relation.h"
  #include "rewrite/rewriteHandler.h"
***************
*** 149,154 **** typedef struct CopyStateData
--- 150,156 ----
  	Oid		   *typioparams;	/* array of element types for in_functions */
  	int		   *defmap;			/* array of default att numbers */
  	ExprState **defexprs;		/* array of default att expressions */
+ 	bool		volatile_defexprs; /* is any of defexprs volatile? */
  
  	/*
  	 * These variables are used to reduce overhead in textual COPY FROM.
***************
*** 277,282 **** static uint64 CopyTo(CopyState cstate);
--- 279,289 ----
  static void CopyOneRowTo(CopyState cstate, Oid tupleOid,
  			 Datum *values, bool *nulls);
  static uint64 CopyFrom(CopyState cstate);
+ static void WriteCopyFromBatch(CopyState cstate, EState *estate,
+ 				   CommandId mycid, int hi_options,
+ 				   ResultRelInfo *resultRelInfo, TupleTableSlot *myslot,
+ 				   BulkInsertState bistate,
+ 				   int nBufferedTuples, HeapTuple *bufferedTuples);
  static bool CopyReadLine(CopyState cstate);
  static bool CopyReadLineText(CopyState cstate);
  static int	CopyReadAttributesText(CopyState cstate);
***************
*** 1842,1852 **** CopyFrom(CopyState cstate)
--- 1849,1865 ----
  	ExprContext *econtext;
  	TupleTableSlot *myslot;
  	MemoryContext oldcontext = CurrentMemoryContext;
+ 
  	ErrorContextCallback errcontext;
  	CommandId	mycid = GetCurrentCommandId(true);
  	int			hi_options = 0; /* start with default heap_insert options */
  	BulkInsertState bistate;
  	uint64		processed = 0;
+ 	bool		useHeapMultiInsert;
+ 	int			nBufferedTuples = 0;
+ #define MAX_BUFFERED_TUPLES 1000
+ 	HeapTuple  *bufferedTuples;
+ 	Size		bufferedTuplesSize = 0;
  
  	Assert(cstate->rel);
  
***************
*** 1941,1946 **** CopyFrom(CopyState cstate)
--- 1954,1981 ----
  	/* Triggers might need a slot as well */
  	estate->es_trig_tuple_slot = ExecInitExtraTupleSlot(estate);
  
+ 	/*
+ 	 * It's more efficient to prepare a bunch of tuples for insertion, and
+ 	 * insert them in one heap_multi_insert() call, than call heap_insert()
+ 	 * separately for every tuple. However, we can't do that if there are
+ 	 * BEFORE/INSTEAD OF triggers, or we need to evaluate volatile default
+ 	 * expressions. Such triggers or expressions might query the table we're
+ 	 * inserting to, and act differently if the tuples that have already been
+ 	 * processed and prepared for insertion are not there.
+ 	 */
+ 	if ((resultRelInfo->ri_TrigDesc != NULL &&
+ 		(resultRelInfo->ri_TrigDesc->trig_insert_before_row ||
+ 		 resultRelInfo->ri_TrigDesc->trig_insert_instead_row)) ||
+ 		cstate->volatile_defexprs)
+ 	{
+ 		useHeapMultiInsert = false;
+ 	}
+ 	else
+ 	{
+ 		useHeapMultiInsert = true;
+ 		bufferedTuples = palloc(MAX_BUFFERED_TUPLES * sizeof(HeapTuple));
+ 	}
+ 
  	/* Prepare to catch AFTER triggers. */
  	AfterTriggerBeginQuery();
  
***************
*** 1972,1979 **** CopyFrom(CopyState cstate)
  
  		CHECK_FOR_INTERRUPTS();
  
! 		/* Reset the per-tuple exprcontext */
! 		ResetPerTupleExprContext(estate);
  
  		/* Switch into its memory context */
  		MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
--- 2007,2021 ----
  
  		CHECK_FOR_INTERRUPTS();
  
! 		if (nBufferedTuples == 0)
! 		{
! 			/*
! 			 * Reset the per-tuple exprcontext. We can only do this if the
! 			 * tuple buffer is empty (calling the context the per-tuple memory
! 			 * context is a bit of a misnomer now
! 			 */
! 			ResetPerTupleExprContext(estate);
! 		}
  
  		/* Switch into its memory context */
  		MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
***************
*** 2010,2033 **** CopyFrom(CopyState cstate)
  
  		if (!skip_tuple)
  		{
- 			List	   *recheckIndexes = NIL;
- 
  			/* Check the constraints of the tuple */
  			if (cstate->rel->rd_att->constr)
  				ExecConstraints(resultRelInfo, slot, estate);
  
! 			/* OK, store the tuple and create index entries for it */
! 			heap_insert(cstate->rel, tuple, mycid, hi_options, bistate);
  
! 			if (resultRelInfo->ri_NumIndices > 0)
! 				recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
! 													   estate);
  
! 			/* AFTER ROW INSERT Triggers */
! 			ExecARInsertTriggers(estate, resultRelInfo, tuple,
! 								 recheckIndexes);
  
! 			list_free(recheckIndexes);
  
  			/*
  			 * We count only tuples not suppressed by a BEFORE INSERT trigger;
--- 2052,2100 ----
  
  		if (!skip_tuple)
  		{
  			/* Check the constraints of the tuple */
  			if (cstate->rel->rd_att->constr)
  				ExecConstraints(resultRelInfo, slot, estate);
  
! 			if (useHeapMultiInsert)
! 			{
! 				/* Add this tuple to the tuple buffer */
! 				bufferedTuples[nBufferedTuples++] = tuple;
! 				bufferedTuplesSize += tuple->t_len;
  
! 				/*
! 				 * If the buffer filled up, flush it. Also flush if the total
! 				 * size of all the tuples in the buffer becomes large, to
! 				 * avoid using large amounts of memory for the buffers when
! 				 * the tuples are exceptionally wide.
! 				 */
! 				if (nBufferedTuples == MAX_BUFFERED_TUPLES ||
! 					bufferedTuplesSize > 65535)
! 				{
! 					WriteCopyFromBatch(cstate, estate, mycid, hi_options,
! 									   resultRelInfo, myslot, bistate,
! 									   nBufferedTuples, bufferedTuples);
! 					nBufferedTuples = 0;
! 					bufferedTuplesSize = 0;
! 				}
! 			}
! 			else
! 			{
! 				List	   *recheckIndexes = NIL;
  
! 				/* OK, store the tuple and create index entries for it */
! 				heap_insert(cstate->rel, tuple, mycid, hi_options, bistate);
  
! 				if (resultRelInfo->ri_NumIndices > 0)
! 					recheckIndexes = ExecInsertIndexTuples(slot, &(tuple->t_self),
! 														   estate);
! 
! 				/* AFTER ROW INSERT Triggers */
! 				ExecARInsertTriggers(estate, resultRelInfo, tuple,
! 									 recheckIndexes);
! 
! 				list_free(recheckIndexes);
! 			}
  
  			/*
  			 * We count only tuples not suppressed by a BEFORE INSERT trigger;
***************
*** 2038,2043 **** CopyFrom(CopyState cstate)
--- 2105,2116 ----
  		}
  	}
  
+ 	/* Flush any remaining buffered tuples */
+ 	if (nBufferedTuples > 0)
+ 		WriteCopyFromBatch(cstate, estate, mycid, hi_options,
+ 						   resultRelInfo, myslot, bistate,
+ 						   nBufferedTuples, bufferedTuples);
+ 
  	/* Done, clean up */
  	error_context_stack = errcontext.previous;
  
***************
*** 2071,2076 **** CopyFrom(CopyState cstate)
--- 2144,2210 ----
  }
  
  /*
+  * A subroutine of CopyFrom, to write the current batch of buffered heap
+  * tuples to the heap. Also updates indexes and runs AFTER ROW INSERT
+  * triggers.
+  */
+ static void
+ WriteCopyFromBatch(CopyState cstate, EState *estate, CommandId mycid,
+ 				   int hi_options, ResultRelInfo *resultRelInfo,
+ 				   TupleTableSlot *myslot, BulkInsertState bistate,
+ 				   int nBufferedTuples, HeapTuple *bufferedTuples)
+ {
+ 	MemoryContext oldcontext;
+ 	int			i;
+ 
+ 	/*
+ 	 * heap_multi_insert leaks memory, so switch to short-lived memory
+ 	 * context before calling it.
+ 	 */
+ 	oldcontext = MemoryContextSwitchTo(GetPerTupleMemoryContext(estate));
+ 	heap_multi_insert(cstate->rel,
+ 					  bufferedTuples,
+ 					  nBufferedTuples,
+ 					  mycid,
+ 					  hi_options,
+ 					  bistate);
+ 	MemoryContextSwitchTo(oldcontext);
+ 
+ 	/*
+ 	 * If there are any indexes, update them for all the inserted tuples,
+ 	 * and run AFTER ROW INSERT triggers.
+ 	 */
+ 	if (resultRelInfo->ri_NumIndices > 0)
+ 	{
+ 		for (i = 0; i < nBufferedTuples; i++)
+ 		{
+ 			List *recheckIndexes;
+ 
+ 			ExecStoreTuple(bufferedTuples[i], myslot, InvalidBuffer, false);
+ 			recheckIndexes =
+ 				ExecInsertIndexTuples(myslot, &(bufferedTuples[i]->t_self),
+ 									  estate);
+ 			ExecARInsertTriggers(estate, resultRelInfo,
+ 								 bufferedTuples[i],
+ 								 recheckIndexes);
+ 			list_free(recheckIndexes);
+ 		}
+ 	}
+ 	/*
+ 	 * There's no indexes, but see if we need to run AFTER ROW INSERT triggers
+ 	 * anyway.
+ 	 */
+ 	else if (resultRelInfo->ri_TrigDesc != NULL &&
+ 			 resultRelInfo->ri_TrigDesc->trig_insert_after_row)
+ 	{
+ 		for (i = 0; i < nBufferedTuples; i++)
+ 			ExecARInsertTriggers(estate, resultRelInfo,
+ 								 bufferedTuples[i],
+ 								 NIL);
+ 	}
+ }
+ 
+ /*
   * Setup to read tuples from a file for COPY FROM.
   *
   * 'rel': Used as a template for the tuples
***************
*** 2099,2104 **** BeginCopyFrom(Relation rel,
--- 2233,2239 ----
  	int		   *defmap;
  	ExprState **defexprs;
  	MemoryContext oldcontext;
+ 	bool		volatile_defexprs;
  
  	cstate = BeginCopy(true, rel, NULL, NULL, attnamelist, options);
  	oldcontext = MemoryContextSwitchTo(cstate->copycontext);
***************
*** 2122,2127 **** BeginCopyFrom(Relation rel,
--- 2257,2263 ----
  	attr = tupDesc->attrs;
  	num_phys_attrs = tupDesc->natts;
  	num_defaults = 0;
+ 	volatile_defexprs = false;
  
  	/*
  	 * Pick up the required catalog information for each attribute in the
***************
*** 2163,2168 **** BeginCopyFrom(Relation rel,
--- 2299,2307 ----
  								 expression_planner((Expr *) defexpr), NULL);
  				defmap[num_defaults] = attnum - 1;
  				num_defaults++;
+ 
+ 				if (!volatile_defexprs)
+ 					volatile_defexprs = contain_volatile_functions(defexpr);
  			}
  		}
  	}
***************
*** 2172,2177 **** BeginCopyFrom(Relation rel,
--- 2311,2317 ----
  	cstate->typioparams = typioparams;
  	cstate->defmap = defmap;
  	cstate->defexprs = defexprs;
+ 	cstate->volatile_defexprs = volatile_defexprs;
  	cstate->num_defaults = num_defaults;
  
  	if (pipe)
*** a/src/backend/postmaster/pgstat.c
--- b/src/backend/postmaster/pgstat.c
***************
*** 1677,1686 **** add_tabstat_xact_level(PgStat_TableStatus *pgstat_info, int nest_level)
  }
  
  /*
!  * pgstat_count_heap_insert - count a tuple insertion
   */
  void
! pgstat_count_heap_insert(Relation rel)
  {
  	PgStat_TableStatus *pgstat_info = rel->pgstat_info;
  
--- 1677,1686 ----
  }
  
  /*
!  * pgstat_count_heap_insert - count a tuple insertion of n tuples
   */
  void
! pgstat_count_heap_insert(Relation rel, int n)
  {
  	PgStat_TableStatus *pgstat_info = rel->pgstat_info;
  
***************
*** 1693,1699 **** pgstat_count_heap_insert(Relation rel)
  			pgstat_info->trans->nest_level != nest_level)
  			add_tabstat_xact_level(pgstat_info, nest_level);
  
! 		pgstat_info->trans->tuples_inserted++;
  	}
  }
  
--- 1693,1699 ----
  			pgstat_info->trans->nest_level != nest_level)
  			add_tabstat_xact_level(pgstat_info, nest_level);
  
! 		pgstat_info->trans->tuples_inserted += n;
  	}
  }
  
*** a/src/include/access/heapam.h
--- b/src/include/access/heapam.h
***************
*** 97,102 **** extern void FreeBulkInsertState(BulkInsertState);
--- 97,104 ----
  
  extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
  			int options, BulkInsertState bistate);
+ extern void heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
+ 				  CommandId cid, int options, BulkInsertState bistate);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
  			CommandId cid, Snapshot crosscheck, bool wait);
*** a/src/include/access/htup.h
--- b/src/include/access/htup.h
***************
*** 608,613 **** typedef HeapTupleData *HeapTuple;
--- 608,614 ----
  /* 0x20 is free, was XLOG_HEAP2_CLEAN_MOVE */
  #define XLOG_HEAP2_CLEANUP_INFO 0x30
  #define XLOG_HEAP2_VISIBLE		0x40
+ #define XLOG_HEAP2_MULTI_INSERT	0x50
  
  /*
   * All what we need to find changed tuple
***************
*** 661,666 **** typedef struct xl_heap_insert
--- 662,697 ----
  
  #define SizeOfHeapInsert	(offsetof(xl_heap_insert, all_visible_cleared) + sizeof(bool))
  
+ /*
+  * This is what we need to know about a multi-insert. The record consists of
+  * xl_heap_multi_insert header, followed by a xl_multi_insert_tuple and tuple
+  * data for each tuple. 'offsets' array is omitted if the whole page is
+  * reinitialized (XLOG_HEAP_INIT_PAGE)
+  */
+ typedef struct xl_heap_multi_insert
+ {
+ 	RelFileNode node;
+ 	BlockNumber	blkno;
+ 	bool		all_visible_cleared;
+ 	uint16		ntuples;
+ 	OffsetNumber offsets[1];
+ 
+ 	/* TUPLE DATA (xl_multi_insert_tuples) FOLLOW AT END OF STRUCT */
+ } xl_heap_multi_insert;
+ 
+ #define SizeOfHeapMultiInsert	offsetof(xl_heap_multi_insert, offsets)
+ 
+ typedef struct xl_multi_insert_tuple
+ {
+ 	uint16		datalen;				/* size of tuple data that follows */
+ 	uint16		t_infomask2;
+ 	uint16		t_infomask;
+ 	uint8		t_hoff;
+ 	/* TUPLE DATA FOLLOWS AT END OF STRUCT */
+ } xl_multi_insert_tuple;
+ 
+ #define SizeOfMultiInsertTuple	(offsetof(xl_multi_insert_tuple, t_hoff) + sizeof(uint8))
+ 
  /* This is what we need to know about update|hot_update */
  typedef struct xl_heap_update
  {
*** a/src/include/pgstat.h
--- b/src/include/pgstat.h
***************
*** 766,772 **** extern void pgstat_initstats(Relation rel);
  			(rel)->pgstat_info->t_counts.t_blocks_hit++;			\
  	} while (0)
  
! extern void pgstat_count_heap_insert(Relation rel);
  extern void pgstat_count_heap_update(Relation rel, bool hot);
  extern void pgstat_count_heap_delete(Relation rel);
  extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
--- 766,772 ----
  			(rel)->pgstat_info->t_counts.t_blocks_hit++;			\
  	} while (0)
  
! extern void pgstat_count_heap_insert(Relation rel, int n);
  extern void pgstat_count_heap_update(Relation rel, bool hot);
  extern void pgstat_count_heap_delete(Relation rel);
  extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
#22Jeff Janes
jeff.janes@gmail.com
In reply to: Heikki Linnakangas (#21)
Re: Inserting heap tuples in bulk in COPY

On Mon, Oct 24, 2011 at 7:46 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Thanks! Here's an updated version of the patch, fixing that, and all the
other issues pointed out this far.

I extracted the code that sets oid and tuple headers, and invokes the
toaster, into a new function that's shared by heap_insert() and
heap_multi_insert(). Tom objected to merging heap_insert() and
heap_multi_insert() into one complicated function, and I think he was right
on that, but sharing this code to prepare a tuple still makes sense. IMHO it
makes heap_insert() slightly more readable too.

Hi Heikki,

Thanks for this patch. Doing bulk copies in parallel for me is now
limited by the IO subsystem rather than the CPU.

This patch, commit number d326d9e8ea1d69, causes fillfactor to be
ignored for the copy command. Is this acceptable collateral damage?

This can be seen by using "pgbench -i -s50 -F50" to create the table
combined with and select
pg_size_pretty(pg_table_size('pgbench_accounts')), or by using the
relation_free_space extension to pageinspect proposed elsewhere.

Thanks,

Jeff

#23Jeff Janes
jeff.janes@gmail.com
In reply to: Jeff Janes (#22)
1 attachment(s)
Re: Inserting heap tuples in bulk in COPY

On Fri, Nov 25, 2011 at 12:53 PM, Jeff Janes <jeff.janes@gmail.com> wrote:

Hi Heikki,

Thanks for this patch.  Doing bulk copies in parallel for me is now
limited by the IO subsystem rather than the CPU.

This patch, commit number d326d9e8ea1d69, causes fillfactor to be
ignored for the copy command.  Is this acceptable collateral damage?

Having looked into it a little bit, I think this might be an acceptable fix.

Cheers,

Jeff

Attachments:

bulkwal_copy_1.patchapplication/octet-stream; name=bulkwal_copy_1.patchDownload
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 72ed632..88bb5e3 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -2096,6 +2096,8 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 	char	   *scratch = NULL;
 	Page		page;
 	bool		needwal;
+	Size	saveFreeSpace = RelationGetTargetPageFreeSpace(relation,
+										HEAP_DEFAULT_FILLFACTOR);
 
 	needwal = !(options & HEAP_INSERT_SKIP_WAL) && RelationNeedsWAL(relation);
 
@@ -2157,7 +2159,7 @@ heap_multi_insert(Relation relation, HeapTuple *tuples, int ntuples,
 		{
 			HeapTuple	heaptup = heaptuples[ndone + nthispage];
 
-			if (PageGetHeapFreeSpace(page) < MAXALIGN(heaptup->t_len))
+			if (PageGetHeapFreeSpace(page) - saveFreeSpace < MAXALIGN(heaptup->t_len))
 				break;
 
 			RelationPutHeapTuple(relation, buffer, heaptup);
#24Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jeff Janes (#23)
Re: Inserting heap tuples in bulk in COPY

On 25.11.2011 23:32, Jeff Janes wrote:

On Fri, Nov 25, 2011 at 12:53 PM, Jeff Janes<jeff.janes@gmail.com> wrote:

Thanks for this patch. Doing bulk copies in parallel for me is now
limited by the IO subsystem rather than the CPU.

This patch, commit number d326d9e8ea1d69, causes fillfactor to be
ignored for the copy command. Is this acceptable collateral damage?

Having looked into it a little bit, I think this might be an acceptable fix.

Thanks, applied. It was an oversight.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#25Jeff Janes
jeff.janes@gmail.com
In reply to: Heikki Linnakangas (#9)
Re: Inserting heap tuples in bulk in COPY

On Fri, Aug 12, 2011 at 2:59 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 13.08.2011 00:17, Simon Riggs wrote:

Also, we discussed that you would work on buffering the index inserts,
which is where the main problem lies. The main heap is only a small
part of the overhead if we have multiple indexes already built on a
table - which is the use case that causes the most problem.

Sure, if you have indexes on the table already, then the overhead of
updating them is more significant. I am actually working on that too, I will
make a separate post about that.

Hi Heikki,

Is the bulk index insert still an active area for you?

If not, is there some kind of summary of design or analysis work
already done, which would be a useful for someone else interested in
picking it up?

Thanks,

Jeff

#26Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Janes (#25)
Re: Inserting heap tuples in bulk in COPY

On 7 August 2012 20:58, Jeff Janes <jeff.janes@gmail.com> wrote:

On Fri, Aug 12, 2011 at 2:59 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 13.08.2011 00:17, Simon Riggs wrote:

Also, we discussed that you would work on buffering the index inserts,
which is where the main problem lies. The main heap is only a small
part of the overhead if we have multiple indexes already built on a
table - which is the use case that causes the most problem.

Sure, if you have indexes on the table already, then the overhead of
updating them is more significant. I am actually working on that too, I will
make a separate post about that.

Hi Heikki,

Is the bulk index insert still an active area for you?

If not, is there some kind of summary of design or analysis work
already done, which would be a useful for someone else interested in
picking it up?

The main cost comes from repeated re-seeking down the index tree to
find the insertion point, but repeated lock and pin operations on
index buffers are also high. That is optimisable if the index inserts
are grouped, as they will be with monotonic indexes, (e.g. SERIAL), or
with partial monotonic (i.e. with Parent/Child relationship, on Child
table many inserts will be of the form (x,1), (x,2), (x, 3) etc, e.g.
Order/OrderLine).

All we need do is buffer the inserts in the inserts, before inserting
them into the main index. As long as we flush the buffer before end of
transaction, we're good.

Incidentally, we can also optimise repeated inserts within a normal
transaction using this method, by implementing deferred unique
constraints. At present we say that unique constraints aren't
deferrable, but there's no reason they can't be, if we allow buffering
in the index. (Implementing a deferred constraint that won't fail if
we do UPDATE foo SET pk = pk+1 requires a buffer the size of foo,
which is probably a bad plan, plus you'd need to sort the inputs, so
that particular nut is another issue altogether, AFAICS).

Suggested implementation is to buffer index tuples at the generic
index API, then implement a bulk insert index API call that can then
be implemented for each AM separately. Suggested buffer size is
work_mem. Note we must extract index tuple from heap tuples -
refetching heap blocks to get rows is too costly.

I think we need to implement buffering both to end of statement or end
of transaction, not just one or the other.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#27Jeff Janes
jeff.janes@gmail.com
In reply to: Simon Riggs (#26)
Re: Inserting heap tuples in bulk in COPY

On Tue, Aug 7, 2012 at 1:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 7 August 2012 20:58, Jeff Janes <jeff.janes@gmail.com> wrote:

Hi Heikki,

Is the bulk index insert still an active area for you?

If not, is there some kind of summary of design or analysis work
already done, which would be a useful for someone else interested in
picking it up?

The main cost comes from repeated re-seeking down the index tree to
find the insertion point, but repeated lock and pin operations on
index buffers are also high. That is optimisable if the index inserts
are grouped, as they will be with monotonic indexes, (e.g. SERIAL), or
with partial monotonic (i.e. with Parent/Child relationship, on Child
table many inserts will be of the form (x,1), (x,2), (x, 3) etc, e.g.
Order/OrderLine).

All we need do is buffer the inserts in the inserts, before inserting
them into the main index. As long as we flush the buffer before end of
transaction, we're good.

Incidentally, we can also optimise repeated inserts within a normal
transaction using this method, by implementing deferred unique
constraints. At present we say that unique constraints aren't
deferrable, but there's no reason they can't be, if we allow buffering
in the index. (Implementing a deferred constraint that won't fail if
we do UPDATE foo SET pk = pk+1 requires a buffer the size of foo,
which is probably a bad plan, plus you'd need to sort the inputs, so
that particular nut is another issue altogether, AFAICS).

Suggested implementation is to buffer index tuples at the generic
index API, then implement a bulk insert index API call that can then
be implemented for each AM separately. Suggested buffer size is
work_mem. Note we must extract index tuple from heap tuples -
refetching heap blocks to get rows is too costly.

OK, thanks for the summary. It looks like your plans are to improve
the case where the index maintenance is already CPU limited. I was
more interested in cases where the regions of the index(es) undergoing
active insertion do not fit into usable RAM, so the limit is the
random IO needed to fetch the leaf pages in order to update them or to
write them out once dirtied. Here too buffering is probably the
answer, but the size of the buffer needed to turn that IO from
effectively random to effectively sequential is probably much larger
than the size of the buffer you are contemplating.

I think we need to implement buffering both to end of statement or end
of transaction, not just one or the other.

With the planner deciding which would be better, or explicit user action?

Thanks,

Jeff

#28Simon Riggs
simon@2ndQuadrant.com
In reply to: Jeff Janes (#27)
Re: Inserting heap tuples in bulk in COPY

On 8 August 2012 03:44, Jeff Janes <jeff.janes@gmail.com> wrote:

On Tue, Aug 7, 2012 at 1:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 7 August 2012 20:58, Jeff Janes <jeff.janes@gmail.com> wrote:

Hi Heikki,

Is the bulk index insert still an active area for you?

If not, is there some kind of summary of design or analysis work
already done, which would be a useful for someone else interested in
picking it up?

The main cost comes from repeated re-seeking down the index tree to
find the insertion point, but repeated lock and pin operations on
index buffers are also high. That is optimisable if the index inserts
are grouped, as they will be with monotonic indexes, (e.g. SERIAL), or
with partial monotonic (i.e. with Parent/Child relationship, on Child
table many inserts will be of the form (x,1), (x,2), (x, 3) etc, e.g.
Order/OrderLine).

All we need do is buffer the inserts in the inserts, before inserting
them into the main index. As long as we flush the buffer before end of
transaction, we're good.

Incidentally, we can also optimise repeated inserts within a normal
transaction using this method, by implementing deferred unique
constraints. At present we say that unique constraints aren't
deferrable, but there's no reason they can't be, if we allow buffering
in the index. (Implementing a deferred constraint that won't fail if
we do UPDATE foo SET pk = pk+1 requires a buffer the size of foo,
which is probably a bad plan, plus you'd need to sort the inputs, so
that particular nut is another issue altogether, AFAICS).

Suggested implementation is to buffer index tuples at the generic
index API, then implement a bulk insert index API call that can then
be implemented for each AM separately. Suggested buffer size is
work_mem. Note we must extract index tuple from heap tuples -
refetching heap blocks to get rows is too costly.

OK, thanks for the summary. It looks like your plans are to improve
the case where the index maintenance is already CPU limited. I was
more interested in cases where the regions of the index(es) undergoing
active insertion do not fit into usable RAM, so the limit is the
random IO needed to fetch the leaf pages in order to update them or to
write them out once dirtied. Here too buffering is probably the
answer, but the size of the buffer needed to turn that IO from
effectively random to effectively sequential is probably much larger
than the size of the buffer you are contemplating.

The buffer size can be variable, yes. I was imagining a mechanism that
worked for normal INSERTs as well as COPY. Perhaps we could say buffer
is work_mem with INSERT and maintenance_work_mem with COPY.

Very large index appends are useful, but currently not very easily
usable because of the transactional nature of COPY. If we could reject
rows without ERROR it would be more practical.

I'm not planning to work on this, so all comments for your assistance.

I think we need to implement buffering both to end of statement or end
of transaction, not just one or the other.

With the planner deciding which would be better, or explicit user action?

Probably both: on/off/choose.

Deferring unique check would change the point at which errors were
reported in a transaction, which might not be desirable for some. I
think SQL standard has something to say about this also, so that needs
care. But in general, if your tables use generated PK values they
should be able to benefit from this, so I would suggest a default
setting of choose.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#29Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#26)
Re: Inserting heap tuples in bulk in COPY

On Tue, Aug 7, 2012 at 4:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Incidentally, we can also optimise repeated inserts within a normal
transaction using this method, by implementing deferred unique
constraints. At present we say that unique constraints aren't
deferrable, but there's no reason they can't be, if we allow buffering
in the index. (Implementing a deferred constraint that won't fail if
we do UPDATE foo SET pk = pk+1 requires a buffer the size of foo,
which is probably a bad plan, plus you'd need to sort the inputs, so
that particular nut is another issue altogether, AFAICS).

We've had deferrable unique constraints since 9.0, courtesy of Dean Rasheed.

I think we need to implement buffering both to end of statement or end
of transaction, not just one or the other.

Another (not necessarily better) idea is to use a buffer that's part
of the index, like the GIN fastupdate stuff, so that there's no
particular constraint on when the buffer has to be flushed, but
competing index scans may be slower until it is.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#30Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#29)
Re: Inserting heap tuples in bulk in COPY

On 8 August 2012 20:34, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Aug 7, 2012 at 4:52 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Incidentally, we can also optimise repeated inserts within a normal
transaction using this method, by implementing deferred unique
constraints. At present we say that unique constraints aren't
deferrable, but there's no reason they can't be, if we allow buffering
in the index. (Implementing a deferred constraint that won't fail if
we do UPDATE foo SET pk = pk+1 requires a buffer the size of foo,
which is probably a bad plan, plus you'd need to sort the inputs, so
that particular nut is another issue altogether, AFAICS).

We've had deferrable unique constraints since 9.0, courtesy of Dean Rasheed.

Yeh, but IIRC there was some issue I can't seem to find detail on
about it not working quite right in production. Not sure now.

I think we need to implement buffering both to end of statement or end
of transaction, not just one or the other.

Another (not necessarily better) idea is to use a buffer that's part
of the index, like the GIN fastupdate stuff, so that there's no
particular constraint on when the buffer has to be flushed, but
competing index scans may be slower until it is.

I think that works very well for non-unique indexes, though it does
increase WAL traffic since you need to do a double hop into the index.
Its not possible for unique constraints/pk indexes since they need to
fail the transaction in case of duplicates.

The buffer I was imagining would be a private buffer within a
transaction, so wouldn't increase WAL.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#31Jesper Krogh
jesper@krogh.cc
In reply to: Robert Haas (#29)
Re: Inserting heap tuples in bulk in COPY

On 08/08/12 21:34, Robert Haas wrote:

I think we need to implement buffering both to end of statement or end
of transaction, not just one or the other.

Another (not necessarily better) idea is to use a buffer that's part
of the index, like the GIN fastupdate stuff, so that there's no
particular constraint on when the buffer has to be flushed, but
competing index scans may be slower until it is.

If it is an implementation artifact or an result of this
approach I dont know. But currently, when the GIN fastupdate
code finally decides to "flush" the buffer, it is going to stall all
other processes doing updates while doing it. If you only have
one update process then this doesn't matter. But if you're trying to get
user-interactive-updates to flow in with batch-updates from
background processes, then you'd better kill off this feature,
since you're gauranteed that the user-interactive process is
either going to flush the buffer or wait on someone else doing
it.

I havent done the benchmarking, but I'm actually fairly sure that
fastupdate isn't overall faster if you bump concurrency slightly and run of
memory or SSD-based backends due to this cross-backend contention
of the buffer.

A buffer that is backend local, so you can use transactions to
batch up changes would get around this, but that may have another
huge set of consequenses I dont know if.

... based on my own real-world experience with this feature.

#32Robert Haas
robertmhaas@gmail.com
In reply to: Jesper Krogh (#31)
Re: Inserting heap tuples in bulk in COPY

On Thu, Aug 9, 2012 at 2:59 AM, Jesper Krogh <jesper@krogh.cc> wrote:

If it is an implementation artifact or an result of this
approach I dont know. But currently, when the GIN fastupdate
code finally decides to "flush" the buffer, it is going to stall all
other processes doing updates while doing it. If you only have
one update process then this doesn't matter. But if you're trying to get
user-interactive-updates to flow in with batch-updates from
background processes, then you'd better kill off this feature,
since you're gauranteed that the user-interactive process is
either going to flush the buffer or wait on someone else doing
it.

I havent done the benchmarking, but I'm actually fairly sure that
fastupdate isn't overall faster if you bump concurrency slightly and run of
memory or SSD-based backends due to this cross-backend contention
of the buffer.

Yeah, I've noticed that there are some things that are a little wonky
about GIN fastupdate. On the other hand, I believe that MySQL has
something along these lines called secondary index buffering which
apparently does very good things for random I/O. I am not sure of the
details or the implementation, though.

A buffer that is backend local, so you can use transactions to
batch up changes would get around this, but that may have another
huge set of consequenses I dont know if.

... based on my own real-world experience with this feature.

Well, the main thing to worry about is transactional consistency. If
a backend which has postponed doing the index-inserts does an index
scan after the command-counter-id has been bumped, it'll see
inconsistent results. We could avoid that by only using the
optimization when some set of sanity checks passes and doing the
deferred inserts at the end of the statement, or something like that.

The other tricky part is figuring out how to actually get a
performance improvement out of it. I think Simon's probably right
that a lot of the cost is in repeatedly walking the btree, looking up
and pinning/unpinning/locking/unlocking buffers along the way. Maybe
we could sort the data in index order, walk down to the first
insertion point, and the insert as many tuples in a row as precede the
next key in the index. Then lather, rinse, repeat. If you're
actually just adding everything at the tail of the index, this ought
to work pretty well. But if the inserts are all over the place it
seems like it might not be any better, or actually a little worse.

Of course it's probably premature to speculate too much until someone
actually codes something up and tests it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company