COPY with no WAL, in certain circumstances

Started by Simon Riggsabout 19 years ago39 messages
#1Simon Riggs
simon@2ndquadrant.com
1 attachment(s)

http://archives.postgresql.org/pgsql-hackers/2006-10/msg01172.php

As discussed on -hackers, its possible to avoid writing any WAL at all
for COPY in these circumstances:

BEGIN;
CREATE TABLE foo..
COPY foo...
COMMIT;

BEGIN;
TRUNCATE foo..
COPY foo...
COMMIT;

The enclosed patch implements this, as discussed. There is no user
interface to enable/disable, just as with CTAS and CREATE INDEX; no
docs, just code comments.

This plays nicely with the --single-transaction option in psql to allow
fast restores/upgrades.

YMMV but disk bound COPY will benefit greatly from this patch, some
tests showing 100% gain. COPY is still *very* CPU intensive, so some
tests have shown negligible benefit, fyi, but that isn't the typical
case.

Applies cleanly to CVS HEAD, passes make check.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

Attachments:

copy_nowal.v1.patchtext/x-patch; charset=UTF-8; name=copy_nowal.v1.patchDownload
Index: src/backend/access/heap/heapam.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.222
diff -c -r1.222 heapam.c
*** src/backend/access/heap/heapam.c	17 Nov 2006 18:00:14 -0000	1.222
--- src/backend/access/heap/heapam.c	4 Jan 2007 15:24:03 -0000
***************
*** 28,33 ****
--- 28,34 ----
   *		heap_update		- replace a tuple in a relation with another tuple
   *		heap_markpos	- mark scan position
   *		heap_restrpos	- restore position to marked location
+  *	  heap_sync	   - sync heap, for when no WAL has been written
   *
   * NOTES
   *	  This file contains the heap_ routines which implement
***************
*** 50,55 ****
--- 51,57 ----
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "storage/procarray.h"
+ #include "storage/smgr.h"
  #include "utils/inval.h"
  #include "utils/lsyscache.h"
  #include "utils/relcache.h"
***************
*** 1357,1364 ****
   * non-temp relation.  Safe usage of this behavior requires that we arrange
   * that all new tuples go into new pages not containing any tuples from other
   * transactions, that the relation gets fsync'd before commit, and that the
!  * transaction emits at least one WAL record to ensure RecordTransactionCommit
!  * will decide to WAL-log the commit.
   *
   * use_fsm is passed directly to RelationGetBufferForTuple, which see for
   * more info.
--- 1359,1367 ----
   * non-temp relation.  Safe usage of this behavior requires that we arrange
   * that all new tuples go into new pages not containing any tuples from other
   * transactions, that the relation gets fsync'd before commit, and that the
!  * transaction emits at least one WAL record or must mark 
!  * MyXactMadeXLogEntry=true to ensure RecordTransactionCommit will
!  * decide to WAL-log the commit. (see heap_sync() comments also)
   *
   * use_fsm is passed directly to RelationGetBufferForTuple, which see for
   * more info.
***************
*** 1371,1377 ****
   */
  Oid
  heap_insert(Relation relation, HeapTuple tup, CommandId cid,
! 			bool use_wal, bool use_fsm)
  {
  	TransactionId xid = GetCurrentTransactionId();
  	HeapTuple	heaptup;
--- 1374,1380 ----
   */
  Oid
  heap_insert(Relation relation, HeapTuple tup, CommandId cid,
! 			bool use_wal, bool use_fsm, bool force_committed)
  {
  	TransactionId xid = GetCurrentTransactionId();
  	HeapTuple	heaptup;
***************
*** 1408,1413 ****
--- 1411,1419 ----
  	HeapTupleHeaderSetXmax(tup->t_data, 0);		/* zero out Datum fields */
  	HeapTupleHeaderSetCmax(tup->t_data, 0);		/* for cleanliness */
  	tup->t_tableOid = RelationGetRelid(relation);
+	 if (force_committed)
+	 	tup->t_data->t_infomask |= HEAP_XMIN_COMMITTED;
+		 
  
  	/*
  	 * If the new tuple is too big for storage or contains already toasted
***************
*** 1418,1424 ****
  	 */
  	if (HeapTupleHasExternal(tup) ||
  		(MAXALIGN(tup->t_len) > TOAST_TUPLE_THRESHOLD))
! 		heaptup = toast_insert_or_update(relation, tup, NULL);
  	else
  		heaptup = tup;
  
--- 1424,1430 ----
  	 */
  	if (HeapTupleHasExternal(tup) ||
  		(MAXALIGN(tup->t_len) > TOAST_TUPLE_THRESHOLD))
! 		heaptup = toast_insert_or_update(relation, tup, NULL, use_wal);
  	else
  		heaptup = tup;
  
***************
*** 1532,1538 ****
  Oid
  simple_heap_insert(Relation relation, HeapTuple tup)
  {
! 	return heap_insert(relation, tup, GetCurrentCommandId(), true, true);
  }
  
  /*
--- 1538,1556 ----
  Oid
  simple_heap_insert(Relation relation, HeapTuple tup)
  {
! 	return heap_insert(relation, tup, GetCurrentCommandId(), true, true, false);
! }
! 
! /*
!  *	fast_heap_insert - insert a tuple with options to improve speed
!  *
!  * Currently, this routine allows specifying additional options for speed
!  * in certain cases, such as WAL-avoiding COPY command
!  */
! Oid
! fast_heap_insert(Relation relation, HeapTuple tup, bool use_wal, bool force_committed)
! {
! 	return heap_insert(relation, tup, GetCurrentCommandId(), use_wal, use_wal, force_committed);
  }
  
  /*
***************
*** 2086,2096 ****
  		 *
  		 * Note: below this point, heaptup is the data we actually intend to
  		 * store into the relation; newtup is the caller's original untoasted
! 		 * data.
  		 */
  		if (need_toast)
  		{
! 			heaptup = toast_insert_or_update(relation, newtup, &oldtup);
  			newtupsize = MAXALIGN(heaptup->t_len);
  		}
  		else
--- 2104,2114 ----
  		 *
  		 * Note: below this point, heaptup is the data we actually intend to
  		 * store into the relation; newtup is the caller's original untoasted
! 		 * data. (We always use WAL for toast table updates.)
  		 */
  		if (need_toast)
  		{
! 			heaptup = toast_insert_or_update(relation, newtup, &oldtup, true);
  			newtupsize = MAXALIGN(heaptup->t_len);
  		}
  		else
***************
*** 3966,3968 ****
--- 3984,4007 ----
  	else
  		appendStringInfo(buf, "UNKNOWN");
  }
+ 
+ /* ----------------
+  *		heap_sync - sync a heap, for use when no WAL has been written
+  *
+  * ----------------
+  */
+ void
+ heap_sync(Relation rel)
+ {
+	 if (!rel->rd_istemp)
+	 {
+		 /*
+		  * If we skipped using WAL, and it's not a temp relation,
+		  * we must force the relation down to disk before it's
+		  * safe to commit the transaction.  This requires forcing
+		  * out any dirty buffers and then doing a forced fsync.
+		  */
+		 FlushRelationBuffers(rel);
+		 smgrimmedsync(rel->rd_smgr);
+	 }
+ }
Index: src/backend/access/heap/tuptoaster.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/access/heap/tuptoaster.c,v
retrieving revision 1.66
diff -c -r1.66 tuptoaster.c
*** src/backend/access/heap/tuptoaster.c	5 Oct 2006 23:33:33 -0000	1.66
--- src/backend/access/heap/tuptoaster.c	4 Jan 2007 15:24:04 -0000
***************
*** 42,48 ****
  #undef TOAST_DEBUG
  
  static void toast_delete_datum(Relation rel, Datum value);
! static Datum toast_save_datum(Relation rel, Datum value);
  static varattrib *toast_fetch_datum(varattrib *attr);
  static varattrib *toast_fetch_datum_slice(varattrib *attr,
  						int32 sliceoffset, int32 length);
--- 42,48 ----
  #undef TOAST_DEBUG
  
  static void toast_delete_datum(Relation rel, Datum value);
! static Datum toast_save_datum(Relation rel, Datum value, bool use_wal);
  static varattrib *toast_fetch_datum(varattrib *attr);
  static varattrib *toast_fetch_datum_slice(varattrib *attr,
  						int32 sliceoffset, int32 length);
***************
*** 342,348 ****
   * ----------
   */
  HeapTuple
! toast_insert_or_update(Relation rel, HeapTuple newtup, HeapTuple oldtup)
  {
  	HeapTuple	result_tuple;
  	TupleDesc	tupleDesc;
--- 342,348 ----
   * ----------
   */
  HeapTuple
! toast_insert_or_update(Relation rel, HeapTuple newtup, HeapTuple oldtup, bool use_wal)
  {
  	HeapTuple	result_tuple;
  	TupleDesc	tupleDesc;
***************
*** 612,618 ****
  		i = biggest_attno;
  		old_value = toast_values[i];
  		toast_action[i] = 'p';
! 		toast_values[i] = toast_save_datum(rel, toast_values[i]);
  		if (toast_free[i])
  			pfree(DatumGetPointer(old_value));
  
--- 612,618 ----
  		i = biggest_attno;
  		old_value = toast_values[i];
  		toast_action[i] = 'p';
! 		toast_values[i] = toast_save_datum(rel, toast_values[i], use_wal);
  		if (toast_free[i])
  			pfree(DatumGetPointer(old_value));
  
***************
*** 724,730 ****
  		i = biggest_attno;
  		old_value = toast_values[i];
  		toast_action[i] = 'p';
! 		toast_values[i] = toast_save_datum(rel, toast_values[i]);
  		if (toast_free[i])
  			pfree(DatumGetPointer(old_value));
  
--- 724,730 ----
  		i = biggest_attno;
  		old_value = toast_values[i];
  		toast_action[i] = 'p';
! 		toast_values[i] = toast_save_datum(rel, toast_values[i], use_wal);
  		if (toast_free[i])
  			pfree(DatumGetPointer(old_value));
  
***************
*** 972,978 ****
   * ----------
   */
  static Datum
! toast_save_datum(Relation rel, Datum value)
  {
  	Relation	toastrel;
  	Relation	toastidx;
--- 972,978 ----
   * ----------
   */
  static Datum
! toast_save_datum(Relation rel, Datum value, bool use_wal)
  {
  	Relation	toastrel;
  	Relation	toastidx;
***************
*** 1057,1063 ****
  		if (!HeapTupleIsValid(toasttup))
  			elog(ERROR, "failed to build TOAST tuple");
  
! 		simple_heap_insert(toastrel, toasttup);
  
  		/*
  		 * Create the index entry.	We cheat a little here by not using
--- 1057,1064 ----
  		if (!HeapTupleIsValid(toasttup))
  			elog(ERROR, "failed to build TOAST tuple");
  
!		 /* not worried about forcing commit bits to be set for TOAST */
!		 fast_heap_insert(toastrel, toasttup, use_wal, !use_wal);
  
  		/*
  		 * Create the index entry.	We cheat a little here by not using
Index: src/backend/catalog/index.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/catalog/index.c,v
retrieving revision 1.274
diff -c -r1.274 index.c
*** src/backend/catalog/index.c	4 Oct 2006 00:29:50 -0000	1.274
--- src/backend/catalog/index.c	4 Jan 2007 15:24:06 -0000
***************
*** 1238,1243 ****
--- 1238,1246 ----
  
  	heap_close(pg_class, RowExclusiveLock);
  
+	 /* Remember we did this in current transaction, to allow later optimisations */
+	 relation->rd_newRelfilenodeSubid = GetCurrentSubTransactionId();
+ 
  	/* Make sure the relfilenode change is visible */
  	CommandCounterIncrement();
  }
Index: src/backend/commands/copy.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/commands/copy.c,v
retrieving revision 1.273
diff -c -r1.273 copy.c
*** src/backend/commands/copy.c	6 Oct 2006 17:13:58 -0000	1.273
--- src/backend/commands/copy.c	4 Jan 2007 15:24:08 -0000
***************
*** 1652,1657 ****
--- 1652,1659 ----
  	ExprContext *econtext;		/* used for ExecEvalExpr for default atts */
  	MemoryContext oldcontext = CurrentMemoryContext;
  	ErrorContextCallback errcontext;
+	 bool		use_wal = true; /* By default, we use WAL to log db changes */
+	 bool		force_committed = false;
  
  	Assert(cstate->rel);
  
***************
*** 1843,1848 ****
--- 1845,1878 ----
  	nfields = file_has_oids ? (attr_count + 1) : attr_count;
  	field_strings = (char **) palloc(nfields * sizeof(char *));
  
+ 	/*
+	  * Check for performance optimization by avoiding WAL writes
+	  *
+	  * If archive logging is not be enabled *and* either
+	  * - table is created in same transaction as this COPY
+	  * - table data is now being written to new relfilenode
+	  * then we can safely avoid writing WAL. Why? 
+	  * The data files for the table plus toast table/index, plus any indexes
+	  * will all be dropped at the end of the transaction if it fails, so we
+	  * do not need to worry about inconsistent states.
+	  * As mentioned in comments in utils/rel.h, the in-same-transaction test is
+	  * not completely reliable, since rd_createSubId can be reset to zero in
+	  * certain cases before the end of the creating transaction. 
+	  * We are doing this for performance only, so we only need to know: 
+	  * if rd_createSubid != InvalidSubTransactionId then it is *always* just 
+	  * created.
+	  * We also want to set HEAP_XMIN_COMMITTED for these tuples, so we
+	  * can only do this when we don't have a previously open portal.
+	  * COPY itself open a portal, so we check for just a single portal.
+	  * If we have PITR enabled, then we must use_wal
+	  */
+ 	if (NumAccessiblePortals() == 1 &&
+		  (cstate->rel->rd_createSubid		 != InvalidSubTransactionId ||
+		   cstate->rel->rd_newRelfilenodeSubid != InvalidSubTransactionId))
+		 force_committed = true;
+ 	if (force_committed && !XLogArchivingActive())
+		 use_wal = false;
+ 
  	/* Initialize state variables */
  	cstate->fe_eof = false;
  	cstate->eol_type = EOL_UNKNOWN;
***************
*** 2076,2082 ****
  				ExecConstraints(resultRelInfo, slot, estate);
  
  			/* OK, store the tuple and create index entries for it */
! 			simple_heap_insert(cstate->rel, tuple);
  
  			if (resultRelInfo->ri_NumIndices > 0)
  				ExecInsertIndexTuples(slot, &(tuple->t_self), estate, false);
--- 2106,2112 ----
  				ExecConstraints(resultRelInfo, slot, estate);
  
  			/* OK, store the tuple and create index entries for it */
!			 fast_heap_insert(cstate->rel, tuple, use_wal, force_committed);
  
  			if (resultRelInfo->ri_NumIndices > 0)
  				ExecInsertIndexTuples(slot, &(tuple->t_self), estate, false);
***************
*** 2093,2098 ****
--- 2123,2154 ----
  		}
  	}
  
+	 /* 
+	  * If we skipped writing WAL for heaps, then we need to sync
+	  */
+	 if (!use_wal)
+	 {
+		 /* main heap */
+		 heap_sync(cstate->rel);
+ 
+		 /* main heap indexes, if any */
+		 /* we always use WAL for index inserts, so no need to sync */
+ 
+		 /* toast heap, if any */
+		 if (OidIsValid(cstate->rel->rd_rel->reltoastrelid))
+		 {
+				 Relation		toastrel;
+ 
+				 toastrel = heap_open(cstate->rel->rd_rel->reltoastrelid,
+									  AccessShareLock);
+				 heap_sync(toastrel);
+				 heap_close(toastrel, AccessShareLock);
+		 }
+ 
+		 /* toast index, if toast heap */
+		 /* we always use WAL for index inserts, so no need to sync */
+	 }
+ 
  	/* Done, clean up */
  	error_context_stack = errcontext.previous;
  
Index: src/backend/executor/execMain.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/execMain.c,v
retrieving revision 1.282
diff -c -r1.282 execMain.c
*** src/backend/executor/execMain.c	26 Dec 2006 21:37:19 -0000	1.282
--- src/backend/executor/execMain.c	4 Jan 2007 15:24:08 -0000
***************
*** 1410,1416 ****
  	 */
  	newId = heap_insert(resultRelationDesc, tuple,
  						estate->es_snapshot->curcid,
! 						true, true);
  
  	IncrAppended();
  	(estate->es_processed)++;
--- 1410,1416 ----
  	 */
  	newId = heap_insert(resultRelationDesc, tuple,
  						estate->es_snapshot->curcid,
! 						true, true, false);
  
  	IncrAppended();
  	(estate->es_processed)++;
***************
*** 2516,2526 ****
  		 */
  		if (!estate->es_into_relation_use_wal &&
  			!estate->es_into_relation_descriptor->rd_istemp)
! 		{
! 			FlushRelationBuffers(estate->es_into_relation_descriptor);
! 			/* FlushRelationBuffers will have opened rd_smgr */
! 			smgrimmedsync(estate->es_into_relation_descriptor->rd_smgr);
! 		}
  
  		/* close rel, but keep lock until commit */
  		heap_close(estate->es_into_relation_descriptor, NoLock);
--- 2516,2522 ----
  		 */
  		if (!estate->es_into_relation_use_wal &&
  			!estate->es_into_relation_descriptor->rd_istemp)
!			 heap_sync(estate->es_into_relation_descriptor);
  
  		/* close rel, but keep lock until commit */
  		heap_close(estate->es_into_relation_descriptor, NoLock);
***************
*** 2577,2583 ****
  				tuple,
  				estate->es_snapshot->curcid,
  				estate->es_into_relation_use_wal,
! 				false);			/* never any point in using FSM */
  
  	/* We know this is a newly created relation, so there are no indexes */
  
--- 2573,2580 ----
  				tuple,
  				estate->es_snapshot->curcid,
  				estate->es_into_relation_use_wal,
! 				false,			/* never any point in using FSM */
!				 true);
  
  	/* We know this is a newly created relation, so there are no indexes */
  
Index: src/backend/utils/cache/relcache.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/cache/relcache.c,v
retrieving revision 1.252
diff -c -r1.252 relcache.c
*** src/backend/utils/cache/relcache.c	31 Dec 2006 20:32:04 -0000	1.252
--- src/backend/utils/cache/relcache.c	4 Jan 2007 15:24:10 -0000
***************
*** 836,841 ****
--- 836,842 ----
  	relation->rd_refcnt = 0;
  	relation->rd_isnailed = false;
  	relation->rd_createSubid = InvalidSubTransactionId;
+	 relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
  	relation->rd_istemp = isTempNamespace(relation->rd_rel->relnamespace);
  
  	/*
***************
*** 1342,1347 ****
--- 1343,1349 ----
  	 */
  	relation->rd_isnailed = true;
  	relation->rd_createSubid = InvalidSubTransactionId;
+	 relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
  	relation->rd_istemp = false;
  
  	/*
***************
*** 1753,1758 ****
--- 1755,1761 ----
  		Oid			save_relid = RelationGetRelid(relation);
  		int			old_refcnt = relation->rd_refcnt;
  		SubTransactionId old_createSubid = relation->rd_createSubid;
+ 		SubTransactionId old_newRelfilenodeSubid = relation->rd_newRelfilenodeSubid;
  		TupleDesc	old_att = relation->rd_att;
  		RuleLock   *old_rules = relation->rd_rules;
  		MemoryContext old_rulescxt = relation->rd_rulescxt;
***************
*** 1771,1776 ****
--- 1774,1781 ----
  		}
  		relation->rd_refcnt = old_refcnt;
  		relation->rd_createSubid = old_createSubid;
+		 relation->rd_newRelfilenodeSubid = old_newRelfilenodeSubid;
+ 
  		if (equalTupleDescs(old_att, relation->rd_att))
  		{
  			/* needn't flush typcache here */
***************
*** 1811,1817 ****
  {
  	bool		rebuild;
  
! 	if (relation->rd_createSubid != InvalidSubTransactionId)
  	{
  		/*
  		 * New relcache entries are always rebuilt, not flushed; else we'd
--- 1816,1823 ----
  {
  	bool		rebuild;
  
! 	if (relation->rd_createSubid != InvalidSubTransactionId ||
!		 relation->rd_newRelfilenodeSubid != InvalidSubTransactionId)
  	{
  		/*
  		 * New relcache entries are always rebuilt, not flushed; else we'd
***************
*** 1893,1898 ****
--- 1899,1907 ----
   *	 so we do not touch new-in-transaction relations; they cannot be targets
   *	 of cross-backend SI updates (and our own updates now go through a
   *	 separate linked list that isn't limited by the SI message buffer size).
+  *   We don't do anything special for newRelfilenode-in-transaction relations, 
+  *   though since we have a lock on the relation nobody else should be 
+  *   generating cache invalidation messages for it anyhow.
   *
   *	 We do this in two phases: the first pass deletes deletable items, and
   *	 the second one rebuilds the rebuildable items.  This is essential for
***************
*** 2069,2074 ****
--- 2078,2084 ----
  				continue;
  			}
  		}
+		 relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
  
  		/*
  		 * Flush any temporary index list.
***************
*** 2130,2135 ****
--- 2140,2152 ----
  				continue;
  			}
  		}
+ 		if (relation->rd_newRelfilenodeSubid == mySubid)
+ 		{
+ 			if (isCommit)
+ 				relation->rd_newRelfilenodeSubid = parentSubid;
+ 			else
+				 relation->rd_newRelfilenodeSubid = InvalidSubTransactionId;
+		 }
  
  		/*
  		 * Flush any temporary index list.
***************
*** 2219,2224 ****
--- 2236,2242 ----
  
  	/* it's being created in this transaction */
  	rel->rd_createSubid = GetCurrentSubTransactionId();
+	 rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
  
  	/* must flag that we have rels created in this transaction */
  	need_eoxact_work = true;
***************
*** 3364,3369 ****
--- 3382,3388 ----
  		rel->rd_indexlist = NIL;
  		rel->rd_oidindex = InvalidOid;
  		rel->rd_createSubid = InvalidSubTransactionId;
+		 rel->rd_newRelfilenodeSubid = InvalidSubTransactionId;
  		rel->rd_amcache = NULL;
  		MemSet(&rel->pgstat_info, 0, sizeof(rel->pgstat_info));
  
Index: src/backend/utils/mmgr/portalmem.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/utils/mmgr/portalmem.c,v
retrieving revision 1.97
diff -c -r1.97 portalmem.c
*** src/backend/utils/mmgr/portalmem.c	23 Nov 2006 01:14:59 -0000	1.97
--- src/backend/utils/mmgr/portalmem.c	4 Jan 2007 15:24:12 -0000
***************
*** 88,94 ****
  } while(0)
  
  static MemoryContext PortalMemory = NULL;
! 
  
  /* ----------------------------------------------------------------
   *				   public portal interface functions
--- 88,94 ----
  } while(0)
  
  static MemoryContext PortalMemory = NULL;
! static int NPortals = 0;
  
  /* ----------------------------------------------------------------
   *				   public portal interface functions
***************
*** 227,232 ****
--- 227,234 ----
  	/* put portal in table (sets portal->name) */
  	PortalHashTableInsert(portal, name);
  
+	 ++NPortals;
+ 
  	return portal;
  }
  
***************
*** 410,415 ****
--- 412,420 ----
  
  	/* release portal struct (it's in PortalMemory) */
  	pfree(portal);
+ 
+	 if (--NPortals < 0)
+ 		elog(ERROR, "error in number of portals");
  }
  
  /*
***************
*** 548,554 ****
  	HASH_SEQ_STATUS status;
  	PortalHashEnt *hentry;
  
! 	hash_seq_init(&status, PortalHashTable);
  
  	while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) != NULL)
  	{
--- 553,559 ----
  	HASH_SEQ_STATUS status;
  	PortalHashEnt *hentry;
  
!  	hash_seq_init(&status, PortalHashTable);
  
  	while ((hentry = (PortalHashEnt *) hash_seq_search(&status)) != NULL)
  	{
***************
*** 802,807 ****
--- 807,823 ----
  	}
  }
  
+ /*
+  * Get the number of portals from which the backend *could* retrieve
+  * tuples. This is an important number for the safe use of some 
+  * performance optimizations.
+  */
+ int
+ NumAccessiblePortals(void)
+ {
+	 return NPortals;
+ }
+ 
  /* Find all available cursors */
  Datum
  pg_cursor(PG_FUNCTION_ARGS)
Index: src/include/miscadmin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/miscadmin.h,v
retrieving revision 1.190
diff -c -r1.190 miscadmin.h
*** src/include/miscadmin.h	19 Oct 2006 18:32:47 -0000	1.190
--- src/include/miscadmin.h	4 Jan 2007 15:24:13 -0000
***************
*** 216,221 ****
--- 216,224 ----
  /* in tcop/postgres.c */
  extern void check_stack_depth(void);
  
+ /* in utils/mmgr/portalmem.c */
+ extern int NumAccessiblePortals(void);
+ 
  
  /*****************************************************************************
   *	  pdir.h --																 *
Index: src/include/access/heapam.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/access/heapam.h,v
retrieving revision 1.117
diff -c -r1.117 heapam.h
*** src/include/access/heapam.h	5 Nov 2006 22:42:10 -0000	1.117
--- src/include/access/heapam.h	4 Jan 2007 15:24:13 -0000
***************
*** 157,163 ****
  extern void setLastTid(const ItemPointer tid);
  
  extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
! 			bool use_wal, bool use_fsm);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
  			CommandId cid, Snapshot crosscheck, bool wait);
--- 157,163 ----
  extern void setLastTid(const ItemPointer tid);
  
  extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid,
! 			bool use_wal, bool use_fsm, bool force_committed);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
  			ItemPointer ctid, TransactionId *update_xmax,
  			CommandId cid, Snapshot crosscheck, bool wait);
***************
*** 178,183 ****
--- 178,186 ----
  extern void simple_heap_update(Relation relation, ItemPointer otid,
  				   HeapTuple tup);
  
+ extern Oid fast_heap_insert(Relation relation, HeapTuple tup, bool use_wal, 
+				 bool force_committed);
+ 
  extern void heap_markpos(HeapScanDesc scan);
  extern void heap_restrpos(HeapScanDesc scan);
  
***************
*** 236,239 ****
--- 239,244 ----
  extern HeapTuple heap_addheader(int natts, bool withoid,
  			   Size structlen, void *structure);
  
+ extern void heap_sync(Relation relation);
+ 
  #endif   /* HEAPAM_H */
Index: src/include/access/tuptoaster.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/access/tuptoaster.h,v
retrieving revision 1.28
diff -c -r1.28 tuptoaster.h
*** src/include/access/tuptoaster.h	13 Jul 2006 17:47:01 -0000	1.28
--- src/include/access/tuptoaster.h	4 Jan 2007 15:24:13 -0000
***************
*** 69,75 ****
   * ----------
   */
  extern HeapTuple toast_insert_or_update(Relation rel,
! 					   HeapTuple newtup, HeapTuple oldtup);
  
  /* ----------
   * toast_delete -
--- 69,75 ----
   * ----------
   */
  extern HeapTuple toast_insert_or_update(Relation rel,
! 					   HeapTuple newtup, HeapTuple oldtup, bool use_wal);
  
  /* ----------
   * toast_delete -
Index: src/include/utils/rel.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/utils/rel.h,v
retrieving revision 1.93
diff -c -r1.93 rel.h
*** src/include/utils/rel.h	23 Dec 2006 00:43:13 -0000	1.93
--- src/include/utils/rel.h	4 Jan 2007 15:24:13 -0000
***************
*** 137,142 ****
--- 137,144 ----
  	char		rd_indexvalid;	/* state of rd_indexlist: 0 = not valid, 1 =
  								 * valid, 2 = temporarily forced */
  	SubTransactionId rd_createSubid;	/* rel was created in current xact */
+ 	SubTransactionId rd_newRelfilenodeSubid;	/* rel had new relfilenode in current xact */
+	 
  
  	/*
  	 * rd_createSubid is the ID of the highest subtransaction the rel has
#2Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#1)
Re: COPY with no WAL, in certain circumstances

FYI, I am going need to add documentation in the COPY manual page or no
one will know about this performance enhancement.

---------------------------------------------------------------------------

Simon Riggs wrote:

http://archives.postgresql.org/pgsql-hackers/2006-10/msg01172.php

As discussed on -hackers, its possible to avoid writing any WAL at all
for COPY in these circumstances:

BEGIN;
CREATE TABLE foo..
COPY foo...
COMMIT;

BEGIN;
TRUNCATE foo..
COPY foo...
COMMIT;

The enclosed patch implements this, as discussed. There is no user
interface to enable/disable, just as with CTAS and CREATE INDEX; no
docs, just code comments.

This plays nicely with the --single-transaction option in psql to allow
fast restores/upgrades.

YMMV but disk bound COPY will benefit greatly from this patch, some
tests showing 100% gain. COPY is still *very* CPU intensive, so some
tests have shown negligible benefit, fyi, but that isn't the typical
case.

Applies cleanly to CVS HEAD, passes make check.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#3Joshua D. Drake
jd@commandprompt.com
In reply to: Bruce Momjian (#2)
Re: COPY with no WAL, in certain circumstances

On Sat, 2007-01-06 at 11:05 -0500, Bruce Momjian wrote:

FYI, I am going need to add documentation in the COPY manual page or no
one will know about this performance enhancement.

I have some questions:

As discussed on -hackers, its possible to avoid writing any WAL at all
for COPY in these circumstances:

BEGIN;
CREATE TABLE foo..
COPY foo...
COMMIT;

What if I do this?

BEGIN;
CREATE TABLE foo...
INSERT INTO foo VALUES ('1');
COPY foo...

COMMIT;

?

E.g., what are the boundaries of ignoring the WAL?

Joshua D. Drake

BEGIN;
TRUNCATE foo..
COPY foo...
COMMIT;

The enclosed patch implements this, as discussed. There is no user
interface to enable/disable, just as with CTAS and CREATE INDEX; no
docs, just code comments.

This plays nicely with the --single-transaction option in psql to allow
fast restores/upgrades.

YMMV but disk bound COPY will benefit greatly from this patch, some
tests showing 100% gain. COPY is still *very* CPU intensive, so some
tests have shown negligible benefit, fyi, but that isn't the typical
case.

Applies cleanly to CVS HEAD, passes make check.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

#4Bruce Momjian
bruce@momjian.us
In reply to: Joshua D. Drake (#3)
Re: COPY with no WAL, in certain circumstances

Joshua D. Drake wrote:

On Sat, 2007-01-06 at 11:05 -0500, Bruce Momjian wrote:

FYI, I am going need to add documentation in the COPY manual page or no
one will know about this performance enhancement.

I have some questions:

As discussed on -hackers, its possible to avoid writing any WAL at all
for COPY in these circumstances:

BEGIN;
CREATE TABLE foo..
COPY foo...
COMMIT;

What if I do this?

BEGIN;
CREATE TABLE foo...
INSERT INTO foo VALUES ('1');
COPY foo...

COMMIT;

On ABORT, the entire table disappears, as well as the INSERT, so I don't
see any problem. I assume the INSERT is WAL logged.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#2)
Re: COPY with no WAL, in certain circumstances

Bruce Momjian <bruce@momjian.us> writes:

FYI, I am going need to add documentation in the COPY manual page or no
one will know about this performance enhancement.

I don't think it belongs in COPY. What would make more sense is another
item under the "populating a database" performance tips, suggesting that
wrapping the restore into a single transaction is a good idea. We don't
really want to be documenting this separately under COPY, CREATE INDEX,
and everywhere else that might eventually optimize the case.

Come to think of it, that page also fails to suggest that PITR logging
shouldn't be on during bulk load.

regards, tom lane

#6Joshua D. Drake
jd@commandprompt.com
In reply to: Bruce Momjian (#4)
Re: COPY with no WAL, in certain circumstances

BEGIN;
CREATE TABLE foo...
INSERT INTO foo VALUES ('1');
COPY foo...

COMMIT;

On ABORT, the entire table disappears, as well as the INSERT, so I don't
see any problem. I assume the INSERT is WAL logged.

No I don't see any problems, I am just trying to understand the
boundaries. E.g., is there some weird limitation where if I have any
values in the table before the copy (like the example above) that copy
will go through WAL.

Or in other words, does this patch mean that all COPY execution that is
within a transaction will ignore WAL?

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

In reply to: Simon Riggs (#1)
Re: COPY with no WAL, in certain circumstances

Simon Riggs wrote:

As discussed on -hackers, its possible to avoid writing any WAL at all
for COPY in these circumstances:

Cool.

The enclosed patch implements this, as discussed. There is no user
interface to enable/disable, just as with CTAS and CREATE INDEX; no
docs, just code comments.

IMHO, this deserves an GUC parameter (use_wal_in_copy?). Because a lot
of people use COPY because it's faster than INSERT but expects that it
will be in WAL. The default would be use_wal_in_copy = true.

--
Euler Taveira de Oliveira
http://www.timbira.com/

#8Joshua D. Drake
jd@commandprompt.com
In reply to: Euler Taveira de Oliveira (#7)
Re: COPY with no WAL, in certain circumstances

On Sat, 2007-01-06 at 16:41 -0200, Euler Taveira de Oliveira wrote:

Simon Riggs wrote:

As discussed on -hackers, its possible to avoid writing any WAL at all
for COPY in these circumstances:

Cool.

The enclosed patch implements this, as discussed. There is no user
interface to enable/disable, just as with CTAS and CREATE INDEX; no
docs, just code comments.

IMHO, this deserves an GUC parameter (use_wal_in_copy?). Because a lot
of people use COPY because it's faster than INSERT but expects that it
will be in WAL. The default would be use_wal_in_copy = true.

That I don't think makes sense. A copy is an all or nothing option, if a
copy fails in the middle the whole thing is rolled back.

Sincerely,

Joshua D. Drake

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

#9Tom Lane
tgl@sss.pgh.pa.us
In reply to: Euler Taveira de Oliveira (#7)
Re: COPY with no WAL, in certain circumstances

Euler Taveira de Oliveira <euler@timbira.com> writes:

Simon Riggs wrote:

The enclosed patch implements this, as discussed. There is no user
interface to enable/disable, just as with CTAS and CREATE INDEX; no
docs, just code comments.

IMHO, this deserves an GUC parameter (use_wal_in_copy?).

Why? The whole point is that it's automatic and transparent.

regards, tom lane

#10Bruce Momjian
bruce@momjian.us
In reply to: Joshua D. Drake (#6)
Re: COPY with no WAL, in certain circumstances

Joshua D. Drake wrote:

BEGIN;
CREATE TABLE foo...
INSERT INTO foo VALUES ('1');
COPY foo...

COMMIT;

On ABORT, the entire table disappears, as well as the INSERT, so I don't
see any problem. I assume the INSERT is WAL logged.

No I don't see any problems, I am just trying to understand the
boundaries. E.g., is there some weird limitation where if I have any
values in the table before the copy (like the example above) that copy
will go through WAL.

Or in other words, does this patch mean that all COPY execution that is
within a transaction will ignore WAL?

Yes, because it is possible to do in all cases.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#11Simon Riggs
simon@2ndquadrant.com
In reply to: Bruce Momjian (#10)
Re: COPY with no WAL, in certain circumstances

On Sat, 2007-01-06 at 15:24 -0500, Bruce Momjian wrote:

Joshua D. Drake wrote:

BEGIN;
CREATE TABLE foo...
INSERT INTO foo VALUES ('1');
COPY foo...

COMMIT;

On ABORT, the entire table disappears, as well as the INSERT, so I don't
see any problem. I assume the INSERT is WAL logged.

No I don't see any problems, I am just trying to understand the
boundaries. E.g., is there some weird limitation where if I have any
values in the table before the copy (like the example above) that copy
will go through WAL.

Or in other words, does this patch mean that all COPY execution that is
within a transaction will ignore WAL?

Yes, because it is possible to do in all cases.

Very happy to add documentation where Tom suggested.

Reason for no documentation was that CREATE INDEX and CREATE TABLE AS
SELECT already use this optimisation, but to my knowledge neither was/is
documented on those command pages.

The rule is: if the relfilenode for a table is new in this transaction
(and therefore the whole things will be dropped at end-of-transaction)
then *all* COPY commands are able to avoid writing WAL safely, if:
- PITR is not enabled
- there is no active portal (which could have been opened on an earlier
commandid and could therefore see data prior to the switch to the new
relfilenode). In those cases, *not* using WAL causes no problems at all,
so sleep well without it.

So all of these work as shown

BEGIN;
COPY foo... --uses WAL
TRUNCATE foo...
COPY foo.. --no WAL
COPY foo.. --no WAL
COMMIT;

BEGIN;
CREATE TABLE foo...
INSERT INTO foo --uses WAL
COPY foo.. --no WAL
INSERT INTO foo --uses WAL
COPY foo.. --no WAL
INSERT INTO foo --uses WAL
COPY foo... --no WAL
COMMIT;

BEGIN;
CREATE TABLE foo... AS SELECT
--no WAL
INSERT INTO foo --uses WAL
COPY foo.. --no WAL
COMMIT;

BEGIN;
DECLARE CURSOR cursor
CREATE TABLE foo...
COPY foo.. --uses WAL because active portal
COPY foo.. --uses WAL because active portal
CLOSE cursor
COPY foo.. --no WAL
COPY foo.. --no WAL
COMMIT;

psql --single-transaction -f mydb.pgdump

Come to think of it, I should be able to use
pg_current_xlog_insert_location() to come up with a test case.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#12Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#11)
Re: COPY with no WAL, in certain circumstances

Simon Riggs wrote:

Or in other words, does this patch mean that all COPY execution that is
within a transaction will ignore WAL?

Yes, because it is possible to do in all cases.

Very happy to add documentation where Tom suggested.

Reason for no documentation was that CREATE INDEX and CREATE TABLE AS
SELECT already use this optimisation, but to my knowledge neither was/is
documented on those command pages.

I wasn't aware those used the optimization. Seems they all should be
documented somewhere.

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

In reply to: Joshua D. Drake (#8)
Re: COPY with no WAL, in certain circumstances

Joshua D. Drake wrote:

IMHO, this deserves an GUC parameter (use_wal_in_copy?). Because a lot
of people use COPY because it's faster than INSERT but expects that it
will be in WAL. The default would be use_wal_in_copy = true.

That I don't think makes sense. A copy is an all or nothing option, if a
copy fails in the middle the whole thing is rolled back.

I was worried about PITR, but Simon answers my question: PITR enables so
uses WAL.

--
Euler Taveira de Oliveira
http://www.timbira.com/

#14Robert Treat
xzilla@users.sourceforge.net
In reply to: Bruce Momjian (#12)
Re: COPY with no WAL, in certain circumstances

On Saturday 06 January 2007 16:40, Bruce Momjian wrote:

Simon Riggs wrote:

Or in other words, does this patch mean that all COPY execution that
is within a transaction will ignore WAL?

Yes, because it is possible to do in all cases.

Very happy to add documentation where Tom suggested.

Reason for no documentation was that CREATE INDEX and CREATE TABLE AS
SELECT already use this optimisation, but to my knowledge neither was/is
documented on those command pages.

I wasn't aware those used the optimization. Seems they all should be
documented somewhere.

Might I suggest somewhere under chapter 27, with something akin to what we
have for documenting lock levels and the different operations that use them.
We document the reasons you want to avoid WAL and various operations in the
database that do this automagically.

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

#15Robert Treat
xzilla@users.sourceforge.net
In reply to: Simon Riggs (#11)
Re: COPY with no WAL, in certain circumstances

On Saturday 06 January 2007 16:36, Simon Riggs wrote:

The rule is: if the relfilenode for a table is new in this transaction
(and therefore the whole things will be dropped at end-of-transaction)
then *all* COPY commands are able to avoid writing WAL safely, if:
- PITR is not enabled
- there is no active portal (which could have been opened on an earlier
commandid and could therefore see data prior to the switch to the new
relfilenode). In those cases, *not* using WAL causes no problems at all,
so sleep well without it.

<snip>

BEGIN;
CREATE TABLE foo...
INSERT INTO foo --uses WAL
COPY foo.. --no WAL
INSERT INTO foo --uses WAL
COPY foo.. --no WAL
INSERT INTO foo --uses WAL
COPY foo... --no WAL
COMMIT;

Is there some technical reason that the INSERT statements need to use WAL in
these scenarios? ISTM that in the above scenario there are no cases where
the INSERT statements are any more recoverable than the COPY statements.
While there might not be much gain from bypassing WAL on a single insert, in
bunches, or more importantly when doing INSERT INTO foo SELECT *, it could be
a nice improvement as well. Am I overlooking something?

--
Robert Treat
Build A Brighter LAMP :: Linux Apache {middleware} PostgreSQL

#16Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#11)
Re: COPY with no WAL, in certain circumstances

"Simon Riggs" <simon@2ndquadrant.com> writes:

The rule is: if the relfilenode for a table is new in this transaction
(and therefore the whole things will be dropped at end-of-transaction)
then *all* COPY commands are able to avoid writing WAL safely, if:
- PITR is not enabled
- there is no active portal (which could have been opened on an earlier
commandid and could therefore see data prior to the switch to the new
relfilenode). In those cases, *not* using WAL causes no problems at all,
so sleep well without it.

Uh ... what in the world has an active portal got to do with it?
I think you've confused snapshot considerations with crash recovery.

regards, tom lane

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#12)
Re: COPY with no WAL, in certain circumstances

Bruce Momjian <bruce@momjian.us> writes:

Simon Riggs wrote:

Reason for no documentation was that CREATE INDEX and CREATE TABLE AS
SELECT already use this optimisation, but to my knowledge neither was/is
documented on those command pages.

I wasn't aware those used the optimization. Seems they all should be
documented somewhere.

We don't document every single optimization in the system ... if we did,
the docs would be as big as the source code and equally unreadable by
non-programmers. I think it's a much better idea just to mention it one
place and not try to enumerate exactly which commands have the optimization.

regards, tom lane

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Treat (#15)
Re: COPY with no WAL, in certain circumstances

Robert Treat <xzilla@users.sourceforge.net> writes:

On Saturday 06 January 2007 16:36, Simon Riggs wrote:
<snip>

BEGIN;
CREATE TABLE foo...
INSERT INTO foo --uses WAL
COPY foo.. --no WAL
INSERT INTO foo --uses WAL
COPY foo.. --no WAL
INSERT INTO foo --uses WAL
COPY foo... --no WAL
COMMIT;

Is there some technical reason that the INSERT statements need to use WAL in
these scenarios?

First, there's enough other overhead to an INSERT that you'd not save
much percentagewise. Second, not using WAL doesn't come for free: the
cost is having to fsync the whole table afterwards. So it really only
makes sense for commands that one can expect are writing pretty much
all of the table. I could easily see it being a net loss for individual
INSERTs.

regards, tom lane

#19Joshua D. Drake
jd@commandprompt.com
In reply to: Tom Lane (#18)
Re: COPY with no WAL, in certain circumstances

Is there some technical reason that the INSERT statements need to use WAL in
these scenarios?

First, there's enough other overhead to an INSERT that you'd not save
much percentagewise. Second, not using WAL doesn't come for free: the
cost is having to fsync the whole table afterwards. So it really only
makes sense for commands that one can expect are writing pretty much
all of the table. I could easily see it being a net loss for individual
INSERTs.

What about multi value inserts? Just curious.

Joshua D. Drake

regards, tom lane

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

#20Tom Lane
tgl@sss.pgh.pa.us
In reply to: Joshua D. Drake (#19)
Re: COPY with no WAL, in certain circumstances

"Joshua D. Drake" <jd@commandprompt.com> writes:

cost is having to fsync the whole table afterwards. So it really only
makes sense for commands that one can expect are writing pretty much
all of the table. I could easily see it being a net loss for individual
INSERTs.

What about multi value inserts? Just curious.

I wouldn't want the system to assume that a multi-VALUES insert is
writing most of the table. Would you? The thing is reasonable for
inserting maybe a few hundred or few thousand rows at most, and that's
still small in comparison to typical tables.

regards, tom lane

#21Joshua D. Drake
jd@commandprompt.com
In reply to: Tom Lane (#20)
Re: COPY with no WAL, in certain circumstances

On Sat, 2007-01-06 at 22:09 -0500, Tom Lane wrote:

"Joshua D. Drake" <jd@commandprompt.com> writes:

cost is having to fsync the whole table afterwards. So it really only
makes sense for commands that one can expect are writing pretty much
all of the table. I could easily see it being a net loss for individual
INSERTs.

What about multi value inserts? Just curious.

I wouldn't want the system to assume that a multi-VALUES insert is
writing most of the table. Would you? The thing is reasonable for
inserting maybe a few hundred or few thousand rows at most, and that's
still small in comparison to typical tables.

Good point. :)

Joshua D. Drake

regards, tom lane

--

=== The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
Providing the most comprehensive PostgreSQL solutions since 1997
http://www.commandprompt.com/

Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate

#22Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#18)
Re: COPY with no WAL, in certain circumstances

On Sat, 2007-01-06 at 21:32 -0500, Tom Lane wrote:

Robert Treat <xzilla@users.sourceforge.net> writes:

On Saturday 06 January 2007 16:36, Simon Riggs wrote:
<snip>

BEGIN;
CREATE TABLE foo...
INSERT INTO foo --uses WAL
COPY foo.. --no WAL
INSERT INTO foo --uses WAL
COPY foo.. --no WAL
INSERT INTO foo --uses WAL
COPY foo... --no WAL
COMMIT;

Is there some technical reason that the INSERT statements need to use WAL in
these scenarios?

First, there's enough other overhead to an INSERT that you'd not save
much percentagewise. Second, not using WAL doesn't come for free: the
cost is having to fsync the whole table afterwards. So it really only
makes sense for commands that one can expect are writing pretty much
all of the table. I could easily see it being a net loss for individual
INSERTs.

Agreed. We agreed that before, on the original design thread.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#23Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#16)
Re: COPY with no WAL, in certain circumstances

On Sat, 2007-01-06 at 21:18 -0500, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

The rule is: if the relfilenode for a table is new in this transaction
(and therefore the whole things will be dropped at end-of-transaction)
then *all* COPY commands are able to avoid writing WAL safely, if:
- PITR is not enabled
- there is no active portal (which could have been opened on an earlier
commandid and could therefore see data prior to the switch to the new
relfilenode). In those cases, *not* using WAL causes no problems at all,
so sleep well without it.

Uh ... what in the world has an active portal got to do with it?
I think you've confused snapshot considerations with crash recovery.

The patch sets HEAP_XMIN_COMMITTED on all of the rows loaded by COPY as
well. So the active portal consideration does apply in this case. (We
discussed about a year ago the idea of setting FrozenTransactionId,
which I now agree wouldn't work, but setting the hint bits does work.).
That is important, because otherwise the first person to read the newly
loaded table has to re-write the whole table again; right now we ignore
that cost as being associated with the original COPY, but from most
users perspective it is. Its common practice to issue a select count(*)
from table after its been loaded, so that later readers of the table
don't suffer.

Which makes me think we can still use the no-WAL optimisation, but just
without setting HEAP_XMIN_COMMITTED when there is an active portal.

(I should also mention that the creation of the relfilenode can happen
in earlier committed subtransactions also. There is also a great big
list of commands that throw implicit transactions, all of which cannot
therefore be used with this optimisation either.)

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#23)
Re: COPY with no WAL, in certain circumstances

"Simon Riggs" <simon@2ndquadrant.com> writes:

On Sat, 2007-01-06 at 21:18 -0500, Tom Lane wrote:

Uh ... what in the world has an active portal got to do with it?
I think you've confused snapshot considerations with crash recovery.

The patch sets HEAP_XMIN_COMMITTED on all of the rows loaded by COPY as
well.

I think you just talked yourself out of getting this patch applied.

regards, tom lane

#25Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#24)
Re: COPY with no WAL, in certain circumstances

On Sun, 2007-01-07 at 03:53 -0500, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

On Sat, 2007-01-06 at 21:18 -0500, Tom Lane wrote:

Uh ... what in the world has an active portal got to do with it?
I think you've confused snapshot considerations with crash recovery.

The patch sets HEAP_XMIN_COMMITTED on all of the rows loaded by COPY as
well.

I think you just talked yourself out of getting this patch applied.

Maybe; what would be your explanation? Do you have a failure case you
know of? Perhaps if one exists, there is another route.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#26Martijn van Oosterhout
kleptog@svana.org
In reply to: Simon Riggs (#25)
Re: [HACKERS] COPY with no WAL, in certain circumstances

On Sun, Jan 07, 2007 at 11:46:29AM +0000, Simon Riggs wrote:

On Sun, 2007-01-07 at 03:53 -0500, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

The patch sets HEAP_XMIN_COMMITTED on all of the rows loaded by COPY as
well.

I think you just talked yourself out of getting this patch applied.

Maybe; what would be your explanation? Do you have a failure case you
know of? Perhaps if one exists, there is another route.

One thing I pondered while looking at this: how do you know the user is
going to commit the transaction after the COPY is complete. Could they
run analyze or vacuum or some other DDL command on the table that would
get confused by the disparity between the hint bits and the xlog.

Have a nice day,
--
Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/

Show quoted text

From each according to his ability. To each according to his ability to litigate.

#27Simon Riggs
simon@2ndquadrant.com
In reply to: Martijn van Oosterhout (#26)
Re: [HACKERS] COPY with no WAL, in certain circumstances

On Sun, 2007-01-07 at 12:59 +0100, Martijn van Oosterhout wrote:

On Sun, Jan 07, 2007 at 11:46:29AM +0000, Simon Riggs wrote:

On Sun, 2007-01-07 at 03:53 -0500, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

The patch sets HEAP_XMIN_COMMITTED on all of the rows loaded by COPY as
well.

I think you just talked yourself out of getting this patch applied.

Maybe; what would be your explanation? Do you have a failure case you
know of? Perhaps if one exists, there is another route.

One thing I pondered while looking at this: how do you know the user is
going to commit the transaction after the COPY is complete. Could they
run analyze or vacuum or some other DDL command on the table that would
get confused by the disparity between the hint bits and the xlog.

If it crashes, we'll clean up the file. At end of statement it is synced
to disk. There is no failure condition where the rows continue to exist
on disk && the table relfilenode shows a committed transaction pointing
to the file containing the marked-valid-but-actually-not rows. There is
a failure condition where the new relfilenode is on disk, but the
version of the table that points to that will not be visible.

(You can't run a VACUUM inside a transaction block.)

Everybody else is locked out because the CREATE or TRUNCATE has taken an
AccessExclusiveLock.

I've just re-checked the conditions from tqual.c and they all check,
AFAICS. There would be a problem *if* it was possible to issue a
self-referential COPY, like this:
COPY foo FROM (select * from foo)
which would exhibit the Halloween problem. But this is not yet possible,
and if it were we would be able to check for that and avoid it.

I'm not saying I haven't made a mistake, but I've done lots of thinking
and checking to confirm that this is a valid thing to do. That in itself
is never enough, which is why I/we talk together. If somebody does find
a problem, its a small thing to remove that from the patch, since it is
an additional enhancement on top of the basic WAL removal.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#28Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#25)
Re: [HACKERS] COPY with no WAL, in certain circumstances

"Simon Riggs" <simon@2ndquadrant.com> writes:

On Sun, 2007-01-07 at 03:53 -0500, Tom Lane wrote:

I think you just talked yourself out of getting this patch applied.

Maybe; what would be your explanation?

The main reason is that you were guilty of false advertising. This
patch was described as being an application of a known-and-agreed-safe
optimization to a new case, viz letting COPY into a new table use a
whole-file fsync instead of WAL-logging individual records. I suspect
most people didn't look at it closely because it sounded like nothing
very new; I certainly didn't. Now we find out that you've also decided
you can subvert the MVCC system in the name of speed. This is NOT
something the hackers community has discussed and agreed to, and I for
one doubt that it's safe. The active-portal kluge that you've just
mentioned is nothing but a kluge, proving that you thought of some cases
where it would fail. But I doubt you thought of everything.

In any case the correct method for dealing with a new optimization of
questionable safety or value is to submit it as a separate patch, not
to hope that the committer will fail to notice that the patch doesn't
do what you said it did.

regards, tom lane

#29Tom Lane
tgl@sss.pgh.pa.us
In reply to: Tom Lane (#28)
Re: [HACKERS] COPY with no WAL, in certain circumstances

I wrote:

... The active-portal kluge that you've just
mentioned is nothing but a kluge, proving that you thought of some cases
where it would fail. But I doubt you thought of everything.

BTW, a sufficient counterexample for that kluge is that neither SPI or
SQL-function execution use a separate portal for invoked commands. Thus
testing whether there's only one active portal isn't sufficient to prove
that you're not inside a function executing in serializable mode, and
thus it could have a transaction snapshot predating the COPY.

It's conceivable that it's safe anyway, or could be made so with some
rejiggering of the tests in tqual.c, but counting active portals doesn't
do anything to help.

regards, tom lane

#30Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#27)
Re: [HACKERS] COPY with no WAL, in certain circumstances

"Simon Riggs" <simon@2ndquadrant.com> writes:

There is no failure condition where the rows continue to exist
on disk && the table relfilenode shows a committed transaction pointing
to the file containing the marked-valid-but-actually-not rows.

What of

BEGIN;
CREATE TABLE foo ...;
SAVEPOINT x;
COPY foo FROM ...;
ROLLBACK TO x;
COMMIT;

regards, tom lane

#31Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#28)
Re: [HACKERS] COPY with no WAL, in certain circumstances

On Sun, 2007-01-07 at 11:14 -0500, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

On Sun, 2007-01-07 at 03:53 -0500, Tom Lane wrote:

I think you just talked yourself out of getting this patch applied.

Maybe; what would be your explanation?

The main reason is that you were guilty of false advertising.

It was not my intention to do that, but I see that is how it has come
out.

I am at fault and will withdraw that part of the patch.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#32Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#29)
Re: [HACKERS] COPY with no WAL, in certain circumstances

On Sun, 2007-01-07 at 11:29 -0500, Tom Lane wrote:

I wrote:

... The active-portal kluge that you've just
mentioned is nothing but a kluge, proving that you thought of some cases
where it would fail. But I doubt you thought of everything.

BTW, a sufficient counterexample for that kluge is that neither SPI or
SQL-function execution use a separate portal for invoked commands. Thus
testing whether there's only one active portal isn't sufficient to prove
that you're not inside a function executing in serializable mode, and
thus it could have a transaction snapshot predating the COPY.

Chewing the last pieces of my Bowler hat while reading. I don't have
many left ;-(

It's conceivable that it's safe anyway, or could be made so with some
rejiggering of the tests in tqual.c, but counting active portals doesn't
do anything to help.

I'll rethink, but as you say, with separate proposal and patch.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#33Bruce Momjian
bruce@momjian.us
In reply to: Simon Riggs (#1)
Re: COPY with no WAL, in certain circumstances

Patch withdrawn by author.

---------------------------------------------------------------------------

Simon Riggs wrote:

http://archives.postgresql.org/pgsql-hackers/2006-10/msg01172.php

As discussed on -hackers, its possible to avoid writing any WAL at all
for COPY in these circumstances:

BEGIN;
CREATE TABLE foo..
COPY foo...
COMMIT;

BEGIN;
TRUNCATE foo..
COPY foo...
COMMIT;

The enclosed patch implements this, as discussed. There is no user
interface to enable/disable, just as with CTAS and CREATE INDEX; no
docs, just code comments.

This plays nicely with the --single-transaction option in psql to allow
fast restores/upgrades.

YMMV but disk bound COPY will benefit greatly from this patch, some
tests showing 100% gain. COPY is still *very* CPU intensive, so some
tests have shown negligible benefit, fyi, but that isn't the typical
case.

Applies cleanly to CVS HEAD, passes make check.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#34Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#29)
Re: [PATCHES] COPY with no WAL, in certain circumstances

On Sun, 2007-01-07 at 11:29 -0500, Tom Lane wrote:

I wrote:

... The active-portal kluge that you've just
mentioned is nothing but a kluge, proving that you thought of some cases
where it would fail. But I doubt you thought of everything.

New patch submitted to -patches on different thread.

...continuing this discussion about setting HEAP_XMIN_COMMITTED...

BTW, a sufficient counterexample for that kluge is that neither SPI or
SQL-function execution use a separate portal for invoked commands. Thus
testing whether there's only one active portal isn't sufficient to prove
that you're not inside a function executing in serializable mode, and
thus it could have a transaction snapshot predating the COPY.

What would the best/acceptable way be to test for this condition?

Using
if (IsXactIsoLevelSerializable)
would not be a very tight condition, but at least it would avoid putting
additional status flags into every transaction, just to test for this
case in COPY statements.

ISTM unlikely that people would try to use COPY in Serializable mode;
what do people think?

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#35Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#34)
Re: [PATCHES] COPY with no WAL, in certain circumstances

"Simon Riggs" <simon@2ndquadrant.com> writes:

...continuing this discussion about setting HEAP_XMIN_COMMITTED...

BTW, a sufficient counterexample for that kluge is that neither SPI or
SQL-function execution use a separate portal for invoked commands.

What would the best/acceptable way be to test for this condition?

My opinion is that the only reliable answer would be to find a way not
to have to test. Tuples entered by your own transaction are normally
considered good by tqual.c anyway, and thus I think we might be pretty
close to having it Just Work, but you'd have to go through all the cases
in tqual.c and see if any don't work.

The other point is that to make such an optimization you have to
consider the subtransaction history. For WAL you only have to know that
the table will disappear if the top-level transaction fails, but to
pre-set commit bits requires being sure that the table will disappear
if the current subxact fails --- not the same thing at all.

regards, tom lane

#36Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#35)
Re: [PATCHES] COPY with no WAL, in certain circumstances

On Tue, 2007-01-09 at 16:31 -0500, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

...continuing this discussion about setting HEAP_XMIN_COMMITTED...

BTW, a sufficient counterexample for that kluge is that neither SPI or
SQL-function execution use a separate portal for invoked commands.

What would the best/acceptable way be to test for this condition?

My opinion is that the only reliable answer would be to find a way not
to have to test. Tuples entered by your own transaction are normally
considered good by tqual.c anyway, and thus I think we might be pretty
close to having it Just Work, but you'd have to go through all the cases
in tqual.c and see if any don't work.

I agree we could get this to Just Work by altering
HeapTupleSatisfies...() functions so that their first test is

if (TransactionIdIsCurrentTransactionId(xvac))

rather then

if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED))

I had ruled that out, unconsciously prefering the localised check in
COPY, but I agree that the test was too complex.

Taking this direct approach does have a lot of promise: Looks like
HeapTupleSatisfiesSnapshot() currently does 4 if tests to check that an
undeleted row is visible, and all other paths do much more work.
Increasing the number of checks to 5 might not hurt that much. The
branch prediction would work well for it, since when you are the
CurrentTransactionId the test would pass 100% and when you're not the
branch would fail 100% of the time, so the CPU would react to it
positively I think.

I'll run some tests and see if there's a noticeable difference.

The other point is that to make such an optimization you have to
consider the subtransaction history. For WAL you only have to know that
the table will disappear if the top-level transaction fails, but to
pre-set commit bits requires being sure that the table will disappear
if the current subxact fails --- not the same thing at all.

Right, you reminded me of that on the other part of the thread.

It seems straightforward to put a test into COPY that the optimization
will only work if you're in the Xid that made the relfilenode change.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#37Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#36)
Re: [PATCHES] COPY with no WAL, in certain circumstances

"Simon Riggs" <simon@2ndquadrant.com> writes:

I agree we could get this to Just Work by altering
HeapTupleSatisfies...() functions so that their first test is
if (TransactionIdIsCurrentTransactionId(xvac))
rather then
if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED))

Huh? That doesn't make any sense at all. xvac wouldn't normally be
valid.

I don't want to put TransactionIdIsCurrentTransactionId into the main
path of the tqual routines if at all possible; it's not an especially
cheap test, particularly if deeply nested in subtransactions. My
thought was that for SatisfiesSnapshot, you'd fall through the first
big chunk if XMIN_COMMITTED is set and then fail the XidInSnapshot test.
Then a TransactionIdIsCurrentTransactionId test could be added in that
fairly-unusual failure path, where it wouldn't slow the main path of
control. Something like

if (XidInSnapshot(HeapTupleHeaderGetXmin(tuple), snapshot))
{
if (TransactionIdIsCurrentTransactionId(HeapTupleHeaderGetXmin(tuple)))
{
if (HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid)
return false; /* inserted after scan started */
}
else
return false; /* treat as still in progress */
}

This would require rejiggering snapshots to include our own xid, see
comment for XidInSnapshot. That would add some distributed cost to
executions of XidInSnapshot for recently-committed tuples, but it would
avoid adding any cycles to the path taken for tuples committed before
the xmin horizon, which is the normal case that has to be kept fast.

Haven't looked at whether an equivalent hack is possible for the other
tqual routines.

regards, tom lane

#38Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#37)
Re: [PATCHES] COPY with no WAL, in certain circumstances

On Wed, 2007-01-10 at 10:37 -0500, Tom Lane wrote:

"Simon Riggs" <simon@2ndquadrant.com> writes:

I agree we could get this to Just Work by altering
HeapTupleSatisfies...() functions so that their first test is
if (TransactionIdIsCurrentTransactionId(xvac))

Oh? Sorry, I meant xmin not xvac at that point. Cut N Paste thinko.

rather then
if (!(tuple->t_infomask & HEAP_XMIN_COMMITTED))

Huh? That doesn't make any sense at all. xvac wouldn't normally be
valid.

I don't want to put TransactionIdIsCurrentTransactionId into the main
path of the tqual routines if at all possible; it's not an especially
cheap test, particularly if deeply nested in subtransactions.

Phew, well I'm relieved. Such a mainline change did make me nervous.

This would require rejiggering snapshots to include our own xid, see
comment for XidInSnapshot. That would add some distributed cost to
executions of XidInSnapshot for recently-committed tuples, but it would
avoid adding any cycles to the path taken for tuples committed before
the xmin horizon, which is the normal case that has to be kept fast.

Haven't looked at whether an equivalent hack is possible for the other
tqual routines.

Will check, thanks.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

#39Jim C. Nasby
jim@nasby.net
In reply to: Tom Lane (#17)
Re: [HACKERS] [PATCHES] COPY with no WAL, in certain circumstances

On Sat, Jan 06, 2007 at 09:20:53PM -0500, Tom Lane wrote:

Bruce Momjian <bruce@momjian.us> writes:

Simon Riggs wrote:

Reason for no documentation was that CREATE INDEX and CREATE TABLE AS
SELECT already use this optimisation, but to my knowledge neither was/is
documented on those command pages.

I wasn't aware those used the optimization. Seems they all should be
documented somewhere.

We don't document every single optimization in the system ... if we did,
the docs would be as big as the source code and equally unreadable by
non-programmers. I think it's a much better idea just to mention it one
place and not try to enumerate exactly which commands have the optimization.

I think it would be reasonable to refer to the 'tuning page' from the
appropriate pages in the documentation... I'm thinking of something
similar to the "SEE ALSO" section of man pages.

The big complain that I have (and have heard) about the docs is that
it's very hard to find something unless you know exactly what it is
you're looking for. If you don't know that there are performance
shortcuts associated with CREATE INDEX you're unlikely to find out about
them.
--
Jim Nasby jim@nasby.net
EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)