WAL bypass for CTAS

Started by Simon Riggsover 20 years ago14 messages
#1Simon Riggs
simon@2ndquadrant.com
1 attachment(s)

I enclose a complete patch for avoiding WAL usage for CREATE TABLE AS
SELECT, when not in archive mode (PITR). The main use case for this is
large BI environments that create summary tables or prejoined tables,
though there are many general applications.

There is no user interface for this. The speed gain is automatic, when
archiving is not enabled.

This contains all the lower level machinery required to do the same
thing for COPY, as discussed on hackers. The machinery includes some
additional freespace thinkery, aimed mainly at the forthcoming COPY
patch, which solely needs to be integrated with Alon's work.

Patch is diff -c format, compiles and make checks on cvstip as of now.

No performance tests *on this patch*, though the general principle has
already been proven via a similar prototype patch not published on list.

Best Regards, Simon Riggs

Attachments:

ctas.patchtext/x-patch; charset=UTF-8; name=ctas.patchDownload
Index: src/backend/access/heap/heapam.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/access/heap/heapam.c,v
retrieving revision 1.191
diff -c -c -r1.191 heapam.c
*** src/backend/access/heap/heapam.c	19 May 2005 21:35:44 -0000	1.191
--- src/backend/access/heap/heapam.c	3 Jun 2005 21:58:58 -0000
***************
*** 1036,1042 ****
   * command ID.
   */
  Oid
! heap_insert(Relation relation, HeapTuple tup, CommandId cid)
  {
  	TransactionId xid = GetCurrentTransactionId();
  	Buffer		buffer;
--- 1036,1042 ----
   * command ID.
   */
  Oid
! heap_insert(Relation relation, HeapTuple tup, CommandId cid, bool use_wal)
  {
  	TransactionId xid = GetCurrentTransactionId();
  	Buffer		buffer;
***************
*** 1086,1092 ****
  		heap_tuple_toast_attrs(relation, tup, NULL);
  
  	/* Find buffer to insert this tuple into */
! 	buffer = RelationGetBufferForTuple(relation, tup->t_len, InvalidBuffer);
  
  	/* NO EREPORT(ERROR) from here till changes are logged */
  	START_CRIT_SECTION();
--- 1086,1093 ----
  		heap_tuple_toast_attrs(relation, tup, NULL);
  
  	/* Find buffer to insert this tuple into */
! 	buffer = RelationGetBufferForTuple(relation, tup->t_len, 
!                                         InvalidBuffer, use_wal);
  
  	/* NO EREPORT(ERROR) from here till changes are logged */
  	START_CRIT_SECTION();
***************
*** 1096,1102 ****
  	pgstat_count_heap_insert(&relation->pgstat_info);
  
  	/* XLOG stuff */
! 	if (!relation->rd_istemp)
  	{
  		xl_heap_insert xlrec;
  		xl_heap_header xlhdr;
--- 1097,1103 ----
  	pgstat_count_heap_insert(&relation->pgstat_info);
  
  	/* XLOG stuff */
! 	if (!relation->rd_istemp && use_wal)
  	{
  		xl_heap_insert xlrec;
  		xl_heap_header xlhdr;
***************
*** 1181,1187 ****
  Oid
  simple_heap_insert(Relation relation, HeapTuple tup)
  {
! 	return heap_insert(relation, tup, GetCurrentCommandId());
  }
  
  /*
--- 1182,1188 ----
  Oid
  simple_heap_insert(Relation relation, HeapTuple tup)
  {
! 	return heap_insert(relation, tup, GetCurrentCommandId(), true);
  }
  
  /*
***************
*** 1740,1746 ****
  		{
  			/* Assume there's no chance to put newtup on same page. */
  			newbuf = RelationGetBufferForTuple(relation, newtup->t_len,
! 											   buffer);
  		}
  		else
  		{
--- 1741,1747 ----
  		{
  			/* Assume there's no chance to put newtup on same page. */
  			newbuf = RelationGetBufferForTuple(relation, newtup->t_len,
! 											   buffer, true);
  		}
  		else
  		{
***************
*** 1757,1763 ****
  				 */
  				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
  				newbuf = RelationGetBufferForTuple(relation, newtup->t_len,
! 												   buffer);
  			}
  			else
  			{
--- 1758,1764 ----
  				 */
  				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
  				newbuf = RelationGetBufferForTuple(relation, newtup->t_len,
! 												   buffer, true);
  			}
  			else
  			{
Index: src/backend/access/heap/hio.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/access/heap/hio.c,v
retrieving revision 1.56
diff -c -c -r1.56 hio.c
*** src/backend/access/heap/hio.c	7 May 2005 21:32:23 -0000	1.56
--- src/backend/access/heap/hio.c	3 Jun 2005 21:58:58 -0000
***************
*** 84,90 ****
   */
  Buffer
  RelationGetBufferForTuple(Relation relation, Size len,
! 						  Buffer otherBuffer)
  {
  	Buffer		buffer = InvalidBuffer;
  	Page		pageHeader;
--- 84,90 ----
   */
  Buffer
  RelationGetBufferForTuple(Relation relation, Size len,
! 						  Buffer otherBuffer, bool use_wal)
  {
  	Buffer		buffer = InvalidBuffer;
  	Page		pageHeader;
***************
*** 121,150 ****
  	 * on each page that proves not to be suitable.)  If the FSM has no
  	 * record of a page with enough free space, we give up and extend the
  	 * relation.
! 	 */
  
  	targetBlock = relation->rd_targblock;
  
  	if (targetBlock == InvalidBlockNumber)
  	{
! 		/*
! 		 * We have no cached target page, so ask the FSM for an initial
! 		 * target.
! 		 */
! 		targetBlock = GetPageWithFreeSpace(&relation->rd_node, len);
! 
! 		/*
! 		 * If the FSM knows nothing of the rel, try the last page before
! 		 * we give up and extend.  This avoids one-tuple-per-page syndrome
! 		 * during bootstrapping or in a recently-started system.
! 		 */
! 		if (targetBlock == InvalidBlockNumber)
! 		{
! 			BlockNumber nblocks = RelationGetNumberOfBlocks(relation);
! 
! 			if (nblocks > 0)
! 				targetBlock = nblocks - 1;
! 		}
  	}
  
  	while (targetBlock != InvalidBlockNumber)
--- 121,170 ----
  	 * on each page that proves not to be suitable.)  If the FSM has no
  	 * record of a page with enough free space, we give up and extend the
  	 * relation.
!      *
!      * If we have chosen to to not use_wal we must be careful to write only
!      * to pages that we ourselves have created during the execution of the
!      * current command. As a result we must avoid FSM completely, as well as
!      * avoiding using the last page (which by definition somebody else wrote).
!      * We must also assume that the relation->rd_targblock cache was
!      * cleared (or at least, never set) before we issue the first call of this
!      * routine, since we still need to use this cache for subsequent calls
!      * even when we chose to not use_wal.
!      * Since we avoid the FSM, any blocks extended are private to
!      * ourselves, so we still need to accumulate certain FSM info for when
!      * we complete the command. see freespace.c for description of that info
!      *
!      * XXX This is arguably not appropriate for new relations when fillfactor
!      * is set at 100% (the current case).
!      */
  
  	targetBlock = relation->rd_targblock;
  
  	if (targetBlock == InvalidBlockNumber)
  	{
!         if (use_wal)
!         {
!     		/*
!     		 * We have no cached target page, so ask the FSM for an initial
!     		 * target.
!     		 */
!     		targetBlock = GetPageWithFreeSpace(&relation->rd_node, len);  
! 
!     		/*
!     		 * If the FSM knows nothing of the rel, try the last page before
!     		 * we give up and extend.  This avoids one-tuple-per-page syndrome
!     		 * during bootstrapping or in a recently-started system.
!     		 */
!     		if (targetBlock == InvalidBlockNumber)
!     		{
!     			BlockNumber nblocks = RelationGetNumberOfBlocks(relation);   
! 
!     			if (nblocks > 0)
!     				targetBlock = nblocks - 1;
!     		}
!         }
!         else
!             AccumulateLocalFreeSpaceInfo(len);    
  	}
  
  	while (targetBlock != InvalidBlockNumber)
***************
*** 209,222 ****
  			ReleaseBuffer(buffer);
  		}
  
! 		/*
! 		 * Update FSM as to condition of this page, and ask for another
! 		 * page to try.
! 		 */
! 		targetBlock = RecordAndGetPageWithFreeSpace(&relation->rd_node,
  													targetBlock,
  													pageFreeSpace,
! 													len);
  	}
  
  	/*
--- 229,246 ----
  			ReleaseBuffer(buffer);
  		}
  
! 
!         if (use_wal)
!         {
!     		/*
!     		 * Update FSM as to condition of this page, and ask for another
!     		 * page to try.
!     		 */
!     		targetBlock = RecordAndGetPageWithFreeSpace(&relation->rd_node,
  													targetBlock,
  													pageFreeSpace,
!     												len);
!         }
  	}
  
  	/*
***************
*** 226,233 ****
  	 * the same time, else we will both try to initialize the same new
  	 * page.  We can skip locking for new or temp relations, however,
  	 * since no one else could be accessing them.
  	 */
! 	needLock = !RELATION_IS_LOCAL(relation);
  
  	if (needLock)
  		LockRelationForExtension(relation, ExclusiveLock);
--- 250,259 ----
  	 * the same time, else we will both try to initialize the same new
  	 * page.  We can skip locking for new or temp relations, however,
  	 * since no one else could be accessing them.
+      * We also skip locking for !use_wal, since we have an ExclusiveLock
+      * on the whole relation
  	 */
! 	needLock = !RELATION_IS_LOCAL(relation) && use_wal;
  
  	if (needLock)
  		LockRelationForExtension(relation, ExclusiveLock);
Index: src/backend/executor/execMain.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/executor/execMain.c,v
retrieving revision 1.249
diff -c -c -r1.249 execMain.c
*** src/backend/executor/execMain.c	22 May 2005 22:30:19 -0000	1.249
--- src/backend/executor/execMain.c	3 Jun 2005 21:59:01 -0000
***************
*** 33,38 ****
--- 33,39 ----
  #include "postgres.h"
  
  #include "access/heapam.h"
+ #include "access/xlog.h"
  #include "catalog/heap.h"
  #include "catalog/namespace.h"
  #include "commands/tablecmds.h"
***************
*** 44,49 ****
--- 45,52 ----
  #include "optimizer/clauses.h"
  #include "optimizer/var.h"
  #include "parser/parsetree.h"
+ #include "storage/freespace.h"
+ #include "storage/smgr.h"
  #include "utils/acl.h"
  #include "utils/guc.h"
  #include "utils/lsyscache.h"
***************
*** 784,792 ****
--- 787,798 ----
  		 * And open the constructed table for writing.
  		 */
  		intoRelationDesc = heap_open(intoRelationId, AccessExclusiveLock);
+         if (!XLogArchivingActive())
+             InitLocalFreeSpaceInfo();
  	}
  
  	estate->es_into_relation_descriptor = intoRelationDesc;
+     estate->es_into_relation_use_wal = !XLogArchivingActive();
  
  	queryDesc->tupDesc = tupType;
  	queryDesc->planstate = planstate;
***************
*** 979,985 ****
--- 985,1009 ----
  	 * close the "into" relation if necessary, again keeping lock
  	 */
  	if (estate->es_into_relation_descriptor != NULL)
+     {
+         if (estate->es_into_relation_use_wal)
+         {
+ 
+             Relation rel_use_wal = estate->es_into_relation_descriptor;
+     	    /*
+     	     * We must fsync the relation down to disk before it's
+     	     * safe to commit the transaction.
+              */
+             smgrimmedsync(rel_use_wal->rd_smgr);
+ 
+             /* 
+              * Update the FSM info for the relation. We only do this 
+              * here because we bypassed the FSM earlier.
+              */
+             RecordFreeSpaceInfo(&rel_use_wal->rd_node);
+         }
  		heap_close(estate->es_into_relation_descriptor, NoLock);
+     }
  
  	/*
  	 * close any relations selected FOR UPDATE/FOR SHARE, again keeping locks
***************
*** 1307,1313 ****
  
  		tuple = ExecCopySlotTuple(slot);
  		heap_insert(estate->es_into_relation_descriptor, tuple,
! 					estate->es_snapshot->curcid);
  		/* we know there are no indexes to update */
  		heap_freetuple(tuple);
  		IncrAppended();
--- 1331,1338 ----
  
  		tuple = ExecCopySlotTuple(slot);
  		heap_insert(estate->es_into_relation_descriptor, tuple,
! 					estate->es_snapshot->curcid,
!                     estate->es_into_relation_use_wal);
  		/* we know there are no indexes to update */
  		heap_freetuple(tuple);
  		IncrAppended();
***************
*** 1386,1392 ****
  	 * insert the tuple
  	 */
  	newId = heap_insert(resultRelationDesc, tuple,
! 						estate->es_snapshot->curcid);
  
  	IncrAppended();
  	(estate->es_processed)++;
--- 1411,1417 ----
  	 * insert the tuple
  	 */
  	newId = heap_insert(resultRelationDesc, tuple,
! 						estate->es_snapshot->curcid, true);
  
  	IncrAppended();
  	(estate->es_processed)++;
Index: src/backend/storage/freespace/freespace.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/storage/freespace/freespace.c,v
retrieving revision 1.45
diff -c -c -r1.45 freespace.c
*** src/backend/storage/freespace/freespace.c	29 May 2005 04:23:04 -0000	1.45
--- src/backend/storage/freespace/freespace.c	3 Jun 2005 21:59:03 -0000
***************
*** 248,253 ****
--- 248,254 ----
  static int	fsm_current_chunks(FSMRelation *fsmrel);
  static int	fsm_current_allocation(FSMRelation *fsmrel);
  
+ static int local_avgRequest = 0;
  
  /*
   * Exported routines
***************
*** 391,396 ****
--- 392,436 ----
  	return freepage;
  }
  
+ void
+ InitLocalFreeSpaceInfo(void)
+ {
+     local_avgRequest = 0;
+ }
+ 
+ void
+ AccumulateLocalFreeSpaceInfo(Size spaceNeeded)
+ {
+ 	/*
+ 	 * Update the moving average of space requests.  This code implements
+ 	 * an exponential moving average with an equivalent period of about 63
+ 	 * requests.  Ignore silly requests, however, to ensure that the
+ 	 * average stays sane.
+ 	 */
+ 	if (spaceNeeded > 0 && spaceNeeded < BLCKSZ)
+ 		local_avgRequest += ((int) spaceNeeded - local_avgRequest) / 32;
+ }
+ 
+ void
+ RecordFreeSpaceInfo(RelFileNode *rel)
+ {
+ 	FSMRelation *fsmrel;
+ 
+ 	LWLockAcquire(FreeSpaceLock, LW_EXCLUSIVE);
+ 
+ 	/*
+ 	 * Add rel to the hashtable to ensure it is known
+ 	 */
+ 	fsmrel = create_fsm_rel(rel);
+ 
+ 	fsmrel->avgRequest = (Size) local_avgRequest;
+ 
+ 	LWLockRelease(FreeSpaceLock);
+ 
+     /* Reset local info */
+     InitLocalFreeSpaceInfo();
+ }
+ 
  /*
   * RecordAndGetPageWithFreeSpace - update info about a page and try again.
   *
Index: src/include/access/heapam.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/access/heapam.h,v
retrieving revision 1.100
diff -c -c -r1.100 heapam.h
*** src/include/access/heapam.h	28 Apr 2005 21:47:16 -0000	1.100
--- src/include/access/heapam.h	3 Jun 2005 21:59:05 -0000
***************
*** 156,162 ****
  					ItemPointer tid);
  extern void setLastTid(const ItemPointer tid);
  
! extern Oid	heap_insert(Relation relation, HeapTuple tup, CommandId cid);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid, ItemPointer ctid,
  			CommandId cid, Snapshot crosscheck, bool wait);
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
--- 156,163 ----
  					ItemPointer tid);
  extern void setLastTid(const ItemPointer tid);
  
! extern Oid	heap_insert(Relation relation, HeapTuple tup, 
!                             CommandId cid, bool use_wal);
  extern HTSU_Result heap_delete(Relation relation, ItemPointer tid, ItemPointer ctid,
  			CommandId cid, Snapshot crosscheck, bool wait);
  extern HTSU_Result heap_update(Relation relation, ItemPointer otid, HeapTuple tup,
Index: src/include/access/hio.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/access/hio.h,v
retrieving revision 1.27
diff -c -c -r1.27 hio.h
*** src/include/access/hio.h	31 Dec 2004 22:03:21 -0000	1.27
--- src/include/access/hio.h	3 Jun 2005 21:59:05 -0000
***************
*** 19,24 ****
  extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
  					 HeapTuple tuple);
  extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
! 						  Buffer otherBuffer);
  
  #endif   /* HIO_H */
--- 19,24 ----
  extern void RelationPutHeapTuple(Relation relation, Buffer buffer,
  					 HeapTuple tuple);
  extern Buffer RelationGetBufferForTuple(Relation relation, Size len,
! 						  Buffer otherBuffer, bool use_wal);
  
  #endif   /* HIO_H */
Index: src/include/nodes/execnodes.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/nodes/execnodes.h,v
retrieving revision 1.133
diff -c -c -r1.133 execnodes.h
*** src/include/nodes/execnodes.h	14 May 2005 21:29:23 -0000	1.133
--- src/include/nodes/execnodes.h	3 Jun 2005 21:59:08 -0000
***************
*** 305,310 ****
--- 305,311 ----
  												 * elt */
  	JunkFilter *es_junkFilter;	/* currently active junk filter */
  	Relation	es_into_relation_descriptor;	/* for SELECT INTO */
+     bool        es_into_relation_use_wal;
  
  	/* Parameter info: */
  	ParamListInfo es_param_list_info;	/* values of external params */
Index: src/include/storage/freespace.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/storage/freespace.h,v
retrieving revision 1.17
diff -c -c -r1.17 freespace.h
*** src/include/storage/freespace.h	31 Dec 2004 22:03:42 -0000	1.17
--- src/include/storage/freespace.h	3 Jun 2005 21:59:08 -0000
***************
*** 63,68 ****
--- 63,72 ----
  extern void DumpFreeSpaceMap(int code, Datum arg);
  extern void LoadFreeSpaceMap(void);
  
+ extern void InitLocalFreeSpaceInfo(void);
+ extern void AccumulateLocalFreeSpaceInfo(Size spaceNeeded);
+ extern void RecordFreeSpaceInfo(RelFileNode *rel);
+ 
  #ifdef FREESPACE_DEBUG
  extern void DumpFreeSpace(void);
  #endif
#2Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Simon Riggs (#1)
Re: WAL bypass for CTAS

Simon Riggs wrote:

I enclose a complete patch for avoiding WAL usage for CREATE TABLE AS
SELECT, when not in archive mode (PITR). The main use case for this is
large BI environments that create summary tables or prejoined tables,
though there are many general applications.

There is no user interface for this. The speed gain is automatic, when
archiving is not enabled.

Could we do your NOLOGGING automatically in COPY if we test to see if
anyone else is connected to our current database? I would _love_ to see
pg_dump loads use this automatically, without having to add clauses to
pg_dump output.

I think we decided we can't do it automatically for all zero-row COPYs
because of locking concerns.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#3Neil Conway
neilc@samurai.com
In reply to: Bruce Momjian (#2)
Re: WAL bypass for CTAS

Bruce Momjian wrote:

Could we do your NOLOGGING automatically in COPY if we test to see if
anyone else is connected to our current database?

That seems pretty fragile -- what happens if someone connects after the
COPY has started? Considering that many COPY operations can take many
minutes or hours, I don't think it is wise to make assumptions based on
the initial state of the system.

I would _love_ to see pg_dump loads use this automatically, without
having to add clauses to pg_dump output.

What's wrong with adding clauses to the pg_dump output?

-Neil

#4Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Neil Conway (#3)
Re: WAL bypass for CTAS

Neil Conway wrote:

Bruce Momjian wrote:

Could we do your NOLOGGING automatically in COPY if we test to see if
anyone else is connected to our current database?

That seems pretty fragile -- what happens if someone connects after the
COPY has started? Considering that many COPY operations can take many
minutes or hours, I don't think it is wise to make assumptions based on
the initial state of the system.

I would _love_ to see pg_dump loads use this automatically, without
having to add clauses to pg_dump output.

What's wrong with adding clauses to the pg_dump output?

Well, it isn't going to help us for 8.1 because 8.0 will not have it,
and if we add the clause we make loading the data into previous releases
harder.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#5Neil Conway
neilc@samurai.com
In reply to: Bruce Momjian (#4)
Re: WAL bypass for CTAS

Bruce Momjian wrote:

Well, it isn't going to help us for 8.1 because 8.0 will not have it,
and if we add the clause we make loading the data into previous releases
harder.

pg_dump output in general is not compatible with prior releases. It
would be a nice feature to have, but until we have it, I don't see that
changing or not changing the COPY syntax will make a major difference to
dump backward compatibility.

-Neil

#6Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Neil Conway (#5)
Re: [PATCHES] WAL bypass for CTAS

Neil Conway wrote:

Bruce Momjian wrote:

Well, it isn't going to help us for 8.1 because 8.0 will not have it,
and if we add the clause we make loading the data into previous releases
harder.

pg_dump output in general is not compatible with prior releases. It
would be a nice feature to have, but until we have it, I don't see that
changing or not changing the COPY syntax will make a major difference to
dump backward compatibility.

Right, usually the schema changes are not backward compatibible, but the
COPY commands are. But now that I look at this example:

COPY test (x) FROM stdin;
1
\.

The column name "(x)" actually broke backward compatibility when we
added it, so yea, we could add a new option now too. No one complained
when we added the column names, so another option would be fine.

I suppose no one would like adding an option to turn off locking during
COPY, so the non-WAL logging would become the default? (Just asking.
You know me, I like automatic.)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#6)
Re: [PATCHES] WAL bypass for CTAS

Bruce Momjian <pgman@candle.pha.pa.us> writes:

I suppose no one would like adding an option to turn off locking during
COPY, so the non-WAL logging would become the default?

When and if we add LOCK or some equivalent option to COPY, I'm sure
we'll change pg_dump to specify that option in its output. But trying
to get that behavior by default for existing dumps seems to me to be
far more dangerous than it's worth. Not every performance improvement
has to automatically apply to existing dumps...

regards, tom lane

#8Russell Smith
mr-russ@pws.com.au
In reply to: Neil Conway (#5)
Re: WAL bypass for CTAS

On Sun, 5 Jun 2005 10:29 am, Neil Conway wrote:

Bruce Momjian wrote:

Well, it isn't going to help us for 8.1 because 8.0 will not have it,
and if we add the clause we make loading the data into previous releases
harder.

pg_dump output in general is not compatible with prior releases. It
would be a nice feature to have, but until we have it, I don't see that
changing or not changing the COPY syntax will make a major difference to
dump backward compatibility.

Don't we usually suggest using the new pg_dump to dump the old database anyway?

If that's the case, then we just add the locking options in there. Otherwise, yes you are
stuck with the original locking mechanism. But if people are smart and want faster loading
they will play with sed and friends to make it work.

Even if people for 8.1 just get the supposed 500% speed increase because of a better parser,
lots of people will be happy.

Regards

Russell Smith

#9Simon Riggs
simon@2ndquadrant.com
In reply to: Neil Conway (#3)
Re: WAL bypass for CTAS

On Sun, 2005-06-05 at 10:20 +1000, Neil Conway wrote:

Bruce Momjian wrote:

Could we do your NOLOGGING automatically in COPY if we test to see if
anyone else is connected to our current database?

Remember that this patch doe NOT yet handle COPY, but that is planned...

That seems pretty fragile -- what happens if someone connects after the
COPY has started? Considering that many COPY operations can take many
minutes or hours, I don't think it is wise to make assumptions based on
the initial state of the system.

Agreed.

Best Regards, Simon Riggs

#10Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Simon Riggs (#1)
Re: WAL bypass for CTAS

Tom has applied this patch. Thanks.

---------------------------------------------------------------------------

Simon Riggs wrote:

I enclose a complete patch for avoiding WAL usage for CREATE TABLE AS
SELECT, when not in archive mode (PITR). The main use case for this is
large BI environments that create summary tables or prejoined tables,
though there are many general applications.

There is no user interface for this. The speed gain is automatic, when
archiving is not enabled.

This contains all the lower level machinery required to do the same
thing for COPY, as discussed on hackers. The machinery includes some
additional freespace thinkery, aimed mainly at the forthcoming COPY
patch, which solely needs to be integrated with Alon's work.

Patch is diff -c format, compiles and make checks on cvstip as of now.

No performance tests *on this patch*, though the general principle has
already been proven via a similar prototype patch not published on list.

Best Regards, Simon Riggs

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
(send "unregister YourEmailAddressHere" to majordomo@postgresql.org)

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Simon Riggs (#1)
Re: WAL bypass for CTAS

Simon Riggs <simon@2ndquadrant.com> writes:

I enclose a complete patch for avoiding WAL usage for CREATE TABLE AS
SELECT, when not in archive mode (PITR). The main use case for this is
large BI environments that create summary tables or prejoined tables,
though there are many general applications.

Applied after heavy corrections --- there were a number of things wrong
with this "simple" patch, starting with having gotten the tests
backwards :-(, and extending to not having actually flushed the data
before commit (smgrimmedsync isn't enough, you have to
FlushRelationBuffers).

A consideration we had all missed in the original discussions is that
if the transaction doesn't emit any WAL records at all,
RecordTransactionCommit will think that it need not WAL-log the
transaction commit, leading to the possibility that the commit is lost
even though all the data is preserved :-(

This is not a hazard for CREATE TABLE AS, since it will certainly have
emitted WAL records while creating the table's catalog entries. It will
be a very real hazard for COPY however. The cleanest solution I can
think of is that the COPY code should emit a WAL record for the first
tuple copied in, but not for later ones. To this end, I separated the
"use_wal" and "use_fsm" aspects of what the patch was doing.

I didn't apply the freespace.c changes either; that struck me as a
serious kluge with no real benefit. We can just omit updating the FSM's
running average, if it even has one. (ISTM there's a reasonable
argument to be made that the tuple sizes during CREATE/COPY might not be
representative of later requests anyway.)

Patch as applied is attached.

regards, tom lane

#12Simon Riggs
simon@2ndquadrant.com
In reply to: Tom Lane (#11)
Re: WAL bypass for CTAS

On Mon, 2005-06-20 at 14:50 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

I enclose a complete patch for avoiding WAL usage for CREATE TABLE AS
SELECT, when not in archive mode (PITR). The main use case for this is
large BI environments that create summary tables or prejoined tables,
though there are many general applications.

Applied

Thanks

after heavy corrections --- there were a number of things wrong
with this "simple" patch, starting with having gotten the tests
backwards :-(

Sorry, I thought I had corrected that error before submission. I was
aware that I had made that error earlier.

and extending to not having actually flushed the data
before commit (smgrimmedsync isn't enough, you have to
FlushRelationBuffers).

I followed the logic as seen in nbtsort.c as you suggested. That code
doesn't perform a FlushRelationBuffers and it looks like I fooled myself
into thinking the CTAS/SELECT INTO case was also in local.

Perhaps we should be building CTAS/SELECT INTO in local buffers anyway?
It looks like we could save time by avoiding shared_buffers completely
and build up a whole page before writing it anywhere. (But thats a story
for another day).

Perhaps this is also related to metapage errors, since the metapage is
always the last page to be written?

A consideration we had all missed in the original discussions is that
if the transaction doesn't emit any WAL records at all,
RecordTransactionCommit will think that it need not WAL-log the
transaction commit, leading to the possibility that the commit is lost
even though all the data is preserved :-(

This is not a hazard for CREATE TABLE AS, since it will certainly have
emitted WAL records while creating the table's catalog entries. It will
be a very real hazard for COPY however.

OK, but I haven't written that patch yet!

The cleanest solution I can
think of is that the COPY code should emit a WAL record for the first
tuple copied in, but not for later ones. To this end, I separated the
"use_wal" and "use_fsm" aspects of what the patch was doing.

Not very clean, but will do as you suggest.

I didn't apply the freespace.c changes either; that struck me as a
serious kluge with no real benefit. We can just omit updating the FSM's
running average, if it even has one. (ISTM there's a reasonable
argument to be made that the tuple sizes during CREATE/COPY might not be
representative of later requests anyway.)

I was striving for completeness only. I was doubtful about that part of
the patch, but thought I'd add that rather than have you say I hadn't
thought about the FSM avg_request_size.

I put those changes in mainly for COPY. If you don't make any request at
all to FSM then a relation never gets to the MRU relation FSM list. I
agree that it is not strictly necessary, but leaving it off would be a
change in behaviour, since COPY did previously cause the relation to get
to the MRU. That could be a problem, since a relation might not then be
allocated any FSM pages following a vacuum.

Best Regards, Simon Riggs

#13Alvaro Herrera
alvherre@surnet.cl
In reply to: Simon Riggs (#12)
Re: WAL bypass for CTAS

On Mon, Jun 20, 2005 at 09:55:12PM +0100, Simon Riggs wrote:

I put those changes in mainly for COPY. If you don't make any request at
all to FSM then a relation never gets to the MRU relation FSM list. I
agree that it is not strictly necessary, but leaving it off would be a
change in behaviour, since COPY did previously cause the relation to get
to the MRU. That could be a problem, since a relation might not then be
allocated any FSM pages following a vacuum.

Is that a problem? If the pages don't fit in FSM, then maybe the system
is misconfigured anyway. The person running the DW should just increase
the FSM settings, which is hardly a costly thing because it uses so
little memory.

--
Alvaro Herrera (<alvherre[a]surnet.cl>)
"No renuncies a nada. No te aferres a nada."

#14Simon Riggs
simon@2ndquadrant.com
In reply to: Alvaro Herrera (#13)
Re: WAL bypass for CTAS

On Mon, 2005-06-20 at 17:09 -0400, Alvaro Herrera wrote:

On Mon, Jun 20, 2005 at 09:55:12PM +0100, Simon Riggs wrote:

I put those changes in mainly for COPY. If you don't make any request at
all to FSM then a relation never gets to the MRU relation FSM list. I
agree that it is not strictly necessary, but leaving it off would be a
change in behaviour, since COPY did previously cause the relation to get
to the MRU. That could be a problem, since a relation might not then be
allocated any FSM pages following a vacuum.

Is that a problem?

Not for me, but I wanted to explain the change in behaviour that
implies.

If the pages don't fit in FSM, then maybe the system
is misconfigured anyway. The person running the DW should just increase
the FSM settings, which is hardly a costly thing because it uses so
little memory.

If you aren't on the relation list you don't get any more pages than the
minimum. No matter how many fsm_pages you allocate. If fsm_pages covers
everything, then you are right, there is no problem.

Best Regards, Simon Riggs