Background writer process

Started by Jan Wieckabout 22 years ago21 messages
#1Jan Wieck
JanWieck@Yahoo.com
1 attachment(s)

The attached diff is another attempt for distributing the write IO.

It is a separate background process much like the checkpointer. It's
purpose is to keep the number of dirty blocks in the buffer cache at a
reasonable level and try that the buffers returned by the strategy for
replacement are allways clean. This current shot does it this way:

- get a list of all dirty blocks in strategy replacement order
- flush n percent of that list or a maximum of m buffers
(whatever is smaller)
- issue a sync()
- sleep for x milliseconds

If there is nothing to do, it will sleep for 10 seconds before checking
again at all. It acquires a checkpoint lock during the flush, so it will
yield for a real checkpoint.

For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

Attachments:

bgwriter.v1.difftext/plain; name=bgwriter.v1.diffDownload
Index: src/backend/bootstrap/bootstrap.c
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/bootstrap/bootstrap.c,v
retrieving revision 1.166
diff -c -r1.166 bootstrap.c
*** src/backend/bootstrap/bootstrap.c	2003/09/02 19:04:12	1.166
--- src/backend/bootstrap/bootstrap.c	2003/11/13 18:39:51
***************
*** 428,435 ****
  
  	BaseInit();
  
  	if (IsUnderPostmaster)
! 		InitDummyProcess();		/* needed to get LWLocks */
  
  	/*
  	 * XLOG operations
--- 428,447 ----
  
  	BaseInit();
  
+ 	/* needed to get LWLocks */
  	if (IsUnderPostmaster)
! 	{
! 		switch (xlogop)
! 		{
! 			case BS_XLOG_BGWRITER:
! 				InitDummyProcess(DUMMY_PROC_BGWRITER);	
! 				break;
! 		
! 			default:
! 				InitDummyProcess(DUMMY_PROC_DEFAULT);	
! 				break;
! 		}
! 	}
  
  	/*
  	 * XLOG operations
***************
*** 451,456 ****
--- 463,473 ----
  			CreateCheckPoint(false, false);
  			SetSavedRedoRecPtr();		/* pass redo ptr back to
  										 * postmaster */
+ 			proc_exit(0);		/* done */
+ 
+ 		case BS_XLOG_BGWRITER:
+ 			CreateDummyCaches();
+ 			BufferBackgroundWriter();
  			proc_exit(0);		/* done */
  
  		case BS_XLOG_STARTUP:
Index: src/backend/catalog/index.c
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/catalog/index.c,v
retrieving revision 1.221
diff -c -r1.221 index.c
*** src/backend/catalog/index.c	2003/11/12 21:15:48	1.221
--- src/backend/catalog/index.c	2003/11/13 16:19:07
***************
*** 1043,1049 ****
  		/* Send out shared cache inval if necessary */
  		if (!IsBootstrapProcessingMode())
  			CacheInvalidateHeapTuple(pg_class, tuple);
! 		BufferSync();
  	}
  	else if (dirty)
  	{
--- 1043,1049 ----
  		/* Send out shared cache inval if necessary */
  		if (!IsBootstrapProcessingMode())
  			CacheInvalidateHeapTuple(pg_class, tuple);
! 		BufferSync(-1, -1);
  	}
  	else if (dirty)
  	{
Index: src/backend/commands/dbcommands.c
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/commands/dbcommands.c,v
retrieving revision 1.126
diff -c -r1.126 dbcommands.c
*** src/backend/commands/dbcommands.c	2003/11/12 21:15:50	1.126
--- src/backend/commands/dbcommands.c	2003/11/13 16:19:07
***************
*** 317,323 ****
  	 * up-to-date for the copy.  (We really only need to flush buffers for
  	 * the source database...)
  	 */
! 	BufferSync();
  
  	/*
  	 * Close virtual file descriptors so the kernel has more available for
--- 317,323 ----
  	 * up-to-date for the copy.  (We really only need to flush buffers for
  	 * the source database...)
  	 */
! 	BufferSync(-1, -1);
  
  	/*
  	 * Close virtual file descriptors so the kernel has more available for
***************
*** 454,460 ****
  	 * will see the new database in pg_database right away.  (They'll see
  	 * an uncommitted tuple, but they don't care; see GetRawDatabaseInfo.)
  	 */
! 	BufferSync();
  }
  
  
--- 454,460 ----
  	 * will see the new database in pg_database right away.  (They'll see
  	 * an uncommitted tuple, but they don't care; see GetRawDatabaseInfo.)
  	 */
! 	BufferSync(-1, -1);
  }
  
  
***************
*** 591,597 ****
  	 * (They'll see an uncommitted deletion, but they don't care; see
  	 * GetRawDatabaseInfo.)
  	 */
! 	BufferSync();
  }
  
  
--- 591,597 ----
  	 * (They'll see an uncommitted deletion, but they don't care; see
  	 * GetRawDatabaseInfo.)
  	 */
! 	BufferSync(-1, -1);
  }
  
  
***************
*** 688,694 ****
  	 * see an uncommitted tuple, but they don't care; see
  	 * GetRawDatabaseInfo.)
  	 */
! 	BufferSync();
  }
  
  
--- 688,694 ----
  	 * see an uncommitted tuple, but they don't care; see
  	 * GetRawDatabaseInfo.)
  	 */
! 	BufferSync(-1, -1);
  }
  
  
Index: src/backend/postmaster/postmaster.c
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/postmaster/postmaster.c,v
retrieving revision 1.348
diff -c -r1.348 postmaster.c
*** src/backend/postmaster/postmaster.c	2003/11/11 01:09:42	1.348
--- src/backend/postmaster/postmaster.c	2003/11/13 19:39:58
***************
*** 205,210 ****
--- 205,213 ----
  int			CheckPointTimeout = 300;
  int			CheckPointWarning = 30;
  time_t		LastSignalledCheckpoint = 0;
+ int			BgWriterDelay = 500;
+ int			BgWriterPercent = 0;
+ int			BgWriterMaxpages = 100;
  
  bool		log_hostname;		/* for ps display */
  bool		LogSourcePort;
***************
*** 224,230 ****
  /* Startup/shutdown state */
  static pid_t StartupPID = 0,
  			ShutdownPID = 0,
! 			CheckPointPID = 0;
  static time_t checkpointed = 0;
  
  #define			NoShutdown		0
--- 227,234 ----
  /* Startup/shutdown state */
  static pid_t StartupPID = 0,
  			ShutdownPID = 0,
! 			CheckPointPID = 0,
! 			BgWriterPID = 0;
  static time_t checkpointed = 0;
  
  #define			NoShutdown		0
***************
*** 298,303 ****
--- 302,308 ----
  
  #define StartupDataBase()		SSDataBase(BS_XLOG_STARTUP)
  #define CheckPointDataBase()	SSDataBase(BS_XLOG_CHECKPOINT)
+ #define StartBackgroundWriter()	SSDataBase(BS_XLOG_BGWRITER)
  #define ShutdownDataBase()		SSDataBase(BS_XLOG_SHUTDOWN)
  
  
***************
*** 1056,1061 ****
--- 1061,1077 ----
  		}
  
  		/*
+ 		 * If no background writer process is running and we should
+ 		 * do background writing, start one. It doesn't matter if
+ 		 * this fails, we'll just try again later.
+ 		 */
+ 		if (BgWriterPID == 0 && BgWriterPercent > 0 &&
+ 				Shutdown == NoShutdown && !FatalError && random_seed != 0)
+ 		{
+ 			BgWriterPID = StartBackgroundWriter();
+ 		}
+ 
+ 		/*
  		 * Wait for something to happen.
  		 */
  		memcpy((char *) &rmask, (char *) &readmask, sizeof(fd_set));
***************
*** 1478,1483 ****
--- 1494,1506 ----
  								 backendPID)));
  		return;
  	}
+ 	else if (backendPID == BgWriterPID)
+ 	{
+ 		ereport(DEBUG2,
+ 				(errmsg_internal("ignoring cancel request for bgwriter process %d",
+ 								 backendPID)));
+ 		return;
+ 	}
  	else if (ExecBackend)
  		AttachSharedMemoryAndSemaphores();
  
***************
*** 1660,1665 ****
--- 1683,1695 ----
  		SignalChildren(SIGHUP);
  		load_hba();
  		load_ident();
+ 
+ 		/*
+ 		 * Tell the background writer to terminate so that we
+ 		 * will start a new one with a possibly changed config
+ 		 */
+ 		if (BgWriterPID != 0)
+ 			kill(BgWriterPID, SIGTERM);
  	}
  
  	PG_SETMASK(&UnBlockSig);
***************
*** 1692,1697 ****
--- 1722,1729 ----
  			 *
  			 * Wait for children to end their work and ShutdownDataBase.
  			 */
+ 			if (BgWriterPID != 0)
+ 				kill(BgWriterPID, SIGTERM);
  			if (Shutdown >= SmartShutdown)
  				break;
  			Shutdown = SmartShutdown;
***************
*** 1724,1729 ****
--- 1756,1763 ----
  			 * abort all children with SIGTERM (rollback active transactions
  			 * and exit) and ShutdownDataBase when they are gone.
  			 */
+ 			if (BgWriterPID != 0)
+ 				kill(BgWriterPID, SIGTERM);
  			if (Shutdown >= FastShutdown)
  				break;
  			ereport(LOG,
***************
*** 1770,1775 ****
--- 1804,1811 ----
  			 * abort all children with SIGQUIT and exit without attempt to
  			 * properly shutdown data base system.
  			 */
+ 			if (BgWriterPID != 0)
+ 				kill(BgWriterPID, SIGQUIT);
  			ereport(LOG,
  					(errmsg("received immediate shutdown request")));
  			if (ShutdownPID > 0)
***************
*** 1877,1882 ****
--- 1913,1924 ----
  			CheckPointPID = 0;
  			checkpointed = time(NULL);
  
+ 			if (BgWriterPID == 0 && BgWriterPercent > 0 &&
+ 				Shutdown == NoShutdown && !FatalError && random_seed != 0)
+ 			{
+ 				BgWriterPID = StartBackgroundWriter();
+ 			}
+ 
  			/*
  			 * Go to shutdown mode if a shutdown request was pending.
  			 */
***************
*** 1983,1988 ****
--- 2025,2032 ----
  				GetSavedRedoRecPtr();
  			}
  		}
+ 		else if (pid == BgWriterPID)
+ 			BgWriterPID = 0;
  		else
  			pgstat_beterm(pid);
  
***************
*** 1996,2001 ****
--- 2040,2046 ----
  	{
  		LogChildExit(LOG,
  				 (pid == CheckPointPID) ? gettext("checkpoint process") :
+ 				 (pid == BgWriterPID) ? gettext("bgwriter process") :
  					 gettext("server process"),
  					 pid, exitstatus);
  		ereport(LOG,
***************
*** 2044,2049 ****
--- 2089,2098 ----
  		CheckPointPID = 0;
  		checkpointed = 0;
  	}
+ 	else if (pid == BgWriterPID)
+ 	{
+ 		BgWriterPID = 0;
+ 	}
  	else
  	{
  		/*
***************
*** 2754,2759 ****
--- 2803,2810 ----
  	}
  	if (CheckPointPID != 0)
  		cnt--;
+ 	if (BgWriterPID != 0)
+ 		cnt--;
  	return cnt;
  }
  
***************
*** 2827,2832 ****
--- 2878,2886 ----
  			case BS_XLOG_CHECKPOINT:
  				statmsg = "checkpoint subprocess";
  				break;
+ 			case BS_XLOG_BGWRITER:
+ 				statmsg = "bgwriter subprocess";
+ 				break;
  			case BS_XLOG_SHUTDOWN:
  				statmsg = "shutdown subprocess";
  				break;
***************
*** 2883,2888 ****
--- 2937,2946 ----
  				ereport(LOG,
  					  (errmsg("could not fork checkpoint process: %m")));
  				break;
+ 			case BS_XLOG_BGWRITER:
+ 				ereport(LOG,
+ 					  (errmsg("could not fork bgwriter process: %m")));
+ 				break;
  			case BS_XLOG_SHUTDOWN:
  				ereport(LOG,
  						(errmsg("could not fork shutdown process: %m")));
***************
*** 2895,2913 ****
  
  		/*
  		 * fork failure is fatal during startup/shutdown, but there's no
! 		 * need to choke if a routine checkpoint fails.
  		 */
  		if (xlop == BS_XLOG_CHECKPOINT)
  			return 0;
  		ExitPostmaster(1);
  	}
  
  	/*
  	 * The startup and shutdown processes are not considered normal
! 	 * backends, but the checkpoint process is.  Checkpoint must be added
! 	 * to the list of backends.
  	 */
! 	if (xlop == BS_XLOG_CHECKPOINT)
  	{
  		if (!(bn = (Backend *) malloc(sizeof(Backend))))
  		{
--- 2953,2974 ----
  
  		/*
  		 * fork failure is fatal during startup/shutdown, but there's no
! 		 * need to choke if a routine checkpoint or starting a background
! 		 * writer fails.
  		 */
  		if (xlop == BS_XLOG_CHECKPOINT)
  			return 0;
+ 		if (xlop == BS_XLOG_BGWRITER)
+ 			return 0;
  		ExitPostmaster(1);
  	}
  
  	/*
  	 * The startup and shutdown processes are not considered normal
! 	 * backends, but the checkpoint and bgwriter processes are.
! 	 * They must be added to the list of backends.
  	 */
! 	if (xlop == BS_XLOG_CHECKPOINT || xlop == BS_XLOG_BGWRITER)
  	{
  		if (!(bn = (Backend *) malloc(sizeof(Backend))))
  		{
Index: src/backend/storage/buffer/bufmgr.c
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/storage/buffer/bufmgr.c,v
retrieving revision 1.144
diff -c -r1.144 bufmgr.c
*** src/backend/storage/buffer/bufmgr.c	2003/11/13 14:57:15	1.144
--- src/backend/storage/buffer/bufmgr.c	2003/11/13 18:41:48
***************
*** 44,49 ****
--- 44,50 ----
  #include <sys/file.h>
  #include <math.h>
  #include <signal.h>
+ #include <unistd.h>
  
  #include "lib/stringinfo.h"
  #include "miscadmin.h"
***************
*** 679,688 ****
  /*
   * BufferSync -- Write all dirty buffers in the pool.
   *
!  * This is called at checkpoint time and writes out all dirty shared buffers.
   */
! void
! BufferSync(void)
  {
  	int			i;
  	BufferDesc *bufHdr;
--- 680,690 ----
  /*
   * BufferSync -- Write all dirty buffers in the pool.
   *
!  * This is called at checkpoint time and writes out all dirty shared buffers,
!  * and by the background writer process to write out some of the dirty blocks.
   */
! int
! BufferSync(int percent, int maxpages)
  {
  	int			i;
  	BufferDesc *bufHdr;
***************
*** 703,714 ****
  	 * have to wait until the next checkpoint.
  	 */
  	buffer_dirty = (int *)palloc(NBuffers * sizeof(int));
! 	num_buffer_dirty = 0;
! 
  	LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
  	num_buffer_dirty = StrategyDirtyBufferList(buffer_dirty, NBuffers);
  	LWLockRelease(BufMgrLock);
  
  	for (i = 0; i < num_buffer_dirty; i++)
  	{
  		Buffer		buffer;
--- 705,728 ----
  	 * have to wait until the next checkpoint.
  	 */
  	buffer_dirty = (int *)palloc(NBuffers * sizeof(int));
! 	
  	LWLockAcquire(BufMgrLock, LW_EXCLUSIVE);
  	num_buffer_dirty = StrategyDirtyBufferList(buffer_dirty, NBuffers);
  	LWLockRelease(BufMgrLock);
  
+ 	/*
+ 	 * If called by the background writer, we are usually asked to
+ 	 * only write out some percentage of dirty buffers now, to prevent
+ 	 * the IO storm at checkpoint time.
+ 	 */
+ 	if (percent > 0 && num_buffer_dirty > 10)
+ 	{
+ 		Assert(percent <= 100);
+ 		num_buffer_dirty = (num_buffer_dirty * percent) / 100;
+ 		if (maxpages > 0 && num_buffer_dirty > maxpages)
+ 			num_buffer_dirty = maxpages;
+ 	}
+ 
  	for (i = 0; i < num_buffer_dirty; i++)
  	{
  		Buffer		buffer;
***************
*** 854,859 ****
--- 868,875 ----
  
  	/* Pop the error context stack */
  	error_context_stack = errcontext.previous;
+ 
+ 	return num_buffer_dirty;
  }
  
  /*
***************
*** 984,991 ****
  void
  FlushBufferPool(void)
  {
! 	BufferSync();
  	smgrsync();
  }
  
  /*
--- 1000,1064 ----
  void
  FlushBufferPool(void)
  {
! 	BufferSync(-1, -1);
  	smgrsync();
+ }
+ 
+ void
+ BufferBackgroundWriter(void)
+ {
+ 	if (BgWriterPercent == 0)
+ 		return;
+ 
+ 	for (;;)
+ 	{
+ 		int n;
+ 
+ 		/*
+ 		 * Acquire a CheckpointLock to suspend background writing
+ 		 * while a real checkpoint is going on.
+ 		 */
+ 		while (!LWLockConditionalAcquire(CheckpointLock, LW_EXCLUSIVE))
+ 		{
+ 			if (InterruptPending)
+ 				return;
+ 			sleep(1);
+ 		}
+ 
+ 		/*
+ 		 * Call BufferSync() with instructions to keep just the
+ 		 * LRU heads clean.
+ 		 */
+ 		n = BufferSync(BgWriterPercent, BgWriterMaxpages);
+ 
+ 		/*
+ 		 * Release the CheckpointLock
+ 		 */
+ 		LWLockRelease(CheckpointLock);
+ 
+ 		/*
+ 		 * Whatever signal is sent to us, let's just die galantly. If
+ 		 * it wasn't meant that way, the postmaster will reincarnate us.
+ 		 */
+ 		if (InterruptPending)
+ 			return;
+ 
+ 		/*
+ 		 * If there was nothing to flush, sleep for 10 seconds. If there
+ 		 * was, pg_fsync() recently written files and nap.
+ 		 */
+ 		if (n > 0)
+ 		{
+ 			/*
+ 			 * TODO: This sync must be replaced with calls to
+ 			 *       pg_fdatasync() for recently written files.
+ 			 */
+ 			sync();
+ 			PG_DELAY(BgWriterDelay);
+ 		}
+ 		else
+ 			sleep(10);
+ 	}
  }
  
  /*
Index: src/backend/storage/buffer/freelist.c
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/storage/buffer/freelist.c,v
retrieving revision 1.34
diff -c -r1.34 freelist.c
*** src/backend/storage/buffer/freelist.c	2003/11/13 14:57:15	1.34
--- src/backend/storage/buffer/freelist.c	2003/11/13 19:45:32
***************
*** 190,197 ****
--- 190,217 ----
  		if (StrategyControl->stat_report + BufferStrategyStatInterval < now)
  		{
  			long	all_hit, b1_hit, t1_hit, t2_hit, b2_hit;
+ 			int		id, t1_clean, t2_clean;
  			ErrorContextCallback	*errcxtold;
  
+ 			id = StrategyControl->listHead[STRAT_LIST_T1];
+ 			t1_clean = 0;
+ 			while (id >= 0)
+ 			{
+ 				if (BufferDescriptors[StrategyCDB[id].buf_id].flags & BM_DIRTY)
+ 					break;
+ 				t1_clean++;
+ 				id = StrategyCDB[id].next;
+ 			}
+ 			id = StrategyControl->listHead[STRAT_LIST_T2];
+ 			t2_clean = 0;
+ 			while (id >= 0)
+ 			{
+ 				if (BufferDescriptors[StrategyCDB[id].buf_id].flags & BM_DIRTY)
+ 					break;
+ 				t2_clean++;
+ 				id = StrategyCDB[id].next;
+ 			}
+ 
  			if (StrategyControl->num_lookup == 0)
  			{
  				all_hit = b1_hit = t1_hit = t2_hit = b2_hit = 0;
***************
*** 215,220 ****
--- 235,242 ----
  					T1_TARGET, B1_LENGTH, T1_LENGTH, T2_LENGTH, B2_LENGTH);
  			elog(DEBUG1, "ARC total   =%4ld%% B1hit=%4ld%% T1hit=%4ld%% T2hit=%4ld%% B2hit=%4ld%%",
  					all_hit, b1_hit, t1_hit, t2_hit, b2_hit);
+ 			elog(DEBUG1, "ARC clean buffers at LRU       T1=   %5d T2=   %5d",
+ 					t1_clean, t2_clean);
  			error_context_stack = errcxtold;
  
  			StrategyControl->num_lookup = 0;
Index: src/backend/storage/lmgr/proc.c
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/storage/lmgr/proc.c,v
retrieving revision 1.136
diff -c -r1.136 proc.c
*** src/backend/storage/lmgr/proc.c	2003/10/16 20:59:35	1.136
--- src/backend/storage/lmgr/proc.c	2003/11/13 18:03:24
***************
*** 71,76 ****
--- 71,77 ----
  static PROC_HDR *ProcGlobal = NULL;
  
  static PGPROC *DummyProc = NULL;
+ static int	dummy_proc_type = -1;
  
  static bool waitingForLock = false;
  static bool waitingForSignal = false;
***************
*** 163,176 ****
  		 * processes, too.	This does not get linked into the freeProcs
  		 * list.
  		 */
! 		DummyProc = (PGPROC *) ShmemAlloc(sizeof(PGPROC));
  		if (!DummyProc)
  			ereport(FATAL,
  					(errcode(ERRCODE_OUT_OF_MEMORY),
  					 errmsg("out of shared memory")));
! 		MemSet(DummyProc, 0, sizeof(PGPROC));
! 		DummyProc->pid = 0;		/* marks DummyProc as not in use */
! 		PGSemaphoreCreate(&DummyProc->sem);
  
  		/* Create ProcStructLock spinlock, too */
  		ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
--- 164,180 ----
  		 * processes, too.	This does not get linked into the freeProcs
  		 * list.
  		 */
! 		DummyProc = (PGPROC *) ShmemAlloc(sizeof(PGPROC) * NUM_DUMMY_PROCS);
  		if (!DummyProc)
  			ereport(FATAL,
  					(errcode(ERRCODE_OUT_OF_MEMORY),
  					 errmsg("out of shared memory")));
! 		MemSet(DummyProc, 0, sizeof(PGPROC) * NUM_DUMMY_PROCS);
! 		for (i = 0; i < NUM_DUMMY_PROCS; i++)
! 		{
! 			DummyProc[i].pid = 0;		/* marks DummyProc as not in use */
! 			PGSemaphoreCreate(&(DummyProc[i].sem));
! 		}
  
  		/* Create ProcStructLock spinlock, too */
  		ProcStructLock = (slock_t *) ShmemAlloc(sizeof(slock_t));
***************
*** 270,277 ****
   * sema that are assigned are the extra ones created during InitProcGlobal.
   */
  void
! InitDummyProcess(void)
  {
  	/*
  	 * ProcGlobal should be set by a previous call to InitProcGlobal (we
  	 * inherit this by fork() from the postmaster).
--- 274,283 ----
   * sema that are assigned are the extra ones created during InitProcGlobal.
   */
  void
! InitDummyProcess(int proctype)
  {
+ 	PGPROC	*dummyproc;
+ 
  	/*
  	 * ProcGlobal should be set by a previous call to InitProcGlobal (we
  	 * inherit this by fork() from the postmaster).
***************
*** 282,293 ****
  	if (MyProc != NULL)
  		elog(ERROR, "you already exist");
  
  	/*
! 	 * DummyProc should not presently be in use by anyone else
  	 */
! 	if (DummyProc->pid != 0)
! 		elog(FATAL, "DummyProc is in use by PID %d", DummyProc->pid);
! 	MyProc = DummyProc;
  
  	/*
  	 * Initialize all fields of MyProc, except MyProc->sem which was set
--- 288,304 ----
  	if (MyProc != NULL)
  		elog(ERROR, "you already exist");
  
+ 	Assert(dummy_proc_type < 0);
+ 	dummy_proc_type = proctype;
+ 	dummyproc = &DummyProc[proctype];
+ 
  	/*
! 	 * dummyproc should not presently be in use by anyone else
  	 */
! 	if (dummyproc->pid != 0)
! 		elog(FATAL, "DummyProc[%d] is in use by PID %d",
! 				proctype, dummyproc->pid);
! 	MyProc = dummyproc;
  
  	/*
  	 * Initialize all fields of MyProc, except MyProc->sem which was set
***************
*** 310,316 ****
  	/*
  	 * Arrange to clean up at process exit.
  	 */
! 	on_shmem_exit(DummyProcKill, 0);
  
  	/*
  	 * We might be reusing a semaphore that belonged to a failed process.
--- 321,327 ----
  	/*
  	 * Arrange to clean up at process exit.
  	 */
! 	on_shmem_exit(DummyProcKill, proctype);
  
  	/*
  	 * We might be reusing a semaphore that belonged to a failed process.
***************
*** 446,453 ****
  static void
  DummyProcKill(void)
  {
! 	Assert(MyProc != NULL && MyProc == DummyProc);
  
  	/* Release any LW locks I am holding */
  	LWLockReleaseAll();
  
--- 457,470 ----
  static void
  DummyProcKill(void)
  {
! 	PGPROC	*dummyproc;
  
+ 	Assert(dummy_proc_type >= 0 && dummy_proc_type < NUM_DUMMY_PROCS);
+ 
+ 	dummyproc = &DummyProc[dummy_proc_type];
+ 
+ 	Assert(MyProc != NULL && MyProc == dummyproc);
+ 
  	/* Release any LW locks I am holding */
  	LWLockReleaseAll();
  
***************
*** 463,468 ****
--- 480,487 ----
  
  	/* PGPROC struct isn't mine anymore */
  	MyProc = NULL;
+ 
+ 	dummy_proc_type = -1;
  }
  
  
Index: src/backend/utils/misc/guc.c
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/utils/misc/guc.c,v
retrieving revision 1.169
diff -c -r1.169 guc.c
*** src/backend/utils/misc/guc.c	2003/11/13 14:57:15	1.169
--- src/backend/utils/misc/guc.c	2003/11/13 19:40:10
***************
*** 74,79 ****
--- 74,82 ----
  extern int	CommitSiblings;
  extern char *preload_libraries_string;
  extern int	BufferStrategyStatInterval;
+ extern int	BgWriterDelay;
+ extern int	BgWriterPercent;
+ extern int	BgWriterMaxpages;
  
  #ifdef HAVE_SYSLOG
  extern char *Syslog_facility;
***************
*** 1198,1203 ****
--- 1201,1233 ----
  		},
  		&BufferStrategyStatInterval,
  		0, 0, 600, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"bgwriter_delay", PGC_SIGHUP, RESOURCES,
+ 			gettext_noop("Background writer sleep time between rounds in milliseconds"),
+ 			NULL
+ 		},
+ 		&BgWriterDelay,
+ 		500, 10, 5000, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"bgwriter_percent", PGC_SIGHUP, RESOURCES,
+ 			gettext_noop("Background writer percentage of dirty buffers to flush per round"),
+ 			NULL
+ 		},
+ 		&BgWriterPercent,
+ 		0, 0, 100, NULL, NULL
+ 	},
+ 
+ 	{
+ 		{"bgwriter_maxpages", PGC_SIGHUP, RESOURCES,
+ 			gettext_noop("Background writer maximum number of pages to flush per round"),
+ 			NULL
+ 		},
+ 		&BgWriterMaxpages,
+ 		100, 1, 1000, NULL, NULL
  	},
  
  	/* End-of-list marker */
Index: src/backend/utils/misc/postgresql.conf.sample
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/backend/utils/misc/postgresql.conf.sample,v
retrieving revision 1.95
diff -c -r1.95 postgresql.conf.sample
*** src/backend/utils/misc/postgresql.conf.sample	2003/11/13 14:57:15	1.95
--- src/backend/utils/misc/postgresql.conf.sample	2003/11/13 21:20:03
***************
*** 60,65 ****
--- 60,70 ----
  #vacuum_mem = 8192		# min 1024, size in KB
  #buffer_strategy_status_interval = 0	# 0-600 seconds
  
+ # - Background writer -
+ #bgwriter_delay = 500		# 10-5000 milliseconds
+ #bgwriter_percent = 0		# 0-100% of dirty buffers
+ #bgwriter_maxpages = 100	# 1-1000 buffers max at once
+ 
  # - Free Space Map -
  
  #max_fsm_pages = 20000		# min max_fsm_relations*16, 6 bytes each
Index: src/include/bootstrap/bootstrap.h
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/include/bootstrap/bootstrap.h,v
retrieving revision 1.31
diff -c -r1.31 bootstrap.h
*** src/include/bootstrap/bootstrap.h	2003/08/04 02:40:10	1.31
--- src/include/bootstrap/bootstrap.h	2003/11/13 16:19:07
***************
*** 59,64 ****
  #define BS_XLOG_BOOTSTRAP	1
  #define BS_XLOG_STARTUP		2
  #define BS_XLOG_CHECKPOINT	3
! #define BS_XLOG_SHUTDOWN	4
  
  #endif   /* BOOTSTRAP_H */
--- 59,65 ----
  #define BS_XLOG_BOOTSTRAP	1
  #define BS_XLOG_STARTUP		2
  #define BS_XLOG_CHECKPOINT	3
! #define BS_XLOG_BGWRITER	4
! #define BS_XLOG_SHUTDOWN	5
  
  #endif   /* BOOTSTRAP_H */
Index: src/include/storage/bufmgr.h
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/include/storage/bufmgr.h,v
retrieving revision 1.70
diff -c -r1.70 bufmgr.h
*** src/include/storage/bufmgr.h	2003/08/10 19:48:08	1.70
--- src/include/storage/bufmgr.h	2003/11/13 17:12:15
***************
*** 37,42 ****
--- 37,47 ----
  extern DLLIMPORT Block *LocalBufferBlockPointers;
  extern long *LocalRefCount;
  
+ /* in postmaster.c ... they don't belong here */
+ extern int	BgWriterDelay;
+ extern int	BgWriterPercent;
+ extern int	BgWriterMaxpages;
+ 
  /* special pageno for bget */
  #define P_NEW	InvalidBlockNumber		/* grow the file to get a new page */
  
***************
*** 186,192 ****
  extern void AbortBufferIO(void);
  
  extern void BufmgrCommit(void);
! extern void BufferSync(void);
  
  extern void InitLocalBuffer(void);
  
--- 191,198 ----
  extern void AbortBufferIO(void);
  
  extern void BufmgrCommit(void);
! extern int	BufferSync(int percent, int maxpages);
! extern void BufferBackgroundWriter(void);
  
  extern void InitLocalBuffer(void);
  
Index: src/include/storage/proc.h
===================================================================
RCS file: /home/pgsql/CvsRoot/pgsql-server/src/include/storage/proc.h,v
retrieving revision 1.64
diff -c -r1.64 proc.h
*** src/include/storage/proc.h	2003/08/04 02:40:15	1.64
--- src/include/storage/proc.h	2003/11/13 17:55:02
***************
*** 86,91 ****
--- 86,96 ----
  } PROC_HDR;
  
  
+ #define	DUMMY_PROC_DEFAULT	0
+ #define	DUMMY_PROC_BGWRITER	1
+ #define	NUM_DUMMY_PROCS		2
+ 
+ 
  /* configurable options */
  extern int	DeadlockTimeout;
  extern int	StatementTimeout;
***************
*** 97,103 ****
  extern int	ProcGlobalSemas(int maxBackends);
  extern void InitProcGlobal(int maxBackends);
  extern void InitProcess(void);
! extern void InitDummyProcess(void);
  extern void ProcReleaseLocks(bool isCommit);
  
  extern void ProcQueueInit(PROC_QUEUE *queue);
--- 102,108 ----
  extern int	ProcGlobalSemas(int maxBackends);
  extern void InitProcGlobal(int maxBackends);
  extern void InitProcess(void);
! extern void InitDummyProcess(int proctype);
  extern void ProcReleaseLocks(bool isCommit);
  
  extern void ProcQueueInit(PROC_QUEUE *queue);
#2Kurt Roeckx
Q@ping.be
In reply to: Jan Wieck (#1)
Re: Background writer process

On Thu, Nov 13, 2003 at 04:35:31PM -0500, Jan Wieck wrote:

For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.

Why is the sync() needed at all? My understanding was that it
was only needed in case of a checkpoint.

Kurt

#3Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Kurt Roeckx (#2)
Re: Background writer process

Kurt Roeckx wrote:

On Thu, Nov 13, 2003 at 04:35:31PM -0500, Jan Wieck wrote:

For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.

Why is the sync() needed at all? My understanding was that it
was only needed in case of a checkpoint.

He found that write() itself didn't encourage the kernel to write the
buffers to disk fast enough. I think the final solution will be to use
fsync or O_SYNC.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#4Jan Wieck
JanWieck@Yahoo.com
In reply to: Bruce Momjian (#3)
Re: Background writer process

Bruce Momjian wrote:

Kurt Roeckx wrote:

On Thu, Nov 13, 2003 at 04:35:31PM -0500, Jan Wieck wrote:

For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.

Why is the sync() needed at all? My understanding was that it
was only needed in case of a checkpoint.

He found that write() itself didn't encourage the kernel to write the
buffers to disk fast enough. I think the final solution will be to use
fsync or O_SYNC.

write() alone doesn't encourage the kernel to do any physical IO at all.
As long as you have enough OS buffers, it does happy write caching until
you checkpoint and sync(), and then the system freezes.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#5Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Jan Wieck (#4)
Re: Background writer process

Jan Wieck wrote:

Bruce Momjian wrote:

Kurt Roeckx wrote:

On Thu, Nov 13, 2003 at 04:35:31PM -0500, Jan Wieck wrote:

For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.

Why is the sync() needed at all? My understanding was that it
was only needed in case of a checkpoint.

He found that write() itself didn't encourage the kernel to write the
buffers to disk fast enough. I think the final solution will be to use
fsync or O_SYNC.

write() alone doesn't encourage the kernel to do any physical IO at all.
As long as you have enough OS buffers, it does happy write caching until
you checkpoint and sync(), and then the system freezes.

That's not completely true. Some kernels with trickle sync, meaning
they sync a little bit regularly rather than all at once so write() does
help get those shared buffers into the kernel for possible writing.
Also, it is possible the kernel will issue a sync() on its own.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#6Kurt Roeckx
Q@ping.be
In reply to: Bruce Momjian (#5)
Re: Background writer process

On Thu, Nov 13, 2003 at 05:39:32PM -0500, Bruce Momjian wrote:

Jan Wieck wrote:

Bruce Momjian wrote:

He found that write() itself didn't encourage the kernel to write the
buffers to disk fast enough. I think the final solution will be to use
fsync or O_SYNC.

write() alone doesn't encourage the kernel to do any physical IO at all.
As long as you have enough OS buffers, it does happy write caching until
you checkpoint and sync(), and then the system freezes.

That's not completely true. Some kernels with trickle sync, meaning
they sync a little bit regularly rather than all at once so write() does
help get those shared buffers into the kernel for possible writing.
Also, it is possible the kernel will issue a sync() on its own.

So basicly on some kernels you want them to flush their dirty
buffers faster.

I have a feeling we should more make it depend on the system how
we ask them not to keep it in memory too long and that maybe the
sync(), fsync() or O_SYNC could be a fallback in case it's needed
and there are no better ways of doing it.

Maybe something as posix_fadvise() might be useful too on systems
that have it?

Kurt

#7Jan Wieck
JanWieck@Yahoo.com
In reply to: Kurt Roeckx (#6)
Re: Background writer process

Kurt Roeckx wrote:

On Thu, Nov 13, 2003 at 05:39:32PM -0500, Bruce Momjian wrote:

Jan Wieck wrote:

Bruce Momjian wrote:

He found that write() itself didn't encourage the kernel to write the
buffers to disk fast enough. I think the final solution will be to use
fsync or O_SYNC.

write() alone doesn't encourage the kernel to do any physical IO at all.
As long as you have enough OS buffers, it does happy write caching until
you checkpoint and sync(), and then the system freezes.

That's not completely true. Some kernels with trickle sync, meaning
they sync a little bit regularly rather than all at once so write() does
help get those shared buffers into the kernel for possible writing.
Also, it is possible the kernel will issue a sync() on its own.

So basicly on some kernels you want them to flush their dirty
buffers faster.

I have a feeling we should more make it depend on the system how
we ask them not to keep it in memory too long and that maybe the
sync(), fsync() or O_SYNC could be a fallback in case it's needed
and there are no better ways of doing it.

Maybe something as posix_fadvise() might be useful too on systems
that have it?

That is all right and as said, how often, how much and how forced we do
the IO can all be configurable and as flexible as people see fit. But
whether you use sync(), fsync(), fdatasync(), O_SYNC, O_DSYNC or
posix_fadvise(), somewhere you have to do the write(). And that write
has to be coordinated with the buffer cache replacement strategy so that
you write those buffers that are likely to be replaced soon, and don't
write those that the strategy thinks keeping for longer anyway. Except
at a checkpoint, then you have to write whatever is dirty.

The patch I posted does this write() in coordination with the strategy
in a separate background process, so that the regular backends don't
have to write under normal circumstances (there are some places in DDL
statements that call BufferSync(), that's exceptions IMHO). Can we agree
on this general outline? Or do we have any better proposals?

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#8Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Kurt Roeckx (#6)
Re: Background writer process

Kurt Roeckx wrote:

On Thu, Nov 13, 2003 at 05:39:32PM -0500, Bruce Momjian wrote:

Jan Wieck wrote:

Bruce Momjian wrote:

He found that write() itself didn't encourage the kernel to write the
buffers to disk fast enough. I think the final solution will be to use
fsync or O_SYNC.

write() alone doesn't encourage the kernel to do any physical IO at all.
As long as you have enough OS buffers, it does happy write caching until
you checkpoint and sync(), and then the system freezes.

That's not completely true. Some kernels with trickle sync, meaning
they sync a little bit regularly rather than all at once so write() does
help get those shared buffers into the kernel for possible writing.
Also, it is possible the kernel will issue a sync() on its own.

So basicly on some kernels you want them to flush their dirty
buffers faster.

I have a feeling we should more make it depend on the system how
we ask them not to keep it in memory too long and that maybe the
sync(), fsync() or O_SYNC could be a fallback in case it's needed
and there are no better ways of doing it.

I think the final plan is to have a GUC variable that controls how the
kernel is _encouraged_ to write dirty buffers to disk.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#9Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Jan Wieck (#7)
Re: Background writer process

Jan Wieck wrote:

That is all right and as said, how often, how much and how forced we do
the IO can all be configurable and as flexible as people see fit. But
whether you use sync(), fsync(), fdatasync(), O_SYNC, O_DSYNC or
posix_fadvise(), somewhere you have to do the write(). And that write
has to be coordinated with the buffer cache replacement strategy so that
you write those buffers that are likely to be replaced soon, and don't
write those that the strategy thinks keeping for longer anyway. Except
at a checkpoint, then you have to write whatever is dirty.

The patch I posted does this write() in coordination with the strategy
in a separate background process, so that the regular backends don't
have to write under normal circumstances (there are some places in DDL
statements that call BufferSync(), that's exceptions IMHO). Can we agree
on this general outline? Or do we have any better proposals?

Agreed. Background write() is a win on all all OS's. It is just the
kernel to disk part we will have to have configurable, I think.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#10Shridhar Daithankar
shridhar_daithankar@myrealbox.com
In reply to: Jan Wieck (#1)
Re: Background writer process

On Friday 14 November 2003 03:05, Jan Wieck wrote:

For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.

Having fsync for regular data files and sync for WAL segment a comfortable
compramise? Or this is going to use fsync for all of them.

IMO, with fsync, we tell kernel that you can write this buffer. It may or may
not write it immediately, unless it is hard sync.

Since postgresql can afford lazy writes for data files, I think this could
work.

Just a thought..

Shridhar

#11Jan Wieck
JanWieck@Yahoo.com
In reply to: Shridhar Daithankar (#10)
Re: Background writer process

Shridhar Daithankar wrote:

On Friday 14 November 2003 03:05, Jan Wieck wrote:

For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.

Having fsync for regular data files and sync for WAL segment a comfortable
compramise? Or this is going to use fsync for all of them.

IMO, with fsync, we tell kernel that you can write this buffer. It may or may
not write it immediately, unless it is hard sync.

I think it's more the other way around. On some systems sync() might
return before all buffers are flushed to disk, while fsync() does not.

Since postgresql can afford lazy writes for data files, I think this could
work.

The whole point of a checkpoint is to know for certain that a specific
change is in the datafile, so that it is safe to throw away older WAL
segments.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me. #
#================================================== JanWieck@Yahoo.com #

#12Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Shridhar Daithankar (#10)
Re: Background writer process

Shridhar Daithankar wrote:

On Friday 14 November 2003 03:05, Jan Wieck wrote:

For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.

Having fsync for regular data files and sync for WAL segment a comfortable
compramise? Or this is going to use fsync for all of them.

I think we still need sync() for WAL because sometimes backends are
going to have to write their own buffers, and we don't want them using
fsync or it will be very slow.

IMO, with fsync, we tell kernel that you can write this buffer. It may or may
not write it immediately, unless it is hard sync.

Since postgresql can afford lazy writes for data files, I think this could
work.

fsync() doesn't return until the data is on the disk. It doesn't
schedule the write then return, as far as I know. sync() does schedule
the writes, I think, which can be bad, but we delay a little to wait for
it to complete.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#13Tom Lane
tgl@sss.pgh.pa.us
In reply to: Bruce Momjian (#12)
Re: Background writer process

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Shridhar Daithankar wrote:

Having fsync for regular data files and sync for WAL segment a comfortable
compramise? Or this is going to use fsync for all of them.

I think we still need sync() for WAL because sometimes backends are
going to have to write their own buffers, and we don't want them using
fsync or it will be very slow.

sync() for WAL is a complete nonstarter, because it gives you no
guarantees at all about whether the write has occurred. I don't really
care what you say about speed; this is a correctness point.

regards, tom lane

#14Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Tom Lane (#13)
Re: Background writer process

Tom Lane wrote:

Bruce Momjian <pgman@candle.pha.pa.us> writes:

Shridhar Daithankar wrote:

Having fsync for regular data files and sync for WAL segment a comfortable
compramise? Or this is going to use fsync for all of them.

I think we still need sync() for WAL because sometimes backends are
going to have to write their own buffers, and we don't want them using
fsync or it will be very slow.

sync() for WAL is a complete nonstarter, because it gives you no
guarantees at all about whether the write has occurred. I don't really
care what you say about speed; this is a correctness point.

Sorry, I meant sync() is needed for recycling WAL (checkpoint), not for
WAL writes. I assume that's what Shridhar meant, but now I am not sure.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#15Shridhar Daithankar
shridhar_daithankar@myrealbox.com
In reply to: Jan Wieck (#11)
Re: Background writer process

On Friday 14 November 2003 22:10, Jan Wieck wrote:

Shridhar Daithankar wrote:

On Friday 14 November 2003 03:05, Jan Wieck wrote:

For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.

Having fsync for regular data files and sync for WAL segment a
comfortable compramise? Or this is going to use fsync for all of them.

IMO, with fsync, we tell kernel that you can write this buffer. It may or
may not write it immediately, unless it is hard sync.

I think it's more the other way around. On some systems sync() might
return before all buffers are flushed to disk, while fsync() does not.

Oops.. that's bad.

Since postgresql can afford lazy writes for data files, I think this
could work.

The whole point of a checkpoint is to know for certain that a specific
change is in the datafile, so that it is safe to throw away older WAL
segments.

I just made another posing on patches for a thread crossing win32-devel.

Essentially I said

1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if current code does
it. The hackery in xlog.c is not exactly trivial.)
2. Open data files normally and fsync them only in background writer process.

Now BGWriter process will flush everything at the time of checkpointing. It
does not need to flush WAL because of O_SYNC(ideally but an additional fsync
won't hurt). So it just flushes all the file decriptors touched since last
checkpoint, which should not be much of a load because it is flushing those
files intermittently anyways.

It could also work nicely if only background writer fsync the data files.
Backends can either wait or proceed to other business by the time disk is
flushed. Backends needs to wait for certain while committing and it should be
rather small delay of syncing to disk in current process as opposed to in
background process.

In case of commit, BGWriter could get away with files touched in transaction
+WAL as opposed to all files touched since last checkpoint+WAL in case of
chekpoint. I don't know how difficult that would be.

What is different in currrent BGwriter implementation? Use of sync()?

Shridhar

#16Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Shridhar Daithankar (#15)
Re: Background writer process

1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if

Without grouping WAL writes that does not fly. Iff however such grouping
is implemented that should deliver optimal performance. I don't think flushing
WAL to the OS early (before a tx commits) is necessary, since writing 8k or 256k
to disk with one call takes nearly the same time. The WAL write would need to be
done as soon as eighter 256k fill or a txn commits.

Andreas

#17Shridhar Daithankar
shridhar_daithankar@persistent.co.in
In reply to: Zeugswetter Andreas SB SD (#16)
Re: Background writer process

Zeugswetter Andreas SB SD wrote:

1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if

Without grouping WAL writes that does not fly. Iff however such grouping
is implemented that should deliver optimal performance. I don't think flushing
WAL to the OS early (before a tx commits) is necessary, since writing 8k or 256k
to disk with one call takes nearly the same time. The WAL write would need to be
done as soon as eighter 256k fill or a txn commits.

That means no special treatment to WAL files? If it works, great. There would be
single class of files to take care w.r.t sync. issue. Even more simpler.

Shridhar

#18Bruce Momjian
pgman@candle.pha.pa.us
In reply to: Shridhar Daithankar (#15)
Re: Background writer process

Shridhar Daithankar wrote:

On Friday 14 November 2003 22:10, Jan Wieck wrote:

Shridhar Daithankar wrote:

On Friday 14 November 2003 03:05, Jan Wieck wrote:

For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.

Having fsync for regular data files and sync for WAL segment a
comfortable compromise? Or this is going to use fsync for all of them.

IMO, with fsync, we tell kernel that you can write this buffer. It may or
may not write it immediately, unless it is hard sync.

I think it's more the other way around. On some systems sync() might
return before all buffers are flushed to disk, while fsync() does not.

Oops.. that's bad.

Yes, one I idea I had was to do an fsync on a new file _after_ issuing
sync, hoping that this will complete after all the sync buffers are
done.

Since postgresql can afford lazy writes for data files, I think this
could work.

The whole point of a checkpoint is to know for certain that a specific
change is in the datafile, so that it is safe to throw away older WAL
segments.

I just made another posing on patches for a thread crossing win32-devel.

Essentially I said

1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if current code does
it. The hackery in xlog.c is not exactly trivial.)

We write WAL, then fsync, so if we write multiple blocks, we can write
them and fsync once, rather than O_SYNC every write.

2. Open data files normally and fsync them only in background writer process.

Now BGWriter process will flush everything at the time of checkpointing. It
does not need to flush WAL because of O_SYNC(ideally but an additional fsync
won't hurt). So it just flushes all the file descriptors touched since last
checkpoint, which should not be much of a load because it is flushing those
files intermittently anyways.

It could also work nicely if only background writer fsync the data files.
Backends can either wait or proceed to other business by the time disk is
flushed. Backends needs to wait for certain while committing and it should be
rather small delay of syncing to disk in current process as opposed to in
background process.

In case of commit, BGWriter could get away with files touched in transaction
+WAL as opposed to all files touched since last checkpoint+WAL in case of
checkpoint. I don't know how difficult that would be.

What is different in current BGwriter implementation? Use of sync()?

Well, basically we are still discussing how to do this. Right now the
backend writer patch uses sync(), but the final version will use fsync
or O_SYNC, or maybe nothing.

The open items are whether a background process can keep the dirty
buffers cleaned fast enough to keep up with the maximum number of
backends. We might need to use multiple processes or threads to do
this. We certainly will have a background writer in 7.5 --- the big
question is whether _all_ write will go through it. It certainly would
be nice if it could, and Tom thinks it can, so we are still exploring
this.

If the background writer uses fsync, it can write and allow the buffer
to be reused and fsync later, while if we use O_SYNC, we have to wait
for the O_SYNC write to happen before reusing the buffer; that will be
slower.

Another open issue is _if_ the backend writer can't keep up with the
normal backends, do we allow normal backends to write dirty buffers, and
do they use fsync(), or can we record the file in a shared area and have
the background writer do the fsync. This is the issue of whether one
process can fsync all dirty buffers for the file or just the buffers it
wrote.

I think this is these are the basics of the current discussion.

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073
#19Shridhar Daithankar
shridhar_daithankar@myrealbox.com
In reply to: Bruce Momjian (#18)
Re: Background writer process

Bruce Momjian wrote:

Shridhar Daithankar wrote:

On Friday 14 November 2003 22:10, Jan Wieck wrote:

Shridhar Daithankar wrote:

On Friday 14 November 2003 03:05, Jan Wieck wrote:

For sure the sync() needs to be replaced by the discussed fsync() of
recently written files. And I think the algorithm how much and how often
to flush can be significantly improved. But after all, this does not
change the real checkpointing at all, and the general framework having a
separate process is what we probably want.

Having fsync for regular data files and sync for WAL segment a
comfortable compromise? Or this is going to use fsync for all of them.

IMO, with fsync, we tell kernel that you can write this buffer. It may or
may not write it immediately, unless it is hard sync.

I think it's more the other way around. On some systems sync() might
return before all buffers are flushed to disk, while fsync() does not.

Oops.. that's bad.

Yes, one I idea I had was to do an fsync on a new file _after_ issuing
sync, hoping that this will complete after all the sync buffers are
done.

Since postgresql can afford lazy writes for data files, I think this
could work.

The whole point of a checkpoint is to know for certain that a specific
change is in the datafile, so that it is safe to throw away older WAL
segments.

I just made another posing on patches for a thread crossing win32-devel.

Essentially I said

1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if current code does
it. The hackery in xlog.c is not exactly trivial.)

We write WAL, then fsync, so if we write multiple blocks, we can write
them and fsync once, rather than O_SYNC every write.

2. Open data files normally and fsync them only in background writer process.

Now BGWriter process will flush everything at the time of checkpointing. It
does not need to flush WAL because of O_SYNC(ideally but an additional fsync
won't hurt). So it just flushes all the file descriptors touched since last
checkpoint, which should not be much of a load because it is flushing those
files intermittently anyways.

It could also work nicely if only background writer fsync the data files.
Backends can either wait or proceed to other business by the time disk is
flushed. Backends needs to wait for certain while committing and it should be
rather small delay of syncing to disk in current process as opposed to in
background process.

In case of commit, BGWriter could get away with files touched in transaction
+WAL as opposed to all files touched since last checkpoint+WAL in case of
checkpoint. I don't know how difficult that would be.

What is different in current BGwriter implementation? Use of sync()?

Well, basically we are still discussing how to do this. Right now the
backend writer patch uses sync(), but the final version will use fsync
or O_SYNC, or maybe nothing.

The open items are whether a background process can keep the dirty
buffers cleaned fast enough to keep up with the maximum number of
backends. We might need to use multiple processes or threads to do
this. We certainly will have a background writer in 7.5 --- the big
question is whether _all_ write will go through it. It certainly would
be nice if it could, and Tom thinks it can, so we are still exploring
this.

Given that fsync is blocking, the background writer has to scale up in terms of
processes/threads and load w.r.t. disk flushing.

I would vote for threads for a simple reason that, in BGWriter, threads are
needed only to flush the file. Get the fd, fsync it and get next one. No need to
make entire process thread safe.

Furthermore BGWriter has to detect the disk limit. If adding threads does not
improve fsyncing speed, it should stop adding them and wait. There is nothing to
do when disk is saturated.

If the background writer uses fsync, it can write and allow the buffer
to be reused and fsync later, while if we use O_SYNC, we have to wait
for the O_SYNC write to happen before reusing the buffer; that will be
slower.

Certainly. However an O_SYNC open file would not require fsync separately. I
suggested it only for WAL. But for WAL block grouping as suggested in another
post, all files with fsync might be a good idea.

Just a thought.

Shridhar

#20Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Shridhar Daithankar (#19)
Re: Background writer process

If the background writer uses fsync, it can write and allow the buffer
to be reused and fsync later, while if we use O_SYNC, we have to wait
for the O_SYNC write to happen before reusing the buffer;
that will be slower.

You can forget O_SYNC for datafiles for now. There would simply be too much to
do currently to allow decent performance, like scatter/gather IO, ...
Imho the reasonable target should be to write from all backends but sync (fsync)
from the background writer only. (Tune the OS if it actually waits until the
pg invoked sync (== 5 minutes per default))

Andreas

#21Zeugswetter Andreas SB SD
ZeugswetterA@spardat.at
In reply to: Zeugswetter Andreas SB SD (#20)
Re: Background writer process

1. Open WAL files with O_SYNC|O_DIRECT or O_SYNC(Not sure if

Without grouping WAL writes that does not fly. Iff however such grouping
is implemented that should deliver optimal performance. I don't think flushing
WAL to the OS early (before a tx commits) is necessary, since writing 8k or 256k
to disk with one call takes nearly the same time. The WAL write would need to be
done as soon as eighter 256k fill or a txn commits.

That means no special treatment to WAL files? If it works, great. There would be
single class of files to take care w.r.t sync. issue. Even more simpler.

No, WAL needs special handling. Eighter leave it as is with write + f[data]sync,
or implement O_SYNC|O_DIRECT with grouping of writes (the current O_SYNC implementation
is only good for small (<8kb) transactions).

Andreas