Separating bgwriter and checkpointer

Started by Simon Riggsover 14 years ago39 messages

simon@2ndQuadrant.com

over 14 years ago

1 attachment(s)

As discussed previously...

Currently the bgwriter process performs both background writing,
checkpointing and some other duties. This means that we can't perform
the final checkpoint fsync without stopping background writing, so
there is a negative performance effect from doing both things in one
process.

Additionally, our aim in 9.2 is to replace polling loops with latches
for power reduction. The complexity of the bgwriter loops is high and
it seems unlikely to come up with a clean approach using latches.

This patch splits bgwriter into 2 processes: checkpointer and
bgwriter, seeking to avoid contentious changes. Additional changes are
expected in this release to build upon these changes for both new
processes, though this patch stands on its own as both a performance
vehicle and in some ways a refcatoring to simplify the code.

Checkpointer does the important things, "new bgwriter" just does
background writing and so is much less important than before.

Current patch has a bug at shutdown I've not located yet, but seems
likely is a simple error. That is mainly because for personal reasons
I've not been able to work on the patch recently. I expect to be able
to fix that later in the CF.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

bgwriter_split.v1.patchapplication/octet-stream; name=bgwriter_split.v1.patchDownload

*** a/src/backend/bootstrap/bootstrap.c
--- b/src/backend/bootstrap/bootstrap.c
***************
*** 315,320 **** AuxiliaryProcessMain(int argc, char *argv[])
--- 315,323 ----
  			case BgWriterProcess:
  				statmsg = "writer process";
  				break;
+ 			case CheckpointerProcess:
+ 				statmsg = "checkpointer process";
+ 				break;
  			case WalWriterProcess:
  				statmsg = "wal writer process";
  				break;
***************
*** 419,424 **** AuxiliaryProcessMain(int argc, char *argv[])
--- 422,432 ----
  			BackgroundWriterMain();
  			proc_exit(1);		/* should never return */
  
+ 		case CheckpointerProcess:
+ 			/* don't set signals, checkpointer has its own agenda */
+ 			CheckpointerMain();
+ 			proc_exit(1);		/* should never return */
+ 
  		case WalWriterProcess:
  			/* don't set signals, walwriter has its own agenda */
  			InitXLOGAccess();
*** a/src/backend/postmaster/Makefile
--- b/src/backend/postmaster/Makefile
***************
*** 13,18 **** top_builddir = ../../..
  include $(top_builddir)/src/Makefile.global
  
  OBJS = autovacuum.o bgwriter.o fork_process.o pgarch.o pgstat.o postmaster.o \
! 	syslogger.o walwriter.o
  
  include $(top_srcdir)/src/backend/common.mk
--- 13,18 ----
  include $(top_builddir)/src/Makefile.global
  
  OBJS = autovacuum.o bgwriter.o fork_process.o pgarch.o pgstat.o postmaster.o \
! 	syslogger.o walwriter.o checkpointer.o
  
  include $(top_srcdir)/src/backend/common.mk
*** a/src/backend/postmaster/bgwriter.c
--- b/src/backend/postmaster/bgwriter.c
***************
*** 10,29 ****
   * still empowered to issue writes if the bgwriter fails to maintain enough
   * clean shared buffers.
   *
!  * The bgwriter is also charged with handling all checkpoints.	It will
!  * automatically dispatch a checkpoint after a certain amount of time has
!  * elapsed since the last one, and it can be signaled to perform requested
!  * checkpoints as well.  (The GUC parameter that mandates a checkpoint every
!  * so many WAL segments is implemented by having backends signal the bgwriter
!  * when they fill WAL segments; the bgwriter itself doesn't watch for the
!  * condition.)
   *
   * The bgwriter is started by the postmaster as soon as the startup subprocess
   * finishes, or as soon as recovery begins if we are doing archive recovery.
   * It remains alive until the postmaster commands it to terminate.
!  * Normal termination is by SIGUSR2, which instructs the bgwriter to execute
!  * a shutdown checkpoint and then exit(0).	(All backends must be stopped
!  * before SIGUSR2 is issued!)  Emergency termination is by SIGQUIT; like any
   * backend, the bgwriter will simply abort and exit on SIGQUIT.
   *
   * If the bgwriter exits unexpectedly, the postmaster treats that the same
--- 10,22 ----
   * still empowered to issue writes if the bgwriter fails to maintain enough
   * clean shared buffers.
   *
!  * As of Postgres 9.2 the bgwriter no longer handles checkpoints.
   *
   * The bgwriter is started by the postmaster as soon as the startup subprocess
   * finishes, or as soon as recovery begins if we are doing archive recovery.
   * It remains alive until the postmaster commands it to terminate.
!  * Normal termination is by SIGUSR2, which instructs the bgwriter to exit(0).
!  * Emergency termination is by SIGQUIT; like any
   * backend, the bgwriter will simply abort and exit on SIGQUIT.
   *
   * If the bgwriter exits unexpectedly, the postmaster treats that the same
***************
*** 54,60 ****
  #include "miscadmin.h"
  #include "pgstat.h"
  #include "postmaster/bgwriter.h"
- #include "replication/syncrep.h"
  #include "storage/bufmgr.h"
  #include "storage/ipc.h"
  #include "storage/lwlock.h"
--- 47,52 ----
***************
*** 67,162 ****
  #include "utils/resowner.h"
  
  
- /*----------
-  * Shared memory area for communication between bgwriter and backends
-  *
-  * The ckpt counters allow backends to watch for completion of a checkpoint
-  * request they send.  Here's how it works:
-  *	* At start of a checkpoint, bgwriter reads (and clears) the request flags
-  *	  and increments ckpt_started, while holding ckpt_lck.
-  *	* On completion of a checkpoint, bgwriter sets ckpt_done to
-  *	  equal ckpt_started.
-  *	* On failure of a checkpoint, bgwriter increments ckpt_failed
-  *	  and sets ckpt_done to equal ckpt_started.
-  *
-  * The algorithm for backends is:
-  *	1. Record current values of ckpt_failed and ckpt_started, and
-  *	   set request flags, while holding ckpt_lck.
-  *	2. Send signal to request checkpoint.
-  *	3. Sleep until ckpt_started changes.  Now you know a checkpoint has
-  *	   begun since you started this algorithm (although *not* that it was
-  *	   specifically initiated by your signal), and that it is using your flags.
-  *	4. Record new value of ckpt_started.
-  *	5. Sleep until ckpt_done >= saved value of ckpt_started.  (Use modulo
-  *	   arithmetic here in case counters wrap around.)  Now you know a
-  *	   checkpoint has started and completed, but not whether it was
-  *	   successful.
-  *	6. If ckpt_failed is different from the originally saved value,
-  *	   assume request failed; otherwise it was definitely successful.
-  *
-  * ckpt_flags holds the OR of the checkpoint request flags sent by all
-  * requesting backends since the last checkpoint start.  The flags are
-  * chosen so that OR'ing is the correct way to combine multiple requests.
-  *
-  * num_backend_writes is used to count the number of buffer writes performed
-  * by non-bgwriter processes.  This counter should be wide enough that it
-  * can't overflow during a single bgwriter cycle.  num_backend_fsync
-  * counts the subset of those writes that also had to do their own fsync,
-  * because the background writer failed to absorb their request.
-  *
-  * The requests array holds fsync requests sent by backends and not yet
-  * absorbed by the bgwriter.
-  *
-  * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
-  * the requests fields are protected by BgWriterCommLock.
-  *----------
-  */
- typedef struct
- {
- 	RelFileNodeBackend rnode;
- 	ForkNumber	forknum;
- 	BlockNumber segno;			/* see md.c for special values */
- 	/* might add a real request-type field later; not needed yet */
- } BgWriterRequest;
- 
- typedef struct
- {
- 	pid_t		bgwriter_pid;	/* PID of bgwriter (0 if not started) */
- 
- 	slock_t		ckpt_lck;		/* protects all the ckpt_* fields */
- 
- 	int			ckpt_started;	/* advances when checkpoint starts */
- 	int			ckpt_done;		/* advances when checkpoint done */
- 	int			ckpt_failed;	/* advances when checkpoint fails */
- 
- 	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
- 
- 	uint32		num_backend_writes;		/* counts non-bgwriter buffer writes */
- 	uint32		num_backend_fsync;		/* counts non-bgwriter fsync calls */
- 
- 	int			num_requests;	/* current # of requests */
- 	int			max_requests;	/* allocated array size */
- 	BgWriterRequest requests[1];	/* VARIABLE LENGTH ARRAY */
- } BgWriterShmemStruct;
- 
- static BgWriterShmemStruct *BgWriterShmem;
- 
- /* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
- #define WRITES_PER_ABSORB		1000
- 
  /*
   * GUC parameters
   */
  int			BgWriterDelay = 200;
- int			CheckPointTimeout = 300;
- int			CheckPointWarning = 30;
- double		CheckPointCompletionTarget = 0.5;
  
  /*
   * Flags set by interrupt handlers for later service in the main loop.
   */
  static volatile sig_atomic_t got_SIGHUP = false;
- static volatile sig_atomic_t checkpoint_requested = false;
  static volatile sig_atomic_t shutdown_requested = false;
  
  /*
--- 59,73 ----
***************
*** 164,192 **** static volatile sig_atomic_t shutdown_requested = false;
   */
  static bool am_bg_writer = false;
  
- static bool ckpt_active = false;
- 
- /* these values are valid when ckpt_active is true: */
- static pg_time_t ckpt_start_time;
- static XLogRecPtr ckpt_start_recptr;
- static double ckpt_cached_elapsed;
- 
- static pg_time_t last_checkpoint_time;
- static pg_time_t last_xlog_switch_time;
- 
  /* Prototypes for private functions */
  
- static void CheckArchiveTimeout(void);
  static void BgWriterNap(void);
- static bool IsCheckpointOnSchedule(double progress);
- static bool ImmediateCheckpointRequested(void);
- static bool CompactBgwriterRequestQueue(void);
  
  /* Signal handlers */
  
  static void bg_quickdie(SIGNAL_ARGS);
  static void BgSigHupHandler(SIGNAL_ARGS);
- static void ReqCheckpointHandler(SIGNAL_ARGS);
  static void ReqShutdownHandler(SIGNAL_ARGS);
  
  
--- 75,88 ----
***************
*** 202,208 **** BackgroundWriterMain(void)
  	sigjmp_buf	local_sigjmp_buf;
  	MemoryContext bgwriter_context;
  
- 	BgWriterShmem->bgwriter_pid = MyProcPid;
  	am_bg_writer = true;
  
  	/*
--- 98,103 ----
***************
*** 228,235 **** BackgroundWriterMain(void)
  	 * process to participate in ProcSignal signalling.
  	 */
  	pqsignal(SIGHUP, BgSigHupHandler);	/* set flag to read config file */
! 	pqsignal(SIGINT, ReqCheckpointHandler);		/* request checkpoint */
! 	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
  	pqsignal(SIGQUIT, bg_quickdie);		/* hard crash time */
  	pqsignal(SIGALRM, SIG_IGN);
  	pqsignal(SIGPIPE, SIG_IGN);
--- 123,130 ----
  	 * process to participate in ProcSignal signalling.
  	 */
  	pqsignal(SIGHUP, BgSigHupHandler);	/* set flag to read config file */
! 	pqsignal(SIGINT, SIG_IGN);			/* as of 9.2 no longer requests checkpoint */
! 	pqsignal(SIGTERM, SIG_IGN); 		/* ignore SIGTERM */
  	pqsignal(SIGQUIT, bg_quickdie);		/* hard crash time */
  	pqsignal(SIGALRM, SIG_IGN);
  	pqsignal(SIGPIPE, SIG_IGN);
***************
*** 249,259 **** BackgroundWriterMain(void)
  	sigdelset(&BlockSig, SIGQUIT);
  
  	/*
- 	 * Initialize so that first time-driven event happens at the correct time.
- 	 */
- 	last_checkpoint_time = last_xlog_switch_time = (pg_time_t) time(NULL);
- 
- 	/*
  	 * Create a resource owner to keep track of our resources (currently only
  	 * buffer pins).
  	 */
--- 144,149 ----
***************
*** 305,324 **** BackgroundWriterMain(void)
  		AtEOXact_Files();
  		AtEOXact_HashTables(false);
  
- 		/* Warn any waiting backends that the checkpoint failed. */
- 		if (ckpt_active)
- 		{
- 			/* use volatile pointer to prevent code rearrangement */
- 			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
- 
- 			SpinLockAcquire(&bgs->ckpt_lck);
- 			bgs->ckpt_failed++;
- 			bgs->ckpt_done = bgs->ckpt_started;
- 			SpinLockRelease(&bgs->ckpt_lck);
- 
- 			ckpt_active = false;
- 		}
- 
  		/*
  		 * Now return to normal top-level context and clear ErrorContext for
  		 * next time.
--- 195,200 ----
***************
*** 361,379 **** BackgroundWriterMain(void)
  	if (RecoveryInProgress())
  		ThisTimeLineID = GetRecoveryTargetTLI();
  
- 	/* Do this once before starting the loop, then just at SIGHUP time. */
- 	SyncRepUpdateSyncStandbysDefined();
- 
  	/*
  	 * Loop forever
  	 */
  	for (;;)
  	{
- 		bool		do_checkpoint = false;
- 		int			flags = 0;
- 		pg_time_t	now;
- 		int			elapsed_secs;
- 
  		/*
  		 * Emergency bailout if postmaster has died.  This is to avoid the
  		 * necessity for manual cleanup of all postmaster children.
--- 237,247 ----
***************
*** 381,403 **** BackgroundWriterMain(void)
  		if (!PostmasterIsAlive())
  			exit(1);
  
- 		/*
- 		 * Process any requests or signals received recently.
- 		 */
- 		AbsorbFsyncRequests();
- 
  		if (got_SIGHUP)
  		{
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
  			/* update global shmem state for sync rep */
- 			SyncRepUpdateSyncStandbysDefined();
- 		}
- 		if (checkpoint_requested)
- 		{
- 			checkpoint_requested = false;
- 			do_checkpoint = true;
- 			BgWriterStats.m_requested_checkpoints++;
  		}
  		if (shutdown_requested)
  		{
--- 249,259 ----
***************
*** 406,547 **** BackgroundWriterMain(void)
  			 * control back to the sigsetjmp block above
  			 */
  			ExitOnAnyError = true;
- 			/* Close down the database */
- 			ShutdownXLOG(0, 0);
  			/* Normal exit from the bgwriter is here */
  			proc_exit(0);		/* done */
  		}
  
  		/*
! 		 * Force a checkpoint if too much time has elapsed since the last one.
! 		 * Note that we count a timed checkpoint in stats only when this
! 		 * occurs without an external request, but we set the CAUSE_TIME flag
! 		 * bit even if there is also an external request.
  		 */
! 		now = (pg_time_t) time(NULL);
! 		elapsed_secs = now - last_checkpoint_time;
! 		if (elapsed_secs >= CheckPointTimeout)
! 		{
! 			if (!do_checkpoint)
! 				BgWriterStats.m_timed_checkpoints++;
! 			do_checkpoint = true;
! 			flags |= CHECKPOINT_CAUSE_TIME;
! 		}
! 
! 		/*
! 		 * Do a checkpoint if requested, otherwise do one cycle of
! 		 * dirty-buffer writing.
! 		 */
! 		if (do_checkpoint)
! 		{
! 			bool		ckpt_performed = false;
! 			bool		do_restartpoint;
! 
! 			/* use volatile pointer to prevent code rearrangement */
! 			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
! 
! 			/*
! 			 * Check if we should perform a checkpoint or a restartpoint. As a
! 			 * side-effect, RecoveryInProgress() initializes TimeLineID if
! 			 * it's not set yet.
! 			 */
! 			do_restartpoint = RecoveryInProgress();
! 
! 			/*
! 			 * Atomically fetch the request flags to figure out what kind of a
! 			 * checkpoint we should perform, and increase the started-counter
! 			 * to acknowledge that we've started a new checkpoint.
! 			 */
! 			SpinLockAcquire(&bgs->ckpt_lck);
! 			flags |= bgs->ckpt_flags;
! 			bgs->ckpt_flags = 0;
! 			bgs->ckpt_started++;
! 			SpinLockRelease(&bgs->ckpt_lck);
! 
! 			/*
! 			 * The end-of-recovery checkpoint is a real checkpoint that's
! 			 * performed while we're still in recovery.
! 			 */
! 			if (flags & CHECKPOINT_END_OF_RECOVERY)
! 				do_restartpoint = false;
! 
! 			/*
! 			 * We will warn if (a) too soon since last checkpoint (whatever
! 			 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
! 			 * since the last checkpoint start.  Note in particular that this
! 			 * implementation will not generate warnings caused by
! 			 * CheckPointTimeout < CheckPointWarning.
! 			 */
! 			if (!do_restartpoint &&
! 				(flags & CHECKPOINT_CAUSE_XLOG) &&
! 				elapsed_secs < CheckPointWarning)
! 				ereport(LOG,
! 						(errmsg_plural("checkpoints are occurring too frequently (%d second apart)",
! 				"checkpoints are occurring too frequently (%d seconds apart)",
! 									   elapsed_secs,
! 									   elapsed_secs),
! 						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
! 
! 			/*
! 			 * Initialize bgwriter-private variables used during checkpoint.
! 			 */
! 			ckpt_active = true;
! 			if (!do_restartpoint)
! 				ckpt_start_recptr = GetInsertRecPtr();
! 			ckpt_start_time = now;
! 			ckpt_cached_elapsed = 0;
! 
! 			/*
! 			 * Do the checkpoint.
! 			 */
! 			if (!do_restartpoint)
! 			{
! 				CreateCheckPoint(flags);
! 				ckpt_performed = true;
! 			}
! 			else
! 				ckpt_performed = CreateRestartPoint(flags);
! 
! 			/*
! 			 * After any checkpoint, close all smgr files.	This is so we
! 			 * won't hang onto smgr references to deleted files indefinitely.
! 			 */
! 			smgrcloseall();
! 
! 			/*
! 			 * Indicate checkpoint completion to any waiting backends.
! 			 */
! 			SpinLockAcquire(&bgs->ckpt_lck);
! 			bgs->ckpt_done = bgs->ckpt_started;
! 			SpinLockRelease(&bgs->ckpt_lck);
! 
! 			if (ckpt_performed)
! 			{
! 				/*
! 				 * Note we record the checkpoint start time not end time as
! 				 * last_checkpoint_time.  This is so that time-driven
! 				 * checkpoints happen at a predictable spacing.
! 				 */
! 				last_checkpoint_time = now;
! 			}
! 			else
! 			{
! 				/*
! 				 * We were not able to perform the restartpoint (checkpoints
! 				 * throw an ERROR in case of error).  Most likely because we
! 				 * have not received any new checkpoint WAL records since the
! 				 * last restartpoint. Try again in 15 s.
! 				 */
! 				last_checkpoint_time = now - CheckPointTimeout + 15;
! 			}
! 
! 			ckpt_active = false;
! 		}
! 		else
! 			BgBufferSync();
! 
! 		/* Check for archive_timeout and switch xlog files if necessary. */
! 		CheckArchiveTimeout();
  
  		/* Nap for the configured time. */
  		BgWriterNap();
--- 262,275 ----
  			 * control back to the sigsetjmp block above
  			 */
  			ExitOnAnyError = true;
  			/* Normal exit from the bgwriter is here */
  			proc_exit(0);		/* done */
  		}
  
  		/*
! 		 * Do one cycle of dirty-buffer writing.
  		 */
! 		BgBufferSync();
  
  		/* Nap for the configured time. */
  		BgWriterNap();
***************
*** 549,609 **** BackgroundWriterMain(void)
  }
  
  /*
-  * CheckArchiveTimeout -- check for archive_timeout and switch xlog files
-  *
-  * This will switch to a new WAL file and force an archive file write
-  * if any activity is recorded in the current WAL file, including just
-  * a single checkpoint record.
-  */
- static void
- CheckArchiveTimeout(void)
- {
- 	pg_time_t	now;
- 	pg_time_t	last_time;
- 
- 	if (XLogArchiveTimeout <= 0 || RecoveryInProgress())
- 		return;
- 
- 	now = (pg_time_t) time(NULL);
- 
- 	/* First we do a quick check using possibly-stale local state. */
- 	if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
- 		return;
- 
- 	/*
- 	 * Update local state ... note that last_xlog_switch_time is the last time
- 	 * a switch was performed *or requested*.
- 	 */
- 	last_time = GetLastSegSwitchTime();
- 
- 	last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
- 
- 	/* Now we can do the real check */
- 	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
- 	{
- 		XLogRecPtr	switchpoint;
- 
- 		/* OK, it's time to switch */
- 		switchpoint = RequestXLogSwitch();
- 
- 		/*
- 		 * If the returned pointer points exactly to a segment boundary,
- 		 * assume nothing happened.
- 		 */
- 		if ((switchpoint.xrecoff % XLogSegSize) != 0)
- 			ereport(DEBUG1,
- 				(errmsg("transaction log switch forced (archive_timeout=%d)",
- 						XLogArchiveTimeout)));
- 
- 		/*
- 		 * Update state in any case, so we don't retry constantly when the
- 		 * system is idle.
- 		 */
- 		last_xlog_switch_time = now;
- 	}
- }
- 
- /*
   * BgWriterNap -- Nap for the configured time or until a signal is received.
   */
  static void
--- 277,282 ----
***************
*** 624,808 **** BgWriterNap(void)
  	 * respond reasonably promptly when someone signals us, break down the
  	 * sleep into 1-second increments, and check for interrupts after each
  	 * nap.
- 	 *
- 	 * We absorb pending requests after each short sleep.
  	 */
! 	if (bgwriter_lru_maxpages > 0 || ckpt_active)
  		udelay = BgWriterDelay * 1000L;
- 	else if (XLogArchiveTimeout > 0)
- 		udelay = 1000000L;		/* One second */
  	else
  		udelay = 10000000L;		/* Ten seconds */
  
  	while (udelay > 999999L)
  	{
! 		if (got_SIGHUP || shutdown_requested ||
! 		(ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
  			break;
  		pg_usleep(1000000L);
- 		AbsorbFsyncRequests();
  		udelay -= 1000000L;
  	}
  
! 	if (!(got_SIGHUP || shutdown_requested ||
! 	  (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested)))
  		pg_usleep(udelay);
  }
  
- /*
-  * Returns true if an immediate checkpoint request is pending.	(Note that
-  * this does not check the *current* checkpoint's IMMEDIATE flag, but whether
-  * there is one pending behind it.)
-  */
- static bool
- ImmediateCheckpointRequested(void)
- {
- 	if (checkpoint_requested)
- 	{
- 		volatile BgWriterShmemStruct *bgs = BgWriterShmem;
- 
- 		/*
- 		 * We don't need to acquire the ckpt_lck in this case because we're
- 		 * only looking at a single flag bit.
- 		 */
- 		if (bgs->ckpt_flags & CHECKPOINT_IMMEDIATE)
- 			return true;
- 	}
- 	return false;
- }
- 
- /*
-  * CheckpointWriteDelay -- yield control to bgwriter during a checkpoint
-  *
-  * This function is called after each page write performed by BufferSync().
-  * It is responsible for keeping the bgwriter's normal activities in
-  * progress during a long checkpoint, and for throttling BufferSync()'s
-  * write rate to hit checkpoint_completion_target.
-  *
-  * The checkpoint request flags should be passed in; currently the only one
-  * examined is CHECKPOINT_IMMEDIATE, which disables delays between writes.
-  *
-  * 'progress' is an estimate of how much of the work has been done, as a
-  * fraction between 0.0 meaning none, and 1.0 meaning all done.
-  */
- void
- CheckpointWriteDelay(int flags, double progress)
- {
- 	static int	absorb_counter = WRITES_PER_ABSORB;
- 
- 	/* Do nothing if checkpoint is being executed by non-bgwriter process */
- 	if (!am_bg_writer)
- 		return;
- 
- 	/*
- 	 * Perform the usual bgwriter duties and take a nap, unless we're behind
- 	 * schedule, in which case we just try to catch up as quickly as possible.
- 	 */
- 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
- 		!shutdown_requested &&
- 		!ImmediateCheckpointRequested() &&
- 		IsCheckpointOnSchedule(progress))
- 	{
- 		if (got_SIGHUP)
- 		{
- 			got_SIGHUP = false;
- 			ProcessConfigFile(PGC_SIGHUP);
- 			/* update global shmem state for sync rep */
- 			SyncRepUpdateSyncStandbysDefined();
- 		}
- 
- 		AbsorbFsyncRequests();
- 		absorb_counter = WRITES_PER_ABSORB;
- 
- 		BgBufferSync();
- 		CheckArchiveTimeout();
- 		BgWriterNap();
- 	}
- 	else if (--absorb_counter <= 0)
- 	{
- 		/*
- 		 * Absorb pending fsync requests after each WRITES_PER_ABSORB write
- 		 * operations even when we don't sleep, to prevent overflow of the
- 		 * fsync request queue.
- 		 */
- 		AbsorbFsyncRequests();
- 		absorb_counter = WRITES_PER_ABSORB;
- 	}
- }
- 
- /*
-  * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
-  *		 in time?
-  *
-  * Compares the current progress against the time/segments elapsed since last
-  * checkpoint, and returns true if the progress we've made this far is greater
-  * than the elapsed time/segments.
-  */
- static bool
- IsCheckpointOnSchedule(double progress)
- {
- 	XLogRecPtr	recptr;
- 	struct timeval now;
- 	double		elapsed_xlogs,
- 				elapsed_time;
- 
- 	Assert(ckpt_active);
- 
- 	/* Scale progress according to checkpoint_completion_target. */
- 	progress *= CheckPointCompletionTarget;
- 
- 	/*
- 	 * Check against the cached value first. Only do the more expensive
- 	 * calculations once we reach the target previously calculated. Since
- 	 * neither time or WAL insert pointer moves backwards, a freshly
- 	 * calculated value can only be greater than or equal to the cached value.
- 	 */
- 	if (progress < ckpt_cached_elapsed)
- 		return false;
- 
- 	/*
- 	 * Check progress against WAL segments written and checkpoint_segments.
- 	 *
- 	 * We compare the current WAL insert location against the location
- 	 * computed before calling CreateCheckPoint. The code in XLogInsert that
- 	 * actually triggers a checkpoint when checkpoint_segments is exceeded
- 	 * compares against RedoRecptr, so this is not completely accurate.
- 	 * However, it's good enough for our purposes, we're only calculating an
- 	 * estimate anyway.
- 	 */
- 	if (!RecoveryInProgress())
- 	{
- 		recptr = GetInsertRecPtr();
- 		elapsed_xlogs =
- 			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
- 			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
- 			CheckPointSegments;
- 
- 		if (progress < elapsed_xlogs)
- 		{
- 			ckpt_cached_elapsed = elapsed_xlogs;
- 			return false;
- 		}
- 	}
- 
- 	/*
- 	 * Check progress against time elapsed and checkpoint_timeout.
- 	 */
- 	gettimeofday(&now, NULL);
- 	elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
- 					now.tv_usec / 1000000.0) / CheckPointTimeout;
- 
- 	if (progress < elapsed_time)
- 	{
- 		ckpt_cached_elapsed = elapsed_time;
- 		return false;
- 	}
- 
- 	/* It looks like we're on schedule. */
- 	return true;
- }
- 
- 
  /* --------------------------------
   *		signal handler routines
   * --------------------------------
--- 297,320 ----
  	 * respond reasonably promptly when someone signals us, break down the
  	 * sleep into 1-second increments, and check for interrupts after each
  	 * nap.
  	 */
! 	if (bgwriter_lru_maxpages > 0)
  		udelay = BgWriterDelay * 1000L;
  	else
  		udelay = 10000000L;		/* Ten seconds */
  
  	while (udelay > 999999L)
  	{
! 		if (got_SIGHUP || shutdown_requested)
  			break;
  		pg_usleep(1000000L);
  		udelay -= 1000000L;
  	}
  
! 	if (!(got_SIGHUP || shutdown_requested))
  		pg_usleep(udelay);
  }
  
  /* --------------------------------
   *		signal handler routines
   * --------------------------------
***************
*** 847,1287 **** BgSigHupHandler(SIGNAL_ARGS)
  	got_SIGHUP = true;
  }
  
- /* SIGINT: set flag to run a normal checkpoint right away */
- static void
- ReqCheckpointHandler(SIGNAL_ARGS)
- {
- 	checkpoint_requested = true;
- }
- 
  /* SIGUSR2: set flag to run a shutdown checkpoint and exit */
  static void
  ReqShutdownHandler(SIGNAL_ARGS)
  {
  	shutdown_requested = true;
  }
- 
- 
- /* --------------------------------
-  *		communication with backends
-  * --------------------------------
-  */
- 
- /*
-  * BgWriterShmemSize
-  *		Compute space needed for bgwriter-related shared memory
-  */
- Size
- BgWriterShmemSize(void)
- {
- 	Size		size;
- 
- 	/*
- 	 * Currently, the size of the requests[] array is arbitrarily set equal to
- 	 * NBuffers.  This may prove too large or small ...
- 	 */
- 	size = offsetof(BgWriterShmemStruct, requests);
- 	size = add_size(size, mul_size(NBuffers, sizeof(BgWriterRequest)));
- 
- 	return size;
- }
- 
- /*
-  * BgWriterShmemInit
-  *		Allocate and initialize bgwriter-related shared memory
-  */
- void
- BgWriterShmemInit(void)
- {
- 	bool		found;
- 
- 	BgWriterShmem = (BgWriterShmemStruct *)
- 		ShmemInitStruct("Background Writer Data",
- 						BgWriterShmemSize(),
- 						&found);
- 
- 	if (!found)
- 	{
- 		/* First time through, so initialize */
- 		MemSet(BgWriterShmem, 0, sizeof(BgWriterShmemStruct));
- 		SpinLockInit(&BgWriterShmem->ckpt_lck);
- 		BgWriterShmem->max_requests = NBuffers;
- 	}
- }
- 
- /*
-  * RequestCheckpoint
-  *		Called in backend processes to request a checkpoint
-  *
-  * flags is a bitwise OR of the following:
-  *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
-  *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
-  *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
-  *		ignoring checkpoint_completion_target parameter.
-  *	CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occured
-  *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
-  *		CHECKPOINT_END_OF_RECOVERY).
-  *	CHECKPOINT_WAIT: wait for completion before returning (otherwise,
-  *		just signal bgwriter to do it, and return).
-  *	CHECKPOINT_CAUSE_XLOG: checkpoint is requested due to xlog filling.
-  *		(This affects logging, and in particular enables CheckPointWarning.)
-  */
- void
- RequestCheckpoint(int flags)
- {
- 	/* use volatile pointer to prevent code rearrangement */
- 	volatile BgWriterShmemStruct *bgs = BgWriterShmem;
- 	int			ntries;
- 	int			old_failed,
- 				old_started;
- 
- 	/*
- 	 * If in a standalone backend, just do it ourselves.
- 	 */
- 	if (!IsPostmasterEnvironment)
- 	{
- 		/*
- 		 * There's no point in doing slow checkpoints in a standalone backend,
- 		 * because there's no other backends the checkpoint could disrupt.
- 		 */
- 		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
- 
- 		/*
- 		 * After any checkpoint, close all smgr files.	This is so we won't
- 		 * hang onto smgr references to deleted files indefinitely.
- 		 */
- 		smgrcloseall();
- 
- 		return;
- 	}
- 
- 	/*
- 	 * Atomically set the request flags, and take a snapshot of the counters.
- 	 * When we see ckpt_started > old_started, we know the flags we set here
- 	 * have been seen by bgwriter.
- 	 *
- 	 * Note that we OR the flags with any existing flags, to avoid overriding
- 	 * a "stronger" request by another backend.  The flag senses must be
- 	 * chosen to make this work!
- 	 */
- 	SpinLockAcquire(&bgs->ckpt_lck);
- 
- 	old_failed = bgs->ckpt_failed;
- 	old_started = bgs->ckpt_started;
- 	bgs->ckpt_flags |= flags;
- 
- 	SpinLockRelease(&bgs->ckpt_lck);
- 
- 	/*
- 	 * Send signal to request checkpoint.  It's possible that the bgwriter
- 	 * hasn't started yet, or is in process of restarting, so we will retry a
- 	 * few times if needed.  Also, if not told to wait for the checkpoint to
- 	 * occur, we consider failure to send the signal to be nonfatal and merely
- 	 * LOG it.
- 	 */
- 	for (ntries = 0;; ntries++)
- 	{
- 		if (BgWriterShmem->bgwriter_pid == 0)
- 		{
- 			if (ntries >= 20)	/* max wait 2.0 sec */
- 			{
- 				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
- 				"could not request checkpoint because bgwriter not running");
- 				break;
- 			}
- 		}
- 		else if (kill(BgWriterShmem->bgwriter_pid, SIGINT) != 0)
- 		{
- 			if (ntries >= 20)	/* max wait 2.0 sec */
- 			{
- 				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
- 					 "could not signal for checkpoint: %m");
- 				break;
- 			}
- 		}
- 		else
- 			break;				/* signal sent successfully */
- 
- 		CHECK_FOR_INTERRUPTS();
- 		pg_usleep(100000L);		/* wait 0.1 sec, then retry */
- 	}
- 
- 	/*
- 	 * If requested, wait for completion.  We detect completion according to
- 	 * the algorithm given above.
- 	 */
- 	if (flags & CHECKPOINT_WAIT)
- 	{
- 		int			new_started,
- 					new_failed;
- 
- 		/* Wait for a new checkpoint to start. */
- 		for (;;)
- 		{
- 			SpinLockAcquire(&bgs->ckpt_lck);
- 			new_started = bgs->ckpt_started;
- 			SpinLockRelease(&bgs->ckpt_lck);
- 
- 			if (new_started != old_started)
- 				break;
- 
- 			CHECK_FOR_INTERRUPTS();
- 			pg_usleep(100000L);
- 		}
- 
- 		/*
- 		 * We are waiting for ckpt_done >= new_started, in a modulo sense.
- 		 */
- 		for (;;)
- 		{
- 			int			new_done;
- 
- 			SpinLockAcquire(&bgs->ckpt_lck);
- 			new_done = bgs->ckpt_done;
- 			new_failed = bgs->ckpt_failed;
- 			SpinLockRelease(&bgs->ckpt_lck);
- 
- 			if (new_done - new_started >= 0)
- 				break;
- 
- 			CHECK_FOR_INTERRUPTS();
- 			pg_usleep(100000L);
- 		}
- 
- 		if (new_failed != old_failed)
- 			ereport(ERROR,
- 					(errmsg("checkpoint request failed"),
- 					 errhint("Consult recent messages in the server log for details.")));
- 	}
- }
- 
- /*
-  * ForwardFsyncRequest
-  *		Forward a file-fsync request from a backend to the bgwriter
-  *
-  * Whenever a backend is compelled to write directly to a relation
-  * (which should be seldom, if the bgwriter is getting its job done),
-  * the backend calls this routine to pass over knowledge that the relation
-  * is dirty and must be fsync'd before next checkpoint.  We also use this
-  * opportunity to count such writes for statistical purposes.
-  *
-  * segno specifies which segment (not block!) of the relation needs to be
-  * fsync'd.  (Since the valid range is much less than BlockNumber, we can
-  * use high values for special flags; that's all internal to md.c, which
-  * see for details.)
-  *
-  * To avoid holding the lock for longer than necessary, we normally write
-  * to the requests[] queue without checking for duplicates.  The bgwriter
-  * will have to eliminate dups internally anyway.  However, if we discover
-  * that the queue is full, we make a pass over the entire queue to compact
-  * it.	This is somewhat expensive, but the alternative is for the backend
-  * to perform its own fsync, which is far more expensive in practice.  It
-  * is theoretically possible a backend fsync might still be necessary, if
-  * the queue is full and contains no duplicate entries.  In that case, we
-  * let the backend know by returning false.
-  */
- bool
- ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
- 					BlockNumber segno)
- {
- 	BgWriterRequest *request;
- 
- 	if (!IsUnderPostmaster)
- 		return false;			/* probably shouldn't even get here */
- 
- 	if (am_bg_writer)
- 		elog(ERROR, "ForwardFsyncRequest must not be called in bgwriter");
- 
- 	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
- 
- 	/* Count all backend writes regardless of if they fit in the queue */
- 	BgWriterShmem->num_backend_writes++;
- 
- 	/*
- 	 * If the background writer isn't running or the request queue is full,
- 	 * the backend will have to perform its own fsync request.	But before
- 	 * forcing that to happen, we can try to compact the background writer
- 	 * request queue.
- 	 */
- 	if (BgWriterShmem->bgwriter_pid == 0 ||
- 		(BgWriterShmem->num_requests >= BgWriterShmem->max_requests
- 		 && !CompactBgwriterRequestQueue()))
- 	{
- 		/*
- 		 * Count the subset of writes where backends have to do their own
- 		 * fsync
- 		 */
- 		BgWriterShmem->num_backend_fsync++;
- 		LWLockRelease(BgWriterCommLock);
- 		return false;
- 	}
- 	request = &BgWriterShmem->requests[BgWriterShmem->num_requests++];
- 	request->rnode = rnode;
- 	request->forknum = forknum;
- 	request->segno = segno;
- 	LWLockRelease(BgWriterCommLock);
- 	return true;
- }
- 
- /*
-  * CompactBgwriterRequestQueue
-  *		Remove duplicates from the request queue to avoid backend fsyncs.
-  *
-  * Although a full fsync request queue is not common, it can lead to severe
-  * performance problems when it does happen.  So far, this situation has
-  * only been observed to occur when the system is under heavy write load,
-  * and especially during the "sync" phase of a checkpoint.	Without this
-  * logic, each backend begins doing an fsync for every block written, which
-  * gets very expensive and can slow down the whole system.
-  *
-  * Trying to do this every time the queue is full could lose if there
-  * aren't any removable entries.  But should be vanishingly rare in
-  * practice: there's one queue entry per shared buffer.
-  */
- static bool
- CompactBgwriterRequestQueue()
- {
- 	struct BgWriterSlotMapping
- 	{
- 		BgWriterRequest request;
- 		int			slot;
- 	};
- 
- 	int			n,
- 				preserve_count;
- 	int			num_skipped = 0;
- 	HASHCTL		ctl;
- 	HTAB	   *htab;
- 	bool	   *skip_slot;
- 
- 	/* must hold BgWriterCommLock in exclusive mode */
- 	Assert(LWLockHeldByMe(BgWriterCommLock));
- 
- 	/* Initialize temporary hash table */
- 	MemSet(&ctl, 0, sizeof(ctl));
- 	ctl.keysize = sizeof(BgWriterRequest);
- 	ctl.entrysize = sizeof(struct BgWriterSlotMapping);
- 	ctl.hash = tag_hash;
- 	htab = hash_create("CompactBgwriterRequestQueue",
- 					   BgWriterShmem->num_requests,
- 					   &ctl,
- 					   HASH_ELEM | HASH_FUNCTION);
- 
- 	/* Initialize skip_slot array */
- 	skip_slot = palloc0(sizeof(bool) * BgWriterShmem->num_requests);
- 
- 	/*
- 	 * The basic idea here is that a request can be skipped if it's followed
- 	 * by a later, identical request.  It might seem more sensible to work
- 	 * backwards from the end of the queue and check whether a request is
- 	 * *preceded* by an earlier, identical request, in the hopes of doing less
- 	 * copying.  But that might change the semantics, if there's an
- 	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
- 	 * we do it this way.  It would be possible to be even smarter if we made
- 	 * the code below understand the specific semantics of such requests (it
- 	 * could blow away preceding entries that would end up being canceled
- 	 * anyhow), but it's not clear that the extra complexity would buy us
- 	 * anything.
- 	 */
- 	for (n = 0; n < BgWriterShmem->num_requests; ++n)
- 	{
- 		BgWriterRequest *request;
- 		struct BgWriterSlotMapping *slotmap;
- 		bool		found;
- 
- 		request = &BgWriterShmem->requests[n];
- 		slotmap = hash_search(htab, request, HASH_ENTER, &found);
- 		if (found)
- 		{
- 			skip_slot[slotmap->slot] = true;
- 			++num_skipped;
- 		}
- 		slotmap->slot = n;
- 	}
- 
- 	/* Done with the hash table. */
- 	hash_destroy(htab);
- 
- 	/* If no duplicates, we're out of luck. */
- 	if (!num_skipped)
- 	{
- 		pfree(skip_slot);
- 		return false;
- 	}
- 
- 	/* We found some duplicates; remove them. */
- 	for (n = 0, preserve_count = 0; n < BgWriterShmem->num_requests; ++n)
- 	{
- 		if (skip_slot[n])
- 			continue;
- 		BgWriterShmem->requests[preserve_count++] = BgWriterShmem->requests[n];
- 	}
- 	ereport(DEBUG1,
- 	   (errmsg("compacted fsync request queue from %d entries to %d entries",
- 			   BgWriterShmem->num_requests, preserve_count)));
- 	BgWriterShmem->num_requests = preserve_count;
- 
- 	/* Cleanup. */
- 	pfree(skip_slot);
- 	return true;
- }
- 
- /*
-  * AbsorbFsyncRequests
-  *		Retrieve queued fsync requests and pass them to local smgr.
-  *
-  * This is exported because it must be called during CreateCheckPoint;
-  * we have to be sure we have accepted all pending requests just before
-  * we start fsync'ing.  Since CreateCheckPoint sometimes runs in
-  * non-bgwriter processes, do nothing if not bgwriter.
-  */
- void
- AbsorbFsyncRequests(void)
- {
- 	BgWriterRequest *requests = NULL;
- 	BgWriterRequest *request;
- 	int			n;
- 
- 	if (!am_bg_writer)
- 		return;
- 
- 	/*
- 	 * We have to PANIC if we fail to absorb all the pending requests (eg,
- 	 * because our hashtable runs out of memory).  This is because the system
- 	 * cannot run safely if we are unable to fsync what we have been told to
- 	 * fsync.  Fortunately, the hashtable is so small that the problem is
- 	 * quite unlikely to arise in practice.
- 	 */
- 	START_CRIT_SECTION();
- 
- 	/*
- 	 * We try to avoid holding the lock for a long time by copying the request
- 	 * array.
- 	 */
- 	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
- 
- 	/* Transfer write count into pending pgstats message */
- 	BgWriterStats.m_buf_written_backend += BgWriterShmem->num_backend_writes;
- 	BgWriterStats.m_buf_fsync_backend += BgWriterShmem->num_backend_fsync;
- 
- 	BgWriterShmem->num_backend_writes = 0;
- 	BgWriterShmem->num_backend_fsync = 0;
- 
- 	n = BgWriterShmem->num_requests;
- 	if (n > 0)
- 	{
- 		requests = (BgWriterRequest *) palloc(n * sizeof(BgWriterRequest));
- 		memcpy(requests, BgWriterShmem->requests, n * sizeof(BgWriterRequest));
- 	}
- 	BgWriterShmem->num_requests = 0;
- 
- 	LWLockRelease(BgWriterCommLock);
- 
- 	for (request = requests; n > 0; request++, n--)
- 		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
- 
- 	if (requests)
- 		pfree(requests);
- 
- 	END_CRIT_SECTION();
- }
--- 359,367 ----
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***************
*** 206,211 **** bool		restart_after_crash = true;
--- 206,212 ----
  /* PIDs of special child processes; 0 when not running */
  static pid_t StartupPID = 0,
  			BgWriterPID = 0,
+ 			CheckpointerPID = 0,
  			WalWriterPID = 0,
  			WalReceiverPID = 0,
  			AutoVacPID = 0,
***************
*** 277,283 **** typedef enum
  	PM_WAIT_BACKUP,				/* waiting for online backup mode to end */
  	PM_WAIT_READONLY,			/* waiting for read only backends to exit */
  	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
! 	PM_SHUTDOWN,				/* waiting for bgwriter to do shutdown ckpt */
  	PM_SHUTDOWN_2,				/* waiting for archiver and walsenders to
  								 * finish */
  	PM_WAIT_DEAD_END,			/* waiting for dead_end children to exit */
--- 278,284 ----
  	PM_WAIT_BACKUP,				/* waiting for online backup mode to end */
  	PM_WAIT_READONLY,			/* waiting for read only backends to exit */
  	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
! 	PM_SHUTDOWN,				/* waiting for checkpointer to do shutdown ckpt */
  	PM_SHUTDOWN_2,				/* waiting for archiver and walsenders to
  								 * finish */
  	PM_WAIT_DEAD_END,			/* waiting for dead_end children to exit */
***************
*** 462,467 **** static void ShmemBackendArrayRemove(Backend *bn);
--- 463,469 ----
  
  #define StartupDataBase()		StartChildProcess(StartupProcess)
  #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
+ #define StartCheckpointer()		StartChildProcess(CheckpointerProcess)
  #define StartWalWriter()		StartChildProcess(WalWriterProcess)
  #define StartWalReceiver()		StartChildProcess(WalReceiverProcess)
  
***************
*** 1029,1036 **** PostmasterMain(int argc, char *argv[])
  	 * CAUTION: when changing this list, check for side-effects on the signal
  	 * handling setup of child processes.  See tcop/postgres.c,
  	 * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
! 	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c, and
! 	 * postmaster/syslogger.c.
  	 */
  	pqinitmask();
  	PG_SETMASK(&BlockSig);
--- 1031,1038 ----
  	 * CAUTION: when changing this list, check for side-effects on the signal
  	 * handling setup of child processes.  See tcop/postgres.c,
  	 * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
! 	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c,
! 	 * postmaster/syslogger.c and postmaster/checkpointer.c
  	 */
  	pqinitmask();
  	PG_SETMASK(&BlockSig);
***************
*** 1367,1376 **** ServerLoop(void)
  		 * state that prevents it, start one.  It doesn't matter if this
  		 * fails, we'll just try again later.
  		 */
! 		if (BgWriterPID == 0 &&
! 			(pmState == PM_RUN || pmState == PM_RECOVERY ||
! 			 pmState == PM_HOT_STANDBY))
! 			BgWriterPID = StartBackgroundWriter();
  
  		/*
  		 * Likewise, if we have lost the walwriter process, try to start a new
--- 1369,1382 ----
  		 * state that prevents it, start one.  It doesn't matter if this
  		 * fails, we'll just try again later.
  		 */
! 		if (pmState == PM_RUN || pmState == PM_RECOVERY ||
! 			 pmState == PM_HOT_STANDBY)
! 		{
! 			if (BgWriterPID == 0)
! 				BgWriterPID = StartBackgroundWriter();
! 			if (CheckpointerPID == 0)
! 				CheckpointerPID = StartCheckpointer();
! 		}
  
  		/*
  		 * Likewise, if we have lost the walwriter process, try to start a new
***************
*** 2048,2053 **** SIGHUP_handler(SIGNAL_ARGS)
--- 2054,2061 ----
  			signal_child(StartupPID, SIGHUP);
  		if (BgWriterPID != 0)
  			signal_child(BgWriterPID, SIGHUP);
+ 		if (CheckpointerPID != 0)
+ 			signal_child(CheckpointerPID, SIGHUP);
  		if (WalWriterPID != 0)
  			signal_child(WalWriterPID, SIGHUP);
  		if (WalReceiverPID != 0)
***************
*** 2162,2168 **** pmdie(SIGNAL_ARGS)
  				signal_child(WalReceiverPID, SIGTERM);
  			if (pmState == PM_RECOVERY)
  			{
! 				/* only bgwriter is active in this state */
  				pmState = PM_WAIT_BACKENDS;
  			}
  			else if (pmState == PM_RUN ||
--- 2170,2176 ----
  				signal_child(WalReceiverPID, SIGTERM);
  			if (pmState == PM_RECOVERY)
  			{
! 				/* only checkpointer is active in this state */
  				pmState = PM_WAIT_BACKENDS;
  			}
  			else if (pmState == PM_RUN ||
***************
*** 2207,2212 **** pmdie(SIGNAL_ARGS)
--- 2215,2222 ----
  				signal_child(StartupPID, SIGQUIT);
  			if (BgWriterPID != 0)
  				signal_child(BgWriterPID, SIGQUIT);
+ 			if (CheckpointerPID != 0)
+ 				signal_child(CheckpointerPID, SIGQUIT);
  			if (WalWriterPID != 0)
  				signal_child(WalWriterPID, SIGQUIT);
  			if (WalReceiverPID != 0)
***************
*** 2337,2348 **** reaper(SIGNAL_ARGS)
  			}
  
  			/*
! 			 * Crank up the background writer, if we didn't do that already
  			 * when we entered consistent recovery state.  It doesn't matter
  			 * if this fails, we'll just try again later.
  			 */
  			if (BgWriterPID == 0)
  				BgWriterPID = StartBackgroundWriter();
  
  			/*
  			 * Likewise, start other special children as needed.  In a restart
--- 2347,2360 ----
  			}
  
  			/*
! 			 * Crank up background tasks, if we didn't do that already
  			 * when we entered consistent recovery state.  It doesn't matter
  			 * if this fails, we'll just try again later.
  			 */
  			if (BgWriterPID == 0)
  				BgWriterPID = StartBackgroundWriter();
+ 			if (CheckpointerPID == 0)
+ 				CheckpointerPID = StartCheckpointer();
  
  			/*
  			 * Likewise, start other special children as needed.  In a restart
***************
*** 2370,2379 **** reaper(SIGNAL_ARGS)
  		if (pid == BgWriterPID)
  		{
  			BgWriterPID = 0;
  			if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN)
  			{
  				/*
! 				 * OK, we saw normal exit of the bgwriter after it's been told
  				 * to shut down.  We expect that it wrote a shutdown
  				 * checkpoint.	(If for some reason it didn't, recovery will
  				 * occur on next postmaster start.)
--- 2382,2403 ----
  		if (pid == BgWriterPID)
  		{
  			BgWriterPID = 0;
+ 			if (!EXIT_STATUS_0(exitstatus))
+ 				HandleChildCrash(pid, exitstatus,
+ 								 _("background writer process"));
+ 			continue;
+ 		}
+ 
+ 		/*
+ 		 * Was it the checkpointer?
+ 		 */
+ 		if (pid == CheckpointerPID)
+ 		{
+ 			CheckpointerPID = 0;
  			if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN)
  			{
  				/*
! 				 * OK, we saw normal exit of the checkpointer after it's been told
  				 * to shut down.  We expect that it wrote a shutdown
  				 * checkpoint.	(If for some reason it didn't, recovery will
  				 * occur on next postmaster start.)
***************
*** 2410,2420 **** reaper(SIGNAL_ARGS)
  			else
  			{
  				/*
! 				 * Any unexpected exit of the bgwriter (including FATAL exit)
  				 * is treated as a crash.
  				 */
  				HandleChildCrash(pid, exitstatus,
! 								 _("background writer process"));
  			}
  
  			continue;
--- 2434,2444 ----
  			else
  			{
  				/*
! 				 * Any unexpected exit of the checkpointer (including FATAL exit)
  				 * is treated as a crash.
  				 */
  				HandleChildCrash(pid, exitstatus,
! 								 _("checkpointer process"));
  			}
  
  			continue;
***************
*** 2598,2605 **** CleanupBackend(int pid,
  }
  
  /*
!  * HandleChildCrash -- cleanup after failed backend, bgwriter, walwriter,
!  * or autovacuum.
   *
   * The objectives here are to clean up our local state about the child
   * process, and to signal all other remaining children to quickdie.
--- 2622,2629 ----
  }
  
  /*
!  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
!  * walwriter or autovacuum.
   *
   * The objectives here are to clean up our local state about the child
   * process, and to signal all other remaining children to quickdie.
***************
*** 2692,2697 **** HandleChildCrash(int pid, int exitstatus, const char *procname)
--- 2716,2733 ----
  		signal_child(BgWriterPID, (SendStop ? SIGSTOP : SIGQUIT));
  	}
  
+ 	/* Take care of the checkpointer too */
+ 	if (pid == CheckpointerPID)
+ 		CheckpointerPID = 0;
+ 	else if (CheckpointerPID != 0 && !FatalError)
+ 	{
+ 		ereport(DEBUG2,
+ 				(errmsg_internal("sending %s to process %d",
+ 								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ 								 (int) CheckpointerPID)));
+ 		signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
+ 	}
+ 
  	/* Take care of the walwriter too */
  	if (pid == WalWriterPID)
  		WalWriterPID = 0;
***************
*** 2871,2879 **** PostmasterStateMachine(void)
  	{
  		/*
  		 * PM_WAIT_BACKENDS state ends when we have no regular backends
! 		 * (including autovac workers) and no walwriter or autovac launcher.
! 		 * If we are doing crash recovery then we expect the bgwriter to exit
! 		 * too, otherwise not.	The archiver, stats, and syslogger processes
  		 * are disregarded since they are not connected to shared memory; we
  		 * also disregard dead_end children here. Walsenders are also
  		 * disregarded, they will be terminated later after writing the
--- 2907,2916 ----
  	{
  		/*
  		 * PM_WAIT_BACKENDS state ends when we have no regular backends
! 		 * (including autovac workers) and no walwriter, autovac launcher
! 		 * or bgwriter.  If we are doing crash recovery then we expect the
! 		 * checkpointer to exit as well, otherwise not.	
! 		 * The archiver, stats, and syslogger processes
  		 * are disregarded since they are not connected to shared memory; we
  		 * also disregard dead_end children here. Walsenders are also
  		 * disregarded, they will be terminated later after writing the
***************
*** 2882,2888 **** PostmasterStateMachine(void)
  		if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_AUTOVAC) == 0 &&
  			StartupPID == 0 &&
  			WalReceiverPID == 0 &&
! 			(BgWriterPID == 0 || !FatalError) &&
  			WalWriterPID == 0 &&
  			AutoVacPID == 0)
  		{
--- 2919,2926 ----
  		if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_AUTOVAC) == 0 &&
  			StartupPID == 0 &&
  			WalReceiverPID == 0 &&
! 			BgWriterPID == 0 &&
! 			(CheckpointerPID == 0 || !FatalError) &&
  			WalWriterPID == 0 &&
  			AutoVacPID == 0)
  		{
***************
*** 2904,2925 **** PostmasterStateMachine(void)
  				/*
  				 * If we get here, we are proceeding with normal shutdown. All
  				 * the regular children are gone, and it's time to tell the
! 				 * bgwriter to do a shutdown checkpoint.
  				 */
  				Assert(Shutdown > NoShutdown);
! 				/* Start the bgwriter if not running */
! 				if (BgWriterPID == 0)
! 					BgWriterPID = StartBackgroundWriter();
  				/* And tell it to shut down */
! 				if (BgWriterPID != 0)
  				{
! 					signal_child(BgWriterPID, SIGUSR2);
  					pmState = PM_SHUTDOWN;
  				}
  				else
  				{
  					/*
! 					 * If we failed to fork a bgwriter, just shut down. Any
  					 * required cleanup will happen at next restart. We set
  					 * FatalError so that an "abnormal shutdown" message gets
  					 * logged when we exit.
--- 2942,2963 ----
  				/*
  				 * If we get here, we are proceeding with normal shutdown. All
  				 * the regular children are gone, and it's time to tell the
! 				 * checkpointer to do a shutdown checkpoint.
  				 */
  				Assert(Shutdown > NoShutdown);
! 				/* Start the checkpointer if not running */
! 				if (CheckpointerPID == 0)
! 					CheckpointerPID = StartCheckpointer();
  				/* And tell it to shut down */
! 				if (CheckpointerPID != 0)
  				{
! 					signal_child(CheckpointerPID, SIGUSR2);
  					pmState = PM_SHUTDOWN;
  				}
  				else
  				{
  					/*
! 					 * If we failed to fork a checkpointer, just shut down. Any
  					 * required cleanup will happen at next restart. We set
  					 * FatalError so that an "abnormal shutdown" message gets
  					 * logged when we exit.
***************
*** 2978,2983 **** PostmasterStateMachine(void)
--- 3016,3022 ----
  			Assert(StartupPID == 0);
  			Assert(WalReceiverPID == 0);
  			Assert(BgWriterPID == 0);
+ 			Assert(CheckpointerPID == 0);
  			Assert(WalWriterPID == 0);
  			Assert(AutoVacPID == 0);
  			/* syslogger is not considered here */
***************
*** 4157,4162 **** sigusr1_handler(SIGNAL_ARGS)
--- 4196,4203 ----
  		 */
  		Assert(BgWriterPID == 0);
  		BgWriterPID = StartBackgroundWriter();
+ 		Assert(CheckpointerPID == 0);
+ 		CheckpointerPID = StartCheckpointer();
  
  		pmState = PM_RECOVERY;
  	}
***************
*** 4443,4448 **** StartChildProcess(AuxProcType type)
--- 4484,4493 ----
  				ereport(LOG,
  				   (errmsg("could not fork background writer process: %m")));
  				break;
+ 			case CheckpointerProcess:
+ 				ereport(LOG,
+ 				   (errmsg("could not fork checkpointer process: %m")));
+ 				break;
  			case WalWriterProcess:
  				ereport(LOG,
  						(errmsg("could not fork WAL writer process: %m")));
*** a/src/backend/storage/buffer/bufmgr.c
--- b/src/backend/storage/buffer/bufmgr.c
***************
*** 1278,1288 **** BufferSync(int flags)
  					break;
  
  				/*
! 				 * Perform normal bgwriter duties and sleep to throttle our
! 				 * I/O rate.
  				 */
! 				CheckpointWriteDelay(flags,
! 									 (double) num_written / num_to_write);
  			}
  		}
  
--- 1278,1286 ----
  					break;
  
  				/*
! 				 * Sleep to throttle our I/O rate.
  				 */
! 				CheckpointWriteDelay(flags, num_written, num_to_write);
  			}
  		}
  
*** a/src/backend/storage/smgr/md.c
--- b/src/backend/storage/smgr/md.c
***************
*** 38,44 ****
  /*
   * Special values for the segno arg to RememberFsyncRequest.
   *
!  * Note that CompactBgwriterRequestQueue assumes that it's OK to remove an
   * fsync request from the queue if an identical, subsequent request is found.
   * See comments there before making changes here.
   */
--- 38,44 ----
  /*
   * Special values for the segno arg to RememberFsyncRequest.
   *
!  * Note that CompactcheckpointerRequestQueue assumes that it's OK to remove an
   * fsync request from the queue if an identical, subsequent request is found.
   * See comments there before making changes here.
   */
***************
*** 77,83 ****
   *	Inactive segments are those that once contained data but are currently
   *	not needed because of an mdtruncate() operation.  The reason for leaving
   *	them present at size zero, rather than unlinking them, is that other
!  *	backends and/or the bgwriter might be holding open file references to
   *	such segments.	If the relation expands again after mdtruncate(), such
   *	that a deactivated segment becomes active again, it is important that
   *	such file references still be valid --- else data might get written
--- 77,83 ----
   *	Inactive segments are those that once contained data but are currently
   *	not needed because of an mdtruncate() operation.  The reason for leaving
   *	them present at size zero, rather than unlinking them, is that other
!  *	backends and/or the checkpointer might be holding open file references to
   *	such segments.	If the relation expands again after mdtruncate(), such
   *	that a deactivated segment becomes active again, it is important that
   *	such file references still be valid --- else data might get written
***************
*** 111,117 **** static MemoryContext MdCxt;		/* context for all md.c allocations */
  
  
  /*
!  * In some contexts (currently, standalone backends and the bgwriter process)
   * we keep track of pending fsync operations: we need to remember all relation
   * segments that have been written since the last checkpoint, so that we can
   * fsync them down to disk before completing the next checkpoint.  This hash
--- 111,117 ----
  
  
  /*
!  * In some contexts (currently, standalone backends and the checkpointer process)
   * we keep track of pending fsync operations: we need to remember all relation
   * segments that have been written since the last checkpoint, so that we can
   * fsync them down to disk before completing the next checkpoint.  This hash
***************
*** 123,129 **** static MemoryContext MdCxt;		/* context for all md.c allocations */
   * a hash table, because we don't expect there to be any duplicate requests.
   *
   * (Regular backends do not track pending operations locally, but forward
!  * them to the bgwriter.)
   */
  typedef struct
  {
--- 123,129 ----
   * a hash table, because we don't expect there to be any duplicate requests.
   *
   * (Regular backends do not track pending operations locally, but forward
!  * them to the checkpointer.)
   */
  typedef struct
  {
***************
*** 194,200 **** mdinit(void)
  	 * Create pending-operations hashtable if we need it.  Currently, we need
  	 * it if we are standalone (not under a postmaster) OR if we are a
  	 * bootstrap-mode subprocess of a postmaster (that is, a startup or
! 	 * bgwriter process).
  	 */
  	if (!IsUnderPostmaster || IsBootstrapProcessingMode())
  	{
--- 194,200 ----
  	 * Create pending-operations hashtable if we need it.  Currently, we need
  	 * it if we are standalone (not under a postmaster) OR if we are a
  	 * bootstrap-mode subprocess of a postmaster (that is, a startup or
! 	 * checkpointer process).
  	 */
  	if (!IsUnderPostmaster || IsBootstrapProcessingMode())
  	{
***************
*** 214,223 **** mdinit(void)
  }
  
  /*
!  * In archive recovery, we rely on bgwriter to do fsyncs, but we will have
   * already created the pendingOpsTable during initialization of the startup
   * process.  Calling this function drops the local pendingOpsTable so that
!  * subsequent requests will be forwarded to bgwriter.
   */
  void
  SetForwardFsyncRequests(void)
--- 214,223 ----
  }
  
  /*
!  * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
   * already created the pendingOpsTable during initialization of the startup
   * process.  Calling this function drops the local pendingOpsTable so that
!  * subsequent requests will be forwarded to checkpointer.
   */
  void
  SetForwardFsyncRequests(void)
***************
*** 765,773 **** mdnblocks(SMgrRelation reln, ForkNumber forknum)
  	 * NOTE: this assumption could only be wrong if another backend has
  	 * truncated the relation.	We rely on higher code levels to handle that
  	 * scenario by closing and re-opening the md fd, which is handled via
! 	 * relcache flush.	(Since the bgwriter doesn't participate in relcache
  	 * flush, it could have segment chain entries for inactive segments;
! 	 * that's OK because the bgwriter never needs to compute relation size.)
  	 */
  	while (v->mdfd_chain != NULL)
  	{
--- 765,773 ----
  	 * NOTE: this assumption could only be wrong if another backend has
  	 * truncated the relation.	We rely on higher code levels to handle that
  	 * scenario by closing and re-opening the md fd, which is handled via
! 	 * relcache flush.	(Since the checkpointer doesn't participate in relcache
  	 * flush, it could have segment chain entries for inactive segments;
! 	 * that's OK because the checkpointer never needs to compute relation size.)
  	 */
  	while (v->mdfd_chain != NULL)
  	{
***************
*** 957,963 **** mdsync(void)
  		elog(ERROR, "cannot sync without a pendingOpsTable");
  
  	/*
! 	 * If we are in the bgwriter, the sync had better include all fsync
  	 * requests that were queued by backends up to this point.	The tightest
  	 * race condition that could occur is that a buffer that must be written
  	 * and fsync'd for the checkpoint could have been dumped by a backend just
--- 957,963 ----
  		elog(ERROR, "cannot sync without a pendingOpsTable");
  
  	/*
! 	 * If we are in the checkpointer, the sync had better include all fsync
  	 * requests that were queued by backends up to this point.	The tightest
  	 * race condition that could occur is that a buffer that must be written
  	 * and fsync'd for the checkpoint could have been dumped by a backend just
***************
*** 1033,1039 **** mdsync(void)
  			int			failures;
  
  			/*
! 			 * If in bgwriter, we want to absorb pending requests every so
  			 * often to prevent overflow of the fsync request queue.  It is
  			 * unspecified whether newly-added entries will be visited by
  			 * hash_seq_search, but we don't care since we don't need to
--- 1033,1039 ----
  			int			failures;
  
  			/*
! 			 * If in checkpointer, we want to absorb pending requests every so
  			 * often to prevent overflow of the fsync request queue.  It is
  			 * unspecified whether newly-added entries will be visited by
  			 * hash_seq_search, but we don't care since we don't need to
***************
*** 1070,1078 **** mdsync(void)
  				 * say "but an unreferenced SMgrRelation is still a leak!" Not
  				 * really, because the only case in which a checkpoint is done
  				 * by a process that isn't about to shut down is in the
! 				 * bgwriter, and it will periodically do smgrcloseall(). This
  				 * fact justifies our not closing the reln in the success path
! 				 * either, which is a good thing since in non-bgwriter cases
  				 * we couldn't safely do that.)  Furthermore, in many cases
  				 * the relation will have been dirtied through this same smgr
  				 * relation, and so we can save a file open/close cycle.
--- 1070,1078 ----
  				 * say "but an unreferenced SMgrRelation is still a leak!" Not
  				 * really, because the only case in which a checkpoint is done
  				 * by a process that isn't about to shut down is in the
! 				 * checkpointer, and it will periodically do smgrcloseall(). This
  				 * fact justifies our not closing the reln in the success path
! 				 * either, which is a good thing since in non-checkpointer cases
  				 * we couldn't safely do that.)  Furthermore, in many cases
  				 * the relation will have been dirtied through this same smgr
  				 * relation, and so we can save a file open/close cycle.
***************
*** 1301,1307 **** register_unlink(RelFileNodeBackend rnode)
  	else
  	{
  		/*
! 		 * Notify the bgwriter about it.  If we fail to queue the request
  		 * message, we have to sleep and try again, because we can't simply
  		 * delete the file now.  Ugly, but hopefully won't happen often.
  		 *
--- 1301,1307 ----
  	else
  	{
  		/*
! 		 * Notify the checkpointer about it.  If we fail to queue the request
  		 * message, we have to sleep and try again, because we can't simply
  		 * delete the file now.  Ugly, but hopefully won't happen often.
  		 *
***************
*** 1315,1324 **** register_unlink(RelFileNodeBackend rnode)
  }
  
  /*
!  * RememberFsyncRequest() -- callback from bgwriter side of fsync request
   *
   * We stuff most fsync requests into the local hash table for execution
!  * during the bgwriter's next checkpoint.  UNLINK requests go into a
   * separate linked list, however, because they get processed separately.
   *
   * The range of possible segment numbers is way less than the range of
--- 1315,1324 ----
  }
  
  /*
!  * RememberFsyncRequest() -- callback from checkpointer side of fsync request
   *
   * We stuff most fsync requests into the local hash table for execution
!  * during the checkpointer's next checkpoint.  UNLINK requests go into a
   * separate linked list, however, because they get processed separately.
   *
   * The range of possible segment numbers is way less than the range of
***************
*** 1460,1479 **** ForgetRelationFsyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
  	else if (IsUnderPostmaster)
  	{
  		/*
! 		 * Notify the bgwriter about it.  If we fail to queue the revoke
  		 * message, we have to sleep and try again ... ugly, but hopefully
  		 * won't happen often.
  		 *
  		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
  		 * error would leave the no-longer-used file still present on disk,
! 		 * which would be bad, so I'm inclined to assume that the bgwriter
  		 * will always empty the queue soon.
  		 */
  		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
  			pg_usleep(10000L);	/* 10 msec seems a good number */
  
  		/*
! 		 * Note we don't wait for the bgwriter to actually absorb the revoke
  		 * message; see mdsync() for the implications.
  		 */
  	}
--- 1460,1479 ----
  	else if (IsUnderPostmaster)
  	{
  		/*
! 		 * Notify the checkpointer about it.  If we fail to queue the revoke
  		 * message, we have to sleep and try again ... ugly, but hopefully
  		 * won't happen often.
  		 *
  		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
  		 * error would leave the no-longer-used file still present on disk,
! 		 * which would be bad, so I'm inclined to assume that the checkpointer
  		 * will always empty the queue soon.
  		 */
  		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
  			pg_usleep(10000L);	/* 10 msec seems a good number */
  
  		/*
! 		 * Note we don't wait for the checkpointer to actually absorb the revoke
  		 * message; see mdsync() for the implications.
  		 */
  	}
*** a/src/include/access/xlog_internal.h
--- b/src/include/access/xlog_internal.h
***************
*** 256,262 **** typedef struct RmgrData
  extern const RmgrData RmgrTable[];
  
  /*
!  * Exported to support xlog switching from bgwriter
   */
  extern pg_time_t GetLastSegSwitchTime(void);
  extern XLogRecPtr RequestXLogSwitch(void);
--- 256,262 ----
  extern const RmgrData RmgrTable[];
  
  /*
!  * Exported to support xlog switching from checkpointer
   */
  extern pg_time_t GetLastSegSwitchTime(void);
  extern XLogRecPtr RequestXLogSwitch(void);
*** a/src/include/bootstrap/bootstrap.h
--- b/src/include/bootstrap/bootstrap.h
***************
*** 22,27 **** typedef enum
--- 22,28 ----
  	BootstrapProcess,
  	StartupProcess,
  	BgWriterProcess,
+ 	CheckpointerProcess,
  	WalWriterProcess,
  	WalReceiverProcess,
  
*** a/src/include/postmaster/bgwriter.h
--- b/src/include/postmaster/bgwriter.h
***************
*** 23,31 **** extern int	CheckPointWarning;
  extern double CheckPointCompletionTarget;
  
  extern void BackgroundWriterMain(void);
  
  extern void RequestCheckpoint(int flags);
! extern void CheckpointWriteDelay(int flags, double progress);
  
  extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
  					BlockNumber segno);
--- 23,32 ----
  extern double CheckPointCompletionTarget;
  
  extern void BackgroundWriterMain(void);
+ extern void CheckpointerMain(void);
  
  extern void RequestCheckpoint(int flags);
! extern void CheckpointWriteDelay(int flags, int num_written, int num_to_write);
  
  extern bool ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
  					BlockNumber segno);
*** a/src/include/storage/proc.h
--- b/src/include/storage/proc.h
***************
*** 190,200 **** extern PROC_HDR *ProcGlobal;
   * We set aside some extra PGPROC structures for auxiliary processes,
   * ie things that aren't full-fledged backends but need shmem access.
   *
!  * Background writer and WAL writer run during normal operation. Startup
!  * process and WAL receiver also consume 2 slots, but WAL writer is
!  * launched only after startup has exited, so we only need 3 slots.
   */
! #define NUM_AUXILIARY_PROCS		3
  
  
  /* configurable options */
--- 190,200 ----
   * We set aside some extra PGPROC structures for auxiliary processes,
   * ie things that aren't full-fledged backends but need shmem access.
   *
!  * Background writer, checkpointer and WAL writer run during normal operation.
!  * Startup process and WAL receiver also consume 2 slots, but WAL writer is
!  * launched only after startup has exited, so we only need 4 slots.
   */
! #define NUM_AUXILIARY_PROCS		4
  
  
  /* configurable options */
*** a/src/include/storage/procsignal.h
--- b/src/include/storage/procsignal.h
***************
*** 19,25 ****
  
  /*
   * Reasons for signalling a Postgres child process (a backend or an auxiliary
!  * process, like bgwriter).  We can cope with concurrent signals for different
   * reasons.  However, if the same reason is signaled multiple times in quick
   * succession, the process is likely to observe only one notification of it.
   * This is okay for the present uses.
--- 19,25 ----
  
  /*
   * Reasons for signalling a Postgres child process (a backend or an auxiliary
!  * process, like checkpointer).  We can cope with concurrent signals for different
   * reasons.  However, if the same reason is signaled multiple times in quick
   * succession, the process is likely to observe only one notification of it.
   * This is okay for the present uses.

Fujii Masao

masao.fujii@gmail.com

over 14 years ago

In reply to: Simon Riggs (#1)

Re: Separating bgwriter and checkpointer

On Fri, Sep 16, 2011 at 7:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

This patch splits bgwriter into 2 processes: checkpointer and
bgwriter, seeking to avoid contentious changes. Additional changes are
expected in this release to build upon these changes for both new
processes, though this patch stands on its own as both a performance
vehicle and in some ways a refcatoring to simplify the code.

I like this idea to simplify the code. How much performance gain can we
expect by this patch?

Current patch has a bug at shutdown I've not located yet, but seems
likely is a simple error. That is mainly because for personal reasons
I've not been able to work on the patch recently. I expect to be able
to fix that later in the CF.

You seem to have forgotten to include checkpointor.c and .h in the patch.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Fujii Masao (#2)

Re: Separating bgwriter and checkpointer

On Fri, Sep 16, 2011 at 2:38 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Sep 16, 2011 at 7:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

This patch splits bgwriter into 2 processes: checkpointer and
bgwriter, seeking to avoid contentious changes. Additional changes are
expected in this release to build upon these changes for both new
processes, though this patch stands on its own as both a performance
vehicle and in some ways a refcatoring to simplify the code.

I like this idea to simplify the code. How much performance gain can we
expect by this patch?

On heavily I/O bound systems, this is likely to make a noticeable
difference, since bgwriter reduces I/O in user processes.

The overhead of sending signals between processes is much less than I
had previously thought, so I expect no problems there, even on highly
loaded systems.

Current patch has a bug at shutdown I've not located yet, but seems
likely is a simple error. That is mainly because for personal reasons
I've not been able to work on the patch recently. I expect to be able
to fix that later in the CF.

You seem to have forgotten to include checkpointor.c and .h in the patch.

I confirm this error. I'll repost full patch later in the week when I
have more time.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 14 years ago

In reply to: Simon Riggs (#3)

Re: Separating bgwriter and checkpointer

On 20.09.2011 10:48, Simon Riggs wrote:

On Fri, Sep 16, 2011 at 2:38 AM, Fujii Masao<masao.fujii@gmail.com> wrote:

On Fri, Sep 16, 2011 at 7:53 AM, Simon Riggs<simon@2ndquadrant.com> wrote:

This patch splits bgwriter into 2 processes: checkpointer and
bgwriter, seeking to avoid contentious changes. Additional changes are
expected in this release to build upon these changes for both new
processes, though this patch stands on its own as both a performance
vehicle and in some ways a refcatoring to simplify the code.

I like this idea to simplify the code. How much performance gain can we
expect by this patch?

On heavily I/O bound systems, this is likely to make a noticeable
difference, since bgwriter reduces I/O in user processes.

Hmm. If the system is I/O bound, it doesn't matter which process
performs the I/O. It's still the same amount of I/O in total, and in an
I/O bound system, that's what determines the overall throughput.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Heikki Linnakangas (#4)

Re: Separating bgwriter and checkpointer

On Tue, Sep 20, 2011 at 9:06 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 20.09.2011 10:48, Simon Riggs wrote:

On Fri, Sep 16, 2011 at 2:38 AM, Fujii Masao<masao.fujii@gmail.com>
wrote:

On Fri, Sep 16, 2011 at 7:53 AM, Simon Riggs<simon@2ndquadrant.com>
wrote:

This patch splits bgwriter into 2 processes: checkpointer and
bgwriter, seeking to avoid contentious changes. Additional changes are
expected in this release to build upon these changes for both new
processes, though this patch stands on its own as both a performance
vehicle and in some ways a refcatoring to simplify the code.

I like this idea to simplify the code. How much performance gain can we
expect by this patch?

On heavily I/O bound systems, this is likely to make a noticeable
difference, since bgwriter reduces I/O in user processes.

Hmm. If the system is I/O bound, it doesn't matter which process performs
the I/O. It's still the same amount of I/O in total, and in an I/O bound
system, that's what determines the overall throughput.

That's true, but not relevant.

The bgwriter avoids I/O, if it is operating correctly. This patch
ensures it continues to operate even during heavy checkpoints. So it
helps avoid extra I/O during a period of very high I/O activity.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 14 years ago

In reply to: Simon Riggs (#5)

Re: Separating bgwriter and checkpointer

On 20.09.2011 11:18, Simon Riggs wrote:

The bgwriter avoids I/O, if it is operating correctly. This patch
ensures it continues to operate even during heavy checkpoints. So it
helps avoid extra I/O during a period of very high I/O activity.

I don't see what difference it makes which process does the I/O. If a
write() by checkpointer process blocks, any write()s by the separate
bgwriter process at that time will block too. If the I/O is not
saturated, and the checkpoint write()s don't block, then even without
this patch, the bgwriter process can handle its usual bgwriter duties
during checkpoint just fine. (And if the I/O is not saturated, it's not
an I/O bound system anyway.)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Heikki Linnakangas (#6)

Re: Separating bgwriter and checkpointer

On Tue, Sep 20, 2011 at 10:03 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 20.09.2011 11:18, Simon Riggs wrote:

The bgwriter avoids I/O, if it is operating correctly. This patch
ensures it continues to operate even during heavy checkpoints. So it
helps avoid extra I/O during a period of very high I/O activity.

I don't see what difference it makes which process does the I/O. If a
write() by checkpointer process blocks, any write()s by the separate
bgwriter process at that time will block too. If the I/O is not saturated,
and the checkpoint write()s don't block, then even without this patch, the
bgwriter process can handle its usual bgwriter duties during checkpoint just
fine. (And if the I/O is not saturated, it's not an I/O bound system
anyway.)

Whatever value you assign to the bgwriter, then this patch makes sure
that happens during heavy fsyncs.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Greg Stark

stark@mit.edu

over 14 years ago

In reply to: Simon Riggs (#7)

Re: Separating bgwriter and checkpointer

On Tue, Sep 20, 2011 at 11:03 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

I don't see what difference it makes which process does the I/O. If a
write() by checkpointer process blocks, any write()s by the separate
bgwriter process at that time will block too. If the I/O is not saturated,
and the checkpoint write()s don't block, then even without this patch, the
bgwriter process can handle its usual bgwriter duties during checkpoint just
fine. (And if the I/O is not saturated, it's not an I/O bound system
anyway.)

Whatever value you assign to the bgwriter, then this patch makes sure
that happens during heavy fsyncs.

I think his point is that it doesn't because if the heavy fsyncs cause
the system to be i/o bound it then bgwriter will just block issuing
the writes instead of the fsyncs.

I'm not actually convinced. Writes will only block if the kernel
decides to block. We don't really know how the kernel makes this
decision but it's entirely possible that having pending physical i/o
issued due to an fsync doesn't influence the decision if there is
still a reasonable number of dirty pages in the buffer cache. In a
sense, "I/O bound" means different things for write and fsync. Or to
put it another way fsync is latency sensitive but write is only
bandwidth sensitive.

All that said my question is which way is the code more legible and
easier to follow?

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 14 years ago

In reply to: Greg Stark (#8)

Re: Separating bgwriter and checkpointer

On 20.09.2011 16:29, Greg Stark wrote:

On Tue, Sep 20, 2011 at 11:03 AM, Simon Riggs<simon@2ndquadrant.com> wrote:

I don't see what difference it makes which process does the I/O. If a
write() by checkpointer process blocks, any write()s by the separate
bgwriter process at that time will block too. If the I/O is not saturated,
and the checkpoint write()s don't block, then even without this patch, the
bgwriter process can handle its usual bgwriter duties during checkpoint just
fine. (And if the I/O is not saturated, it's not an I/O bound system
anyway.)

Whatever value you assign to the bgwriter, then this patch makes sure
that happens during heavy fsyncs.

I think his point is that it doesn't because if the heavy fsyncs cause
the system to be i/o bound it then bgwriter will just block issuing
the writes instead of the fsyncs.

I'm not actually convinced. Writes will only block if the kernel
decides to block. We don't really know how the kernel makes this
decision but it's entirely possible that having pending physical i/o
issued due to an fsync doesn't influence the decision if there is
still a reasonable number of dirty pages in the buffer cache. In a
sense, "I/O bound" means different things for write and fsync. Or to
put it another way fsync is latency sensitive but write is only
bandwidth sensitive.

Yeah, I was thinking of write()s, not fsyncs. I agree this might have
some effect during fsync phase.

All that said my question is which way is the code more legible and
easier to follow?

Hear hear. If we're going to give the bgwriter more responsibilities,
this might make sense even if it has no effect on performance.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#10

Magnus Hagander

magnus@hagander.net

over 14 years ago

In reply to: Heikki Linnakangas (#9)

Re: Separating bgwriter and checkpointer

On Tue, Sep 20, 2011 at 15:35, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 20.09.2011 16:29, Greg Stark wrote:

On Tue, Sep 20, 2011 at 11:03 AM, Simon Riggs<simon@2ndquadrant.com>
wrote:

I don't see what difference it makes which process does the I/O. If a
write() by checkpointer process blocks, any write()s by the separate
bgwriter process at that time will block too. If the I/O is not
saturated,
and the checkpoint write()s don't block, then even without this patch,
the
bgwriter process can handle its usual bgwriter duties during checkpoint
just
fine. (And if the I/O is not saturated, it's not an I/O bound system
anyway.)

Whatever value you assign to the bgwriter, then this patch makes sure
that happens during heavy fsyncs.

I think his point is that it doesn't because if the heavy fsyncs cause
the system to be i/o bound it then bgwriter will just block issuing
the writes instead of the fsyncs.

I'm not actually convinced. Writes will only block if the kernel
decides to block. We don't really know how the kernel makes this
decision but it's entirely possible that having pending physical i/o
issued due to an fsync doesn't influence the decision if there is
still a reasonable number of dirty pages in the buffer cache. In a
sense, "I/O bound" means different things for write and fsync. Or to
put it another way fsync is latency sensitive but write is only
bandwidth sensitive.

Yeah, I was thinking of write()s, not fsyncs. I agree this might have some
effect during fsync phase.

All that said my question is which way is the code more legible and
easier to follow?

Hear hear. If we're going to give the bgwriter more responsibilities, this
might make sense even if it has no effect on performance.

Isn't there also the advantage of that work put in two different
processes can use two different CPU cores? Or is that likely to never
ever come in play here?

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#11

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 14 years ago

In reply to: Magnus Hagander (#10)

Re: Separating bgwriter and checkpointer

On 20.09.2011 16:49, Magnus Hagander wrote:

Isn't there also the advantage of that work put in two different
processes can use two different CPU cores? Or is that likely to never
ever come in play here?

You would need one helluva I/O system to saturate even a single CPU,
just by doing write+fsync.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#12

Cédric Villemain

cedric.villemain.debian@gmail.com

over 14 years ago

In reply to: Heikki Linnakangas (#11)

Re: Separating bgwriter and checkpointer

2011/9/20 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:

On 20.09.2011 16:49, Magnus Hagander wrote:

Isn't there also the advantage of that work put in two different
processes can use two different CPU cores? Or is that likely to never
ever come in play here?

You would need one helluva I/O system to saturate even a single CPU, just by
doing write+fsync.

The point of Magnus is valid. There are possible throttling done by
linux per node, per process/task.
Since ..2.6.37 (32 ?) I believe .. there are more temptation to have
have per cgroup io/sec limits, and there exists some promising work
done to have a better IO bandwith throttling per process.

IMO, splitting the type of IO workload per process allows the
administrators to have more control on the IO limits they want to have
(and it may help the kernels() to have a better strategy ?)

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

#13

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 14 years ago

In reply to: Cédric Villemain (#12)

Re: Separating bgwriter and checkpointer

On 20.09.2011 17:31, Cï¿½dric Villemain wrote:

2011/9/20 Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>:

On 20.09.2011 16:49, Magnus Hagander wrote:

Isn't there also the advantage of that work put in two different
processes can use two different CPU cores? Or is that likely to never
ever come in play here?

You would need one helluva I/O system to saturate even a single CPU, just by
doing write+fsync.

The point of Magnus is valid. There are possible throttling done by
linux per node, per process/task.
Since ..2.6.37 (32 ?) I believe .. there are more temptation to have
have per cgroup io/sec limits, and there exists some promising work
done to have a better IO bandwith throttling per process.

IMO, splitting the type of IO workload per process allows the
administrators to have more control on the IO limits they want to have
(and it may help the kernels() to have a better strategy ?)

That is a separate issue from being able to use different CPU cores. But
cool! I didn't know Linux can do that nowadays. That could be highly
useful, if you can put e.g autovacuum on a different cgroup from regular
backends.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#14

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Heikki Linnakangas (#9)

Re: Separating bgwriter and checkpointer

On Tue, Sep 20, 2011 at 9:35 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

All that said my question is which way is the code more legible and
easier to follow?

Hear hear. If we're going to give the bgwriter more responsibilities, this
might make sense even if it has no effect on performance.

I agree. I don't think this change needs to be justified on
performance grounds; there are enough collateral benefits to make it
worthwhile. If the checkpoint process handles all the stuff with
highly variable latency (i.e. fsyncs), then the background writer work
will happen more regularly and predictably. The code will also be
simpler, which I think will open up opportunities for additional
optimizations such as (perhaps) making the background writer only wake
up when there are dirty buffers to write, which ties in to
longstanding concerns about power consumption.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15

Marti Raudsepp

marti@juffo.org

over 14 years ago

In reply to: Simon Riggs (#1)

Re: Separating bgwriter and checkpointer

On Fri, Sep 16, 2011 at 01:53, Simon Riggs <simon@2ndquadrant.com> wrote:

This patch splits bgwriter into 2 processes: checkpointer and
bgwriter, seeking to avoid contentious changes. Additional changes are
expected in this release to build upon these changes for both new
processes, though this patch stands on its own as both a performance
vehicle and in some ways a refcatoring to simplify the code.

While you're already splitting up bgwriter, could there be any benefit
to spawning a separate bgwriter process for each tablespace?

If your database has one tablespace on a fast I/O system and another
on a slow one, the slow tablespace would also bog down background
writing for the fast tablespace. But I don't know whether that's
really a problem or not.

Regards,
Marti

#16

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Marti Raudsepp (#15)

Re: Separating bgwriter and checkpointer

On Tue, Sep 20, 2011 at 11:01 AM, Marti Raudsepp <marti@juffo.org> wrote:

On Fri, Sep 16, 2011 at 01:53, Simon Riggs <simon@2ndquadrant.com> wrote:

This patch splits bgwriter into 2 processes: checkpointer and
bgwriter, seeking to avoid contentious changes. Additional changes are
expected in this release to build upon these changes for both new
processes, though this patch stands on its own as both a performance
vehicle and in some ways a refcatoring to simplify the code.

While you're already splitting up bgwriter, could there be any benefit
to spawning a separate bgwriter process for each tablespace?

If your database has one tablespace on a fast I/O system and another
on a slow one, the slow tablespace would also bog down background
writing for the fast tablespace. But I don't know whether that's
really a problem or not.

I doubt it. Most of the time the writes are going to be absorbed by
the OS write cache anyway.

I think there's probably more performance to be squeezed out of the
background writer, but maybe not that exact thing, and in any case it
seems like material for a separate patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17

Greg Smith

greg@2ndQuadrant.com

over 14 years ago

In reply to: Heikki Linnakangas (#9)

Re: Separating bgwriter and checkpointer

On 09/20/2011 09:35 AM, Heikki Linnakangas wrote:

Yeah, I was thinking of write()s, not fsyncs. I agree this might have
some effect during fsync phase.

Right; that's where the most serious problems seem to pop up at anyway
now. All the testing I did earlier this year suggested Linux at least
is happy to do a granular fsync, and it can also use things like
barriers when appropriate to schedule I/O. The hope here is that the
background writer work to clean ahead of the strategy point is helpful
to backends, and that should keep going even during the sync
phase--which currently doesn't pause for anything else once it's
started. The cleaner writes should all queue up into RAM in a lazy way
rather than block the true I/O, which is being driven by sync calls.

There is some risk here that the cleaner writes happen faster than the
true rate at which backends really need buffers, since it has a
predictive component it can be wrong about. Those could in theory
result in the write cache filling faster than it would in the current
environment, such that writes truly block that would have been cached in
the current code. If you're that close to the edge though, backends
should really benefit from the cleaner--that same write done by a client
would turn into a serious stall. From that perspective, when things
have completely filled the write cache, any writes the cleaner can get
out of the way in advance of when a backend needs it should be the
biggest win most of the time.

--
Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us

#18

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Fujii Masao (#2)

1 attachment(s)

Re: Separating bgwriter and checkpointer

On Fri, Sep 16, 2011 at 2:38 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Sep 16, 2011 at 7:53 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

This patch splits bgwriter into 2 processes: checkpointer and
bgwriter, seeking to avoid contentious changes. Additional changes are
expected in this release to build upon these changes for both new
processes, though this patch stands on its own as both a performance
vehicle and in some ways a refcatoring to simplify the code.

I like this idea to simplify the code. How much performance gain can we
expect by this patch?

Current patch has a bug at shutdown I've not located yet, but seems
likely is a simple error. That is mainly because for personal reasons
I've not been able to work on the patch recently. I expect to be able
to fix that later in the CF.

You seem to have forgotten to include checkpointor.c and .h in the patch.

Original patch included here.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

bgwriter_split.v1a.patchapplication/octet-stream; name=bgwriter_split.v1a.patchDownload

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index e3dd472..f8027c1 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -315,6 +315,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			case BgWriterProcess:
 				statmsg = "writer process";
 				break;
+			case CheckpointerProcess:
+				statmsg = "checkpointer process";
+				break;
 			case WalWriterProcess:
 				statmsg = "wal writer process";
 				break;
@@ -419,6 +422,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			BackgroundWriterMain();
 			proc_exit(1);		/* should never return */
 
+		case CheckpointerProcess:
+			/* don't set signals, checkpointer has its own agenda */
+			CheckpointerMain();
+			proc_exit(1);		/* should never return */
+
 		case WalWriterProcess:
 			/* don't set signals, walwriter has its own agenda */
 			InitXLOGAccess();
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 0767e97..e7414d2 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = autovacuum.o bgwriter.o fork_process.o pgarch.o pgstat.o postmaster.o \
-	syslogger.o walwriter.o
+	syslogger.o walwriter.o checkpointer.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 2d0b639..e0f3167 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -10,20 +10,13 @@
  * still empowered to issue writes if the bgwriter fails to maintain enough
  * clean shared buffers.
  *
- * The bgwriter is also charged with handling all checkpoints.	It will
- * automatically dispatch a checkpoint after a certain amount of time has
- * elapsed since the last one, and it can be signaled to perform requested
- * checkpoints as well.  (The GUC parameter that mandates a checkpoint every
- * so many WAL segments is implemented by having backends signal the bgwriter
- * when they fill WAL segments; the bgwriter itself doesn't watch for the
- * condition.)
+ * As of Postgres 9.2 the bgwriter no longer handles checkpoints.
  *
  * The bgwriter is started by the postmaster as soon as the startup subprocess
  * finishes, or as soon as recovery begins if we are doing archive recovery.
  * It remains alive until the postmaster commands it to terminate.
- * Normal termination is by SIGUSR2, which instructs the bgwriter to execute
- * a shutdown checkpoint and then exit(0).	(All backends must be stopped
- * before SIGUSR2 is issued!)  Emergency termination is by SIGQUIT; like any
+ * Normal termination is by SIGUSR2, which instructs the bgwriter to exit(0).
+ * Emergency termination is by SIGQUIT; like any
  * backend, the bgwriter will simply abort and exit on SIGQUIT.
  *
  * If the bgwriter exits unexpectedly, the postmaster treats that the same
@@ -54,7 +47,6 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
-#include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
@@ -67,96 +59,15 @@
 #include "utils/resowner.h"
 
 
-/*----------
- * Shared memory area for communication between bgwriter and backends
- *
- * The ckpt counters allow backends to watch for completion of a checkpoint
- * request they send.  Here's how it works:
- *	* At start of a checkpoint, bgwriter reads (and clears) the request flags
- *	  and increments ckpt_started, while holding ckpt_lck.
- *	* On completion of a checkpoint, bgwriter sets ckpt_done to
- *	  equal ckpt_started.
- *	* On failure of a checkpoint, bgwriter increments ckpt_failed
- *	  and sets ckpt_done to equal ckpt_started.
- *
- * The algorithm for backends is:
- *	1. Record current values of ckpt_failed and ckpt_started, and
- *	   set request flags, while holding ckpt_lck.
- *	2. Send signal to request checkpoint.
- *	3. Sleep until ckpt_started changes.  Now you know a checkpoint has
- *	   begun since you started this algorithm (although *not* that it was
- *	   specifically initiated by your signal), and that it is using your flags.
- *	4. Record new value of ckpt_started.
- *	5. Sleep until ckpt_done >= saved value of ckpt_started.  (Use modulo
- *	   arithmetic here in case counters wrap around.)  Now you know a
- *	   checkpoint has started and completed, but not whether it was
- *	   successful.
- *	6. If ckpt_failed is different from the originally saved value,
- *	   assume request failed; otherwise it was definitely successful.
- *
- * ckpt_flags holds the OR of the checkpoint request flags sent by all
- * requesting backends since the last checkpoint start.  The flags are
- * chosen so that OR'ing is the correct way to combine multiple requests.
- *
- * num_backend_writes is used to count the number of buffer writes performed
- * by non-bgwriter processes.  This counter should be wide enough that it
- * can't overflow during a single bgwriter cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the background writer failed to absorb their request.
- *
- * The requests array holds fsync requests sent by backends and not yet
- * absorbed by the bgwriter.
- *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by BgWriterCommLock.
- *----------
- */
-typedef struct
-{
-	RelFileNodeBackend rnode;
-	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
-	/* might add a real request-type field later; not needed yet */
-} BgWriterRequest;
-
-typedef struct
-{
-	pid_t		bgwriter_pid;	/* PID of bgwriter (0 if not started) */
-
-	slock_t		ckpt_lck;		/* protects all the ckpt_* fields */
-
-	int			ckpt_started;	/* advances when checkpoint starts */
-	int			ckpt_done;		/* advances when checkpoint done */
-	int			ckpt_failed;	/* advances when checkpoint fails */
-
-	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
-
-	uint32		num_backend_writes;		/* counts non-bgwriter buffer writes */
-	uint32		num_backend_fsync;		/* counts non-bgwriter fsync calls */
-
-	int			num_requests;	/* current # of requests */
-	int			max_requests;	/* allocated array size */
-	BgWriterRequest requests[1];	/* VARIABLE LENGTH ARRAY */
-} BgWriterShmemStruct;
-
-static BgWriterShmemStruct *BgWriterShmem;
-
-/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
-#define WRITES_PER_ABSORB		1000
-
 /*
  * GUC parameters
  */
 int			BgWriterDelay = 200;
-int			CheckPointTimeout = 300;
-int			CheckPointWarning = 30;
-double		CheckPointCompletionTarget = 0.5;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
 static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
 static volatile sig_atomic_t shutdown_requested = false;
 
 /*
@@ -164,29 +75,14 @@ static volatile sig_atomic_t shutdown_requested = false;
  */
 static bool am_bg_writer = false;
 
-static bool ckpt_active = false;
-
-/* these values are valid when ckpt_active is true: */
-static pg_time_t ckpt_start_time;
-static XLogRecPtr ckpt_start_recptr;
-static double ckpt_cached_elapsed;
-
-static pg_time_t last_checkpoint_time;
-static pg_time_t last_xlog_switch_time;
-
 /* Prototypes for private functions */
 
-static void CheckArchiveTimeout(void);
 static void BgWriterNap(void);
-static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
-static bool CompactBgwriterRequestQueue(void);
 
 /* Signal handlers */
 
 static void bg_quickdie(SIGNAL_ARGS);
 static void BgSigHupHandler(SIGNAL_ARGS);
-static void ReqCheckpointHandler(SIGNAL_ARGS);
 static void ReqShutdownHandler(SIGNAL_ARGS);
 
 
@@ -202,7 +98,6 @@ BackgroundWriterMain(void)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 
-	BgWriterShmem->bgwriter_pid = MyProcPid;
 	am_bg_writer = true;
 
 	/*
@@ -228,8 +123,8 @@ BackgroundWriterMain(void)
 	 * process to participate in ProcSignal signalling.
 	 */
 	pqsignal(SIGHUP, BgSigHupHandler);	/* set flag to read config file */
-	pqsignal(SIGINT, ReqCheckpointHandler);		/* request checkpoint */
-	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
+	pqsignal(SIGINT, SIG_IGN);			/* as of 9.2 no longer requests checkpoint */
+	pqsignal(SIGTERM, SIG_IGN); 		/* ignore SIGTERM */
 	pqsignal(SIGQUIT, bg_quickdie);		/* hard crash time */
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
@@ -249,11 +144,6 @@ BackgroundWriterMain(void)
 	sigdelset(&BlockSig, SIGQUIT);
 
 	/*
-	 * Initialize so that first time-driven event happens at the correct time.
-	 */
-	last_checkpoint_time = last_xlog_switch_time = (pg_time_t) time(NULL);
-
-	/*
 	 * Create a resource owner to keep track of our resources (currently only
 	 * buffer pins).
 	 */
@@ -305,20 +195,6 @@ BackgroundWriterMain(void)
 		AtEOXact_Files();
 		AtEOXact_HashTables(false);
 
-		/* Warn any waiting backends that the checkpoint failed. */
-		if (ckpt_active)
-		{
-			/* use volatile pointer to prevent code rearrangement */
-			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-
-			SpinLockAcquire(&bgs->ckpt_lck);
-			bgs->ckpt_failed++;
-			bgs->ckpt_done = bgs->ckpt_started;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			ckpt_active = false;
-		}
-
 		/*
 		 * Now return to normal top-level context and clear ErrorContext for
 		 * next time.
@@ -361,19 +237,11 @@ BackgroundWriterMain(void)
 	if (RecoveryInProgress())
 		ThisTimeLineID = GetRecoveryTargetTLI();
 
-	/* Do this once before starting the loop, then just at SIGHUP time. */
-	SyncRepUpdateSyncStandbysDefined();
-
 	/*
 	 * Loop forever
 	 */
 	for (;;)
 	{
-		bool		do_checkpoint = false;
-		int			flags = 0;
-		pg_time_t	now;
-		int			elapsed_secs;
-
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
 		 * necessity for manual cleanup of all postmaster children.
@@ -381,23 +249,11 @@ BackgroundWriterMain(void)
 		if (!PostmasterIsAlive())
 			exit(1);
 
-		/*
-		 * Process any requests or signals received recently.
-		 */
-		AbsorbFsyncRequests();
-
 		if (got_SIGHUP)
 		{
 			got_SIGHUP = false;
 			ProcessConfigFile(PGC_SIGHUP);
 			/* update global shmem state for sync rep */
-			SyncRepUpdateSyncStandbysDefined();
-		}
-		if (checkpoint_requested)
-		{
-			checkpoint_requested = false;
-			do_checkpoint = true;
-			BgWriterStats.m_requested_checkpoints++;
 		}
 		if (shutdown_requested)
 		{
@@ -406,142 +262,14 @@ BackgroundWriterMain(void)
 			 * control back to the sigsetjmp block above
 			 */
 			ExitOnAnyError = true;
-			/* Close down the database */
-			ShutdownXLOG(0, 0);
 			/* Normal exit from the bgwriter is here */
 			proc_exit(0);		/* done */
 		}
 
 		/*
-		 * Force a checkpoint if too much time has elapsed since the last one.
-		 * Note that we count a timed checkpoint in stats only when this
-		 * occurs without an external request, but we set the CAUSE_TIME flag
-		 * bit even if there is also an external request.
+		 * Do one cycle of dirty-buffer writing.
 		 */
-		now = (pg_time_t) time(NULL);
-		elapsed_secs = now - last_checkpoint_time;
-		if (elapsed_secs >= CheckPointTimeout)
-		{
-			if (!do_checkpoint)
-				BgWriterStats.m_timed_checkpoints++;
-			do_checkpoint = true;
-			flags |= CHECKPOINT_CAUSE_TIME;
-		}
-
-		/*
-		 * Do a checkpoint if requested, otherwise do one cycle of
-		 * dirty-buffer writing.
-		 */
-		if (do_checkpoint)
-		{
-			bool		ckpt_performed = false;
-			bool		do_restartpoint;
-
-			/* use volatile pointer to prevent code rearrangement */
-			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-
-			/*
-			 * Check if we should perform a checkpoint or a restartpoint. As a
-			 * side-effect, RecoveryInProgress() initializes TimeLineID if
-			 * it's not set yet.
-			 */
-			do_restartpoint = RecoveryInProgress();
-
-			/*
-			 * Atomically fetch the request flags to figure out what kind of a
-			 * checkpoint we should perform, and increase the started-counter
-			 * to acknowledge that we've started a new checkpoint.
-			 */
-			SpinLockAcquire(&bgs->ckpt_lck);
-			flags |= bgs->ckpt_flags;
-			bgs->ckpt_flags = 0;
-			bgs->ckpt_started++;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			/*
-			 * The end-of-recovery checkpoint is a real checkpoint that's
-			 * performed while we're still in recovery.
-			 */
-			if (flags & CHECKPOINT_END_OF_RECOVERY)
-				do_restartpoint = false;
-
-			/*
-			 * We will warn if (a) too soon since last checkpoint (whatever
-			 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
-			 * since the last checkpoint start.  Note in particular that this
-			 * implementation will not generate warnings caused by
-			 * CheckPointTimeout < CheckPointWarning.
-			 */
-			if (!do_restartpoint &&
-				(flags & CHECKPOINT_CAUSE_XLOG) &&
-				elapsed_secs < CheckPointWarning)
-				ereport(LOG,
-						(errmsg_plural("checkpoints are occurring too frequently (%d second apart)",
-				"checkpoints are occurring too frequently (%d seconds apart)",
-									   elapsed_secs,
-									   elapsed_secs),
-						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
-
-			/*
-			 * Initialize bgwriter-private variables used during checkpoint.
-			 */
-			ckpt_active = true;
-			if (!do_restartpoint)
-				ckpt_start_recptr = GetInsertRecPtr();
-			ckpt_start_time = now;
-			ckpt_cached_elapsed = 0;
-
-			/*
-			 * Do the checkpoint.
-			 */
-			if (!do_restartpoint)
-			{
-				CreateCheckPoint(flags);
-				ckpt_performed = true;
-			}
-			else
-				ckpt_performed = CreateRestartPoint(flags);
-
-			/*
-			 * After any checkpoint, close all smgr files.	This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
-			 */
-			smgrcloseall();
-
-			/*
-			 * Indicate checkpoint completion to any waiting backends.
-			 */
-			SpinLockAcquire(&bgs->ckpt_lck);
-			bgs->ckpt_done = bgs->ckpt_started;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			if (ckpt_performed)
-			{
-				/*
-				 * Note we record the checkpoint start time not end time as
-				 * last_checkpoint_time.  This is so that time-driven
-				 * checkpoints happen at a predictable spacing.
-				 */
-				last_checkpoint_time = now;
-			}
-			else
-			{
-				/*
-				 * We were not able to perform the restartpoint (checkpoints
-				 * throw an ERROR in case of error).  Most likely because we
-				 * have not received any new checkpoint WAL records since the
-				 * last restartpoint. Try again in 15 s.
-				 */
-				last_checkpoint_time = now - CheckPointTimeout + 15;
-			}
-
-			ckpt_active = false;
-		}
-		else
-			BgBufferSync();
-
-		/* Check for archive_timeout and switch xlog files if necessary. */
-		CheckArchiveTimeout();
+		BgBufferSync();
 
 		/* Nap for the configured time. */
 		BgWriterNap();
@@ -549,61 +277,6 @@ BackgroundWriterMain(void)
 }
 
 /*
- * CheckArchiveTimeout -- check for archive_timeout and switch xlog files
- *
- * This will switch to a new WAL file and force an archive file write
- * if any activity is recorded in the current WAL file, including just
- * a single checkpoint record.
- */
-static void
-CheckArchiveTimeout(void)
-{
-	pg_time_t	now;
-	pg_time_t	last_time;
-
-	if (XLogArchiveTimeout <= 0 || RecoveryInProgress())
-		return;
-
-	now = (pg_time_t) time(NULL);
-
-	/* First we do a quick check using possibly-stale local state. */
-	if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
-		return;
-
-	/*
-	 * Update local state ... note that last_xlog_switch_time is the last time
-	 * a switch was performed *or requested*.
-	 */
-	last_time = GetLastSegSwitchTime();
-
-	last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
-
-	/* Now we can do the real check */
-	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
-	{
-		XLogRecPtr	switchpoint;
-
-		/* OK, it's time to switch */
-		switchpoint = RequestXLogSwitch();
-
-		/*
-		 * If the returned pointer points exactly to a segment boundary,
-		 * assume nothing happened.
-		 */
-		if ((switchpoint.xrecoff % XLogSegSize) != 0)
-			ereport(DEBUG1,
-				(errmsg("transaction log switch forced (archive_timeout=%d)",
-						XLogArchiveTimeout)));
-
-		/*
-		 * Update state in any case, so we don't retry constantly when the
-		 * system is idle.
-		 */
-		last_xlog_switch_time = now;
-	}
-}
-
-/*
  * BgWriterNap -- Nap for the configured time or until a signal is received.
  */
 static void
@@ -624,185 +297,24 @@ BgWriterNap(void)
 	 * respond reasonably promptly when someone signals us, break down the
 	 * sleep into 1-second increments, and check for interrupts after each
 	 * nap.
-	 *
-	 * We absorb pending requests after each short sleep.
 	 */
-	if (bgwriter_lru_maxpages > 0 || ckpt_active)
+	if (bgwriter_lru_maxpages > 0)
 		udelay = BgWriterDelay * 1000L;
-	else if (XLogArchiveTimeout > 0)
-		udelay = 1000000L;		/* One second */
 	else
 		udelay = 10000000L;		/* Ten seconds */
 
 	while (udelay > 999999L)
 	{
-		if (got_SIGHUP || shutdown_requested ||
-		(ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
+		if (got_SIGHUP || shutdown_requested)
 			break;
 		pg_usleep(1000000L);
-		AbsorbFsyncRequests();
 		udelay -= 1000000L;
 	}
 
-	if (!(got_SIGHUP || shutdown_requested ||
-	  (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested)))
+	if (!(got_SIGHUP || shutdown_requested))
 		pg_usleep(udelay);
 }
 
-/*
- * Returns true if an immediate checkpoint request is pending.	(Note that
- * this does not check the *current* checkpoint's IMMEDIATE flag, but whether
- * there is one pending behind it.)
- */
-static bool
-ImmediateCheckpointRequested(void)
-{
-	if (checkpoint_requested)
-	{
-		volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-
-		/*
-		 * We don't need to acquire the ckpt_lck in this case because we're
-		 * only looking at a single flag bit.
-		 */
-		if (bgs->ckpt_flags & CHECKPOINT_IMMEDIATE)
-			return true;
-	}
-	return false;
-}
-
-/*
- * CheckpointWriteDelay -- yield control to bgwriter during a checkpoint
- *
- * This function is called after each page write performed by BufferSync().
- * It is responsible for keeping the bgwriter's normal activities in
- * progress during a long checkpoint, and for throttling BufferSync()'s
- * write rate to hit checkpoint_completion_target.
- *
- * The checkpoint request flags should be passed in; currently the only one
- * examined is CHECKPOINT_IMMEDIATE, which disables delays between writes.
- *
- * 'progress' is an estimate of how much of the work has been done, as a
- * fraction between 0.0 meaning none, and 1.0 meaning all done.
- */
-void
-CheckpointWriteDelay(int flags, double progress)
-{
-	static int	absorb_counter = WRITES_PER_ABSORB;
-
-	/* Do nothing if checkpoint is being executed by non-bgwriter process */
-	if (!am_bg_writer)
-		return;
-
-	/*
-	 * Perform the usual bgwriter duties and take a nap, unless we're behind
-	 * schedule, in which case we just try to catch up as quickly as possible.
-	 */
-	if (!(flags & CHECKPOINT_IMMEDIATE) &&
-		!shutdown_requested &&
-		!ImmediateCheckpointRequested() &&
-		IsCheckpointOnSchedule(progress))
-	{
-		if (got_SIGHUP)
-		{
-			got_SIGHUP = false;
-			ProcessConfigFile(PGC_SIGHUP);
-			/* update global shmem state for sync rep */
-			SyncRepUpdateSyncStandbysDefined();
-		}
-
-		AbsorbFsyncRequests();
-		absorb_counter = WRITES_PER_ABSORB;
-
-		BgBufferSync();
-		CheckArchiveTimeout();
-		BgWriterNap();
-	}
-	else if (--absorb_counter <= 0)
-	{
-		/*
-		 * Absorb pending fsync requests after each WRITES_PER_ABSORB write
-		 * operations even when we don't sleep, to prevent overflow of the
-		 * fsync request queue.
-		 */
-		AbsorbFsyncRequests();
-		absorb_counter = WRITES_PER_ABSORB;
-	}
-}
-
-/*
- * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
- *		 in time?
- *
- * Compares the current progress against the time/segments elapsed since last
- * checkpoint, and returns true if the progress we've made this far is greater
- * than the elapsed time/segments.
- */
-static bool
-IsCheckpointOnSchedule(double progress)
-{
-	XLogRecPtr	recptr;
-	struct timeval now;
-	double		elapsed_xlogs,
-				elapsed_time;
-
-	Assert(ckpt_active);
-
-	/* Scale progress according to checkpoint_completion_target. */
-	progress *= CheckPointCompletionTarget;
-
-	/*
-	 * Check against the cached value first. Only do the more expensive
-	 * calculations once we reach the target previously calculated. Since
-	 * neither time or WAL insert pointer moves backwards, a freshly
-	 * calculated value can only be greater than or equal to the cached value.
-	 */
-	if (progress < ckpt_cached_elapsed)
-		return false;
-
-	/*
-	 * Check progress against WAL segments written and checkpoint_segments.
-	 *
-	 * We compare the current WAL insert location against the location
-	 * computed before calling CreateCheckPoint. The code in XLogInsert that
-	 * actually triggers a checkpoint when checkpoint_segments is exceeded
-	 * compares against RedoRecptr, so this is not completely accurate.
-	 * However, it's good enough for our purposes, we're only calculating an
-	 * estimate anyway.
-	 */
-	if (!RecoveryInProgress())
-	{
-		recptr = GetInsertRecPtr();
-		elapsed_xlogs =
-			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
-			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
-			CheckPointSegments;
-
-		if (progress < elapsed_xlogs)
-		{
-			ckpt_cached_elapsed = elapsed_xlogs;
-			return false;
-		}
-	}
-
-	/*
-	 * Check progress against time elapsed and checkpoint_timeout.
-	 */
-	gettimeofday(&now, NULL);
-	elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
-					now.tv_usec / 1000000.0) / CheckPointTimeout;
-
-	if (progress < elapsed_time)
-	{
-		ckpt_cached_elapsed = elapsed_time;
-		return false;
-	}
-
-	/* It looks like we're on schedule. */
-	return true;
-}
-
-
 /* --------------------------------
  *		signal handler routines
  * --------------------------------
@@ -847,441 +359,9 @@ BgSigHupHandler(SIGNAL_ARGS)
 	got_SIGHUP = true;
 }
 
-/* SIGINT: set flag to run a normal checkpoint right away */
-static void
-ReqCheckpointHandler(SIGNAL_ARGS)
-{
-	checkpoint_requested = true;
-}
-
 /* SIGUSR2: set flag to run a shutdown checkpoint and exit */
 static void
 ReqShutdownHandler(SIGNAL_ARGS)
 {
 	shutdown_requested = true;
 }
-
-
-/* --------------------------------
- *		communication with backends
- * --------------------------------
- */
-
-/*
- * BgWriterShmemSize
- *		Compute space needed for bgwriter-related shared memory
- */
-Size
-BgWriterShmemSize(void)
-{
-	Size		size;
-
-	/*
-	 * Currently, the size of the requests[] array is arbitrarily set equal to
-	 * NBuffers.  This may prove too large or small ...
-	 */
-	size = offsetof(BgWriterShmemStruct, requests);
-	size = add_size(size, mul_size(NBuffers, sizeof(BgWriterRequest)));
-
-	return size;
-}
-
-/*
- * BgWriterShmemInit
- *		Allocate and initialize bgwriter-related shared memory
- */
-void
-BgWriterShmemInit(void)
-{
-	bool		found;
-
-	BgWriterShmem = (BgWriterShmemStruct *)
-		ShmemInitStruct("Background Writer Data",
-						BgWriterShmemSize(),
-						&found);
-
-	if (!found)
-	{
-		/* First time through, so initialize */
-		MemSet(BgWriterShmem, 0, sizeof(BgWriterShmemStruct));
-		SpinLockInit(&BgWriterShmem->ckpt_lck);
-		BgWriterShmem->max_requests = NBuffers;
-	}
-}
-
-/*
- * RequestCheckpoint
- *		Called in backend processes to request a checkpoint
- *
- * flags is a bitwise OR of the following:
- *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
- *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
- *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
- *		ignoring checkpoint_completion_target parameter.
- *	CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occured
- *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
- *		CHECKPOINT_END_OF_RECOVERY).
- *	CHECKPOINT_WAIT: wait for completion before returning (otherwise,
- *		just signal bgwriter to do it, and return).
- *	CHECKPOINT_CAUSE_XLOG: checkpoint is requested due to xlog filling.
- *		(This affects logging, and in particular enables CheckPointWarning.)
- */
-void
-RequestCheckpoint(int flags)
-{
-	/* use volatile pointer to prevent code rearrangement */
-	volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-	int			ntries;
-	int			old_failed,
-				old_started;
-
-	/*
-	 * If in a standalone backend, just do it ourselves.
-	 */
-	if (!IsPostmasterEnvironment)
-	{
-		/*
-		 * There's no point in doing slow checkpoints in a standalone backend,
-		 * because there's no other backends the checkpoint could disrupt.
-		 */
-		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
-
-		/*
-		 * After any checkpoint, close all smgr files.	This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
-
-		return;
-	}
-
-	/*
-	 * Atomically set the request flags, and take a snapshot of the counters.
-	 * When we see ckpt_started > old_started, we know the flags we set here
-	 * have been seen by bgwriter.
-	 *
-	 * Note that we OR the flags with any existing flags, to avoid overriding
-	 * a "stronger" request by another backend.  The flag senses must be
-	 * chosen to make this work!
-	 */
-	SpinLockAcquire(&bgs->ckpt_lck);
-
-	old_failed = bgs->ckpt_failed;
-	old_started = bgs->ckpt_started;
-	bgs->ckpt_flags |= flags;
-
-	SpinLockRelease(&bgs->ckpt_lck);
-
-	/*
-	 * Send signal to request checkpoint.  It's possible that the bgwriter
-	 * hasn't started yet, or is in process of restarting, so we will retry a
-	 * few times if needed.  Also, if not told to wait for the checkpoint to
-	 * occur, we consider failure to send the signal to be nonfatal and merely
-	 * LOG it.
-	 */
-	for (ntries = 0;; ntries++)
-	{
-		if (BgWriterShmem->bgwriter_pid == 0)
-		{
-			if (ntries >= 20)	/* max wait 2.0 sec */
-			{
-				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-				"could not request checkpoint because bgwriter not running");
-				break;
-			}
-		}
-		else if (kill(BgWriterShmem->bgwriter_pid, SIGINT) != 0)
-		{
-			if (ntries >= 20)	/* max wait 2.0 sec */
-			{
-				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: %m");
-				break;
-			}
-		}
-		else
-			break;				/* signal sent successfully */
-
-		CHECK_FOR_INTERRUPTS();
-		pg_usleep(100000L);		/* wait 0.1 sec, then retry */
-	}
-
-	/*
-	 * If requested, wait for completion.  We detect completion according to
-	 * the algorithm given above.
-	 */
-	if (flags & CHECKPOINT_WAIT)
-	{
-		int			new_started,
-					new_failed;
-
-		/* Wait for a new checkpoint to start. */
-		for (;;)
-		{
-			SpinLockAcquire(&bgs->ckpt_lck);
-			new_started = bgs->ckpt_started;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			if (new_started != old_started)
-				break;
-
-			CHECK_FOR_INTERRUPTS();
-			pg_usleep(100000L);
-		}
-
-		/*
-		 * We are waiting for ckpt_done >= new_started, in a modulo sense.
-		 */
-		for (;;)
-		{
-			int			new_done;
-
-			SpinLockAcquire(&bgs->ckpt_lck);
-			new_done = bgs->ckpt_done;
-			new_failed = bgs->ckpt_failed;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			if (new_done - new_started >= 0)
-				break;
-
-			CHECK_FOR_INTERRUPTS();
-			pg_usleep(100000L);
-		}
-
-		if (new_failed != old_failed)
-			ereport(ERROR,
-					(errmsg("checkpoint request failed"),
-					 errhint("Consult recent messages in the server log for details.")));
-	}
-}
-
-/*
- * ForwardFsyncRequest
- *		Forward a file-fsync request from a backend to the bgwriter
- *
- * Whenever a backend is compelled to write directly to a relation
- * (which should be seldom, if the bgwriter is getting its job done),
- * the backend calls this routine to pass over knowledge that the relation
- * is dirty and must be fsync'd before next checkpoint.  We also use this
- * opportunity to count such writes for statistical purposes.
- *
- * segno specifies which segment (not block!) of the relation needs to be
- * fsync'd.  (Since the valid range is much less than BlockNumber, we can
- * use high values for special flags; that's all internal to md.c, which
- * see for details.)
- *
- * To avoid holding the lock for longer than necessary, we normally write
- * to the requests[] queue without checking for duplicates.  The bgwriter
- * will have to eliminate dups internally anyway.  However, if we discover
- * that the queue is full, we make a pass over the entire queue to compact
- * it.	This is somewhat expensive, but the alternative is for the backend
- * to perform its own fsync, which is far more expensive in practice.  It
- * is theoretically possible a backend fsync might still be necessary, if
- * the queue is full and contains no duplicate entries.  In that case, we
- * let the backend know by returning false.
- */
-bool
-ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
-					BlockNumber segno)
-{
-	BgWriterRequest *request;
-
-	if (!IsUnderPostmaster)
-		return false;			/* probably shouldn't even get here */
-
-	if (am_bg_writer)
-		elog(ERROR, "ForwardFsyncRequest must not be called in bgwriter");
-
-	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
-
-	/* Count all backend writes regardless of if they fit in the queue */
-	BgWriterShmem->num_backend_writes++;
-
-	/*
-	 * If the background writer isn't running or the request queue is full,
-	 * the backend will have to perform its own fsync request.	But before
-	 * forcing that to happen, we can try to compact the background writer
-	 * request queue.
-	 */
-	if (BgWriterShmem->bgwriter_pid == 0 ||
-		(BgWriterShmem->num_requests >= BgWriterShmem->max_requests
-		 && !CompactBgwriterRequestQueue()))
-	{
-		/*
-		 * Count the subset of writes where backends have to do their own
-		 * fsync
-		 */
-		BgWriterShmem->num_backend_fsync++;
-		LWLockRelease(BgWriterCommLock);
-		return false;
-	}
-	request = &BgWriterShmem->requests[BgWriterShmem->num_requests++];
-	request->rnode = rnode;
-	request->forknum = forknum;
-	request->segno = segno;
-	LWLockRelease(BgWriterCommLock);
-	return true;
-}
-
-/*
- * CompactBgwriterRequestQueue
- *		Remove duplicates from the request queue to avoid backend fsyncs.
- *
- * Although a full fsync request queue is not common, it can lead to severe
- * performance problems when it does happen.  So far, this situation has
- * only been observed to occur when the system is under heavy write load,
- * and especially during the "sync" phase of a checkpoint.	Without this
- * logic, each backend begins doing an fsync for every block written, which
- * gets very expensive and can slow down the whole system.
- *
- * Trying to do this every time the queue is full could lose if there
- * aren't any removable entries.  But should be vanishingly rare in
- * practice: there's one queue entry per shared buffer.
- */
-static bool
-CompactBgwriterRequestQueue()
-{
-	struct BgWriterSlotMapping
-	{
-		BgWriterRequest request;
-		int			slot;
-	};
-
-	int			n,
-				preserve_count;
-	int			num_skipped = 0;
-	HASHCTL		ctl;
-	HTAB	   *htab;
-	bool	   *skip_slot;
-
-	/* must hold BgWriterCommLock in exclusive mode */
-	Assert(LWLockHeldByMe(BgWriterCommLock));
-
-	/* Initialize temporary hash table */
-	MemSet(&ctl, 0, sizeof(ctl));
-	ctl.keysize = sizeof(BgWriterRequest);
-	ctl.entrysize = sizeof(struct BgWriterSlotMapping);
-	ctl.hash = tag_hash;
-	htab = hash_create("CompactBgwriterRequestQueue",
-					   BgWriterShmem->num_requests,
-					   &ctl,
-					   HASH_ELEM | HASH_FUNCTION);
-
-	/* Initialize skip_slot array */
-	skip_slot = palloc0(sizeof(bool) * BgWriterShmem->num_requests);
-
-	/*
-	 * The basic idea here is that a request can be skipped if it's followed
-	 * by a later, identical request.  It might seem more sensible to work
-	 * backwards from the end of the queue and check whether a request is
-	 * *preceded* by an earlier, identical request, in the hopes of doing less
-	 * copying.  But that might change the semantics, if there's an
-	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
-	 * we do it this way.  It would be possible to be even smarter if we made
-	 * the code below understand the specific semantics of such requests (it
-	 * could blow away preceding entries that would end up being canceled
-	 * anyhow), but it's not clear that the extra complexity would buy us
-	 * anything.
-	 */
-	for (n = 0; n < BgWriterShmem->num_requests; ++n)
-	{
-		BgWriterRequest *request;
-		struct BgWriterSlotMapping *slotmap;
-		bool		found;
-
-		request = &BgWriterShmem->requests[n];
-		slotmap = hash_search(htab, request, HASH_ENTER, &found);
-		if (found)
-		{
-			skip_slot[slotmap->slot] = true;
-			++num_skipped;
-		}
-		slotmap->slot = n;
-	}
-
-	/* Done with the hash table. */
-	hash_destroy(htab);
-
-	/* If no duplicates, we're out of luck. */
-	if (!num_skipped)
-	{
-		pfree(skip_slot);
-		return false;
-	}
-
-	/* We found some duplicates; remove them. */
-	for (n = 0, preserve_count = 0; n < BgWriterShmem->num_requests; ++n)
-	{
-		if (skip_slot[n])
-			continue;
-		BgWriterShmem->requests[preserve_count++] = BgWriterShmem->requests[n];
-	}
-	ereport(DEBUG1,
-	   (errmsg("compacted fsync request queue from %d entries to %d entries",
-			   BgWriterShmem->num_requests, preserve_count)));
-	BgWriterShmem->num_requests = preserve_count;
-
-	/* Cleanup. */
-	pfree(skip_slot);
-	return true;
-}
-
-/*
- * AbsorbFsyncRequests
- *		Retrieve queued fsync requests and pass them to local smgr.
- *
- * This is exported because it must be called during CreateCheckPoint;
- * we have to be sure we have accepted all pending requests just before
- * we start fsync'ing.  Since CreateCheckPoint sometimes runs in
- * non-bgwriter processes, do nothing if not bgwriter.
- */
-void
-AbsorbFsyncRequests(void)
-{
-	BgWriterRequest *requests = NULL;
-	BgWriterRequest *request;
-	int			n;
-
-	if (!am_bg_writer)
-		return;
-
-	/*
-	 * We have to PANIC if we fail to absorb all the pending requests (eg,
-	 * because our hashtable runs out of memory).  This is because the system
-	 * cannot run safely if we are unable to fsync what we have been told to
-	 * fsync.  Fortunately, the hashtable is so small that the problem is
-	 * quite unlikely to arise in practice.
-	 */
-	START_CRIT_SECTION();
-
-	/*
-	 * We try to avoid holding the lock for a long time by copying the request
-	 * array.
-	 */
-	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
-
-	/* Transfer write count into pending pgstats message */
-	BgWriterStats.m_buf_written_backend += BgWriterShmem->num_backend_writes;
-	BgWriterStats.m_buf_fsync_backend += BgWriterShmem->num_backend_fsync;
-
-	BgWriterShmem->num_backend_writes = 0;
-	BgWriterShmem->num_backend_fsync = 0;
-
-	n = BgWriterShmem->num_requests;
-	if (n > 0)
-	{
-		requests = (BgWriterRequest *) palloc(n * sizeof(BgWriterRequest));
-		memcpy(requests, BgWriterShmem->requests, n * sizeof(BgWriterRequest));
-	}
-	BgWriterShmem->num_requests = 0;
-
-	LWLockRelease(BgWriterCommLock);
-
-	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
-
-	if (requests)
-		pfree(requests);
-
-	END_CRIT_SECTION();
-}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
new file mode 100644
index 0000000..0ba83bb
--- /dev/null
+++ b/src/backend/postmaster/checkpointer.c
@@ -0,0 +1,1229 @@
+/*-------------------------------------------------------------------------
+ *
+ * checkpointer.c
+ *
+ * The checkpointer is new as of Postgres 9.2.	It handles all checkpoints.
+ * Checkpoints are automatically dispatched after a certain amount of time has
+ * elapsed since the last one, and it can be signaled to perform requested
+ * checkpoints as well.  (The GUC parameter that mandates a checkpoint every
+ * so many WAL segments is implemented by having backends signal when they
+ * fill WAL segments; the checkpointer itself doesn't watch for the
+ * condition.)
+ *
+ * The checkpointer is started by the postmaster as soon as the startup subprocess
+ * finishes, or as soon as recovery begins if we are doing archive recovery.
+ * It remains alive until the postmaster commands it to terminate.
+ * Normal termination is by SIGUSR2, which instructs the checkpointer to execute
+ * a shutdown checkpoint and then exit(0).	(All backends must be stopped
+ * before SIGUSR2 is issued!)  Emergency termination is by SIGQUIT; like any
+ * backend, the checkpointer will simply abort and exit on SIGQUIT.
+ *
+ * If the checkpointer exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining backends
+ * should be killed by SIGQUIT and then a recovery cycle started.  (Even if
+ * shared memory isn't corrupted, we have lost information about which
+ * files need to be fsync'd for the next checkpoint, and so a system
+ * restart needs to be forced.)
+ *
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/checkpointer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <signal.h>
+#include <sys/time.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "replication/syncrep.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/pmsignal.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+
+/*----------
+ * Shared memory area for communication between checkpointer and backends
+ *
+ * The ckpt counters allow backends to watch for completion of a checkpoint
+ * request they send.  Here's how it works:
+ *	* At start of a checkpoint, checkpointer reads (and clears) the request flags
+ *	  and increments ckpt_started, while holding ckpt_lck.
+ *	* On completion of a checkpoint, checkpointer sets ckpt_done to
+ *	  equal ckpt_started.
+ *	* On failure of a checkpoint, checkpointer increments ckpt_failed
+ *	  and sets ckpt_done to equal ckpt_started.
+ *
+ * The algorithm for backends is:
+ *	1. Record current values of ckpt_failed and ckpt_started, and
+ *	   set request flags, while holding ckpt_lck.
+ *	2. Send signal to request checkpoint.
+ *	3. Sleep until ckpt_started changes.  Now you know a checkpoint has
+ *	   begun since you started this algorithm (although *not* that it was
+ *	   specifically initiated by your signal), and that it is using your flags.
+ *	4. Record new value of ckpt_started.
+ *	5. Sleep until ckpt_done >= saved value of ckpt_started.  (Use modulo
+ *	   arithmetic here in case counters wrap around.)  Now you know a
+ *	   checkpoint has started and completed, but not whether it was
+ *	   successful.
+ *	6. If ckpt_failed is different from the originally saved value,
+ *	   assume request failed; otherwise it was definitely successful.
+ *
+ * ckpt_flags holds the OR of the checkpoint request flags sent by all
+ * requesting backends since the last checkpoint start.  The flags are
+ * chosen so that OR'ing is the correct way to combine multiple requests.
+ *
+ * num_backend_writes is used to count the number of buffer writes performed
+ * by non-bgwriter processes.  This counter should be wide enough that it
+ * can't overflow during a single bgwriter cycle.  num_backend_fsync
+ * counts the subset of those writes that also had to do their own fsync,
+ * because the background writer failed to absorb their request.
+ *
+ * The requests array holds fsync requests sent by backends and not yet
+ * absorbed by the checkpointer.
+ *
+ * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
+ * the requests fields are protected by BgWriterCommLock.
+ *----------
+ */
+typedef struct
+{
+	RelFileNodeBackend rnode;
+	ForkNumber	forknum;
+	BlockNumber segno;			/* see md.c for special values */
+	/* might add a real request-type field later; not needed yet */
+} BgWriterRequest;
+
+typedef struct
+{
+	pid_t		checkpointer_pid;	/* PID of bgwriter (0 if not started) */
+
+	slock_t		ckpt_lck;		/* protects all the ckpt_* fields */
+
+	int			ckpt_started;	/* advances when checkpoint starts */
+	int			ckpt_done;		/* advances when checkpoint done */
+	int			ckpt_failed;	/* advances when checkpoint fails */
+
+	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
+
+	uint32		num_backend_writes;		/* counts non-bgwriter buffer writes */
+	uint32		num_backend_fsync;		/* counts non-bgwriter fsync calls */
+
+	int			num_requests;	/* current # of requests */
+	int			max_requests;	/* allocated array size */
+	BgWriterRequest requests[1];	/* VARIABLE LENGTH ARRAY */
+} BgWriterShmemStruct;
+
+static BgWriterShmemStruct *BgWriterShmem;
+
+/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
+#define WRITES_PER_ABSORB		1000
+
+/*
+ * GUC parameters
+ */
+int			CheckPointTimeout = 300;
+int			CheckPointWarning = 30;
+double		CheckPointCompletionTarget = 0.5;
+
+/*
+ * Flags set by interrupt handlers for later service in the main loop.
+ */
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t checkpoint_requested = false;
+static volatile sig_atomic_t shutdown_requested = false;
+
+/*
+ * Private state
+ */
+static bool am_checkpointer = false;
+
+static bool ckpt_active = false;
+
+/* these values are valid when ckpt_active is true: */
+static pg_time_t ckpt_start_time;
+static XLogRecPtr ckpt_start_recptr;
+static double ckpt_cached_elapsed;
+
+static pg_time_t last_checkpoint_time;
+static pg_time_t last_xlog_switch_time;
+
+/* Prototypes for private functions */
+
+static void CheckArchiveTimeout(void);
+static bool IsCheckpointOnSchedule(double progress);
+static bool ImmediateCheckpointRequested(void);
+static bool CompactCheckpointerRequestQueue(void);
+
+/* Signal handlers */
+
+static void chkpt_quickdie(SIGNAL_ARGS);
+static void ChkptSigHupHandler(SIGNAL_ARGS);
+static void ReqCheckpointHandler(SIGNAL_ARGS);
+static void ReqShutdownHandler(SIGNAL_ARGS);
+
+
+/*
+ * Main entry point for checkpointer process
+ *
+ * This is invoked from BootstrapMain, which has already created the basic
+ * execution environment, but not enabled signals yet.
+ */
+void
+CheckpointerMain(void)
+{
+	sigjmp_buf	local_sigjmp_buf;
+	MemoryContext checkpointer_context;
+
+	BgWriterShmem->checkpointer_pid = MyProcPid;
+	am_checkpointer = true;
+
+	/*
+	 * If possible, make this process a group leader, so that the postmaster
+	 * can signal any child processes too.	(checkpointer probably never has any
+	 * child processes, but for consistency we make all postmaster child
+	 * processes do this.)
+	 */
+#ifdef HAVE_SETSID
+	if (setsid() < 0)
+		elog(FATAL, "setsid() failed: %m");
+#endif
+
+	/*
+	 * Properly accept or ignore signals the postmaster might send us
+	 *
+	 * Note: we deliberately ignore SIGTERM, because during a standard Unix
+	 * system shutdown cycle, init will SIGTERM all processes at once.	We
+	 * want to wait for the backends to exit, whereupon the postmaster will
+	 * tell us it's okay to shut down (via SIGUSR2).
+	 *
+	 * SIGUSR1 is presently unused; keep it spare in case someday we want this
+	 * process to participate in ProcSignal signalling.
+	 */
+	pqsignal(SIGHUP, ChkptSigHupHandler);	/* set flag to read config file */
+	pqsignal(SIGINT, ReqCheckpointHandler);	/* request checkpoint */
+	pqsignal(SIGTERM, SIG_IGN);				/* ignore SIGTERM */
+	pqsignal(SIGQUIT, chkpt_quickdie);		/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, SIG_IGN); /* reserve for ProcSignal */
+	pqsignal(SIGUSR2, ReqShutdownHandler);		/* request shutdown */
+
+	/*
+	 * Reset some signals that are accepted by postmaster but not here
+	 */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* We allow SIGQUIT (quickdie) at all times */
+	sigdelset(&BlockSig, SIGQUIT);
+
+	/*
+	 * Initialize so that first time-driven event happens at the correct time.
+	 */
+	last_checkpoint_time = last_xlog_switch_time = (pg_time_t) time(NULL);
+
+	/*
+	 * Create a resource owner to keep track of our resources (currently only
+	 * buffer pins).
+	 */
+	CurrentResourceOwner = ResourceOwnerCreate(NULL, "Checkpointer");
+
+	/*
+	 * Create a memory context that we will do all our work in.  We do this so
+	 * that we can reset the context during error recovery and thereby avoid
+	 * possible memory leaks.  Formerly this code just ran in
+	 * TopMemoryContext, but resetting that would be a really bad idea.
+	 */
+	checkpointer_context = AllocSetContextCreate(TopMemoryContext,
+											 "Checkpointer",
+											 ALLOCSET_DEFAULT_MINSIZE,
+											 ALLOCSET_DEFAULT_INITSIZE,
+											 ALLOCSET_DEFAULT_MAXSIZE);
+	MemoryContextSwitchTo(checkpointer_context);
+
+	/*
+	 * If an exception is encountered, processing resumes here.
+	 *
+	 * See notes in postgres.c about the design of this coding.
+	 */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		/* Since not using PG_TRY, must reset error stack by hand */
+		error_context_stack = NULL;
+
+		/* Prevent interrupts while cleaning up */
+		HOLD_INTERRUPTS();
+
+		/* Report the error to the server log */
+		EmitErrorReport();
+
+		/*
+		 * These operations are really just a minimal subset of
+		 * AbortTransaction().	We don't have very many resources to worry
+		 * about in checkpointer, but we do have LWLocks, buffers, and temp files.
+		 */
+		LWLockReleaseAll();
+		AbortBufferIO();
+		UnlockBuffers();
+		/* buffer pins are released here: */
+		ResourceOwnerRelease(CurrentResourceOwner,
+							 RESOURCE_RELEASE_BEFORE_LOCKS,
+							 false, true);
+		/* we needn't bother with the other ResourceOwnerRelease phases */
+		AtEOXact_Buffers(false);
+		AtEOXact_Files();
+		AtEOXact_HashTables(false);
+
+		/* Warn any waiting backends that the checkpoint failed. */
+		if (ckpt_active)
+		{
+			/* use volatile pointer to prevent code rearrangement */
+			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+
+			SpinLockAcquire(&bgs->ckpt_lck);
+			bgs->ckpt_failed++;
+			bgs->ckpt_done = bgs->ckpt_started;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			ckpt_active = false;
+		}
+
+		/*
+		 * Now return to normal top-level context and clear ErrorContext for
+		 * next time.
+		 */
+		MemoryContextSwitchTo(checkpointer_context);
+		FlushErrorState();
+
+		/* Flush any leaked data in the top-level context */
+		MemoryContextResetAndDeleteChildren(checkpointer_context);
+
+		/* Now we can allow interrupts again */
+		RESUME_INTERRUPTS();
+
+		/*
+		 * Sleep at least 1 second after any error.  A write error is likely
+		 * to be repeated, and we don't want to be filling the error logs as
+		 * fast as we can.
+		 */
+		pg_usleep(1000000L);
+
+		/*
+		 * Close all open files after any error.  This is helpful on Windows,
+		 * where holding deleted files open causes various strange errors.
+		 * It's not clear we need it elsewhere, but shouldn't hurt.
+		 */
+		smgrcloseall();
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	/*
+	 * Unblock signals (they were blocked when the postmaster forked us)
+	 */
+	PG_SETMASK(&UnBlockSig);
+
+	/*
+	 * Use the recovery target timeline ID during recovery
+	 */
+	if (RecoveryInProgress())
+		ThisTimeLineID = GetRecoveryTargetTLI();
+
+	/* Do this once before starting the loop, then just at SIGHUP time. */
+	SyncRepUpdateSyncStandbysDefined();
+
+	/*
+	 * Loop forever
+	 */
+	for (;;)
+	{
+		bool		do_checkpoint = false;
+		int			flags = 0;
+		pg_time_t	now;
+		int			elapsed_secs;
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (!PostmasterIsAlive())
+			exit(1);
+
+		/*
+		 * Process any requests or signals received recently.
+		 */
+		AbsorbFsyncRequests();
+
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+			/* update global shmem state for sync rep */
+			SyncRepUpdateSyncStandbysDefined();
+		}
+		if (checkpoint_requested)
+		{
+			checkpoint_requested = false;
+			do_checkpoint = true;
+			BgWriterStats.m_requested_checkpoints++;
+		}
+		if (shutdown_requested)
+		{
+			/*
+			 * From here on, elog(ERROR) should end with exit(1), not send
+			 * control back to the sigsetjmp block above
+			 */
+			ExitOnAnyError = true;
+			/* Close down the database */
+			ShutdownXLOG(0, 0);
+			/* Normal exit from the checkpointer is here */
+			proc_exit(0);		/* done */
+		}
+
+		/*
+		 * Force a checkpoint if too much time has elapsed since the last one.
+		 * Note that we count a timed checkpoint in stats only when this
+		 * occurs without an external request, but we set the CAUSE_TIME flag
+		 * bit even if there is also an external request.
+		 */
+		now = (pg_time_t) time(NULL);
+		elapsed_secs = now - last_checkpoint_time;
+		if (elapsed_secs >= CheckPointTimeout)
+		{
+			if (!do_checkpoint)
+				BgWriterStats.m_timed_checkpoints++;
+			do_checkpoint = true;
+			flags |= CHECKPOINT_CAUSE_TIME;
+		}
+
+		/*
+		 * Do a checkpoint if requested.
+		 */
+		if (do_checkpoint)
+		{
+			bool		ckpt_performed = false;
+			bool		do_restartpoint;
+
+			/* use volatile pointer to prevent code rearrangement */
+			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+
+			/*
+			 * Check if we should perform a checkpoint or a restartpoint. As a
+			 * side-effect, RecoveryInProgress() initializes TimeLineID if
+			 * it's not set yet.
+			 */
+			do_restartpoint = RecoveryInProgress();
+
+			/*
+			 * Atomically fetch the request flags to figure out what kind of a
+			 * checkpoint we should perform, and increase the started-counter
+			 * to acknowledge that we've started a new checkpoint.
+			 */
+			SpinLockAcquire(&bgs->ckpt_lck);
+			flags |= bgs->ckpt_flags;
+			bgs->ckpt_flags = 0;
+			bgs->ckpt_started++;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			/*
+			 * The end-of-recovery checkpoint is a real checkpoint that's
+			 * performed while we're still in recovery.
+			 */
+			if (flags & CHECKPOINT_END_OF_RECOVERY)
+				do_restartpoint = false;
+
+			/*
+			 * We will warn if (a) too soon since last checkpoint (whatever
+			 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
+			 * since the last checkpoint start.  Note in particular that this
+			 * implementation will not generate warnings caused by
+			 * CheckPointTimeout < CheckPointWarning.
+			 */
+			if (!do_restartpoint &&
+				(flags & CHECKPOINT_CAUSE_XLOG) &&
+				elapsed_secs < CheckPointWarning)
+				ereport(LOG,
+						(errmsg_plural("checkpoints are occurring too frequently (%d second apart)",
+				"checkpoints are occurring too frequently (%d seconds apart)",
+									   elapsed_secs,
+									   elapsed_secs),
+						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
+
+			/*
+			 * Initialize checkpointer-private variables used during checkpoint.
+			 */
+			ckpt_active = true;
+			if (!do_restartpoint)
+				ckpt_start_recptr = GetInsertRecPtr();
+			ckpt_start_time = now;
+			ckpt_cached_elapsed = 0;
+
+			/*
+			 * Do the checkpoint.
+			 */
+			if (!do_restartpoint)
+			{
+				CreateCheckPoint(flags);
+				ckpt_performed = true;
+			}
+			else
+				ckpt_performed = CreateRestartPoint(flags);
+
+			/*
+			 * After any checkpoint, close all smgr files.	This is so we
+			 * won't hang onto smgr references to deleted files indefinitely.
+			 */
+			smgrcloseall();
+
+			/*
+			 * Indicate checkpoint completion to any waiting backends.
+			 */
+			SpinLockAcquire(&bgs->ckpt_lck);
+			bgs->ckpt_done = bgs->ckpt_started;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			if (ckpt_performed)
+			{
+				/*
+				 * Note we record the checkpoint start time not end time as
+				 * last_checkpoint_time.  This is so that time-driven
+				 * checkpoints happen at a predictable spacing.
+				 */
+				last_checkpoint_time = now;
+			}
+			else
+			{
+				/*
+				 * We were not able to perform the restartpoint (checkpoints
+				 * throw an ERROR in case of error).  Most likely because we
+				 * have not received any new checkpoint WAL records since the
+				 * last restartpoint. Try again in 15 s.
+				 */
+				last_checkpoint_time = now - CheckPointTimeout + 15;
+			}
+
+			ckpt_active = false;
+		}
+
+		/* Check for archive_timeout and switch xlog files if necessary. */
+		CheckArchiveTimeout();
+	}
+}
+
+/*
+ * CheckArchiveTimeout -- check for archive_timeout and switch xlog files
+ *
+ * This will switch to a new WAL file and force an archive file write
+ * if any activity is recorded in the current WAL file, including just
+ * a single checkpoint record.
+ */
+static void
+CheckArchiveTimeout(void)
+{
+	pg_time_t	now;
+	pg_time_t	last_time;
+
+	if (XLogArchiveTimeout <= 0 || RecoveryInProgress())
+		return;
+
+	now = (pg_time_t) time(NULL);
+
+	/* First we do a quick check using possibly-stale local state. */
+	if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
+		return;
+
+	/*
+	 * Update local state ... note that last_xlog_switch_time is the last time
+	 * a switch was performed *or requested*.
+	 */
+	last_time = GetLastSegSwitchTime();
+
+	last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
+
+	/* Now we can do the real check */
+	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
+	{
+		XLogRecPtr	switchpoint;
+
+		/* OK, it's time to switch */
+		switchpoint = RequestXLogSwitch();
+
+		/*
+		 * If the returned pointer points exactly to a segment boundary,
+		 * assume nothing happened.
+		 */
+		if ((switchpoint.xrecoff % XLogSegSize) != 0)
+			ereport(DEBUG1,
+				(errmsg("transaction log switch forced (archive_timeout=%d)",
+						XLogArchiveTimeout)));
+
+		/*
+		 * Update state in any case, so we don't retry constantly when the
+		 * system is idle.
+		 */
+		last_xlog_switch_time = now;
+	}
+}
+
+/*
+ * Returns true if an immediate checkpoint request is pending.	(Note that
+ * this does not check the *current* checkpoint's IMMEDIATE flag, but whether
+ * there is one pending behind it.)
+ */
+static bool
+ImmediateCheckpointRequested(void)
+{
+	if (checkpoint_requested)
+	{
+		volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+
+		/*
+		 * We don't need to acquire the ckpt_lck in this case because we're
+		 * only looking at a single flag bit.
+		 */
+		if (bgs->ckpt_flags & CHECKPOINT_IMMEDIATE)
+			return true;
+	}
+	return false;
+}
+
+/*
+ * CheckpointWriteDelay -- control rate of checkpoint
+ *
+ * This function is called after each page write performed by BufferSync().
+ * It is responsible for throttling BufferSync()'s write rate to hit
+ * checkpoint_completion_target.
+ *
+ * The checkpoint request flags should be passed in; currently the only one
+ * examined is CHECKPOINT_IMMEDIATE, which disables delays between writes.
+ *
+ * 'progress' is an estimate of how much of the work has been done, as a
+ * fraction between 0.0 meaning none, and 1.0 meaning all done.
+ */
+void
+CheckpointWriteDelay(int flags, double progress)
+{
+	static int	absorb_counter = WRITES_PER_ABSORB;
+
+	/* Do nothing if checkpoint is being executed by non-checkpointer process */
+	if (!am_checkpointer)
+		return;
+
+	/*
+	 * Perform the usual duties and take a nap, unless we're behind
+	 * schedule, in which case we just try to catch up as quickly as possible.
+	 */
+	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!shutdown_requested &&
+		!ImmediateCheckpointRequested() &&
+		IsCheckpointOnSchedule(progress))
+	{
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+			/* update global shmem state for sync rep */
+			SyncRepUpdateSyncStandbysDefined();
+		}
+
+		AbsorbFsyncRequests();
+		absorb_counter = WRITES_PER_ABSORB;
+
+		CheckArchiveTimeout();
+
+		/*
+		 * Checkpoint sleep used to be connected to bgwriter_delay at 200ms.
+		 * That resulted in more frequent wakeups if not much work to do.
+		 * Checkpointer and bgwriter are no longer related so take the Big Sleep.
+		 */
+		pg_usleep(500000L);
+	}
+	else if (--absorb_counter <= 0)
+	{
+		/*
+		 * Absorb pending fsync requests after each WRITES_PER_ABSORB write
+		 * operations even when we don't sleep, to prevent overflow of the
+		 * fsync request queue.
+		 */
+		AbsorbFsyncRequests();
+		absorb_counter = WRITES_PER_ABSORB;
+	}
+}
+
+/*
+ * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
+ *		 in time?
+ *
+ * Compares the current progress against the time/segments elapsed since last
+ * checkpoint, and returns true if the progress we've made this far is greater
+ * than the elapsed time/segments.
+ */
+static bool
+IsCheckpointOnSchedule(double progress)
+{
+	XLogRecPtr	recptr;
+	struct timeval now;
+	double		elapsed_xlogs,
+				elapsed_time;
+
+	Assert(ckpt_active);
+
+	/* Scale progress according to checkpoint_completion_target. */
+	progress *= CheckPointCompletionTarget;
+
+	/*
+	 * Check against the cached value first. Only do the more expensive
+	 * calculations once we reach the target previously calculated. Since
+	 * neither time or WAL insert pointer moves backwards, a freshly
+	 * calculated value can only be greater than or equal to the cached value.
+	 */
+	if (progress < ckpt_cached_elapsed)
+		return false;
+
+	/*
+	 * Check progress against WAL segments written and checkpoint_segments.
+	 *
+	 * We compare the current WAL insert location against the location
+	 * computed before calling CreateCheckPoint. The code in XLogInsert that
+	 * actually triggers a checkpoint when checkpoint_segments is exceeded
+	 * compares against RedoRecptr, so this is not completely accurate.
+	 * However, it's good enough for our purposes, we're only calculating an
+	 * estimate anyway.
+	 */
+	if (!RecoveryInProgress())
+	{
+		recptr = GetInsertRecPtr();
+		elapsed_xlogs =
+			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
+			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+			CheckPointSegments;
+
+		if (progress < elapsed_xlogs)
+		{
+			ckpt_cached_elapsed = elapsed_xlogs;
+			return false;
+		}
+	}
+
+	/*
+	 * Check progress against time elapsed and checkpoint_timeout.
+	 */
+	gettimeofday(&now, NULL);
+	elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
+					now.tv_usec / 1000000.0) / CheckPointTimeout;
+
+	if (progress < elapsed_time)
+	{
+		ckpt_cached_elapsed = elapsed_time;
+		return false;
+	}
+
+	/* It looks like we're on schedule. */
+	return true;
+}
+
+
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+
+/*
+ * chkpt_quickdie() occurs when signalled SIGQUIT by the postmaster.
+ *
+ * Some backend has bought the farm,
+ * so we need to stop what we're doing and exit.
+ */
+static void
+chkpt_quickdie(SIGNAL_ARGS)
+{
+	PG_SETMASK(&BlockSig);
+
+	/*
+	 * We DO NOT want to run proc_exit() callbacks -- we're here because
+	 * shared memory may be corrupted, so we don't want to try to clean up our
+	 * transaction.  Just nail the windows shut and get out of town.  Now that
+	 * there's an atexit callback to prevent third-party code from breaking
+	 * things by calling exit() directly, we have to reset the callbacks
+	 * explicitly to make this work as intended.
+	 */
+	on_exit_reset();
+
+	/*
+	 * Note we do exit(2) not exit(0).	This is to force the postmaster into a
+	 * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+	 * backend.  This is necessary precisely because we don't clean up our
+	 * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+	 * should ensure the postmaster sees this as a crash, too, but no harm in
+	 * being doubly sure.)
+	 */
+	exit(2);
+}
+
+/* SIGHUP: set flag to re-read config file at next convenient time */
+static void
+ChkptSigHupHandler(SIGNAL_ARGS)
+{
+	got_SIGHUP = true;
+}
+
+/* SIGINT: set flag to run a normal checkpoint right away */
+static void
+ReqCheckpointHandler(SIGNAL_ARGS)
+{
+	checkpoint_requested = true;
+}
+
+/* SIGUSR2: set flag to run a shutdown checkpoint and exit */
+static void
+ReqShutdownHandler(SIGNAL_ARGS)
+{
+	shutdown_requested = true;
+}
+
+
+/* --------------------------------
+ *		communication with backends
+ * --------------------------------
+ */
+
+/*
+ * BgWriterShmemSize
+ *		Compute space needed for bgwriter-related shared memory
+ */
+Size
+BgWriterShmemSize(void)
+{
+	Size		size;
+
+	/*
+	 * Currently, the size of the requests[] array is arbitrarily set equal to
+	 * NBuffers.  This may prove too large or small ...
+	 */
+	size = offsetof(BgWriterShmemStruct, requests);
+	size = add_size(size, mul_size(NBuffers, sizeof(BgWriterRequest)));
+
+	return size;
+}
+
+/*
+ * BgWriterShmemInit
+ *		Allocate and initialize bgwriter-related shared memory
+ */
+void
+BgWriterShmemInit(void)
+{
+	bool		found;
+
+	BgWriterShmem = (BgWriterShmemStruct *)
+		ShmemInitStruct("Background Writer Data",
+						BgWriterShmemSize(),
+						&found);
+
+	if (!found)
+	{
+		/* First time through, so initialize */
+		MemSet(BgWriterShmem, 0, sizeof(BgWriterShmemStruct));
+		SpinLockInit(&BgWriterShmem->ckpt_lck);
+		BgWriterShmem->max_requests = NBuffers;
+	}
+}
+
+/*
+ * RequestCheckpoint
+ *		Called in backend processes to request a checkpoint
+ *
+ * flags is a bitwise OR of the following:
+ *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
+ *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
+ *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
+ *		ignoring checkpoint_completion_target parameter.
+ *	CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occured
+ *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
+ *		CHECKPOINT_END_OF_RECOVERY).
+ *	CHECKPOINT_WAIT: wait for completion before returning (otherwise,
+ *		just signal bgwriter to do it, and return).
+ *	CHECKPOINT_CAUSE_XLOG: checkpoint is requested due to xlog filling.
+ *		(This affects logging, and in particular enables CheckPointWarning.)
+ */
+void
+RequestCheckpoint(int flags)
+{
+	/* use volatile pointer to prevent code rearrangement */
+	volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+	int			ntries;
+	int			old_failed,
+				old_started;
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		/*
+		 * There's no point in doing slow checkpoints in a standalone backend,
+		 * because there's no other backends the checkpoint could disrupt.
+		 */
+		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
+
+		/*
+		 * After any checkpoint, close all smgr files.	This is so we won't
+		 * hang onto smgr references to deleted files indefinitely.
+		 */
+		smgrcloseall();
+
+		return;
+	}
+
+	/*
+	 * Atomically set the request flags, and take a snapshot of the counters.
+	 * When we see ckpt_started > old_started, we know the flags we set here
+	 * have been seen by bgwriter.
+	 *
+	 * Note that we OR the flags with any existing flags, to avoid overriding
+	 * a "stronger" request by another backend.  The flag senses must be
+	 * chosen to make this work!
+	 */
+	SpinLockAcquire(&bgs->ckpt_lck);
+
+	old_failed = bgs->ckpt_failed;
+	old_started = bgs->ckpt_started;
+	bgs->ckpt_flags |= flags;
+
+	SpinLockRelease(&bgs->ckpt_lck);
+
+	/*
+	 * Send signal to request checkpoint.  It's possible that the bgwriter
+	 * hasn't started yet, or is in process of restarting, so we will retry a
+	 * few times if needed.  Also, if not told to wait for the checkpoint to
+	 * occur, we consider failure to send the signal to be nonfatal and merely
+	 * LOG it.
+	 */
+	for (ntries = 0;; ntries++)
+	{
+		if (BgWriterShmem->checkpointer_pid == 0)
+		{
+			if (ntries >= 20)	/* max wait 2.0 sec */
+			{
+				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
+				"could not request checkpoint because bgwriter not running");
+				break;
+			}
+		}
+		else if (kill(BgWriterShmem->checkpointer_pid, SIGINT) != 0)
+		{
+			if (ntries >= 20)	/* max wait 2.0 sec */
+			{
+				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
+					 "could not signal for checkpoint: %m");
+				break;
+			}
+		}
+		else
+			break;				/* signal sent successfully */
+
+		CHECK_FOR_INTERRUPTS();
+		pg_usleep(100000L);		/* wait 0.1 sec, then retry */
+	}
+
+	/*
+	 * If requested, wait for completion.  We detect completion according to
+	 * the algorithm given above.
+	 */
+	if (flags & CHECKPOINT_WAIT)
+	{
+		int			new_started,
+					new_failed;
+
+		/* Wait for a new checkpoint to start. */
+		for (;;)
+		{
+			SpinLockAcquire(&bgs->ckpt_lck);
+			new_started = bgs->ckpt_started;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			if (new_started != old_started)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(100000L);
+		}
+
+		/*
+		 * We are waiting for ckpt_done >= new_started, in a modulo sense.
+		 */
+		for (;;)
+		{
+			int			new_done;
+
+			SpinLockAcquire(&bgs->ckpt_lck);
+			new_done = bgs->ckpt_done;
+			new_failed = bgs->ckpt_failed;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			if (new_done - new_started >= 0)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(100000L);
+		}
+
+		if (new_failed != old_failed)
+			ereport(ERROR,
+					(errmsg("checkpoint request failed"),
+					 errhint("Consult recent messages in the server log for details.")));
+	}
+}
+
+/*
+ * ForwardFsyncRequest
+ *		Forward a file-fsync request from a backend to the bgwriter
+ *
+ * Whenever a backend is compelled to write directly to a relation
+ * (which should be seldom, if the bgwriter is getting its job done),
+ * the backend calls this routine to pass over knowledge that the relation
+ * is dirty and must be fsync'd before next checkpoint.  We also use this
+ * opportunity to count such writes for statistical purposes.
+ *
+ * segno specifies which segment (not block!) of the relation needs to be
+ * fsync'd.  (Since the valid range is much less than BlockNumber, we can
+ * use high values for special flags; that's all internal to md.c, which
+ * see for details.)
+ *
+ * To avoid holding the lock for longer than necessary, we normally write
+ * to the requests[] queue without checking for duplicates.  The bgwriter
+ * will have to eliminate dups internally anyway.  However, if we discover
+ * that the queue is full, we make a pass over the entire queue to compact
+ * it.	This is somewhat expensive, but the alternative is for the backend
+ * to perform its own fsync, which is far more expensive in practice.  It
+ * is theoretically possible a backend fsync might still be necessary, if
+ * the queue is full and contains no duplicate entries.  In that case, we
+ * let the backend know by returning false.
+ */
+bool
+ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
+					BlockNumber segno)
+{
+	BgWriterRequest *request;
+
+	if (!IsUnderPostmaster)
+		return false;			/* probably shouldn't even get here */
+
+	if (am_checkpointer)
+		elog(ERROR, "ForwardFsyncRequest must not be called in bgwriter");
+
+	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
+
+	/* Count all backend writes regardless of if they fit in the queue */
+	BgWriterShmem->num_backend_writes++;
+
+	/*
+	 * If the background writer isn't running or the request queue is full,
+	 * the backend will have to perform its own fsync request.	But before
+	 * forcing that to happen, we can try to compact the background writer
+	 * request queue.
+	 */
+	if (BgWriterShmem->checkpointer_pid == 0 ||
+		(BgWriterShmem->num_requests >= BgWriterShmem->max_requests
+		 && !CompactCheckpointerRequestQueue()))
+	{
+		/*
+		 * Count the subset of writes where backends have to do their own
+		 * fsync
+		 */
+		BgWriterShmem->num_backend_fsync++;
+		LWLockRelease(BgWriterCommLock);
+		return false;
+	}
+	request = &BgWriterShmem->requests[BgWriterShmem->num_requests++];
+	request->rnode = rnode;
+	request->forknum = forknum;
+	request->segno = segno;
+	LWLockRelease(BgWriterCommLock);
+	return true;
+}
+
+/*
+ * CompactCheckpointerRequestQueue
+ *		Remove duplicates from the request queue to avoid backend fsyncs.
+ *
+ * Although a full fsync request queue is not common, it can lead to severe
+ * performance problems when it does happen.  So far, this situation has
+ * only been observed to occur when the system is under heavy write load,
+ * and especially during the "sync" phase of a checkpoint.	Without this
+ * logic, each backend begins doing an fsync for every block written, which
+ * gets very expensive and can slow down the whole system.
+ *
+ * Trying to do this every time the queue is full could lose if there
+ * aren't any removable entries.  But should be vanishingly rare in
+ * practice: there's one queue entry per shared buffer.
+ */
+static bool
+CompactCheckpointerRequestQueue()
+{
+	struct BgWriterSlotMapping
+	{
+		BgWriterRequest request;
+		int			slot;
+	};
+
+	int			n,
+				preserve_count;
+	int			num_skipped = 0;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	bool	   *skip_slot;
+
+	/* must hold BgWriterCommLock in exclusive mode */
+	Assert(LWLockHeldByMe(BgWriterCommLock));
+
+	/* Initialize temporary hash table */
+	MemSet(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(BgWriterRequest);
+	ctl.entrysize = sizeof(struct BgWriterSlotMapping);
+	ctl.hash = tag_hash;
+	htab = hash_create("CompactBgwriterRequestQueue",
+					   BgWriterShmem->num_requests,
+					   &ctl,
+					   HASH_ELEM | HASH_FUNCTION);
+
+	/* Initialize skip_slot array */
+	skip_slot = palloc0(sizeof(bool) * BgWriterShmem->num_requests);
+
+	/*
+	 * The basic idea here is that a request can be skipped if it's followed
+	 * by a later, identical request.  It might seem more sensible to work
+	 * backwards from the end of the queue and check whether a request is
+	 * *preceded* by an earlier, identical request, in the hopes of doing less
+	 * copying.  But that might change the semantics, if there's an
+	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
+	 * we do it this way.  It would be possible to be even smarter if we made
+	 * the code below understand the specific semantics of such requests (it
+	 * could blow away preceding entries that would end up being canceled
+	 * anyhow), but it's not clear that the extra complexity would buy us
+	 * anything.
+	 */
+	for (n = 0; n < BgWriterShmem->num_requests; ++n)
+	{
+		BgWriterRequest *request;
+		struct BgWriterSlotMapping *slotmap;
+		bool		found;
+
+		request = &BgWriterShmem->requests[n];
+		slotmap = hash_search(htab, request, HASH_ENTER, &found);
+		if (found)
+		{
+			skip_slot[slotmap->slot] = true;
+			++num_skipped;
+		}
+		slotmap->slot = n;
+	}
+
+	/* Done with the hash table. */
+	hash_destroy(htab);
+
+	/* If no duplicates, we're out of luck. */
+	if (!num_skipped)
+	{
+		pfree(skip_slot);
+		return false;
+	}
+
+	/* We found some duplicates; remove them. */
+	for (n = 0, preserve_count = 0; n < BgWriterShmem->num_requests; ++n)
+	{
+		if (skip_slot[n])
+			continue;
+		BgWriterShmem->requests[preserve_count++] = BgWriterShmem->requests[n];
+	}
+	ereport(DEBUG1,
+	   (errmsg("compacted fsync request queue from %d entries to %d entries",
+			   BgWriterShmem->num_requests, preserve_count)));
+	BgWriterShmem->num_requests = preserve_count;
+
+	/* Cleanup. */
+	pfree(skip_slot);
+	return true;
+}
+
+/*
+ * AbsorbFsyncRequests
+ *		Retrieve queued fsync requests and pass them to local smgr.
+ *
+ * This is exported because it must be called during CreateCheckPoint;
+ * we have to be sure we have accepted all pending requests just before
+ * we start fsync'ing.  Since CreateCheckPoint sometimes runs in
+ * non-checkpointer processes, do nothing if not checkpointer.
+ */
+void
+AbsorbFsyncRequests(void)
+{
+	BgWriterRequest *requests = NULL;
+	BgWriterRequest *request;
+	int			n;
+
+	if (!am_checkpointer)
+		return;
+
+	/*
+	 * We have to PANIC if we fail to absorb all the pending requests (eg,
+	 * because our hashtable runs out of memory).  This is because the system
+	 * cannot run safely if we are unable to fsync what we have been told to
+	 * fsync.  Fortunately, the hashtable is so small that the problem is
+	 * quite unlikely to arise in practice.
+	 */
+	START_CRIT_SECTION();
+
+	/*
+	 * We try to avoid holding the lock for a long time by copying the request
+	 * array.
+	 */
+	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
+
+	/* Transfer write count into pending pgstats message */
+	BgWriterStats.m_buf_written_backend += BgWriterShmem->num_backend_writes;
+	BgWriterStats.m_buf_fsync_backend += BgWriterShmem->num_backend_fsync;
+
+	BgWriterShmem->num_backend_writes = 0;
+	BgWriterShmem->num_backend_fsync = 0;
+
+	n = BgWriterShmem->num_requests;
+	if (n > 0)
+	{
+		requests = (BgWriterRequest *) palloc(n * sizeof(BgWriterRequest));
+		memcpy(requests, BgWriterShmem->requests, n * sizeof(BgWriterRequest));
+	}
+	BgWriterShmem->num_requests = 0;
+
+	LWLockRelease(BgWriterCommLock);
+
+	for (request = requests; n > 0; request++, n--)
+		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+
+	if (requests)
+		pfree(requests);
+
+	END_CRIT_SECTION();
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index df4a2aa..af68741 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -206,6 +206,7 @@ bool		restart_after_crash = true;
 /* PIDs of special child processes; 0 when not running */
 static pid_t StartupPID = 0,
 			BgWriterPID = 0,
+			CheckpointerPID = 0,
 			WalWriterPID = 0,
 			WalReceiverPID = 0,
 			AutoVacPID = 0,
@@ -277,7 +278,7 @@ typedef enum
 	PM_WAIT_BACKUP,				/* waiting for online backup mode to end */
 	PM_WAIT_READONLY,			/* waiting for read only backends to exit */
 	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
-	PM_SHUTDOWN,				/* waiting for bgwriter to do shutdown ckpt */
+	PM_SHUTDOWN,				/* waiting for checkpointer to do shutdown ckpt */
 	PM_SHUTDOWN_2,				/* waiting for archiver and walsenders to
 								 * finish */
 	PM_WAIT_DEAD_END,			/* waiting for dead_end children to exit */
@@ -462,6 +463,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 
 #define StartupDataBase()		StartChildProcess(StartupProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
+#define StartCheckpointer()		StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()		StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()		StartChildProcess(WalReceiverProcess)
 
@@ -1029,8 +1031,8 @@ PostmasterMain(int argc, char *argv[])
 	 * CAUTION: when changing this list, check for side-effects on the signal
 	 * handling setup of child processes.  See tcop/postgres.c,
 	 * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
-	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c, and
-	 * postmaster/syslogger.c.
+	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c,
+	 * postmaster/syslogger.c and postmaster/checkpointer.c
 	 */
 	pqinitmask();
 	PG_SETMASK(&BlockSig);
@@ -1367,10 +1369,14 @@ ServerLoop(void)
 		 * state that prevents it, start one.  It doesn't matter if this
 		 * fails, we'll just try again later.
 		 */
-		if (BgWriterPID == 0 &&
-			(pmState == PM_RUN || pmState == PM_RECOVERY ||
-			 pmState == PM_HOT_STANDBY))
-			BgWriterPID = StartBackgroundWriter();
+		if (pmState == PM_RUN || pmState == PM_RECOVERY ||
+			 pmState == PM_HOT_STANDBY)
+		{
+			if (BgWriterPID == 0)
+				BgWriterPID = StartBackgroundWriter();
+			if (CheckpointerPID == 0)
+				CheckpointerPID = StartCheckpointer();
+		}
 
 		/*
 		 * Likewise, if we have lost the walwriter process, try to start a new
@@ -2048,6 +2054,8 @@ SIGHUP_handler(SIGNAL_ARGS)
 			signal_child(StartupPID, SIGHUP);
 		if (BgWriterPID != 0)
 			signal_child(BgWriterPID, SIGHUP);
+		if (CheckpointerPID != 0)
+			signal_child(CheckpointerPID, SIGHUP);
 		if (WalWriterPID != 0)
 			signal_child(WalWriterPID, SIGHUP);
 		if (WalReceiverPID != 0)
@@ -2162,7 +2170,7 @@ pmdie(SIGNAL_ARGS)
 				signal_child(WalReceiverPID, SIGTERM);
 			if (pmState == PM_RECOVERY)
 			{
-				/* only bgwriter is active in this state */
+				/* only checkpointer is active in this state */
 				pmState = PM_WAIT_BACKENDS;
 			}
 			else if (pmState == PM_RUN ||
@@ -2207,6 +2215,8 @@ pmdie(SIGNAL_ARGS)
 				signal_child(StartupPID, SIGQUIT);
 			if (BgWriterPID != 0)
 				signal_child(BgWriterPID, SIGQUIT);
+			if (CheckpointerPID != 0)
+				signal_child(CheckpointerPID, SIGQUIT);
 			if (WalWriterPID != 0)
 				signal_child(WalWriterPID, SIGQUIT);
 			if (WalReceiverPID != 0)
@@ -2337,12 +2347,14 @@ reaper(SIGNAL_ARGS)
 			}
 
 			/*
-			 * Crank up the background writer, if we didn't do that already
+			 * Crank up background tasks, if we didn't do that already
 			 * when we entered consistent recovery state.  It doesn't matter
 			 * if this fails, we'll just try again later.
 			 */
 			if (BgWriterPID == 0)
 				BgWriterPID = StartBackgroundWriter();
+			if (CheckpointerPID == 0)
+				CheckpointerPID = StartCheckpointer();
 
 			/*
 			 * Likewise, start other special children as needed.  In a restart
@@ -2370,10 +2382,22 @@ reaper(SIGNAL_ARGS)
 		if (pid == BgWriterPID)
 		{
 			BgWriterPID = 0;
+			if (!EXIT_STATUS_0(exitstatus))
+				HandleChildCrash(pid, exitstatus,
+								 _("background writer process"));
+			continue;
+		}
+
+		/*
+		 * Was it the checkpointer?
+		 */
+		if (pid == CheckpointerPID)
+		{
+			CheckpointerPID = 0;
 			if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN)
 			{
 				/*
-				 * OK, we saw normal exit of the bgwriter after it's been told
+				 * OK, we saw normal exit of the checkpointer after it's been told
 				 * to shut down.  We expect that it wrote a shutdown
 				 * checkpoint.	(If for some reason it didn't, recovery will
 				 * occur on next postmaster start.)
@@ -2410,11 +2434,11 @@ reaper(SIGNAL_ARGS)
 			else
 			{
 				/*
-				 * Any unexpected exit of the bgwriter (including FATAL exit)
+				 * Any unexpected exit of the checkpointer (including FATAL exit)
 				 * is treated as a crash.
 				 */
 				HandleChildCrash(pid, exitstatus,
-								 _("background writer process"));
+								 _("checkpointer process"));
 			}
 
 			continue;
@@ -2598,8 +2622,8 @@ CleanupBackend(int pid,
 }
 
 /*
- * HandleChildCrash -- cleanup after failed backend, bgwriter, walwriter,
- * or autovacuum.
+ * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
+ * walwriter or autovacuum.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -2692,6 +2716,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		signal_child(BgWriterPID, (SendStop ? SIGSTOP : SIGQUIT));
 	}
 
+	/* Take care of the checkpointer too */
+	if (pid == CheckpointerPID)
+		CheckpointerPID = 0;
+	else if (CheckpointerPID != 0 && !FatalError)
+	{
+		ereport(DEBUG2,
+				(errmsg_internal("sending %s to process %d",
+								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+								 (int) CheckpointerPID)));
+		signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
+	}
+
 	/* Take care of the walwriter too */
 	if (pid == WalWriterPID)
 		WalWriterPID = 0;
@@ -2871,9 +2907,10 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_WAIT_BACKENDS state ends when we have no regular backends
-		 * (including autovac workers) and no walwriter or autovac launcher.
-		 * If we are doing crash recovery then we expect the bgwriter to exit
-		 * too, otherwise not.	The archiver, stats, and syslogger processes
+		 * (including autovac workers) and no walwriter, autovac launcher
+		 * or bgwriter.  If we are doing crash recovery then we expect the
+		 * checkpointer to exit as well, otherwise not.
+		 * The archiver, stats, and syslogger processes
 		 * are disregarded since they are not connected to shared memory; we
 		 * also disregard dead_end children here. Walsenders are also
 		 * disregarded, they will be terminated later after writing the
@@ -2882,7 +2919,8 @@ PostmasterStateMachine(void)
 		if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_AUTOVAC) == 0 &&
 			StartupPID == 0 &&
 			WalReceiverPID == 0 &&
-			(BgWriterPID == 0 || !FatalError) &&
+			BgWriterPID == 0 &&
+			(CheckpointerPID == 0 || !FatalError) &&
 			WalWriterPID == 0 &&
 			AutoVacPID == 0)
 		{
@@ -2904,22 +2942,22 @@ PostmasterStateMachine(void)
 				/*
 				 * If we get here, we are proceeding with normal shutdown. All
 				 * the regular children are gone, and it's time to tell the
-				 * bgwriter to do a shutdown checkpoint.
+				 * checkpointer to do a shutdown checkpoint.
 				 */
 				Assert(Shutdown > NoShutdown);
-				/* Start the bgwriter if not running */
-				if (BgWriterPID == 0)
-					BgWriterPID = StartBackgroundWriter();
+				/* Start the checkpointer if not running */
+				if (CheckpointerPID == 0)
+					CheckpointerPID = StartCheckpointer();
 				/* And tell it to shut down */
-				if (BgWriterPID != 0)
+				if (CheckpointerPID != 0)
 				{
-					signal_child(BgWriterPID, SIGUSR2);
+					signal_child(CheckpointerPID, SIGUSR2);
 					pmState = PM_SHUTDOWN;
 				}
 				else
 				{
 					/*
-					 * If we failed to fork a bgwriter, just shut down. Any
+					 * If we failed to fork a checkpointer, just shut down. Any
 					 * required cleanup will happen at next restart. We set
 					 * FatalError so that an "abnormal shutdown" message gets
 					 * logged when we exit.
@@ -2978,6 +3016,7 @@ PostmasterStateMachine(void)
 			Assert(StartupPID == 0);
 			Assert(WalReceiverPID == 0);
 			Assert(BgWriterPID == 0);
+			Assert(CheckpointerPID == 0);
 			Assert(WalWriterPID == 0);
 			Assert(AutoVacPID == 0);
 			/* syslogger is not considered here */
@@ -4157,6 +4196,8 @@ sigusr1_handler(SIGNAL_ARGS)
 		 */
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
+		Assert(CheckpointerPID == 0);
+		CheckpointerPID = StartCheckpointer();
 
 		pmState = PM_RECOVERY;
 	}
@@ -4443,6 +4484,10 @@ StartChildProcess(AuxProcType type)
 				ereport(LOG,
 				   (errmsg("could not fork background writer process: %m")));
 				break;
+			case CheckpointerProcess:
+				ereport(LOG,
+				   (errmsg("could not fork checkpointer process: %m")));
+				break;
 			case WalWriterProcess:
 				ereport(LOG,
 						(errmsg("could not fork WAL writer process: %m")));
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8647edd..184e820 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1278,11 +1278,9 @@ BufferSync(int flags)
 					break;
 
 				/*
-				 * Perform normal bgwriter duties and sleep to throttle our
-				 * I/O rate.
+				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags,
-									 (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
 			}
 		}
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 3015885..a761369 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -38,7 +38,7 @@
 /*
  * Special values for the segno arg to RememberFsyncRequest.
  *
- * Note that CompactBgwriterRequestQueue assumes that it's OK to remove an
+ * Note that CompactcheckpointerRequestQueue assumes that it's OK to remove an
  * fsync request from the queue if an identical, subsequent request is found.
  * See comments there before making changes here.
  */
@@ -77,7 +77,7 @@
  *	Inactive segments are those that once contained data but are currently
  *	not needed because of an mdtruncate() operation.  The reason for leaving
  *	them present at size zero, rather than unlinking them, is that other
- *	backends and/or the bgwriter might be holding open file references to
+ *	backends and/or the checkpointer might be holding open file references to
  *	such segments.	If the relation expands again after mdtruncate(), such
  *	that a deactivated segment becomes active again, it is important that
  *	such file references still be valid --- else data might get written
@@ -111,7 +111,7 @@ static MemoryContext MdCxt;		/* context for all md.c allocations */
 
 
 /*
- * In some contexts (currently, standalone backends and the bgwriter process)
+ * In some contexts (currently, standalone backends and the checkpointer process)
  * we keep track of pending fsync operations: we need to remember all relation
  * segments that have been written since the last checkpoint, so that we can
  * fsync them down to disk before completing the next checkpoint.  This hash
@@ -123,7 +123,7 @@ static MemoryContext MdCxt;		/* context for all md.c allocations */
  * a hash table, because we don't expect there to be any duplicate requests.
  *
  * (Regular backends do not track pending operations locally, but forward
- * them to the bgwriter.)
+ * them to the checkpointer.)
  */
 typedef struct
 {
@@ -194,7 +194,7 @@ mdinit(void)
 	 * Create pending-operations hashtable if we need it.  Currently, we need
 	 * it if we are standalone (not under a postmaster) OR if we are a
 	 * bootstrap-mode subprocess of a postmaster (that is, a startup or
-	 * bgwriter process).
+	 * checkpointer process).
 	 */
 	if (!IsUnderPostmaster || IsBootstrapProcessingMode())
 	{
@@ -214,10 +214,10 @@ mdinit(void)
 }
 
 /*
- * In archive recovery, we rely on bgwriter to do fsyncs, but we will have
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
  * already created the pendingOpsTable during initialization of the startup
  * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to bgwriter.
+ * subsequent requests will be forwarded to checkpointer.
  */
 void
 SetForwardFsyncRequests(void)
@@ -765,9 +765,9 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 	 * NOTE: this assumption could only be wrong if another backend has
 	 * truncated the relation.	We rely on higher code levels to handle that
 	 * scenario by closing and re-opening the md fd, which is handled via
-	 * relcache flush.	(Since the bgwriter doesn't participate in relcache
+	 * relcache flush.	(Since the checkpointer doesn't participate in relcache
 	 * flush, it could have segment chain entries for inactive segments;
-	 * that's OK because the bgwriter never needs to compute relation size.)
+	 * that's OK because the checkpointer never needs to compute relation size.)
 	 */
 	while (v->mdfd_chain != NULL)
 	{
@@ -957,7 +957,7 @@ mdsync(void)
 		elog(ERROR, "cannot sync without a pendingOpsTable");
 
 	/*
-	 * If we are in the bgwriter, the sync had better include all fsync
+	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.	The tightest
 	 * race condition that could occur is that a buffer that must be written
 	 * and fsync'd for the checkpoint could have been dumped by a backend just
@@ -1033,7 +1033,7 @@ mdsync(void)
 			int			failures;
 
 			/*
-			 * If in bgwriter, we want to absorb pending requests every so
+			 * If in checkpointer, we want to absorb pending requests every so
 			 * often to prevent overflow of the fsync request queue.  It is
 			 * unspecified whether newly-added entries will be visited by
 			 * hash_seq_search, but we don't care since we don't need to
@@ -1070,9 +1070,9 @@ mdsync(void)
 				 * say "but an unreferenced SMgrRelation is still a leak!" Not
 				 * really, because the only case in which a checkpoint is done
 				 * by a process that isn't about to shut down is in the
-				 * bgwriter, and it will periodically do smgrcloseall(). This
+				 * checkpointer, and it will periodically do smgrcloseall(). This
 				 * fact justifies our not closing the reln in the success path
-				 * either, which is a good thing since in non-bgwriter cases
+				 * either, which is a good thing since in non-checkpointer cases
 				 * we couldn't safely do that.)  Furthermore, in many cases
 				 * the relation will have been dirtied through this same smgr
 				 * relation, and so we can save a file open/close cycle.
@@ -1301,7 +1301,7 @@ register_unlink(RelFileNodeBackend rnode)
 	else
 	{
 		/*
-		 * Notify the bgwriter about it.  If we fail to queue the request
+		 * Notify the checkpointer about it.  If we fail to queue the request
 		 * message, we have to sleep and try again, because we can't simply
 		 * delete the file now.  Ugly, but hopefully won't happen often.
 		 *
@@ -1315,10 +1315,10 @@ register_unlink(RelFileNodeBackend rnode)
 }
 
 /*
- * RememberFsyncRequest() -- callback from bgwriter side of fsync request
+ * RememberFsyncRequest() -- callback from checkpointer side of fsync request
  *
  * We stuff most fsync requests into the local hash table for execution
- * during the bgwriter's next checkpoint.  UNLINK requests go into a
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
  * separate linked list, however, because they get processed separately.
  *
  * The range of possible segment numbers is way less than the range of
@@ -1460,20 +1460,20 @@ ForgetRelationFsyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
 	else if (IsUnderPostmaster)
 	{
 		/*
-		 * Notify the bgwriter about it.  If we fail to queue the revoke
+		 * Notify the checkpointer about it.  If we fail to queue the revoke
 		 * message, we have to sleep and try again ... ugly, but hopefully
 		 * won't happen often.
 		 *
 		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
 		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the bgwriter
+		 * which would be bad, so I'm inclined to assume that the checkpointer
 		 * will always empty the queue soon.
 		 */
 		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
 			pg_usleep(10000L);	/* 10 msec seems a good number */
 
 		/*
-		 * Note we don't wait for the bgwriter to actually absorb the revoke
+		 * Note we don't wait for the checkpointer to actually absorb the revoke
 		 * message; see mdsync() for the implications.
 		 */
 	}
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 4eaa243..cb43879 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -256,7 +256,7 @@ typedef struct RmgrData
 extern const RmgrData RmgrTable[];
 
 /*
- * Exported to support xlog switching from bgwriter
+ * Exported to support xlog switching from checkpointer
  */
 extern pg_time_t GetLastSegSwitchTime(void);
 extern XLogRecPtr RequestXLogSwitch(void);
diff --git a/src/include/bootstrap/bootstrap.h b/src/include/bootstrap/bootstrap.h
index cee9bd1..6153b7a 100644
--- a/src/include/bootstrap/bootstrap.h
+++ b/src/include/bootstrap/bootstrap.h
@@ -22,6 +22,7 @@ typedef enum
 	BootstrapProcess,
 	StartupProcess,
 	BgWriterProcess,
+	CheckpointerProcess,
 	WalWriterProcess,
 	WalReceiverProcess,
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index eaf2206..c05901e 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -23,6 +23,7 @@ extern int	CheckPointWarning;
 extern double CheckPointCompletionTarget;
 
 extern void BackgroundWriterMain(void);
+extern void CheckpointerMain(void);
 
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 46ec625..6e798b1 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -190,11 +190,11 @@ extern PROC_HDR *ProcGlobal;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer and WAL writer run during normal operation. Startup
- * process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 3 slots.
+ * Background writer, checkpointer and WAL writer run during normal operation.
+ * Startup process and WAL receiver also consume 2 slots, but WAL writer is
+ * launched only after startup has exited, so we only need 4 slots.
  */
-#define NUM_AUXILIARY_PROCS		3
+#define NUM_AUXILIARY_PROCS		4
 
 
 /* configurable options */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 2a27e0b..d5afe01 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -19,7 +19,7 @@
 
 /*
  * Reasons for signalling a Postgres child process (a backend or an auxiliary
- * process, like bgwriter).  We can cope with concurrent signals for different
+ * process, like checkpointer).  We can cope with concurrent signals for different
  * reasons.  However, if the same reason is signaled multiple times in quick
  * succession, the process is likely to observe only one notification of it.
  * This is okay for the present uses.

#19

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Simon Riggs (#1)

1 attachment(s)

Re: Separating bgwriter and checkpointer

On Thu, Sep 15, 2011 at 11:53 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Current patch has a bug at shutdown I've not located yet, but seems
likely is a simple error. That is mainly because for personal reasons
I've not been able to work on the patch recently. I expect to be able
to fix that later in the CF.

Full patch, with bug fixed. (v2)

I'm now free to take review comments and make changes.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

bgwriter_split.v2.patchapplication/octet-stream; name=bgwriter_split.v2.patchDownload

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4fe08df..f9b839c 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -315,6 +315,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			case BgWriterProcess:
 				statmsg = "writer process";
 				break;
+			case CheckpointerProcess:
+				statmsg = "checkpointer process";
+				break;
 			case WalWriterProcess:
 				statmsg = "wal writer process";
 				break;
@@ -415,6 +418,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			BackgroundWriterMain();
 			proc_exit(1);		/* should never return */
 
+		case CheckpointerProcess:
+			/* don't set signals, checkpointer has its own agenda */
+			CheckpointerMain();
+			proc_exit(1);		/* should never return */
+
 		case WalWriterProcess:
 			/* don't set signals, walwriter has its own agenda */
 			InitXLOGAccess();
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 0767e97..e7414d2 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = autovacuum.o bgwriter.o fork_process.o pgarch.o pgstat.o postmaster.o \
-	syslogger.o walwriter.o
+	syslogger.o walwriter.o checkpointer.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 2d0b639..e0f3167 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -10,20 +10,13 @@
  * still empowered to issue writes if the bgwriter fails to maintain enough
  * clean shared buffers.
  *
- * The bgwriter is also charged with handling all checkpoints.	It will
- * automatically dispatch a checkpoint after a certain amount of time has
- * elapsed since the last one, and it can be signaled to perform requested
- * checkpoints as well.  (The GUC parameter that mandates a checkpoint every
- * so many WAL segments is implemented by having backends signal the bgwriter
- * when they fill WAL segments; the bgwriter itself doesn't watch for the
- * condition.)
+ * As of Postgres 9.2 the bgwriter no longer handles checkpoints.
  *
  * The bgwriter is started by the postmaster as soon as the startup subprocess
  * finishes, or as soon as recovery begins if we are doing archive recovery.
  * It remains alive until the postmaster commands it to terminate.
- * Normal termination is by SIGUSR2, which instructs the bgwriter to execute
- * a shutdown checkpoint and then exit(0).	(All backends must be stopped
- * before SIGUSR2 is issued!)  Emergency termination is by SIGQUIT; like any
+ * Normal termination is by SIGUSR2, which instructs the bgwriter to exit(0).
+ * Emergency termination is by SIGQUIT; like any
  * backend, the bgwriter will simply abort and exit on SIGQUIT.
  *
  * If the bgwriter exits unexpectedly, the postmaster treats that the same
@@ -54,7 +47,6 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
-#include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
@@ -67,96 +59,15 @@
 #include "utils/resowner.h"
 
 
-/*----------
- * Shared memory area for communication between bgwriter and backends
- *
- * The ckpt counters allow backends to watch for completion of a checkpoint
- * request they send.  Here's how it works:
- *	* At start of a checkpoint, bgwriter reads (and clears) the request flags
- *	  and increments ckpt_started, while holding ckpt_lck.
- *	* On completion of a checkpoint, bgwriter sets ckpt_done to
- *	  equal ckpt_started.
- *	* On failure of a checkpoint, bgwriter increments ckpt_failed
- *	  and sets ckpt_done to equal ckpt_started.
- *
- * The algorithm for backends is:
- *	1. Record current values of ckpt_failed and ckpt_started, and
- *	   set request flags, while holding ckpt_lck.
- *	2. Send signal to request checkpoint.
- *	3. Sleep until ckpt_started changes.  Now you know a checkpoint has
- *	   begun since you started this algorithm (although *not* that it was
- *	   specifically initiated by your signal), and that it is using your flags.
- *	4. Record new value of ckpt_started.
- *	5. Sleep until ckpt_done >= saved value of ckpt_started.  (Use modulo
- *	   arithmetic here in case counters wrap around.)  Now you know a
- *	   checkpoint has started and completed, but not whether it was
- *	   successful.
- *	6. If ckpt_failed is different from the originally saved value,
- *	   assume request failed; otherwise it was definitely successful.
- *
- * ckpt_flags holds the OR of the checkpoint request flags sent by all
- * requesting backends since the last checkpoint start.  The flags are
- * chosen so that OR'ing is the correct way to combine multiple requests.
- *
- * num_backend_writes is used to count the number of buffer writes performed
- * by non-bgwriter processes.  This counter should be wide enough that it
- * can't overflow during a single bgwriter cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the background writer failed to absorb their request.
- *
- * The requests array holds fsync requests sent by backends and not yet
- * absorbed by the bgwriter.
- *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by BgWriterCommLock.
- *----------
- */
-typedef struct
-{
-	RelFileNodeBackend rnode;
-	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
-	/* might add a real request-type field later; not needed yet */
-} BgWriterRequest;
-
-typedef struct
-{
-	pid_t		bgwriter_pid;	/* PID of bgwriter (0 if not started) */
-
-	slock_t		ckpt_lck;		/* protects all the ckpt_* fields */
-
-	int			ckpt_started;	/* advances when checkpoint starts */
-	int			ckpt_done;		/* advances when checkpoint done */
-	int			ckpt_failed;	/* advances when checkpoint fails */
-
-	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
-
-	uint32		num_backend_writes;		/* counts non-bgwriter buffer writes */
-	uint32		num_backend_fsync;		/* counts non-bgwriter fsync calls */
-
-	int			num_requests;	/* current # of requests */
-	int			max_requests;	/* allocated array size */
-	BgWriterRequest requests[1];	/* VARIABLE LENGTH ARRAY */
-} BgWriterShmemStruct;
-
-static BgWriterShmemStruct *BgWriterShmem;
-
-/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
-#define WRITES_PER_ABSORB		1000
-
 /*
  * GUC parameters
  */
 int			BgWriterDelay = 200;
-int			CheckPointTimeout = 300;
-int			CheckPointWarning = 30;
-double		CheckPointCompletionTarget = 0.5;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
 static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
 static volatile sig_atomic_t shutdown_requested = false;
 
 /*
@@ -164,29 +75,14 @@ static volatile sig_atomic_t shutdown_requested = false;
  */
 static bool am_bg_writer = false;
 
-static bool ckpt_active = false;
-
-/* these values are valid when ckpt_active is true: */
-static pg_time_t ckpt_start_time;
-static XLogRecPtr ckpt_start_recptr;
-static double ckpt_cached_elapsed;
-
-static pg_time_t last_checkpoint_time;
-static pg_time_t last_xlog_switch_time;
-
 /* Prototypes for private functions */
 
-static void CheckArchiveTimeout(void);
 static void BgWriterNap(void);
-static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
-static bool CompactBgwriterRequestQueue(void);
 
 /* Signal handlers */
 
 static void bg_quickdie(SIGNAL_ARGS);
 static void BgSigHupHandler(SIGNAL_ARGS);
-static void ReqCheckpointHandler(SIGNAL_ARGS);
 static void ReqShutdownHandler(SIGNAL_ARGS);
 
 
@@ -202,7 +98,6 @@ BackgroundWriterMain(void)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 
-	BgWriterShmem->bgwriter_pid = MyProcPid;
 	am_bg_writer = true;
 
 	/*
@@ -228,8 +123,8 @@ BackgroundWriterMain(void)
 	 * process to participate in ProcSignal signalling.
 	 */
 	pqsignal(SIGHUP, BgSigHupHandler);	/* set flag to read config file */
-	pqsignal(SIGINT, ReqCheckpointHandler);		/* request checkpoint */
-	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
+	pqsignal(SIGINT, SIG_IGN);			/* as of 9.2 no longer requests checkpoint */
+	pqsignal(SIGTERM, SIG_IGN); 		/* ignore SIGTERM */
 	pqsignal(SIGQUIT, bg_quickdie);		/* hard crash time */
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
@@ -249,11 +144,6 @@ BackgroundWriterMain(void)
 	sigdelset(&BlockSig, SIGQUIT);
 
 	/*
-	 * Initialize so that first time-driven event happens at the correct time.
-	 */
-	last_checkpoint_time = last_xlog_switch_time = (pg_time_t) time(NULL);
-
-	/*
 	 * Create a resource owner to keep track of our resources (currently only
 	 * buffer pins).
 	 */
@@ -305,20 +195,6 @@ BackgroundWriterMain(void)
 		AtEOXact_Files();
 		AtEOXact_HashTables(false);
 
-		/* Warn any waiting backends that the checkpoint failed. */
-		if (ckpt_active)
-		{
-			/* use volatile pointer to prevent code rearrangement */
-			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-
-			SpinLockAcquire(&bgs->ckpt_lck);
-			bgs->ckpt_failed++;
-			bgs->ckpt_done = bgs->ckpt_started;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			ckpt_active = false;
-		}
-
 		/*
 		 * Now return to normal top-level context and clear ErrorContext for
 		 * next time.
@@ -361,19 +237,11 @@ BackgroundWriterMain(void)
 	if (RecoveryInProgress())
 		ThisTimeLineID = GetRecoveryTargetTLI();
 
-	/* Do this once before starting the loop, then just at SIGHUP time. */
-	SyncRepUpdateSyncStandbysDefined();
-
 	/*
 	 * Loop forever
 	 */
 	for (;;)
 	{
-		bool		do_checkpoint = false;
-		int			flags = 0;
-		pg_time_t	now;
-		int			elapsed_secs;
-
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
 		 * necessity for manual cleanup of all postmaster children.
@@ -381,23 +249,11 @@ BackgroundWriterMain(void)
 		if (!PostmasterIsAlive())
 			exit(1);
 
-		/*
-		 * Process any requests or signals received recently.
-		 */
-		AbsorbFsyncRequests();
-
 		if (got_SIGHUP)
 		{
 			got_SIGHUP = false;
 			ProcessConfigFile(PGC_SIGHUP);
 			/* update global shmem state for sync rep */
-			SyncRepUpdateSyncStandbysDefined();
-		}
-		if (checkpoint_requested)
-		{
-			checkpoint_requested = false;
-			do_checkpoint = true;
-			BgWriterStats.m_requested_checkpoints++;
 		}
 		if (shutdown_requested)
 		{
@@ -406,142 +262,14 @@ BackgroundWriterMain(void)
 			 * control back to the sigsetjmp block above
 			 */
 			ExitOnAnyError = true;
-			/* Close down the database */
-			ShutdownXLOG(0, 0);
 			/* Normal exit from the bgwriter is here */
 			proc_exit(0);		/* done */
 		}
 
 		/*
-		 * Force a checkpoint if too much time has elapsed since the last one.
-		 * Note that we count a timed checkpoint in stats only when this
-		 * occurs without an external request, but we set the CAUSE_TIME flag
-		 * bit even if there is also an external request.
+		 * Do one cycle of dirty-buffer writing.
 		 */
-		now = (pg_time_t) time(NULL);
-		elapsed_secs = now - last_checkpoint_time;
-		if (elapsed_secs >= CheckPointTimeout)
-		{
-			if (!do_checkpoint)
-				BgWriterStats.m_timed_checkpoints++;
-			do_checkpoint = true;
-			flags |= CHECKPOINT_CAUSE_TIME;
-		}
-
-		/*
-		 * Do a checkpoint if requested, otherwise do one cycle of
-		 * dirty-buffer writing.
-		 */
-		if (do_checkpoint)
-		{
-			bool		ckpt_performed = false;
-			bool		do_restartpoint;
-
-			/* use volatile pointer to prevent code rearrangement */
-			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-
-			/*
-			 * Check if we should perform a checkpoint or a restartpoint. As a
-			 * side-effect, RecoveryInProgress() initializes TimeLineID if
-			 * it's not set yet.
-			 */
-			do_restartpoint = RecoveryInProgress();
-
-			/*
-			 * Atomically fetch the request flags to figure out what kind of a
-			 * checkpoint we should perform, and increase the started-counter
-			 * to acknowledge that we've started a new checkpoint.
-			 */
-			SpinLockAcquire(&bgs->ckpt_lck);
-			flags |= bgs->ckpt_flags;
-			bgs->ckpt_flags = 0;
-			bgs->ckpt_started++;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			/*
-			 * The end-of-recovery checkpoint is a real checkpoint that's
-			 * performed while we're still in recovery.
-			 */
-			if (flags & CHECKPOINT_END_OF_RECOVERY)
-				do_restartpoint = false;
-
-			/*
-			 * We will warn if (a) too soon since last checkpoint (whatever
-			 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
-			 * since the last checkpoint start.  Note in particular that this
-			 * implementation will not generate warnings caused by
-			 * CheckPointTimeout < CheckPointWarning.
-			 */
-			if (!do_restartpoint &&
-				(flags & CHECKPOINT_CAUSE_XLOG) &&
-				elapsed_secs < CheckPointWarning)
-				ereport(LOG,
-						(errmsg_plural("checkpoints are occurring too frequently (%d second apart)",
-				"checkpoints are occurring too frequently (%d seconds apart)",
-									   elapsed_secs,
-									   elapsed_secs),
-						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
-
-			/*
-			 * Initialize bgwriter-private variables used during checkpoint.
-			 */
-			ckpt_active = true;
-			if (!do_restartpoint)
-				ckpt_start_recptr = GetInsertRecPtr();
-			ckpt_start_time = now;
-			ckpt_cached_elapsed = 0;
-
-			/*
-			 * Do the checkpoint.
-			 */
-			if (!do_restartpoint)
-			{
-				CreateCheckPoint(flags);
-				ckpt_performed = true;
-			}
-			else
-				ckpt_performed = CreateRestartPoint(flags);
-
-			/*
-			 * After any checkpoint, close all smgr files.	This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
-			 */
-			smgrcloseall();
-
-			/*
-			 * Indicate checkpoint completion to any waiting backends.
-			 */
-			SpinLockAcquire(&bgs->ckpt_lck);
-			bgs->ckpt_done = bgs->ckpt_started;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			if (ckpt_performed)
-			{
-				/*
-				 * Note we record the checkpoint start time not end time as
-				 * last_checkpoint_time.  This is so that time-driven
-				 * checkpoints happen at a predictable spacing.
-				 */
-				last_checkpoint_time = now;
-			}
-			else
-			{
-				/*
-				 * We were not able to perform the restartpoint (checkpoints
-				 * throw an ERROR in case of error).  Most likely because we
-				 * have not received any new checkpoint WAL records since the
-				 * last restartpoint. Try again in 15 s.
-				 */
-				last_checkpoint_time = now - CheckPointTimeout + 15;
-			}
-
-			ckpt_active = false;
-		}
-		else
-			BgBufferSync();
-
-		/* Check for archive_timeout and switch xlog files if necessary. */
-		CheckArchiveTimeout();
+		BgBufferSync();
 
 		/* Nap for the configured time. */
 		BgWriterNap();
@@ -549,61 +277,6 @@ BackgroundWriterMain(void)
 }
 
 /*
- * CheckArchiveTimeout -- check for archive_timeout and switch xlog files
- *
- * This will switch to a new WAL file and force an archive file write
- * if any activity is recorded in the current WAL file, including just
- * a single checkpoint record.
- */
-static void
-CheckArchiveTimeout(void)
-{
-	pg_time_t	now;
-	pg_time_t	last_time;
-
-	if (XLogArchiveTimeout <= 0 || RecoveryInProgress())
-		return;
-
-	now = (pg_time_t) time(NULL);
-
-	/* First we do a quick check using possibly-stale local state. */
-	if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
-		return;
-
-	/*
-	 * Update local state ... note that last_xlog_switch_time is the last time
-	 * a switch was performed *or requested*.
-	 */
-	last_time = GetLastSegSwitchTime();
-
-	last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
-
-	/* Now we can do the real check */
-	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
-	{
-		XLogRecPtr	switchpoint;
-
-		/* OK, it's time to switch */
-		switchpoint = RequestXLogSwitch();
-
-		/*
-		 * If the returned pointer points exactly to a segment boundary,
-		 * assume nothing happened.
-		 */
-		if ((switchpoint.xrecoff % XLogSegSize) != 0)
-			ereport(DEBUG1,
-				(errmsg("transaction log switch forced (archive_timeout=%d)",
-						XLogArchiveTimeout)));
-
-		/*
-		 * Update state in any case, so we don't retry constantly when the
-		 * system is idle.
-		 */
-		last_xlog_switch_time = now;
-	}
-}
-
-/*
  * BgWriterNap -- Nap for the configured time or until a signal is received.
  */
 static void
@@ -624,185 +297,24 @@ BgWriterNap(void)
 	 * respond reasonably promptly when someone signals us, break down the
 	 * sleep into 1-second increments, and check for interrupts after each
 	 * nap.
-	 *
-	 * We absorb pending requests after each short sleep.
 	 */
-	if (bgwriter_lru_maxpages > 0 || ckpt_active)
+	if (bgwriter_lru_maxpages > 0)
 		udelay = BgWriterDelay * 1000L;
-	else if (XLogArchiveTimeout > 0)
-		udelay = 1000000L;		/* One second */
 	else
 		udelay = 10000000L;		/* Ten seconds */
 
 	while (udelay > 999999L)
 	{
-		if (got_SIGHUP || shutdown_requested ||
-		(ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
+		if (got_SIGHUP || shutdown_requested)
 			break;
 		pg_usleep(1000000L);
-		AbsorbFsyncRequests();
 		udelay -= 1000000L;
 	}
 
-	if (!(got_SIGHUP || shutdown_requested ||
-	  (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested)))
+	if (!(got_SIGHUP || shutdown_requested))
 		pg_usleep(udelay);
 }
 
-/*
- * Returns true if an immediate checkpoint request is pending.	(Note that
- * this does not check the *current* checkpoint's IMMEDIATE flag, but whether
- * there is one pending behind it.)
- */
-static bool
-ImmediateCheckpointRequested(void)
-{
-	if (checkpoint_requested)
-	{
-		volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-
-		/*
-		 * We don't need to acquire the ckpt_lck in this case because we're
-		 * only looking at a single flag bit.
-		 */
-		if (bgs->ckpt_flags & CHECKPOINT_IMMEDIATE)
-			return true;
-	}
-	return false;
-}
-
-/*
- * CheckpointWriteDelay -- yield control to bgwriter during a checkpoint
- *
- * This function is called after each page write performed by BufferSync().
- * It is responsible for keeping the bgwriter's normal activities in
- * progress during a long checkpoint, and for throttling BufferSync()'s
- * write rate to hit checkpoint_completion_target.
- *
- * The checkpoint request flags should be passed in; currently the only one
- * examined is CHECKPOINT_IMMEDIATE, which disables delays between writes.
- *
- * 'progress' is an estimate of how much of the work has been done, as a
- * fraction between 0.0 meaning none, and 1.0 meaning all done.
- */
-void
-CheckpointWriteDelay(int flags, double progress)
-{
-	static int	absorb_counter = WRITES_PER_ABSORB;
-
-	/* Do nothing if checkpoint is being executed by non-bgwriter process */
-	if (!am_bg_writer)
-		return;
-
-	/*
-	 * Perform the usual bgwriter duties and take a nap, unless we're behind
-	 * schedule, in which case we just try to catch up as quickly as possible.
-	 */
-	if (!(flags & CHECKPOINT_IMMEDIATE) &&
-		!shutdown_requested &&
-		!ImmediateCheckpointRequested() &&
-		IsCheckpointOnSchedule(progress))
-	{
-		if (got_SIGHUP)
-		{
-			got_SIGHUP = false;
-			ProcessConfigFile(PGC_SIGHUP);
-			/* update global shmem state for sync rep */
-			SyncRepUpdateSyncStandbysDefined();
-		}
-
-		AbsorbFsyncRequests();
-		absorb_counter = WRITES_PER_ABSORB;
-
-		BgBufferSync();
-		CheckArchiveTimeout();
-		BgWriterNap();
-	}
-	else if (--absorb_counter <= 0)
-	{
-		/*
-		 * Absorb pending fsync requests after each WRITES_PER_ABSORB write
-		 * operations even when we don't sleep, to prevent overflow of the
-		 * fsync request queue.
-		 */
-		AbsorbFsyncRequests();
-		absorb_counter = WRITES_PER_ABSORB;
-	}
-}
-
-/*
- * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
- *		 in time?
- *
- * Compares the current progress against the time/segments elapsed since last
- * checkpoint, and returns true if the progress we've made this far is greater
- * than the elapsed time/segments.
- */
-static bool
-IsCheckpointOnSchedule(double progress)
-{
-	XLogRecPtr	recptr;
-	struct timeval now;
-	double		elapsed_xlogs,
-				elapsed_time;
-
-	Assert(ckpt_active);
-
-	/* Scale progress according to checkpoint_completion_target. */
-	progress *= CheckPointCompletionTarget;
-
-	/*
-	 * Check against the cached value first. Only do the more expensive
-	 * calculations once we reach the target previously calculated. Since
-	 * neither time or WAL insert pointer moves backwards, a freshly
-	 * calculated value can only be greater than or equal to the cached value.
-	 */
-	if (progress < ckpt_cached_elapsed)
-		return false;
-
-	/*
-	 * Check progress against WAL segments written and checkpoint_segments.
-	 *
-	 * We compare the current WAL insert location against the location
-	 * computed before calling CreateCheckPoint. The code in XLogInsert that
-	 * actually triggers a checkpoint when checkpoint_segments is exceeded
-	 * compares against RedoRecptr, so this is not completely accurate.
-	 * However, it's good enough for our purposes, we're only calculating an
-	 * estimate anyway.
-	 */
-	if (!RecoveryInProgress())
-	{
-		recptr = GetInsertRecPtr();
-		elapsed_xlogs =
-			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
-			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
-			CheckPointSegments;
-
-		if (progress < elapsed_xlogs)
-		{
-			ckpt_cached_elapsed = elapsed_xlogs;
-			return false;
-		}
-	}
-
-	/*
-	 * Check progress against time elapsed and checkpoint_timeout.
-	 */
-	gettimeofday(&now, NULL);
-	elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
-					now.tv_usec / 1000000.0) / CheckPointTimeout;
-
-	if (progress < elapsed_time)
-	{
-		ckpt_cached_elapsed = elapsed_time;
-		return false;
-	}
-
-	/* It looks like we're on schedule. */
-	return true;
-}
-
-
 /* --------------------------------
  *		signal handler routines
  * --------------------------------
@@ -847,441 +359,9 @@ BgSigHupHandler(SIGNAL_ARGS)
 	got_SIGHUP = true;
 }
 
-/* SIGINT: set flag to run a normal checkpoint right away */
-static void
-ReqCheckpointHandler(SIGNAL_ARGS)
-{
-	checkpoint_requested = true;
-}
-
 /* SIGUSR2: set flag to run a shutdown checkpoint and exit */
 static void
 ReqShutdownHandler(SIGNAL_ARGS)
 {
 	shutdown_requested = true;
 }
-
-
-/* --------------------------------
- *		communication with backends
- * --------------------------------
- */
-
-/*
- * BgWriterShmemSize
- *		Compute space needed for bgwriter-related shared memory
- */
-Size
-BgWriterShmemSize(void)
-{
-	Size		size;
-
-	/*
-	 * Currently, the size of the requests[] array is arbitrarily set equal to
-	 * NBuffers.  This may prove too large or small ...
-	 */
-	size = offsetof(BgWriterShmemStruct, requests);
-	size = add_size(size, mul_size(NBuffers, sizeof(BgWriterRequest)));
-
-	return size;
-}
-
-/*
- * BgWriterShmemInit
- *		Allocate and initialize bgwriter-related shared memory
- */
-void
-BgWriterShmemInit(void)
-{
-	bool		found;
-
-	BgWriterShmem = (BgWriterShmemStruct *)
-		ShmemInitStruct("Background Writer Data",
-						BgWriterShmemSize(),
-						&found);
-
-	if (!found)
-	{
-		/* First time through, so initialize */
-		MemSet(BgWriterShmem, 0, sizeof(BgWriterShmemStruct));
-		SpinLockInit(&BgWriterShmem->ckpt_lck);
-		BgWriterShmem->max_requests = NBuffers;
-	}
-}
-
-/*
- * RequestCheckpoint
- *		Called in backend processes to request a checkpoint
- *
- * flags is a bitwise OR of the following:
- *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
- *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
- *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
- *		ignoring checkpoint_completion_target parameter.
- *	CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occured
- *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
- *		CHECKPOINT_END_OF_RECOVERY).
- *	CHECKPOINT_WAIT: wait for completion before returning (otherwise,
- *		just signal bgwriter to do it, and return).
- *	CHECKPOINT_CAUSE_XLOG: checkpoint is requested due to xlog filling.
- *		(This affects logging, and in particular enables CheckPointWarning.)
- */
-void
-RequestCheckpoint(int flags)
-{
-	/* use volatile pointer to prevent code rearrangement */
-	volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-	int			ntries;
-	int			old_failed,
-				old_started;
-
-	/*
-	 * If in a standalone backend, just do it ourselves.
-	 */
-	if (!IsPostmasterEnvironment)
-	{
-		/*
-		 * There's no point in doing slow checkpoints in a standalone backend,
-		 * because there's no other backends the checkpoint could disrupt.
-		 */
-		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
-
-		/*
-		 * After any checkpoint, close all smgr files.	This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
-
-		return;
-	}
-
-	/*
-	 * Atomically set the request flags, and take a snapshot of the counters.
-	 * When we see ckpt_started > old_started, we know the flags we set here
-	 * have been seen by bgwriter.
-	 *
-	 * Note that we OR the flags with any existing flags, to avoid overriding
-	 * a "stronger" request by another backend.  The flag senses must be
-	 * chosen to make this work!
-	 */
-	SpinLockAcquire(&bgs->ckpt_lck);
-
-	old_failed = bgs->ckpt_failed;
-	old_started = bgs->ckpt_started;
-	bgs->ckpt_flags |= flags;
-
-	SpinLockRelease(&bgs->ckpt_lck);
-
-	/*
-	 * Send signal to request checkpoint.  It's possible that the bgwriter
-	 * hasn't started yet, or is in process of restarting, so we will retry a
-	 * few times if needed.  Also, if not told to wait for the checkpoint to
-	 * occur, we consider failure to send the signal to be nonfatal and merely
-	 * LOG it.
-	 */
-	for (ntries = 0;; ntries++)
-	{
-		if (BgWriterShmem->bgwriter_pid == 0)
-		{
-			if (ntries >= 20)	/* max wait 2.0 sec */
-			{
-				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-				"could not request checkpoint because bgwriter not running");
-				break;
-			}
-		}
-		else if (kill(BgWriterShmem->bgwriter_pid, SIGINT) != 0)
-		{
-			if (ntries >= 20)	/* max wait 2.0 sec */
-			{
-				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: %m");
-				break;
-			}
-		}
-		else
-			break;				/* signal sent successfully */
-
-		CHECK_FOR_INTERRUPTS();
-		pg_usleep(100000L);		/* wait 0.1 sec, then retry */
-	}
-
-	/*
-	 * If requested, wait for completion.  We detect completion according to
-	 * the algorithm given above.
-	 */
-	if (flags & CHECKPOINT_WAIT)
-	{
-		int			new_started,
-					new_failed;
-
-		/* Wait for a new checkpoint to start. */
-		for (;;)
-		{
-			SpinLockAcquire(&bgs->ckpt_lck);
-			new_started = bgs->ckpt_started;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			if (new_started != old_started)
-				break;
-
-			CHECK_FOR_INTERRUPTS();
-			pg_usleep(100000L);
-		}
-
-		/*
-		 * We are waiting for ckpt_done >= new_started, in a modulo sense.
-		 */
-		for (;;)
-		{
-			int			new_done;
-
-			SpinLockAcquire(&bgs->ckpt_lck);
-			new_done = bgs->ckpt_done;
-			new_failed = bgs->ckpt_failed;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			if (new_done - new_started >= 0)
-				break;
-
-			CHECK_FOR_INTERRUPTS();
-			pg_usleep(100000L);
-		}
-
-		if (new_failed != old_failed)
-			ereport(ERROR,
-					(errmsg("checkpoint request failed"),
-					 errhint("Consult recent messages in the server log for details.")));
-	}
-}
-
-/*
- * ForwardFsyncRequest
- *		Forward a file-fsync request from a backend to the bgwriter
- *
- * Whenever a backend is compelled to write directly to a relation
- * (which should be seldom, if the bgwriter is getting its job done),
- * the backend calls this routine to pass over knowledge that the relation
- * is dirty and must be fsync'd before next checkpoint.  We also use this
- * opportunity to count such writes for statistical purposes.
- *
- * segno specifies which segment (not block!) of the relation needs to be
- * fsync'd.  (Since the valid range is much less than BlockNumber, we can
- * use high values for special flags; that's all internal to md.c, which
- * see for details.)
- *
- * To avoid holding the lock for longer than necessary, we normally write
- * to the requests[] queue without checking for duplicates.  The bgwriter
- * will have to eliminate dups internally anyway.  However, if we discover
- * that the queue is full, we make a pass over the entire queue to compact
- * it.	This is somewhat expensive, but the alternative is for the backend
- * to perform its own fsync, which is far more expensive in practice.  It
- * is theoretically possible a backend fsync might still be necessary, if
- * the queue is full and contains no duplicate entries.  In that case, we
- * let the backend know by returning false.
- */
-bool
-ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
-					BlockNumber segno)
-{
-	BgWriterRequest *request;
-
-	if (!IsUnderPostmaster)
-		return false;			/* probably shouldn't even get here */
-
-	if (am_bg_writer)
-		elog(ERROR, "ForwardFsyncRequest must not be called in bgwriter");
-
-	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
-
-	/* Count all backend writes regardless of if they fit in the queue */
-	BgWriterShmem->num_backend_writes++;
-
-	/*
-	 * If the background writer isn't running or the request queue is full,
-	 * the backend will have to perform its own fsync request.	But before
-	 * forcing that to happen, we can try to compact the background writer
-	 * request queue.
-	 */
-	if (BgWriterShmem->bgwriter_pid == 0 ||
-		(BgWriterShmem->num_requests >= BgWriterShmem->max_requests
-		 && !CompactBgwriterRequestQueue()))
-	{
-		/*
-		 * Count the subset of writes where backends have to do their own
-		 * fsync
-		 */
-		BgWriterShmem->num_backend_fsync++;
-		LWLockRelease(BgWriterCommLock);
-		return false;
-	}
-	request = &BgWriterShmem->requests[BgWriterShmem->num_requests++];
-	request->rnode = rnode;
-	request->forknum = forknum;
-	request->segno = segno;
-	LWLockRelease(BgWriterCommLock);
-	return true;
-}
-
-/*
- * CompactBgwriterRequestQueue
- *		Remove duplicates from the request queue to avoid backend fsyncs.
- *
- * Although a full fsync request queue is not common, it can lead to severe
- * performance problems when it does happen.  So far, this situation has
- * only been observed to occur when the system is under heavy write load,
- * and especially during the "sync" phase of a checkpoint.	Without this
- * logic, each backend begins doing an fsync for every block written, which
- * gets very expensive and can slow down the whole system.
- *
- * Trying to do this every time the queue is full could lose if there
- * aren't any removable entries.  But should be vanishingly rare in
- * practice: there's one queue entry per shared buffer.
- */
-static bool
-CompactBgwriterRequestQueue()
-{
-	struct BgWriterSlotMapping
-	{
-		BgWriterRequest request;
-		int			slot;
-	};
-
-	int			n,
-				preserve_count;
-	int			num_skipped = 0;
-	HASHCTL		ctl;
-	HTAB	   *htab;
-	bool	   *skip_slot;
-
-	/* must hold BgWriterCommLock in exclusive mode */
-	Assert(LWLockHeldByMe(BgWriterCommLock));
-
-	/* Initialize temporary hash table */
-	MemSet(&ctl, 0, sizeof(ctl));
-	ctl.keysize = sizeof(BgWriterRequest);
-	ctl.entrysize = sizeof(struct BgWriterSlotMapping);
-	ctl.hash = tag_hash;
-	htab = hash_create("CompactBgwriterRequestQueue",
-					   BgWriterShmem->num_requests,
-					   &ctl,
-					   HASH_ELEM | HASH_FUNCTION);
-
-	/* Initialize skip_slot array */
-	skip_slot = palloc0(sizeof(bool) * BgWriterShmem->num_requests);
-
-	/*
-	 * The basic idea here is that a request can be skipped if it's followed
-	 * by a later, identical request.  It might seem more sensible to work
-	 * backwards from the end of the queue and check whether a request is
-	 * *preceded* by an earlier, identical request, in the hopes of doing less
-	 * copying.  But that might change the semantics, if there's an
-	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
-	 * we do it this way.  It would be possible to be even smarter if we made
-	 * the code below understand the specific semantics of such requests (it
-	 * could blow away preceding entries that would end up being canceled
-	 * anyhow), but it's not clear that the extra complexity would buy us
-	 * anything.
-	 */
-	for (n = 0; n < BgWriterShmem->num_requests; ++n)
-	{
-		BgWriterRequest *request;
-		struct BgWriterSlotMapping *slotmap;
-		bool		found;
-
-		request = &BgWriterShmem->requests[n];
-		slotmap = hash_search(htab, request, HASH_ENTER, &found);
-		if (found)
-		{
-			skip_slot[slotmap->slot] = true;
-			++num_skipped;
-		}
-		slotmap->slot = n;
-	}
-
-	/* Done with the hash table. */
-	hash_destroy(htab);
-
-	/* If no duplicates, we're out of luck. */
-	if (!num_skipped)
-	{
-		pfree(skip_slot);
-		return false;
-	}
-
-	/* We found some duplicates; remove them. */
-	for (n = 0, preserve_count = 0; n < BgWriterShmem->num_requests; ++n)
-	{
-		if (skip_slot[n])
-			continue;
-		BgWriterShmem->requests[preserve_count++] = BgWriterShmem->requests[n];
-	}
-	ereport(DEBUG1,
-	   (errmsg("compacted fsync request queue from %d entries to %d entries",
-			   BgWriterShmem->num_requests, preserve_count)));
-	BgWriterShmem->num_requests = preserve_count;
-
-	/* Cleanup. */
-	pfree(skip_slot);
-	return true;
-}
-
-/*
- * AbsorbFsyncRequests
- *		Retrieve queued fsync requests and pass them to local smgr.
- *
- * This is exported because it must be called during CreateCheckPoint;
- * we have to be sure we have accepted all pending requests just before
- * we start fsync'ing.  Since CreateCheckPoint sometimes runs in
- * non-bgwriter processes, do nothing if not bgwriter.
- */
-void
-AbsorbFsyncRequests(void)
-{
-	BgWriterRequest *requests = NULL;
-	BgWriterRequest *request;
-	int			n;
-
-	if (!am_bg_writer)
-		return;
-
-	/*
-	 * We have to PANIC if we fail to absorb all the pending requests (eg,
-	 * because our hashtable runs out of memory).  This is because the system
-	 * cannot run safely if we are unable to fsync what we have been told to
-	 * fsync.  Fortunately, the hashtable is so small that the problem is
-	 * quite unlikely to arise in practice.
-	 */
-	START_CRIT_SECTION();
-
-	/*
-	 * We try to avoid holding the lock for a long time by copying the request
-	 * array.
-	 */
-	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
-
-	/* Transfer write count into pending pgstats message */
-	BgWriterStats.m_buf_written_backend += BgWriterShmem->num_backend_writes;
-	BgWriterStats.m_buf_fsync_backend += BgWriterShmem->num_backend_fsync;
-
-	BgWriterShmem->num_backend_writes = 0;
-	BgWriterShmem->num_backend_fsync = 0;
-
-	n = BgWriterShmem->num_requests;
-	if (n > 0)
-	{
-		requests = (BgWriterRequest *) palloc(n * sizeof(BgWriterRequest));
-		memcpy(requests, BgWriterShmem->requests, n * sizeof(BgWriterRequest));
-	}
-	BgWriterShmem->num_requests = 0;
-
-	LWLockRelease(BgWriterCommLock);
-
-	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
-
-	if (requests)
-		pfree(requests);
-
-	END_CRIT_SECTION();
-}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
new file mode 100644
index 0000000..32eb191
--- /dev/null
+++ b/src/backend/postmaster/checkpointer.c
@@ -0,0 +1,1229 @@
+/*-------------------------------------------------------------------------
+ *
+ * checkpointer.c
+ *
+ * The checkpointer is new as of Postgres 9.2.  It handles all checkpoints.
+ * Checkpoints are automatically dispatched after a certain amount of time has
+ * elapsed since the last one, and it can be signaled to perform requested
+ * checkpoints as well.  (The GUC parameter that mandates a checkpoint every
+ * so many WAL segments is implemented by having backends signal when they
+ * fill WAL segments; the checkpointer itself doesn't watch for the
+ * condition.)
+ *
+ * The checkpointer is started by the postmaster as soon as the startup subprocess
+ * finishes, or as soon as recovery begins if we are doing archive recovery.
+ * It remains alive until the postmaster commands it to terminate.
+ * Normal termination is by SIGUSR2, which instructs the checkpointer to execute
+ * a shutdown checkpoint and then exit(0).	(All backends must be stopped
+ * before SIGUSR2 is issued!)  Emergency termination is by SIGQUIT; like any
+ * backend, the checkpointer will simply abort and exit on SIGQUIT.
+ *
+ * If the checkpointer exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining backends
+ * should be killed by SIGQUIT and then a recovery cycle started.  (Even if
+ * shared memory isn't corrupted, we have lost information about which
+ * files need to be fsync'd for the next checkpoint, and so a system
+ * restart needs to be forced.)
+ *
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/checkpointer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <signal.h>
+#include <sys/time.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "replication/syncrep.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/pmsignal.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+
+/*----------
+ * Shared memory area for communication between checkpointer and backends
+ *
+ * The ckpt counters allow backends to watch for completion of a checkpoint
+ * request they send.  Here's how it works:
+ *	* At start of a checkpoint, checkpointer reads (and clears) the request flags
+ *	  and increments ckpt_started, while holding ckpt_lck.
+ *	* On completion of a checkpoint, checkpointer sets ckpt_done to
+ *	  equal ckpt_started.
+ *	* On failure of a checkpoint, checkpointer increments ckpt_failed
+ *	  and sets ckpt_done to equal ckpt_started.
+ *
+ * The algorithm for backends is:
+ *	1. Record current values of ckpt_failed and ckpt_started, and
+ *	   set request flags, while holding ckpt_lck.
+ *	2. Send signal to request checkpoint.
+ *	3. Sleep until ckpt_started changes.  Now you know a checkpoint has
+ *	   begun since you started this algorithm (although *not* that it was
+ *	   specifically initiated by your signal), and that it is using your flags.
+ *	4. Record new value of ckpt_started.
+ *	5. Sleep until ckpt_done >= saved value of ckpt_started.  (Use modulo
+ *	   arithmetic here in case counters wrap around.)  Now you know a
+ *	   checkpoint has started and completed, but not whether it was
+ *	   successful.
+ *	6. If ckpt_failed is different from the originally saved value,
+ *	   assume request failed; otherwise it was definitely successful.
+ *
+ * ckpt_flags holds the OR of the checkpoint request flags sent by all
+ * requesting backends since the last checkpoint start.  The flags are
+ * chosen so that OR'ing is the correct way to combine multiple requests.
+ *
+ * num_backend_writes is used to count the number of buffer writes performed
+ * by non-bgwriter processes.  This counter should be wide enough that it
+ * can't overflow during a single bgwriter cycle.  num_backend_fsync
+ * counts the subset of those writes that also had to do their own fsync,
+ * because the background writer failed to absorb their request.
+ *
+ * The requests array holds fsync requests sent by backends and not yet
+ * absorbed by the checkpointer.
+ *
+ * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
+ * the requests fields are protected by BgWriterCommLock.
+ *----------
+ */
+typedef struct
+{
+	RelFileNodeBackend rnode;
+	ForkNumber	forknum;
+	BlockNumber segno;			/* see md.c for special values */
+	/* might add a real request-type field later; not needed yet */
+} BgWriterRequest;
+
+typedef struct
+{
+	pid_t		checkpointer_pid;	/* PID of bgwriter (0 if not started) */
+
+	slock_t		ckpt_lck;		/* protects all the ckpt_* fields */
+
+	int			ckpt_started;	/* advances when checkpoint starts */
+	int			ckpt_done;		/* advances when checkpoint done */
+	int			ckpt_failed;	/* advances when checkpoint fails */
+
+	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
+
+	uint32		num_backend_writes;		/* counts non-bgwriter buffer writes */
+	uint32		num_backend_fsync;		/* counts non-bgwriter fsync calls */
+
+	int			num_requests;	/* current # of requests */
+	int			max_requests;	/* allocated array size */
+	BgWriterRequest requests[1];	/* VARIABLE LENGTH ARRAY */
+} BgWriterShmemStruct;
+
+static BgWriterShmemStruct *BgWriterShmem;
+
+/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
+#define WRITES_PER_ABSORB		1000
+
+/*
+ * GUC parameters
+ */
+int			CheckPointTimeout = 300;
+int			CheckPointWarning = 30;
+double		CheckPointCompletionTarget = 0.5;
+
+/*
+ * Flags set by interrupt handlers for later service in the main loop.
+ */
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t checkpoint_requested = false;
+static volatile sig_atomic_t shutdown_requested = false;
+
+/*
+ * Private state
+ */
+static bool am_checkpointer = false;
+
+static bool ckpt_active = false;
+
+/* these values are valid when ckpt_active is true: */
+static pg_time_t ckpt_start_time;
+static XLogRecPtr ckpt_start_recptr;
+static double ckpt_cached_elapsed;
+
+static pg_time_t last_checkpoint_time;
+static pg_time_t last_xlog_switch_time;
+
+/* Prototypes for private functions */
+
+static void CheckArchiveTimeout(void);
+static bool IsCheckpointOnSchedule(double progress);
+static bool ImmediateCheckpointRequested(void);
+static bool CompactCheckpointerRequestQueue(void);
+
+/* Signal handlers */
+
+static void chkpt_quickdie(SIGNAL_ARGS);
+static void ChkptSigHupHandler(SIGNAL_ARGS);
+static void ReqCheckpointHandler(SIGNAL_ARGS);
+static void ReqShutdownHandler(SIGNAL_ARGS);
+
+
+/*
+ * Main entry point for checkpointer process
+ *
+ * This is invoked from BootstrapMain, which has already created the basic
+ * execution environment, but not enabled signals yet.
+ */
+void
+CheckpointerMain(void)
+{
+	sigjmp_buf	local_sigjmp_buf;
+	MemoryContext checkpointer_context;
+
+	BgWriterShmem->checkpointer_pid = MyProcPid;
+	am_checkpointer = true;
+
+	/*
+	 * If possible, make this process a group leader, so that the postmaster
+	 * can signal any child processes too.	(checkpointer probably never has any
+	 * child processes, but for consistency we make all postmaster child
+	 * processes do this.)
+	 */
+#ifdef HAVE_SETSID
+	if (setsid() < 0)
+		elog(FATAL, "setsid() failed: %m");
+#endif
+
+	/*
+	 * Properly accept or ignore signals the postmaster might send us
+	 *
+	 * Note: we deliberately ignore SIGTERM, because during a standard Unix
+	 * system shutdown cycle, init will SIGTERM all processes at once.	We
+	 * want to wait for the backends to exit, whereupon the postmaster will
+	 * tell us it's okay to shut down (via SIGUSR2).
+	 *
+	 * SIGUSR1 is presently unused; keep it spare in case someday we want this
+	 * process to participate in ProcSignal signalling.
+	 */
+	pqsignal(SIGHUP, ChkptSigHupHandler);	/* set flag to read config file */
+	pqsignal(SIGINT, ReqCheckpointHandler);	/* request checkpoint */
+	pqsignal(SIGTERM, SIG_IGN);				/* ignore SIGTERM */
+	pqsignal(SIGQUIT, chkpt_quickdie);		/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, SIG_IGN); /* reserve for ProcSignal */
+	pqsignal(SIGUSR2, ReqShutdownHandler);		/* request shutdown */
+
+	/*
+	 * Reset some signals that are accepted by postmaster but not here
+	 */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* We allow SIGQUIT (quickdie) at all times */
+	sigdelset(&BlockSig, SIGQUIT);
+
+	/*
+	 * Initialize so that first time-driven event happens at the correct time.
+	 */
+	last_checkpoint_time = last_xlog_switch_time = (pg_time_t) time(NULL);
+
+	/*
+	 * Create a resource owner to keep track of our resources (currently only
+	 * buffer pins).
+	 */
+	CurrentResourceOwner = ResourceOwnerCreate(NULL, "Checkpointer");
+
+	/*
+	 * Create a memory context that we will do all our work in.  We do this so
+	 * that we can reset the context during error recovery and thereby avoid
+	 * possible memory leaks.  Formerly this code just ran in
+	 * TopMemoryContext, but resetting that would be a really bad idea.
+	 */
+	checkpointer_context = AllocSetContextCreate(TopMemoryContext,
+											 "Checkpointer",
+											 ALLOCSET_DEFAULT_MINSIZE,
+											 ALLOCSET_DEFAULT_INITSIZE,
+											 ALLOCSET_DEFAULT_MAXSIZE);
+	MemoryContextSwitchTo(checkpointer_context);
+
+	/*
+	 * If an exception is encountered, processing resumes here.
+	 *
+	 * See notes in postgres.c about the design of this coding.
+	 */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		/* Since not using PG_TRY, must reset error stack by hand */
+		error_context_stack = NULL;
+
+		/* Prevent interrupts while cleaning up */
+		HOLD_INTERRUPTS();
+
+		/* Report the error to the server log */
+		EmitErrorReport();
+
+		/*
+		 * These operations are really just a minimal subset of
+		 * AbortTransaction().	We don't have very many resources to worry
+		 * about in checkpointer, but we do have LWLocks, buffers, and temp files.
+		 */
+		LWLockReleaseAll();
+		AbortBufferIO();
+		UnlockBuffers();
+		/* buffer pins are released here: */
+		ResourceOwnerRelease(CurrentResourceOwner,
+							 RESOURCE_RELEASE_BEFORE_LOCKS,
+							 false, true);
+		/* we needn't bother with the other ResourceOwnerRelease phases */
+		AtEOXact_Buffers(false);
+		AtEOXact_Files();
+		AtEOXact_HashTables(false);
+
+		/* Warn any waiting backends that the checkpoint failed. */
+		if (ckpt_active)
+		{
+			/* use volatile pointer to prevent code rearrangement */
+			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+
+			SpinLockAcquire(&bgs->ckpt_lck);
+			bgs->ckpt_failed++;
+			bgs->ckpt_done = bgs->ckpt_started;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			ckpt_active = false;
+		}
+
+		/*
+		 * Now return to normal top-level context and clear ErrorContext for
+		 * next time.
+		 */
+		MemoryContextSwitchTo(checkpointer_context);
+		FlushErrorState();
+
+		/* Flush any leaked data in the top-level context */
+		MemoryContextResetAndDeleteChildren(checkpointer_context);
+
+		/* Now we can allow interrupts again */
+		RESUME_INTERRUPTS();
+
+		/*
+		 * Sleep at least 1 second after any error.  A write error is likely
+		 * to be repeated, and we don't want to be filling the error logs as
+		 * fast as we can.
+		 */
+		pg_usleep(1000000L);
+
+		/*
+		 * Close all open files after any error.  This is helpful on Windows,
+		 * where holding deleted files open causes various strange errors.
+		 * It's not clear we need it elsewhere, but shouldn't hurt.
+		 */
+		smgrcloseall();
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	/*
+	 * Unblock signals (they were blocked when the postmaster forked us)
+	 */
+	PG_SETMASK(&UnBlockSig);
+
+	/*
+	 * Use the recovery target timeline ID during recovery
+	 */
+	if (RecoveryInProgress())
+		ThisTimeLineID = GetRecoveryTargetTLI();
+
+	/* Do this once before starting the loop, then just at SIGHUP time. */
+	SyncRepUpdateSyncStandbysDefined();
+
+	/*
+	 * Loop forever
+	 */
+	for (;;)
+	{
+		bool		do_checkpoint = false;
+		int			flags = 0;
+		pg_time_t	now;
+		int			elapsed_secs;
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (!PostmasterIsAlive())
+			exit(1);
+
+		/*
+		 * Process any requests or signals received recently.
+		 */
+		AbsorbFsyncRequests();
+
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+			/* update global shmem state for sync rep */
+			SyncRepUpdateSyncStandbysDefined();
+		}
+		if (checkpoint_requested)
+		{
+			checkpoint_requested = false;
+			do_checkpoint = true;
+			BgWriterStats.m_requested_checkpoints++;
+		}
+		if (shutdown_requested)
+		{
+			/*
+			 * From here on, elog(ERROR) should end with exit(1), not send
+			 * control back to the sigsetjmp block above
+			 */
+			ExitOnAnyError = true;
+			/* Close down the database */
+			ShutdownXLOG(0, 0);
+			/* Normal exit from the checkpointer is here */
+			proc_exit(0);		/* done */
+		}
+
+		/*
+		 * Force a checkpoint if too much time has elapsed since the last one.
+		 * Note that we count a timed checkpoint in stats only when this
+		 * occurs without an external request, but we set the CAUSE_TIME flag
+		 * bit even if there is also an external request.
+		 */
+		now = (pg_time_t) time(NULL);
+		elapsed_secs = now - last_checkpoint_time;
+		if (elapsed_secs >= CheckPointTimeout)
+		{
+			if (!do_checkpoint)
+				BgWriterStats.m_timed_checkpoints++;
+			do_checkpoint = true;
+			flags |= CHECKPOINT_CAUSE_TIME;
+		}
+
+		/*
+		 * Do a checkpoint if requested.
+		 */
+		if (do_checkpoint)
+		{
+			bool		ckpt_performed = false;
+			bool		do_restartpoint;
+
+			/* use volatile pointer to prevent code rearrangement */
+			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+
+			/*
+			 * Check if we should perform a checkpoint or a restartpoint. As a
+			 * side-effect, RecoveryInProgress() initializes TimeLineID if
+			 * it's not set yet.
+			 */
+			do_restartpoint = RecoveryInProgress();
+
+			/*
+			 * Atomically fetch the request flags to figure out what kind of a
+			 * checkpoint we should perform, and increase the started-counter
+			 * to acknowledge that we've started a new checkpoint.
+			 */
+			SpinLockAcquire(&bgs->ckpt_lck);
+			flags |= bgs->ckpt_flags;
+			bgs->ckpt_flags = 0;
+			bgs->ckpt_started++;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			/*
+			 * The end-of-recovery checkpoint is a real checkpoint that's
+			 * performed while we're still in recovery.
+			 */
+			if (flags & CHECKPOINT_END_OF_RECOVERY)
+				do_restartpoint = false;
+
+			/*
+			 * We will warn if (a) too soon since last checkpoint (whatever
+			 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
+			 * since the last checkpoint start.  Note in particular that this
+			 * implementation will not generate warnings caused by
+			 * CheckPointTimeout < CheckPointWarning.
+			 */
+			if (!do_restartpoint &&
+				(flags & CHECKPOINT_CAUSE_XLOG) &&
+				elapsed_secs < CheckPointWarning)
+				ereport(LOG,
+						(errmsg_plural("checkpoints are occurring too frequently (%d second apart)",
+				"checkpoints are occurring too frequently (%d seconds apart)",
+									   elapsed_secs,
+									   elapsed_secs),
+						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
+
+			/*
+			 * Initialize checkpointer-private variables used during checkpoint.
+			 */
+			ckpt_active = true;
+			if (!do_restartpoint)
+				ckpt_start_recptr = GetInsertRecPtr();
+			ckpt_start_time = now;
+			ckpt_cached_elapsed = 0;
+
+			/*
+			 * Do the checkpoint.
+			 */
+			if (!do_restartpoint)
+			{
+				CreateCheckPoint(flags);
+				ckpt_performed = true;
+			}
+			else
+				ckpt_performed = CreateRestartPoint(flags);
+
+			/*
+			 * After any checkpoint, close all smgr files.	This is so we
+			 * won't hang onto smgr references to deleted files indefinitely.
+			 */
+			smgrcloseall();
+
+			/*
+			 * Indicate checkpoint completion to any waiting backends.
+			 */
+			SpinLockAcquire(&bgs->ckpt_lck);
+			bgs->ckpt_done = bgs->ckpt_started;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			if (ckpt_performed)
+			{
+				/*
+				 * Note we record the checkpoint start time not end time as
+				 * last_checkpoint_time.  This is so that time-driven
+				 * checkpoints happen at a predictable spacing.
+				 */
+				last_checkpoint_time = now;
+			}
+			else
+			{
+				/*
+				 * We were not able to perform the restartpoint (checkpoints
+				 * throw an ERROR in case of error).  Most likely because we
+				 * have not received any new checkpoint WAL records since the
+				 * last restartpoint. Try again in 15 s.
+				 */
+				last_checkpoint_time = now - CheckPointTimeout + 15;
+			}
+
+			ckpt_active = false;
+		}
+
+		/* Check for archive_timeout and switch xlog files if necessary. */
+		CheckArchiveTimeout();
+	}
+}
+
+/*
+ * CheckArchiveTimeout -- check for archive_timeout and switch xlog files
+ *
+ * This will switch to a new WAL file and force an archive file write
+ * if any activity is recorded in the current WAL file, including just
+ * a single checkpoint record.
+ */
+static void
+CheckArchiveTimeout(void)
+{
+	pg_time_t	now;
+	pg_time_t	last_time;
+
+	if (XLogArchiveTimeout <= 0 || RecoveryInProgress())
+		return;
+
+	now = (pg_time_t) time(NULL);
+
+	/* First we do a quick check using possibly-stale local state. */
+	if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
+		return;
+
+	/*
+	 * Update local state ... note that last_xlog_switch_time is the last time
+	 * a switch was performed *or requested*.
+	 */
+	last_time = GetLastSegSwitchTime();
+
+	last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
+
+	/* Now we can do the real check */
+	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
+	{
+		XLogRecPtr	switchpoint;
+
+		/* OK, it's time to switch */
+		switchpoint = RequestXLogSwitch();
+
+		/*
+		 * If the returned pointer points exactly to a segment boundary,
+		 * assume nothing happened.
+		 */
+		if ((switchpoint.xrecoff % XLogSegSize) != 0)
+			ereport(DEBUG1,
+				(errmsg("transaction log switch forced (archive_timeout=%d)",
+						XLogArchiveTimeout)));
+
+		/*
+		 * Update state in any case, so we don't retry constantly when the
+		 * system is idle.
+		 */
+		last_xlog_switch_time = now;
+	}
+}
+
+/*
+ * Returns true if an immediate checkpoint request is pending.	(Note that
+ * this does not check the *current* checkpoint's IMMEDIATE flag, but whether
+ * there is one pending behind it.)
+ */
+static bool
+ImmediateCheckpointRequested(void)
+{
+	if (checkpoint_requested)
+	{
+		volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+
+		/*
+		 * We don't need to acquire the ckpt_lck in this case because we're
+		 * only looking at a single flag bit.
+		 */
+		if (bgs->ckpt_flags & CHECKPOINT_IMMEDIATE)
+			return true;
+	}
+	return false;
+}
+
+/*
+ * CheckpointWriteDelay -- control rate of checkpoint
+ *
+ * This function is called after each page write performed by BufferSync().
+ * It is responsible for throttling BufferSync()'s write rate to hit
+ * checkpoint_completion_target.
+ *
+ * The checkpoint request flags should be passed in; currently the only one
+ * examined is CHECKPOINT_IMMEDIATE, which disables delays between writes.
+ *
+ * 'progress' is an estimate of how much of the work has been done, as a
+ * fraction between 0.0 meaning none, and 1.0 meaning all done.
+ */
+void
+CheckpointWriteDelay(int flags, double progress)
+{
+	static int	absorb_counter = WRITES_PER_ABSORB;
+
+	/* Do nothing if checkpoint is being executed by non-checkpointer process */
+	if (!am_checkpointer)
+		return;
+
+	/*
+	 * Perform the usual duties and take a nap, unless we're behind
+	 * schedule, in which case we just try to catch up as quickly as possible.
+	 */
+	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!shutdown_requested &&
+		!ImmediateCheckpointRequested() &&
+		IsCheckpointOnSchedule(progress))
+	{
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+			/* update global shmem state for sync rep */
+			SyncRepUpdateSyncStandbysDefined();
+		}
+
+		AbsorbFsyncRequests();
+		absorb_counter = WRITES_PER_ABSORB;
+
+		CheckArchiveTimeout();
+
+		/*
+		 * Checkpoint sleep used to be connected to bgwriter_delay at 200ms.
+		 * That resulted in more frequent wakeups if not much work to do.
+		 * Checkpointer and bgwriter are no longer related so take the Big Sleep.
+		 */
+		pg_usleep(500000L);
+	}
+	else if (--absorb_counter <= 0)
+	{
+		/*
+		 * Absorb pending fsync requests after each WRITES_PER_ABSORB write
+		 * operations even when we don't sleep, to prevent overflow of the
+		 * fsync request queue.
+		 */
+		AbsorbFsyncRequests();
+		absorb_counter = WRITES_PER_ABSORB;
+	}
+}
+
+/*
+ * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
+ *		 in time?
+ *
+ * Compares the current progress against the time/segments elapsed since last
+ * checkpoint, and returns true if the progress we've made this far is greater
+ * than the elapsed time/segments.
+ */
+static bool
+IsCheckpointOnSchedule(double progress)
+{
+	XLogRecPtr	recptr;
+	struct timeval now;
+	double		elapsed_xlogs,
+				elapsed_time;
+
+	Assert(ckpt_active);
+
+	/* Scale progress according to checkpoint_completion_target. */
+	progress *= CheckPointCompletionTarget;
+
+	/*
+	 * Check against the cached value first. Only do the more expensive
+	 * calculations once we reach the target previously calculated. Since
+	 * neither time or WAL insert pointer moves backwards, a freshly
+	 * calculated value can only be greater than or equal to the cached value.
+	 */
+	if (progress < ckpt_cached_elapsed)
+		return false;
+
+	/*
+	 * Check progress against WAL segments written and checkpoint_segments.
+	 *
+	 * We compare the current WAL insert location against the location
+	 * computed before calling CreateCheckPoint. The code in XLogInsert that
+	 * actually triggers a checkpoint when checkpoint_segments is exceeded
+	 * compares against RedoRecptr, so this is not completely accurate.
+	 * However, it's good enough for our purposes, we're only calculating an
+	 * estimate anyway.
+	 */
+	if (!RecoveryInProgress())
+	{
+		recptr = GetInsertRecPtr();
+		elapsed_xlogs =
+			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
+			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+			CheckPointSegments;
+
+		if (progress < elapsed_xlogs)
+		{
+			ckpt_cached_elapsed = elapsed_xlogs;
+			return false;
+		}
+	}
+
+	/*
+	 * Check progress against time elapsed and checkpoint_timeout.
+	 */
+	gettimeofday(&now, NULL);
+	elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
+					now.tv_usec / 1000000.0) / CheckPointTimeout;
+
+	if (progress < elapsed_time)
+	{
+		ckpt_cached_elapsed = elapsed_time;
+		return false;
+	}
+
+	/* It looks like we're on schedule. */
+	return true;
+}
+
+
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+
+/*
+ * chkpt_quickdie() occurs when signalled SIGQUIT by the postmaster.
+ *
+ * Some backend has bought the farm,
+ * so we need to stop what we're doing and exit.
+ */
+static void
+chkpt_quickdie(SIGNAL_ARGS)
+{
+	PG_SETMASK(&BlockSig);
+
+	/*
+	 * We DO NOT want to run proc_exit() callbacks -- we're here because
+	 * shared memory may be corrupted, so we don't want to try to clean up our
+	 * transaction.  Just nail the windows shut and get out of town.  Now that
+	 * there's an atexit callback to prevent third-party code from breaking
+	 * things by calling exit() directly, we have to reset the callbacks
+	 * explicitly to make this work as intended.
+	 */
+	on_exit_reset();
+
+	/*
+	 * Note we do exit(2) not exit(0).	This is to force the postmaster into a
+	 * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+	 * backend.  This is necessary precisely because we don't clean up our
+	 * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+	 * should ensure the postmaster sees this as a crash, too, but no harm in
+	 * being doubly sure.)
+	 */
+	exit(2);
+}
+
+/* SIGHUP: set flag to re-read config file at next convenient time */
+static void
+ChkptSigHupHandler(SIGNAL_ARGS)
+{
+	got_SIGHUP = true;
+}
+
+/* SIGINT: set flag to run a normal checkpoint right away */
+static void
+ReqCheckpointHandler(SIGNAL_ARGS)
+{
+	checkpoint_requested = true;
+}
+
+/* SIGUSR2: set flag to run a shutdown checkpoint and exit */
+static void
+ReqShutdownHandler(SIGNAL_ARGS)
+{
+	shutdown_requested = true;
+}
+
+
+/* --------------------------------
+ *		communication with backends
+ * --------------------------------
+ */
+
+/*
+ * BgWriterShmemSize
+ *		Compute space needed for bgwriter-related shared memory
+ */
+Size
+BgWriterShmemSize(void)
+{
+	Size		size;
+
+	/*
+	 * Currently, the size of the requests[] array is arbitrarily set equal to
+	 * NBuffers.  This may prove too large or small ...
+	 */
+	size = offsetof(BgWriterShmemStruct, requests);
+	size = add_size(size, mul_size(NBuffers, sizeof(BgWriterRequest)));
+
+	return size;
+}
+
+/*
+ * BgWriterShmemInit
+ *		Allocate and initialize bgwriter-related shared memory
+ */
+void
+BgWriterShmemInit(void)
+{
+	bool		found;
+
+	BgWriterShmem = (BgWriterShmemStruct *)
+		ShmemInitStruct("Background Writer Data",
+						BgWriterShmemSize(),
+						&found);
+
+	if (!found)
+	{
+		/* First time through, so initialize */
+		MemSet(BgWriterShmem, 0, sizeof(BgWriterShmemStruct));
+		SpinLockInit(&BgWriterShmem->ckpt_lck);
+		BgWriterShmem->max_requests = NBuffers;
+	}
+}
+
+/*
+ * RequestCheckpoint
+ *		Called in backend processes to request a checkpoint
+ *
+ * flags is a bitwise OR of the following:
+ *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
+ *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
+ *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
+ *		ignoring checkpoint_completion_target parameter.
+ *	CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occured
+ *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
+ *		CHECKPOINT_END_OF_RECOVERY).
+ *	CHECKPOINT_WAIT: wait for completion before returning (otherwise,
+ *		just signal bgwriter to do it, and return).
+ *	CHECKPOINT_CAUSE_XLOG: checkpoint is requested due to xlog filling.
+ *		(This affects logging, and in particular enables CheckPointWarning.)
+ */
+void
+RequestCheckpoint(int flags)
+{
+	/* use volatile pointer to prevent code rearrangement */
+	volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+	int			ntries;
+	int			old_failed,
+				old_started;
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		/*
+		 * There's no point in doing slow checkpoints in a standalone backend,
+		 * because there's no other backends the checkpoint could disrupt.
+		 */
+		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
+
+		/*
+		 * After any checkpoint, close all smgr files.	This is so we won't
+		 * hang onto smgr references to deleted files indefinitely.
+		 */
+		smgrcloseall();
+
+		return;
+	}
+
+	/*
+	 * Atomically set the request flags, and take a snapshot of the counters.
+	 * When we see ckpt_started > old_started, we know the flags we set here
+	 * have been seen by bgwriter.
+	 *
+	 * Note that we OR the flags with any existing flags, to avoid overriding
+	 * a "stronger" request by another backend.  The flag senses must be
+	 * chosen to make this work!
+	 */
+	SpinLockAcquire(&bgs->ckpt_lck);
+
+	old_failed = bgs->ckpt_failed;
+	old_started = bgs->ckpt_started;
+	bgs->ckpt_flags |= flags;
+
+	SpinLockRelease(&bgs->ckpt_lck);
+
+	/*
+	 * Send signal to request checkpoint.  It's possible that the bgwriter
+	 * hasn't started yet, or is in process of restarting, so we will retry a
+	 * few times if needed.  Also, if not told to wait for the checkpoint to
+	 * occur, we consider failure to send the signal to be nonfatal and merely
+	 * LOG it.
+	 */
+	for (ntries = 0;; ntries++)
+	{
+		if (BgWriterShmem->checkpointer_pid == 0)
+		{
+			if (ntries >= 20)	/* max wait 2.0 sec */
+			{
+				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
+				"could not request checkpoint because bgwriter not running");
+				break;
+			}
+		}
+		else if (kill(BgWriterShmem->checkpointer_pid, SIGINT) != 0)
+		{
+			if (ntries >= 20)	/* max wait 2.0 sec */
+			{
+				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
+					 "could not signal for checkpoint: %m");
+				break;
+			}
+		}
+		else
+			break;				/* signal sent successfully */
+
+		CHECK_FOR_INTERRUPTS();
+		pg_usleep(100000L);		/* wait 0.1 sec, then retry */
+	}
+
+	/*
+	 * If requested, wait for completion.  We detect completion according to
+	 * the algorithm given above.
+	 */
+	if (flags & CHECKPOINT_WAIT)
+	{
+		int			new_started,
+					new_failed;
+
+		/* Wait for a new checkpoint to start. */
+		for (;;)
+		{
+			SpinLockAcquire(&bgs->ckpt_lck);
+			new_started = bgs->ckpt_started;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			if (new_started != old_started)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(100000L);
+		}
+
+		/*
+		 * We are waiting for ckpt_done >= new_started, in a modulo sense.
+		 */
+		for (;;)
+		{
+			int			new_done;
+
+			SpinLockAcquire(&bgs->ckpt_lck);
+			new_done = bgs->ckpt_done;
+			new_failed = bgs->ckpt_failed;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			if (new_done - new_started >= 0)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(100000L);
+		}
+
+		if (new_failed != old_failed)
+			ereport(ERROR,
+					(errmsg("checkpoint request failed"),
+					 errhint("Consult recent messages in the server log for details.")));
+	}
+}
+
+/*
+ * ForwardFsyncRequest
+ *		Forward a file-fsync request from a backend to the bgwriter
+ *
+ * Whenever a backend is compelled to write directly to a relation
+ * (which should be seldom, if the bgwriter is getting its job done),
+ * the backend calls this routine to pass over knowledge that the relation
+ * is dirty and must be fsync'd before next checkpoint.  We also use this
+ * opportunity to count such writes for statistical purposes.
+ *
+ * segno specifies which segment (not block!) of the relation needs to be
+ * fsync'd.  (Since the valid range is much less than BlockNumber, we can
+ * use high values for special flags; that's all internal to md.c, which
+ * see for details.)
+ *
+ * To avoid holding the lock for longer than necessary, we normally write
+ * to the requests[] queue without checking for duplicates.  The bgwriter
+ * will have to eliminate dups internally anyway.  However, if we discover
+ * that the queue is full, we make a pass over the entire queue to compact
+ * it.	This is somewhat expensive, but the alternative is for the backend
+ * to perform its own fsync, which is far more expensive in practice.  It
+ * is theoretically possible a backend fsync might still be necessary, if
+ * the queue is full and contains no duplicate entries.  In that case, we
+ * let the backend know by returning false.
+ */
+bool
+ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
+					BlockNumber segno)
+{
+	BgWriterRequest *request;
+
+	if (!IsUnderPostmaster)
+		return false;			/* probably shouldn't even get here */
+
+	if (am_checkpointer)
+		elog(ERROR, "ForwardFsyncRequest must not be called in bgwriter");
+
+	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
+
+	/* Count all backend writes regardless of if they fit in the queue */
+	BgWriterShmem->num_backend_writes++;
+
+	/*
+	 * If the background writer isn't running or the request queue is full,
+	 * the backend will have to perform its own fsync request.	But before
+	 * forcing that to happen, we can try to compact the background writer
+	 * request queue.
+	 */
+	if (BgWriterShmem->checkpointer_pid == 0 ||
+		(BgWriterShmem->num_requests >= BgWriterShmem->max_requests
+		 && !CompactCheckpointerRequestQueue()))
+	{
+		/*
+		 * Count the subset of writes where backends have to do their own
+		 * fsync
+		 */
+		BgWriterShmem->num_backend_fsync++;
+		LWLockRelease(BgWriterCommLock);
+		return false;
+	}
+	request = &BgWriterShmem->requests[BgWriterShmem->num_requests++];
+	request->rnode = rnode;
+	request->forknum = forknum;
+	request->segno = segno;
+	LWLockRelease(BgWriterCommLock);
+	return true;
+}
+
+/*
+ * CompactCheckpointerRequestQueue
+ *		Remove duplicates from the request queue to avoid backend fsyncs.
+ *
+ * Although a full fsync request queue is not common, it can lead to severe
+ * performance problems when it does happen.  So far, this situation has
+ * only been observed to occur when the system is under heavy write load,
+ * and especially during the "sync" phase of a checkpoint.	Without this
+ * logic, each backend begins doing an fsync for every block written, which
+ * gets very expensive and can slow down the whole system.
+ *
+ * Trying to do this every time the queue is full could lose if there
+ * aren't any removable entries.  But should be vanishingly rare in
+ * practice: there's one queue entry per shared buffer.
+ */
+static bool
+CompactCheckpointerRequestQueue()
+{
+	struct BgWriterSlotMapping
+	{
+		BgWriterRequest request;
+		int			slot;
+	};
+
+	int			n,
+				preserve_count;
+	int			num_skipped = 0;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	bool	   *skip_slot;
+
+	/* must hold BgWriterCommLock in exclusive mode */
+	Assert(LWLockHeldByMe(BgWriterCommLock));
+
+	/* Initialize temporary hash table */
+	MemSet(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(BgWriterRequest);
+	ctl.entrysize = sizeof(struct BgWriterSlotMapping);
+	ctl.hash = tag_hash;
+	htab = hash_create("CompactBgwriterRequestQueue",
+					   BgWriterShmem->num_requests,
+					   &ctl,
+					   HASH_ELEM | HASH_FUNCTION);
+
+	/* Initialize skip_slot array */
+	skip_slot = palloc0(sizeof(bool) * BgWriterShmem->num_requests);
+
+	/*
+	 * The basic idea here is that a request can be skipped if it's followed
+	 * by a later, identical request.  It might seem more sensible to work
+	 * backwards from the end of the queue and check whether a request is
+	 * *preceded* by an earlier, identical request, in the hopes of doing less
+	 * copying.  But that might change the semantics, if there's an
+	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
+	 * we do it this way.  It would be possible to be even smarter if we made
+	 * the code below understand the specific semantics of such requests (it
+	 * could blow away preceding entries that would end up being canceled
+	 * anyhow), but it's not clear that the extra complexity would buy us
+	 * anything.
+	 */
+	for (n = 0; n < BgWriterShmem->num_requests; ++n)
+	{
+		BgWriterRequest *request;
+		struct BgWriterSlotMapping *slotmap;
+		bool		found;
+
+		request = &BgWriterShmem->requests[n];
+		slotmap = hash_search(htab, request, HASH_ENTER, &found);
+		if (found)
+		{
+			skip_slot[slotmap->slot] = true;
+			++num_skipped;
+		}
+		slotmap->slot = n;
+	}
+
+	/* Done with the hash table. */
+	hash_destroy(htab);
+
+	/* If no duplicates, we're out of luck. */
+	if (!num_skipped)
+	{
+		pfree(skip_slot);
+		return false;
+	}
+
+	/* We found some duplicates; remove them. */
+	for (n = 0, preserve_count = 0; n < BgWriterShmem->num_requests; ++n)
+	{
+		if (skip_slot[n])
+			continue;
+		BgWriterShmem->requests[preserve_count++] = BgWriterShmem->requests[n];
+	}
+	ereport(DEBUG1,
+	   (errmsg("compacted fsync request queue from %d entries to %d entries",
+			   BgWriterShmem->num_requests, preserve_count)));
+	BgWriterShmem->num_requests = preserve_count;
+
+	/* Cleanup. */
+	pfree(skip_slot);
+	return true;
+}
+
+/*
+ * AbsorbFsyncRequests
+ *		Retrieve queued fsync requests and pass them to local smgr.
+ *
+ * This is exported because it must be called during CreateCheckPoint;
+ * we have to be sure we have accepted all pending requests just before
+ * we start fsync'ing.  Since CreateCheckPoint sometimes runs in
+ * non-checkpointer processes, do nothing if not checkpointer.
+ */
+void
+AbsorbFsyncRequests(void)
+{
+	BgWriterRequest *requests = NULL;
+	BgWriterRequest *request;
+	int			n;
+
+	if (!am_checkpointer)
+		return;
+
+	/*
+	 * We have to PANIC if we fail to absorb all the pending requests (eg,
+	 * because our hashtable runs out of memory).  This is because the system
+	 * cannot run safely if we are unable to fsync what we have been told to
+	 * fsync.  Fortunately, the hashtable is so small that the problem is
+	 * quite unlikely to arise in practice.
+	 */
+	START_CRIT_SECTION();
+
+	/*
+	 * We try to avoid holding the lock for a long time by copying the request
+	 * array.
+	 */
+	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
+
+	/* Transfer write count into pending pgstats message */
+	BgWriterStats.m_buf_written_backend += BgWriterShmem->num_backend_writes;
+	BgWriterStats.m_buf_fsync_backend += BgWriterShmem->num_backend_fsync;
+
+	BgWriterShmem->num_backend_writes = 0;
+	BgWriterShmem->num_backend_fsync = 0;
+
+	n = BgWriterShmem->num_requests;
+	if (n > 0)
+	{
+		requests = (BgWriterRequest *) palloc(n * sizeof(BgWriterRequest));
+		memcpy(requests, BgWriterShmem->requests, n * sizeof(BgWriterRequest));
+	}
+	BgWriterShmem->num_requests = 0;
+
+	LWLockRelease(BgWriterCommLock);
+
+	for (request = requests; n > 0; request++, n--)
+		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+
+	if (requests)
+		pfree(requests);
+
+	END_CRIT_SECTION();
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 0a84d97..c8599c2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -206,6 +206,7 @@ bool		restart_after_crash = true;
 /* PIDs of special child processes; 0 when not running */
 static pid_t StartupPID = 0,
 			BgWriterPID = 0,
+			CheckpointerPID = 0,
 			WalWriterPID = 0,
 			WalReceiverPID = 0,
 			AutoVacPID = 0,
@@ -277,7 +278,7 @@ typedef enum
 	PM_WAIT_BACKUP,				/* waiting for online backup mode to end */
 	PM_WAIT_READONLY,			/* waiting for read only backends to exit */
 	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
-	PM_SHUTDOWN,				/* waiting for bgwriter to do shutdown ckpt */
+	PM_SHUTDOWN,				/* waiting for checkpointer to do shutdown ckpt */
 	PM_SHUTDOWN_2,				/* waiting for archiver and walsenders to
 								 * finish */
 	PM_WAIT_DEAD_END,			/* waiting for dead_end children to exit */
@@ -463,6 +464,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 
 #define StartupDataBase()		StartChildProcess(StartupProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
+#define StartCheckpointer()		StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()		StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()		StartChildProcess(WalReceiverProcess)
 
@@ -1015,8 +1017,8 @@ PostmasterMain(int argc, char *argv[])
 	 * CAUTION: when changing this list, check for side-effects on the signal
 	 * handling setup of child processes.  See tcop/postgres.c,
 	 * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
-	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c, and
-	 * postmaster/syslogger.c.
+	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c,
+	 * postmaster/syslogger.c and postmaster/checkpointer.c
 	 */
 	pqinitmask();
 	PG_SETMASK(&BlockSig);
@@ -1353,10 +1355,14 @@ ServerLoop(void)
 		 * state that prevents it, start one.  It doesn't matter if this
 		 * fails, we'll just try again later.
 		 */
-		if (BgWriterPID == 0 &&
-			(pmState == PM_RUN || pmState == PM_RECOVERY ||
-			 pmState == PM_HOT_STANDBY))
-			BgWriterPID = StartBackgroundWriter();
+		if (pmState == PM_RUN || pmState == PM_RECOVERY ||
+			 pmState == PM_HOT_STANDBY)
+		{
+			if (BgWriterPID == 0)
+				BgWriterPID = StartBackgroundWriter();
+			if (CheckpointerPID == 0)
+				CheckpointerPID = StartCheckpointer();
+		}
 
 		/*
 		 * Likewise, if we have lost the walwriter process, try to start a new
@@ -2034,6 +2040,8 @@ SIGHUP_handler(SIGNAL_ARGS)
 			signal_child(StartupPID, SIGHUP);
 		if (BgWriterPID != 0)
 			signal_child(BgWriterPID, SIGHUP);
+		if (CheckpointerPID != 0)
+			signal_child(CheckpointerPID, SIGHUP);
 		if (WalWriterPID != 0)
 			signal_child(WalWriterPID, SIGHUP);
 		if (WalReceiverPID != 0)
@@ -2148,7 +2156,7 @@ pmdie(SIGNAL_ARGS)
 				signal_child(WalReceiverPID, SIGTERM);
 			if (pmState == PM_RECOVERY)
 			{
-				/* only bgwriter is active in this state */
+				/* only checkpointer is active in this state */
 				pmState = PM_WAIT_BACKENDS;
 			}
 			else if (pmState == PM_RUN ||
@@ -2193,6 +2201,8 @@ pmdie(SIGNAL_ARGS)
 				signal_child(StartupPID, SIGQUIT);
 			if (BgWriterPID != 0)
 				signal_child(BgWriterPID, SIGQUIT);
+			if (CheckpointerPID != 0)
+				signal_child(CheckpointerPID, SIGQUIT);
 			if (WalWriterPID != 0)
 				signal_child(WalWriterPID, SIGQUIT);
 			if (WalReceiverPID != 0)
@@ -2323,12 +2333,14 @@ reaper(SIGNAL_ARGS)
 			}
 
 			/*
-			 * Crank up the background writer, if we didn't do that already
+			 * Crank up background tasks, if we didn't do that already
 			 * when we entered consistent recovery state.  It doesn't matter
 			 * if this fails, we'll just try again later.
 			 */
 			if (BgWriterPID == 0)
 				BgWriterPID = StartBackgroundWriter();
+			if (CheckpointerPID == 0)
+				CheckpointerPID = StartCheckpointer();
 
 			/*
 			 * Likewise, start other special children as needed.  In a restart
@@ -2356,10 +2368,22 @@ reaper(SIGNAL_ARGS)
 		if (pid == BgWriterPID)
 		{
 			BgWriterPID = 0;
+			if (!EXIT_STATUS_0(exitstatus))
+				HandleChildCrash(pid, exitstatus,
+								 _("background writer process"));
+			continue;
+		}
+
+		/*
+		 * Was it the checkpointer?
+		 */
+		if (pid == CheckpointerPID)
+		{
+			CheckpointerPID = 0;
 			if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN)
 			{
 				/*
-				 * OK, we saw normal exit of the bgwriter after it's been told
+				 * OK, we saw normal exit of the checkpointer after it's been told
 				 * to shut down.  We expect that it wrote a shutdown
 				 * checkpoint.	(If for some reason it didn't, recovery will
 				 * occur on next postmaster start.)
@@ -2396,11 +2420,11 @@ reaper(SIGNAL_ARGS)
 			else
 			{
 				/*
-				 * Any unexpected exit of the bgwriter (including FATAL exit)
+				 * Any unexpected exit of the checkpointer (including FATAL exit)
 				 * is treated as a crash.
 				 */
 				HandleChildCrash(pid, exitstatus,
-								 _("background writer process"));
+								 _("checkpointer process"));
 			}
 
 			continue;
@@ -2584,8 +2608,8 @@ CleanupBackend(int pid,
 }
 
 /*
- * HandleChildCrash -- cleanup after failed backend, bgwriter, walwriter,
- * or autovacuum.
+ * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
+ * walwriter or autovacuum.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -2678,6 +2702,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		signal_child(BgWriterPID, (SendStop ? SIGSTOP : SIGQUIT));
 	}
 
+	/* Take care of the checkpointer too */
+	if (pid == CheckpointerPID)
+		CheckpointerPID = 0;
+	else if (CheckpointerPID != 0 && !FatalError)
+	{
+		ereport(DEBUG2,
+				(errmsg_internal("sending %s to process %d",
+								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+								 (int) CheckpointerPID)));
+		signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
+	}
+
 	/* Take care of the walwriter too */
 	if (pid == WalWriterPID)
 		WalWriterPID = 0;
@@ -2857,9 +2893,10 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_WAIT_BACKENDS state ends when we have no regular backends
-		 * (including autovac workers) and no walwriter or autovac launcher.
-		 * If we are doing crash recovery then we expect the bgwriter to exit
-		 * too, otherwise not.	The archiver, stats, and syslogger processes
+		 * (including autovac workers) and no walwriter, autovac launcher
+		 * or bgwriter.  If we are doing crash recovery then we expect the
+		 * checkpointer to exit as well, otherwise not.
+		 * The archiver, stats, and syslogger processes
 		 * are disregarded since they are not connected to shared memory; we
 		 * also disregard dead_end children here. Walsenders are also
 		 * disregarded, they will be terminated later after writing the
@@ -2868,7 +2905,8 @@ PostmasterStateMachine(void)
 		if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_AUTOVAC) == 0 &&
 			StartupPID == 0 &&
 			WalReceiverPID == 0 &&
-			(BgWriterPID == 0 || !FatalError) &&
+			BgWriterPID == 0 &&
+			(CheckpointerPID == 0 || !FatalError) &&
 			WalWriterPID == 0 &&
 			AutoVacPID == 0)
 		{
@@ -2890,22 +2928,22 @@ PostmasterStateMachine(void)
 				/*
 				 * If we get here, we are proceeding with normal shutdown. All
 				 * the regular children are gone, and it's time to tell the
-				 * bgwriter to do a shutdown checkpoint.
+				 * checkpointer to do a shutdown checkpoint.
 				 */
 				Assert(Shutdown > NoShutdown);
-				/* Start the bgwriter if not running */
-				if (BgWriterPID == 0)
-					BgWriterPID = StartBackgroundWriter();
+				/* Start the checkpointer if not running */
+				if (CheckpointerPID == 0)
+					CheckpointerPID = StartCheckpointer();
 				/* And tell it to shut down */
-				if (BgWriterPID != 0)
+				if (CheckpointerPID != 0)
 				{
-					signal_child(BgWriterPID, SIGUSR2);
+					signal_child(CheckpointerPID, SIGUSR2);
 					pmState = PM_SHUTDOWN;
 				}
 				else
 				{
 					/*
-					 * If we failed to fork a bgwriter, just shut down. Any
+					 * If we failed to fork a checkpointer, just shut down. Any
 					 * required cleanup will happen at next restart. We set
 					 * FatalError so that an "abnormal shutdown" message gets
 					 * logged when we exit.
@@ -2964,6 +3002,7 @@ PostmasterStateMachine(void)
 			Assert(StartupPID == 0);
 			Assert(WalReceiverPID == 0);
 			Assert(BgWriterPID == 0);
+			Assert(CheckpointerPID == 0);
 			Assert(WalWriterPID == 0);
 			Assert(AutoVacPID == 0);
 			/* syslogger is not considered here */
@@ -4143,6 +4182,8 @@ sigusr1_handler(SIGNAL_ARGS)
 		 */
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
+		Assert(CheckpointerPID == 0);
+		CheckpointerPID = StartCheckpointer();
 
 		pmState = PM_RECOVERY;
 	}
@@ -4429,6 +4470,10 @@ StartChildProcess(AuxProcType type)
 				ereport(LOG,
 				   (errmsg("could not fork background writer process: %m")));
 				break;
+			case CheckpointerProcess:
+				ereport(LOG,
+				   (errmsg("could not fork checkpointer process: %m")));
+				break;
 			case WalWriterProcess:
 				ereport(LOG,
 						(errmsg("could not fork WAL writer process: %m")));
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8647edd..184e820 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1278,11 +1278,9 @@ BufferSync(int flags)
 					break;
 
 				/*
-				 * Perform normal bgwriter duties and sleep to throttle our
-				 * I/O rate.
+				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags,
-									 (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
 			}
 		}
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 3015885..a761369 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -38,7 +38,7 @@
 /*
  * Special values for the segno arg to RememberFsyncRequest.
  *
- * Note that CompactBgwriterRequestQueue assumes that it's OK to remove an
+ * Note that CompactcheckpointerRequestQueue assumes that it's OK to remove an
  * fsync request from the queue if an identical, subsequent request is found.
  * See comments there before making changes here.
  */
@@ -77,7 +77,7 @@
  *	Inactive segments are those that once contained data but are currently
  *	not needed because of an mdtruncate() operation.  The reason for leaving
  *	them present at size zero, rather than unlinking them, is that other
- *	backends and/or the bgwriter might be holding open file references to
+ *	backends and/or the checkpointer might be holding open file references to
  *	such segments.	If the relation expands again after mdtruncate(), such
  *	that a deactivated segment becomes active again, it is important that
  *	such file references still be valid --- else data might get written
@@ -111,7 +111,7 @@ static MemoryContext MdCxt;		/* context for all md.c allocations */
 
 
 /*
- * In some contexts (currently, standalone backends and the bgwriter process)
+ * In some contexts (currently, standalone backends and the checkpointer process)
  * we keep track of pending fsync operations: we need to remember all relation
  * segments that have been written since the last checkpoint, so that we can
  * fsync them down to disk before completing the next checkpoint.  This hash
@@ -123,7 +123,7 @@ static MemoryContext MdCxt;		/* context for all md.c allocations */
  * a hash table, because we don't expect there to be any duplicate requests.
  *
  * (Regular backends do not track pending operations locally, but forward
- * them to the bgwriter.)
+ * them to the checkpointer.)
  */
 typedef struct
 {
@@ -194,7 +194,7 @@ mdinit(void)
 	 * Create pending-operations hashtable if we need it.  Currently, we need
 	 * it if we are standalone (not under a postmaster) OR if we are a
 	 * bootstrap-mode subprocess of a postmaster (that is, a startup or
-	 * bgwriter process).
+	 * checkpointer process).
 	 */
 	if (!IsUnderPostmaster || IsBootstrapProcessingMode())
 	{
@@ -214,10 +214,10 @@ mdinit(void)
 }
 
 /*
- * In archive recovery, we rely on bgwriter to do fsyncs, but we will have
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
  * already created the pendingOpsTable during initialization of the startup
  * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to bgwriter.
+ * subsequent requests will be forwarded to checkpointer.
  */
 void
 SetForwardFsyncRequests(void)
@@ -765,9 +765,9 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 	 * NOTE: this assumption could only be wrong if another backend has
 	 * truncated the relation.	We rely on higher code levels to handle that
 	 * scenario by closing and re-opening the md fd, which is handled via
-	 * relcache flush.	(Since the bgwriter doesn't participate in relcache
+	 * relcache flush.	(Since the checkpointer doesn't participate in relcache
 	 * flush, it could have segment chain entries for inactive segments;
-	 * that's OK because the bgwriter never needs to compute relation size.)
+	 * that's OK because the checkpointer never needs to compute relation size.)
 	 */
 	while (v->mdfd_chain != NULL)
 	{
@@ -957,7 +957,7 @@ mdsync(void)
 		elog(ERROR, "cannot sync without a pendingOpsTable");
 
 	/*
-	 * If we are in the bgwriter, the sync had better include all fsync
+	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.	The tightest
 	 * race condition that could occur is that a buffer that must be written
 	 * and fsync'd for the checkpoint could have been dumped by a backend just
@@ -1033,7 +1033,7 @@ mdsync(void)
 			int			failures;
 
 			/*
-			 * If in bgwriter, we want to absorb pending requests every so
+			 * If in checkpointer, we want to absorb pending requests every so
 			 * often to prevent overflow of the fsync request queue.  It is
 			 * unspecified whether newly-added entries will be visited by
 			 * hash_seq_search, but we don't care since we don't need to
@@ -1070,9 +1070,9 @@ mdsync(void)
 				 * say "but an unreferenced SMgrRelation is still a leak!" Not
 				 * really, because the only case in which a checkpoint is done
 				 * by a process that isn't about to shut down is in the
-				 * bgwriter, and it will periodically do smgrcloseall(). This
+				 * checkpointer, and it will periodically do smgrcloseall(). This
 				 * fact justifies our not closing the reln in the success path
-				 * either, which is a good thing since in non-bgwriter cases
+				 * either, which is a good thing since in non-checkpointer cases
 				 * we couldn't safely do that.)  Furthermore, in many cases
 				 * the relation will have been dirtied through this same smgr
 				 * relation, and so we can save a file open/close cycle.
@@ -1301,7 +1301,7 @@ register_unlink(RelFileNodeBackend rnode)
 	else
 	{
 		/*
-		 * Notify the bgwriter about it.  If we fail to queue the request
+		 * Notify the checkpointer about it.  If we fail to queue the request
 		 * message, we have to sleep and try again, because we can't simply
 		 * delete the file now.  Ugly, but hopefully won't happen often.
 		 *
@@ -1315,10 +1315,10 @@ register_unlink(RelFileNodeBackend rnode)
 }
 
 /*
- * RememberFsyncRequest() -- callback from bgwriter side of fsync request
+ * RememberFsyncRequest() -- callback from checkpointer side of fsync request
  *
  * We stuff most fsync requests into the local hash table for execution
- * during the bgwriter's next checkpoint.  UNLINK requests go into a
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
  * separate linked list, however, because they get processed separately.
  *
  * The range of possible segment numbers is way less than the range of
@@ -1460,20 +1460,20 @@ ForgetRelationFsyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
 	else if (IsUnderPostmaster)
 	{
 		/*
-		 * Notify the bgwriter about it.  If we fail to queue the revoke
+		 * Notify the checkpointer about it.  If we fail to queue the revoke
 		 * message, we have to sleep and try again ... ugly, but hopefully
 		 * won't happen often.
 		 *
 		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
 		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the bgwriter
+		 * which would be bad, so I'm inclined to assume that the checkpointer
 		 * will always empty the queue soon.
 		 */
 		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
 			pg_usleep(10000L);	/* 10 msec seems a good number */
 
 		/*
-		 * Note we don't wait for the bgwriter to actually absorb the revoke
+		 * Note we don't wait for the checkpointer to actually absorb the revoke
 		 * message; see mdsync() for the implications.
 		 */
 	}
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 4eaa243..cb43879 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -256,7 +256,7 @@ typedef struct RmgrData
 extern const RmgrData RmgrTable[];
 
 /*
- * Exported to support xlog switching from bgwriter
+ * Exported to support xlog switching from checkpointer
  */
 extern pg_time_t GetLastSegSwitchTime(void);
 extern XLogRecPtr RequestXLogSwitch(void);
diff --git a/src/include/bootstrap/bootstrap.h b/src/include/bootstrap/bootstrap.h
index cee9bd1..6153b7a 100644
--- a/src/include/bootstrap/bootstrap.h
+++ b/src/include/bootstrap/bootstrap.h
@@ -22,6 +22,7 @@ typedef enum
 	BootstrapProcess,
 	StartupProcess,
 	BgWriterProcess,
+	CheckpointerProcess,
 	WalWriterProcess,
 	WalReceiverProcess,
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index eaf2206..c05901e 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -23,6 +23,7 @@ extern int	CheckPointWarning;
 extern double CheckPointCompletionTarget;
 
 extern void BackgroundWriterMain(void);
+extern void CheckpointerMain(void);
 
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 46ec625..6e798b1 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -190,11 +190,11 @@ extern PROC_HDR *ProcGlobal;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer and WAL writer run during normal operation. Startup
- * process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 3 slots.
+ * Background writer, checkpointer and WAL writer run during normal operation.
+ * Startup process and WAL receiver also consume 2 slots, but WAL writer is
+ * launched only after startup has exited, so we only need 4 slots.
  */
-#define NUM_AUXILIARY_PROCS		3
+#define NUM_AUXILIARY_PROCS		4
 
 
 /* configurable options */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 2a27e0b..d5afe01 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -19,7 +19,7 @@
 
 /*
  * Reasons for signalling a Postgres child process (a backend or an auxiliary
- * process, like bgwriter).  We can cope with concurrent signals for different
+ * process, like checkpointer).  We can cope with concurrent signals for different
  * reasons.  However, if the same reason is signaled multiple times in quick
  * succession, the process is likely to observe only one notification of it.
  * This is okay for the present uses.

#20

Dickson S. Guedes

listas@guedesoft.net

over 14 years ago

In reply to: Simon Riggs (#19)

Re: Separating bgwriter and checkpointer

2011/10/2 Simon Riggs <simon@2ndquadrant.com>:

On Thu, Sep 15, 2011 at 11:53 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Current patch has a bug at shutdown I've not located yet, but seems
likely is a simple error. That is mainly because for personal reasons
I've not been able to work on the patch recently. I expect to be able
to fix that later in the CF.

Full patch, with bug fixed. (v2)

I'm now free to take review comments and make changes.

Hi Simon,

I'm trying your patch, it was applied cleanly to master and compiled
ok. But since I started postgres I'm seeing a 99% of CPU usage:

guedes@betelgeuse:/srv/postgres/bgwriter_split$ ps -ef | grep postgres
guedes 14878 1 0 19:37 pts/0 00:00:00
/srv/postgres/bgwriter_split/bin/postgres -D data
guedes 14880 14878 0 19:37 ? 00:00:00 postgres: writer
process
guedes 14881 14878 99 19:37 ? 00:03:07 postgres: checkpointer
process
guedes 14882 14878 0 19:37 ? 00:00:00 postgres: wal writer
process
guedes 14883 14878 0 19:37 ? 00:00:00 postgres: autovacuum
launcher process
guedes 14884 14878 0 19:37 ? 00:00:00 postgres: stats
collector process

Best regards.
--
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br

#21

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Dickson S. Guedes (#20)

1 attachment(s)

Re: Separating bgwriter and checkpointer

On Sun, Oct 2, 2011 at 11:45 PM, Dickson S. Guedes <listas@guedesoft.net> wrote:

I'm trying your patch, it was applied cleanly to master and compiled
ok. But since I started postgres I'm seeing a 99% of CPU usage:

Oh, thanks. I see what happened. I was toying with the idea of going
straight to a WaitLatch implementation for the loop but decided to
leave it out for a later patch, and then skipped the sleep as well.

New version attached.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

bgwriter_split.v3.patchapplication/octet-stream; name=bgwriter_split.v3.patchDownload

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4fe08df..f9b839c 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -315,6 +315,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			case BgWriterProcess:
 				statmsg = "writer process";
 				break;
+			case CheckpointerProcess:
+				statmsg = "checkpointer process";
+				break;
 			case WalWriterProcess:
 				statmsg = "wal writer process";
 				break;
@@ -415,6 +418,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			BackgroundWriterMain();
 			proc_exit(1);		/* should never return */
 
+		case CheckpointerProcess:
+			/* don't set signals, checkpointer has its own agenda */
+			CheckpointerMain();
+			proc_exit(1);		/* should never return */
+
 		case WalWriterProcess:
 			/* don't set signals, walwriter has its own agenda */
 			InitXLOGAccess();
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 0767e97..e7414d2 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = autovacuum.o bgwriter.o fork_process.o pgarch.o pgstat.o postmaster.o \
-	syslogger.o walwriter.o
+	syslogger.o walwriter.o checkpointer.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 2d0b639..e0f3167 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -10,20 +10,13 @@
  * still empowered to issue writes if the bgwriter fails to maintain enough
  * clean shared buffers.
  *
- * The bgwriter is also charged with handling all checkpoints.	It will
- * automatically dispatch a checkpoint after a certain amount of time has
- * elapsed since the last one, and it can be signaled to perform requested
- * checkpoints as well.  (The GUC parameter that mandates a checkpoint every
- * so many WAL segments is implemented by having backends signal the bgwriter
- * when they fill WAL segments; the bgwriter itself doesn't watch for the
- * condition.)
+ * As of Postgres 9.2 the bgwriter no longer handles checkpoints.
  *
  * The bgwriter is started by the postmaster as soon as the startup subprocess
  * finishes, or as soon as recovery begins if we are doing archive recovery.
  * It remains alive until the postmaster commands it to terminate.
- * Normal termination is by SIGUSR2, which instructs the bgwriter to execute
- * a shutdown checkpoint and then exit(0).	(All backends must be stopped
- * before SIGUSR2 is issued!)  Emergency termination is by SIGQUIT; like any
+ * Normal termination is by SIGUSR2, which instructs the bgwriter to exit(0).
+ * Emergency termination is by SIGQUIT; like any
  * backend, the bgwriter will simply abort and exit on SIGQUIT.
  *
  * If the bgwriter exits unexpectedly, the postmaster treats that the same
@@ -54,7 +47,6 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
-#include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
@@ -67,96 +59,15 @@
 #include "utils/resowner.h"
 
 
-/*----------
- * Shared memory area for communication between bgwriter and backends
- *
- * The ckpt counters allow backends to watch for completion of a checkpoint
- * request they send.  Here's how it works:
- *	* At start of a checkpoint, bgwriter reads (and clears) the request flags
- *	  and increments ckpt_started, while holding ckpt_lck.
- *	* On completion of a checkpoint, bgwriter sets ckpt_done to
- *	  equal ckpt_started.
- *	* On failure of a checkpoint, bgwriter increments ckpt_failed
- *	  and sets ckpt_done to equal ckpt_started.
- *
- * The algorithm for backends is:
- *	1. Record current values of ckpt_failed and ckpt_started, and
- *	   set request flags, while holding ckpt_lck.
- *	2. Send signal to request checkpoint.
- *	3. Sleep until ckpt_started changes.  Now you know a checkpoint has
- *	   begun since you started this algorithm (although *not* that it was
- *	   specifically initiated by your signal), and that it is using your flags.
- *	4. Record new value of ckpt_started.
- *	5. Sleep until ckpt_done >= saved value of ckpt_started.  (Use modulo
- *	   arithmetic here in case counters wrap around.)  Now you know a
- *	   checkpoint has started and completed, but not whether it was
- *	   successful.
- *	6. If ckpt_failed is different from the originally saved value,
- *	   assume request failed; otherwise it was definitely successful.
- *
- * ckpt_flags holds the OR of the checkpoint request flags sent by all
- * requesting backends since the last checkpoint start.  The flags are
- * chosen so that OR'ing is the correct way to combine multiple requests.
- *
- * num_backend_writes is used to count the number of buffer writes performed
- * by non-bgwriter processes.  This counter should be wide enough that it
- * can't overflow during a single bgwriter cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the background writer failed to absorb their request.
- *
- * The requests array holds fsync requests sent by backends and not yet
- * absorbed by the bgwriter.
- *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by BgWriterCommLock.
- *----------
- */
-typedef struct
-{
-	RelFileNodeBackend rnode;
-	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
-	/* might add a real request-type field later; not needed yet */
-} BgWriterRequest;
-
-typedef struct
-{
-	pid_t		bgwriter_pid;	/* PID of bgwriter (0 if not started) */
-
-	slock_t		ckpt_lck;		/* protects all the ckpt_* fields */
-
-	int			ckpt_started;	/* advances when checkpoint starts */
-	int			ckpt_done;		/* advances when checkpoint done */
-	int			ckpt_failed;	/* advances when checkpoint fails */
-
-	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
-
-	uint32		num_backend_writes;		/* counts non-bgwriter buffer writes */
-	uint32		num_backend_fsync;		/* counts non-bgwriter fsync calls */
-
-	int			num_requests;	/* current # of requests */
-	int			max_requests;	/* allocated array size */
-	BgWriterRequest requests[1];	/* VARIABLE LENGTH ARRAY */
-} BgWriterShmemStruct;
-
-static BgWriterShmemStruct *BgWriterShmem;
-
-/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
-#define WRITES_PER_ABSORB		1000
-
 /*
  * GUC parameters
  */
 int			BgWriterDelay = 200;
-int			CheckPointTimeout = 300;
-int			CheckPointWarning = 30;
-double		CheckPointCompletionTarget = 0.5;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
 static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
 static volatile sig_atomic_t shutdown_requested = false;
 
 /*
@@ -164,29 +75,14 @@ static volatile sig_atomic_t shutdown_requested = false;
  */
 static bool am_bg_writer = false;
 
-static bool ckpt_active = false;
-
-/* these values are valid when ckpt_active is true: */
-static pg_time_t ckpt_start_time;
-static XLogRecPtr ckpt_start_recptr;
-static double ckpt_cached_elapsed;
-
-static pg_time_t last_checkpoint_time;
-static pg_time_t last_xlog_switch_time;
-
 /* Prototypes for private functions */
 
-static void CheckArchiveTimeout(void);
 static void BgWriterNap(void);
-static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
-static bool CompactBgwriterRequestQueue(void);
 
 /* Signal handlers */
 
 static void bg_quickdie(SIGNAL_ARGS);
 static void BgSigHupHandler(SIGNAL_ARGS);
-static void ReqCheckpointHandler(SIGNAL_ARGS);
 static void ReqShutdownHandler(SIGNAL_ARGS);
 
 
@@ -202,7 +98,6 @@ BackgroundWriterMain(void)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 
-	BgWriterShmem->bgwriter_pid = MyProcPid;
 	am_bg_writer = true;
 
 	/*
@@ -228,8 +123,8 @@ BackgroundWriterMain(void)
 	 * process to participate in ProcSignal signalling.
 	 */
 	pqsignal(SIGHUP, BgSigHupHandler);	/* set flag to read config file */
-	pqsignal(SIGINT, ReqCheckpointHandler);		/* request checkpoint */
-	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
+	pqsignal(SIGINT, SIG_IGN);			/* as of 9.2 no longer requests checkpoint */
+	pqsignal(SIGTERM, SIG_IGN); 		/* ignore SIGTERM */
 	pqsignal(SIGQUIT, bg_quickdie);		/* hard crash time */
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
@@ -249,11 +144,6 @@ BackgroundWriterMain(void)
 	sigdelset(&BlockSig, SIGQUIT);
 
 	/*
-	 * Initialize so that first time-driven event happens at the correct time.
-	 */
-	last_checkpoint_time = last_xlog_switch_time = (pg_time_t) time(NULL);
-
-	/*
 	 * Create a resource owner to keep track of our resources (currently only
 	 * buffer pins).
 	 */
@@ -305,20 +195,6 @@ BackgroundWriterMain(void)
 		AtEOXact_Files();
 		AtEOXact_HashTables(false);
 
-		/* Warn any waiting backends that the checkpoint failed. */
-		if (ckpt_active)
-		{
-			/* use volatile pointer to prevent code rearrangement */
-			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-
-			SpinLockAcquire(&bgs->ckpt_lck);
-			bgs->ckpt_failed++;
-			bgs->ckpt_done = bgs->ckpt_started;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			ckpt_active = false;
-		}
-
 		/*
 		 * Now return to normal top-level context and clear ErrorContext for
 		 * next time.
@@ -361,19 +237,11 @@ BackgroundWriterMain(void)
 	if (RecoveryInProgress())
 		ThisTimeLineID = GetRecoveryTargetTLI();
 
-	/* Do this once before starting the loop, then just at SIGHUP time. */
-	SyncRepUpdateSyncStandbysDefined();
-
 	/*
 	 * Loop forever
 	 */
 	for (;;)
 	{
-		bool		do_checkpoint = false;
-		int			flags = 0;
-		pg_time_t	now;
-		int			elapsed_secs;
-
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
 		 * necessity for manual cleanup of all postmaster children.
@@ -381,23 +249,11 @@ BackgroundWriterMain(void)
 		if (!PostmasterIsAlive())
 			exit(1);
 
-		/*
-		 * Process any requests or signals received recently.
-		 */
-		AbsorbFsyncRequests();
-
 		if (got_SIGHUP)
 		{
 			got_SIGHUP = false;
 			ProcessConfigFile(PGC_SIGHUP);
 			/* update global shmem state for sync rep */
-			SyncRepUpdateSyncStandbysDefined();
-		}
-		if (checkpoint_requested)
-		{
-			checkpoint_requested = false;
-			do_checkpoint = true;
-			BgWriterStats.m_requested_checkpoints++;
 		}
 		if (shutdown_requested)
 		{
@@ -406,142 +262,14 @@ BackgroundWriterMain(void)
 			 * control back to the sigsetjmp block above
 			 */
 			ExitOnAnyError = true;
-			/* Close down the database */
-			ShutdownXLOG(0, 0);
 			/* Normal exit from the bgwriter is here */
 			proc_exit(0);		/* done */
 		}
 
 		/*
-		 * Force a checkpoint if too much time has elapsed since the last one.
-		 * Note that we count a timed checkpoint in stats only when this
-		 * occurs without an external request, but we set the CAUSE_TIME flag
-		 * bit even if there is also an external request.
+		 * Do one cycle of dirty-buffer writing.
 		 */
-		now = (pg_time_t) time(NULL);
-		elapsed_secs = now - last_checkpoint_time;
-		if (elapsed_secs >= CheckPointTimeout)
-		{
-			if (!do_checkpoint)
-				BgWriterStats.m_timed_checkpoints++;
-			do_checkpoint = true;
-			flags |= CHECKPOINT_CAUSE_TIME;
-		}
-
-		/*
-		 * Do a checkpoint if requested, otherwise do one cycle of
-		 * dirty-buffer writing.
-		 */
-		if (do_checkpoint)
-		{
-			bool		ckpt_performed = false;
-			bool		do_restartpoint;
-
-			/* use volatile pointer to prevent code rearrangement */
-			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-
-			/*
-			 * Check if we should perform a checkpoint or a restartpoint. As a
-			 * side-effect, RecoveryInProgress() initializes TimeLineID if
-			 * it's not set yet.
-			 */
-			do_restartpoint = RecoveryInProgress();
-
-			/*
-			 * Atomically fetch the request flags to figure out what kind of a
-			 * checkpoint we should perform, and increase the started-counter
-			 * to acknowledge that we've started a new checkpoint.
-			 */
-			SpinLockAcquire(&bgs->ckpt_lck);
-			flags |= bgs->ckpt_flags;
-			bgs->ckpt_flags = 0;
-			bgs->ckpt_started++;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			/*
-			 * The end-of-recovery checkpoint is a real checkpoint that's
-			 * performed while we're still in recovery.
-			 */
-			if (flags & CHECKPOINT_END_OF_RECOVERY)
-				do_restartpoint = false;
-
-			/*
-			 * We will warn if (a) too soon since last checkpoint (whatever
-			 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
-			 * since the last checkpoint start.  Note in particular that this
-			 * implementation will not generate warnings caused by
-			 * CheckPointTimeout < CheckPointWarning.
-			 */
-			if (!do_restartpoint &&
-				(flags & CHECKPOINT_CAUSE_XLOG) &&
-				elapsed_secs < CheckPointWarning)
-				ereport(LOG,
-						(errmsg_plural("checkpoints are occurring too frequently (%d second apart)",
-				"checkpoints are occurring too frequently (%d seconds apart)",
-									   elapsed_secs,
-									   elapsed_secs),
-						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
-
-			/*
-			 * Initialize bgwriter-private variables used during checkpoint.
-			 */
-			ckpt_active = true;
-			if (!do_restartpoint)
-				ckpt_start_recptr = GetInsertRecPtr();
-			ckpt_start_time = now;
-			ckpt_cached_elapsed = 0;
-
-			/*
-			 * Do the checkpoint.
-			 */
-			if (!do_restartpoint)
-			{
-				CreateCheckPoint(flags);
-				ckpt_performed = true;
-			}
-			else
-				ckpt_performed = CreateRestartPoint(flags);
-
-			/*
-			 * After any checkpoint, close all smgr files.	This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
-			 */
-			smgrcloseall();
-
-			/*
-			 * Indicate checkpoint completion to any waiting backends.
-			 */
-			SpinLockAcquire(&bgs->ckpt_lck);
-			bgs->ckpt_done = bgs->ckpt_started;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			if (ckpt_performed)
-			{
-				/*
-				 * Note we record the checkpoint start time not end time as
-				 * last_checkpoint_time.  This is so that time-driven
-				 * checkpoints happen at a predictable spacing.
-				 */
-				last_checkpoint_time = now;
-			}
-			else
-			{
-				/*
-				 * We were not able to perform the restartpoint (checkpoints
-				 * throw an ERROR in case of error).  Most likely because we
-				 * have not received any new checkpoint WAL records since the
-				 * last restartpoint. Try again in 15 s.
-				 */
-				last_checkpoint_time = now - CheckPointTimeout + 15;
-			}
-
-			ckpt_active = false;
-		}
-		else
-			BgBufferSync();
-
-		/* Check for archive_timeout and switch xlog files if necessary. */
-		CheckArchiveTimeout();
+		BgBufferSync();
 
 		/* Nap for the configured time. */
 		BgWriterNap();
@@ -549,61 +277,6 @@ BackgroundWriterMain(void)
 }
 
 /*
- * CheckArchiveTimeout -- check for archive_timeout and switch xlog files
- *
- * This will switch to a new WAL file and force an archive file write
- * if any activity is recorded in the current WAL file, including just
- * a single checkpoint record.
- */
-static void
-CheckArchiveTimeout(void)
-{
-	pg_time_t	now;
-	pg_time_t	last_time;
-
-	if (XLogArchiveTimeout <= 0 || RecoveryInProgress())
-		return;
-
-	now = (pg_time_t) time(NULL);
-
-	/* First we do a quick check using possibly-stale local state. */
-	if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
-		return;
-
-	/*
-	 * Update local state ... note that last_xlog_switch_time is the last time
-	 * a switch was performed *or requested*.
-	 */
-	last_time = GetLastSegSwitchTime();
-
-	last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
-
-	/* Now we can do the real check */
-	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
-	{
-		XLogRecPtr	switchpoint;
-
-		/* OK, it's time to switch */
-		switchpoint = RequestXLogSwitch();
-
-		/*
-		 * If the returned pointer points exactly to a segment boundary,
-		 * assume nothing happened.
-		 */
-		if ((switchpoint.xrecoff % XLogSegSize) != 0)
-			ereport(DEBUG1,
-				(errmsg("transaction log switch forced (archive_timeout=%d)",
-						XLogArchiveTimeout)));
-
-		/*
-		 * Update state in any case, so we don't retry constantly when the
-		 * system is idle.
-		 */
-		last_xlog_switch_time = now;
-	}
-}
-
-/*
  * BgWriterNap -- Nap for the configured time or until a signal is received.
  */
 static void
@@ -624,185 +297,24 @@ BgWriterNap(void)
 	 * respond reasonably promptly when someone signals us, break down the
 	 * sleep into 1-second increments, and check for interrupts after each
 	 * nap.
-	 *
-	 * We absorb pending requests after each short sleep.
 	 */
-	if (bgwriter_lru_maxpages > 0 || ckpt_active)
+	if (bgwriter_lru_maxpages > 0)
 		udelay = BgWriterDelay * 1000L;
-	else if (XLogArchiveTimeout > 0)
-		udelay = 1000000L;		/* One second */
 	else
 		udelay = 10000000L;		/* Ten seconds */
 
 	while (udelay > 999999L)
 	{
-		if (got_SIGHUP || shutdown_requested ||
-		(ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
+		if (got_SIGHUP || shutdown_requested)
 			break;
 		pg_usleep(1000000L);
-		AbsorbFsyncRequests();
 		udelay -= 1000000L;
 	}
 
-	if (!(got_SIGHUP || shutdown_requested ||
-	  (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested)))
+	if (!(got_SIGHUP || shutdown_requested))
 		pg_usleep(udelay);
 }
 
-/*
- * Returns true if an immediate checkpoint request is pending.	(Note that
- * this does not check the *current* checkpoint's IMMEDIATE flag, but whether
- * there is one pending behind it.)
- */
-static bool
-ImmediateCheckpointRequested(void)
-{
-	if (checkpoint_requested)
-	{
-		volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-
-		/*
-		 * We don't need to acquire the ckpt_lck in this case because we're
-		 * only looking at a single flag bit.
-		 */
-		if (bgs->ckpt_flags & CHECKPOINT_IMMEDIATE)
-			return true;
-	}
-	return false;
-}
-
-/*
- * CheckpointWriteDelay -- yield control to bgwriter during a checkpoint
- *
- * This function is called after each page write performed by BufferSync().
- * It is responsible for keeping the bgwriter's normal activities in
- * progress during a long checkpoint, and for throttling BufferSync()'s
- * write rate to hit checkpoint_completion_target.
- *
- * The checkpoint request flags should be passed in; currently the only one
- * examined is CHECKPOINT_IMMEDIATE, which disables delays between writes.
- *
- * 'progress' is an estimate of how much of the work has been done, as a
- * fraction between 0.0 meaning none, and 1.0 meaning all done.
- */
-void
-CheckpointWriteDelay(int flags, double progress)
-{
-	static int	absorb_counter = WRITES_PER_ABSORB;
-
-	/* Do nothing if checkpoint is being executed by non-bgwriter process */
-	if (!am_bg_writer)
-		return;
-
-	/*
-	 * Perform the usual bgwriter duties and take a nap, unless we're behind
-	 * schedule, in which case we just try to catch up as quickly as possible.
-	 */
-	if (!(flags & CHECKPOINT_IMMEDIATE) &&
-		!shutdown_requested &&
-		!ImmediateCheckpointRequested() &&
-		IsCheckpointOnSchedule(progress))
-	{
-		if (got_SIGHUP)
-		{
-			got_SIGHUP = false;
-			ProcessConfigFile(PGC_SIGHUP);
-			/* update global shmem state for sync rep */
-			SyncRepUpdateSyncStandbysDefined();
-		}
-
-		AbsorbFsyncRequests();
-		absorb_counter = WRITES_PER_ABSORB;
-
-		BgBufferSync();
-		CheckArchiveTimeout();
-		BgWriterNap();
-	}
-	else if (--absorb_counter <= 0)
-	{
-		/*
-		 * Absorb pending fsync requests after each WRITES_PER_ABSORB write
-		 * operations even when we don't sleep, to prevent overflow of the
-		 * fsync request queue.
-		 */
-		AbsorbFsyncRequests();
-		absorb_counter = WRITES_PER_ABSORB;
-	}
-}
-
-/*
- * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
- *		 in time?
- *
- * Compares the current progress against the time/segments elapsed since last
- * checkpoint, and returns true if the progress we've made this far is greater
- * than the elapsed time/segments.
- */
-static bool
-IsCheckpointOnSchedule(double progress)
-{
-	XLogRecPtr	recptr;
-	struct timeval now;
-	double		elapsed_xlogs,
-				elapsed_time;
-
-	Assert(ckpt_active);
-
-	/* Scale progress according to checkpoint_completion_target. */
-	progress *= CheckPointCompletionTarget;
-
-	/*
-	 * Check against the cached value first. Only do the more expensive
-	 * calculations once we reach the target previously calculated. Since
-	 * neither time or WAL insert pointer moves backwards, a freshly
-	 * calculated value can only be greater than or equal to the cached value.
-	 */
-	if (progress < ckpt_cached_elapsed)
-		return false;
-
-	/*
-	 * Check progress against WAL segments written and checkpoint_segments.
-	 *
-	 * We compare the current WAL insert location against the location
-	 * computed before calling CreateCheckPoint. The code in XLogInsert that
-	 * actually triggers a checkpoint when checkpoint_segments is exceeded
-	 * compares against RedoRecptr, so this is not completely accurate.
-	 * However, it's good enough for our purposes, we're only calculating an
-	 * estimate anyway.
-	 */
-	if (!RecoveryInProgress())
-	{
-		recptr = GetInsertRecPtr();
-		elapsed_xlogs =
-			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
-			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
-			CheckPointSegments;
-
-		if (progress < elapsed_xlogs)
-		{
-			ckpt_cached_elapsed = elapsed_xlogs;
-			return false;
-		}
-	}
-
-	/*
-	 * Check progress against time elapsed and checkpoint_timeout.
-	 */
-	gettimeofday(&now, NULL);
-	elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
-					now.tv_usec / 1000000.0) / CheckPointTimeout;
-
-	if (progress < elapsed_time)
-	{
-		ckpt_cached_elapsed = elapsed_time;
-		return false;
-	}
-
-	/* It looks like we're on schedule. */
-	return true;
-}
-
-
 /* --------------------------------
  *		signal handler routines
  * --------------------------------
@@ -847,441 +359,9 @@ BgSigHupHandler(SIGNAL_ARGS)
 	got_SIGHUP = true;
 }
 
-/* SIGINT: set flag to run a normal checkpoint right away */
-static void
-ReqCheckpointHandler(SIGNAL_ARGS)
-{
-	checkpoint_requested = true;
-}
-
 /* SIGUSR2: set flag to run a shutdown checkpoint and exit */
 static void
 ReqShutdownHandler(SIGNAL_ARGS)
 {
 	shutdown_requested = true;
 }
-
-
-/* --------------------------------
- *		communication with backends
- * --------------------------------
- */
-
-/*
- * BgWriterShmemSize
- *		Compute space needed for bgwriter-related shared memory
- */
-Size
-BgWriterShmemSize(void)
-{
-	Size		size;
-
-	/*
-	 * Currently, the size of the requests[] array is arbitrarily set equal to
-	 * NBuffers.  This may prove too large or small ...
-	 */
-	size = offsetof(BgWriterShmemStruct, requests);
-	size = add_size(size, mul_size(NBuffers, sizeof(BgWriterRequest)));
-
-	return size;
-}
-
-/*
- * BgWriterShmemInit
- *		Allocate and initialize bgwriter-related shared memory
- */
-void
-BgWriterShmemInit(void)
-{
-	bool		found;
-
-	BgWriterShmem = (BgWriterShmemStruct *)
-		ShmemInitStruct("Background Writer Data",
-						BgWriterShmemSize(),
-						&found);
-
-	if (!found)
-	{
-		/* First time through, so initialize */
-		MemSet(BgWriterShmem, 0, sizeof(BgWriterShmemStruct));
-		SpinLockInit(&BgWriterShmem->ckpt_lck);
-		BgWriterShmem->max_requests = NBuffers;
-	}
-}
-
-/*
- * RequestCheckpoint
- *		Called in backend processes to request a checkpoint
- *
- * flags is a bitwise OR of the following:
- *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
- *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
- *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
- *		ignoring checkpoint_completion_target parameter.
- *	CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occured
- *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
- *		CHECKPOINT_END_OF_RECOVERY).
- *	CHECKPOINT_WAIT: wait for completion before returning (otherwise,
- *		just signal bgwriter to do it, and return).
- *	CHECKPOINT_CAUSE_XLOG: checkpoint is requested due to xlog filling.
- *		(This affects logging, and in particular enables CheckPointWarning.)
- */
-void
-RequestCheckpoint(int flags)
-{
-	/* use volatile pointer to prevent code rearrangement */
-	volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-	int			ntries;
-	int			old_failed,
-				old_started;
-
-	/*
-	 * If in a standalone backend, just do it ourselves.
-	 */
-	if (!IsPostmasterEnvironment)
-	{
-		/*
-		 * There's no point in doing slow checkpoints in a standalone backend,
-		 * because there's no other backends the checkpoint could disrupt.
-		 */
-		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
-
-		/*
-		 * After any checkpoint, close all smgr files.	This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
-
-		return;
-	}
-
-	/*
-	 * Atomically set the request flags, and take a snapshot of the counters.
-	 * When we see ckpt_started > old_started, we know the flags we set here
-	 * have been seen by bgwriter.
-	 *
-	 * Note that we OR the flags with any existing flags, to avoid overriding
-	 * a "stronger" request by another backend.  The flag senses must be
-	 * chosen to make this work!
-	 */
-	SpinLockAcquire(&bgs->ckpt_lck);
-
-	old_failed = bgs->ckpt_failed;
-	old_started = bgs->ckpt_started;
-	bgs->ckpt_flags |= flags;
-
-	SpinLockRelease(&bgs->ckpt_lck);
-
-	/*
-	 * Send signal to request checkpoint.  It's possible that the bgwriter
-	 * hasn't started yet, or is in process of restarting, so we will retry a
-	 * few times if needed.  Also, if not told to wait for the checkpoint to
-	 * occur, we consider failure to send the signal to be nonfatal and merely
-	 * LOG it.
-	 */
-	for (ntries = 0;; ntries++)
-	{
-		if (BgWriterShmem->bgwriter_pid == 0)
-		{
-			if (ntries >= 20)	/* max wait 2.0 sec */
-			{
-				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-				"could not request checkpoint because bgwriter not running");
-				break;
-			}
-		}
-		else if (kill(BgWriterShmem->bgwriter_pid, SIGINT) != 0)
-		{
-			if (ntries >= 20)	/* max wait 2.0 sec */
-			{
-				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: %m");
-				break;
-			}
-		}
-		else
-			break;				/* signal sent successfully */
-
-		CHECK_FOR_INTERRUPTS();
-		pg_usleep(100000L);		/* wait 0.1 sec, then retry */
-	}
-
-	/*
-	 * If requested, wait for completion.  We detect completion according to
-	 * the algorithm given above.
-	 */
-	if (flags & CHECKPOINT_WAIT)
-	{
-		int			new_started,
-					new_failed;
-
-		/* Wait for a new checkpoint to start. */
-		for (;;)
-		{
-			SpinLockAcquire(&bgs->ckpt_lck);
-			new_started = bgs->ckpt_started;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			if (new_started != old_started)
-				break;
-
-			CHECK_FOR_INTERRUPTS();
-			pg_usleep(100000L);
-		}
-
-		/*
-		 * We are waiting for ckpt_done >= new_started, in a modulo sense.
-		 */
-		for (;;)
-		{
-			int			new_done;
-
-			SpinLockAcquire(&bgs->ckpt_lck);
-			new_done = bgs->ckpt_done;
-			new_failed = bgs->ckpt_failed;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			if (new_done - new_started >= 0)
-				break;
-
-			CHECK_FOR_INTERRUPTS();
-			pg_usleep(100000L);
-		}
-
-		if (new_failed != old_failed)
-			ereport(ERROR,
-					(errmsg("checkpoint request failed"),
-					 errhint("Consult recent messages in the server log for details.")));
-	}
-}
-
-/*
- * ForwardFsyncRequest
- *		Forward a file-fsync request from a backend to the bgwriter
- *
- * Whenever a backend is compelled to write directly to a relation
- * (which should be seldom, if the bgwriter is getting its job done),
- * the backend calls this routine to pass over knowledge that the relation
- * is dirty and must be fsync'd before next checkpoint.  We also use this
- * opportunity to count such writes for statistical purposes.
- *
- * segno specifies which segment (not block!) of the relation needs to be
- * fsync'd.  (Since the valid range is much less than BlockNumber, we can
- * use high values for special flags; that's all internal to md.c, which
- * see for details.)
- *
- * To avoid holding the lock for longer than necessary, we normally write
- * to the requests[] queue without checking for duplicates.  The bgwriter
- * will have to eliminate dups internally anyway.  However, if we discover
- * that the queue is full, we make a pass over the entire queue to compact
- * it.	This is somewhat expensive, but the alternative is for the backend
- * to perform its own fsync, which is far more expensive in practice.  It
- * is theoretically possible a backend fsync might still be necessary, if
- * the queue is full and contains no duplicate entries.  In that case, we
- * let the backend know by returning false.
- */
-bool
-ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
-					BlockNumber segno)
-{
-	BgWriterRequest *request;
-
-	if (!IsUnderPostmaster)
-		return false;			/* probably shouldn't even get here */
-
-	if (am_bg_writer)
-		elog(ERROR, "ForwardFsyncRequest must not be called in bgwriter");
-
-	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
-
-	/* Count all backend writes regardless of if they fit in the queue */
-	BgWriterShmem->num_backend_writes++;
-
-	/*
-	 * If the background writer isn't running or the request queue is full,
-	 * the backend will have to perform its own fsync request.	But before
-	 * forcing that to happen, we can try to compact the background writer
-	 * request queue.
-	 */
-	if (BgWriterShmem->bgwriter_pid == 0 ||
-		(BgWriterShmem->num_requests >= BgWriterShmem->max_requests
-		 && !CompactBgwriterRequestQueue()))
-	{
-		/*
-		 * Count the subset of writes where backends have to do their own
-		 * fsync
-		 */
-		BgWriterShmem->num_backend_fsync++;
-		LWLockRelease(BgWriterCommLock);
-		return false;
-	}
-	request = &BgWriterShmem->requests[BgWriterShmem->num_requests++];
-	request->rnode = rnode;
-	request->forknum = forknum;
-	request->segno = segno;
-	LWLockRelease(BgWriterCommLock);
-	return true;
-}
-
-/*
- * CompactBgwriterRequestQueue
- *		Remove duplicates from the request queue to avoid backend fsyncs.
- *
- * Although a full fsync request queue is not common, it can lead to severe
- * performance problems when it does happen.  So far, this situation has
- * only been observed to occur when the system is under heavy write load,
- * and especially during the "sync" phase of a checkpoint.	Without this
- * logic, each backend begins doing an fsync for every block written, which
- * gets very expensive and can slow down the whole system.
- *
- * Trying to do this every time the queue is full could lose if there
- * aren't any removable entries.  But should be vanishingly rare in
- * practice: there's one queue entry per shared buffer.
- */
-static bool
-CompactBgwriterRequestQueue()
-{
-	struct BgWriterSlotMapping
-	{
-		BgWriterRequest request;
-		int			slot;
-	};
-
-	int			n,
-				preserve_count;
-	int			num_skipped = 0;
-	HASHCTL		ctl;
-	HTAB	   *htab;
-	bool	   *skip_slot;
-
-	/* must hold BgWriterCommLock in exclusive mode */
-	Assert(LWLockHeldByMe(BgWriterCommLock));
-
-	/* Initialize temporary hash table */
-	MemSet(&ctl, 0, sizeof(ctl));
-	ctl.keysize = sizeof(BgWriterRequest);
-	ctl.entrysize = sizeof(struct BgWriterSlotMapping);
-	ctl.hash = tag_hash;
-	htab = hash_create("CompactBgwriterRequestQueue",
-					   BgWriterShmem->num_requests,
-					   &ctl,
-					   HASH_ELEM | HASH_FUNCTION);
-
-	/* Initialize skip_slot array */
-	skip_slot = palloc0(sizeof(bool) * BgWriterShmem->num_requests);
-
-	/*
-	 * The basic idea here is that a request can be skipped if it's followed
-	 * by a later, identical request.  It might seem more sensible to work
-	 * backwards from the end of the queue and check whether a request is
-	 * *preceded* by an earlier, identical request, in the hopes of doing less
-	 * copying.  But that might change the semantics, if there's an
-	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
-	 * we do it this way.  It would be possible to be even smarter if we made
-	 * the code below understand the specific semantics of such requests (it
-	 * could blow away preceding entries that would end up being canceled
-	 * anyhow), but it's not clear that the extra complexity would buy us
-	 * anything.
-	 */
-	for (n = 0; n < BgWriterShmem->num_requests; ++n)
-	{
-		BgWriterRequest *request;
-		struct BgWriterSlotMapping *slotmap;
-		bool		found;
-
-		request = &BgWriterShmem->requests[n];
-		slotmap = hash_search(htab, request, HASH_ENTER, &found);
-		if (found)
-		{
-			skip_slot[slotmap->slot] = true;
-			++num_skipped;
-		}
-		slotmap->slot = n;
-	}
-
-	/* Done with the hash table. */
-	hash_destroy(htab);
-
-	/* If no duplicates, we're out of luck. */
-	if (!num_skipped)
-	{
-		pfree(skip_slot);
-		return false;
-	}
-
-	/* We found some duplicates; remove them. */
-	for (n = 0, preserve_count = 0; n < BgWriterShmem->num_requests; ++n)
-	{
-		if (skip_slot[n])
-			continue;
-		BgWriterShmem->requests[preserve_count++] = BgWriterShmem->requests[n];
-	}
-	ereport(DEBUG1,
-	   (errmsg("compacted fsync request queue from %d entries to %d entries",
-			   BgWriterShmem->num_requests, preserve_count)));
-	BgWriterShmem->num_requests = preserve_count;
-
-	/* Cleanup. */
-	pfree(skip_slot);
-	return true;
-}
-
-/*
- * AbsorbFsyncRequests
- *		Retrieve queued fsync requests and pass them to local smgr.
- *
- * This is exported because it must be called during CreateCheckPoint;
- * we have to be sure we have accepted all pending requests just before
- * we start fsync'ing.  Since CreateCheckPoint sometimes runs in
- * non-bgwriter processes, do nothing if not bgwriter.
- */
-void
-AbsorbFsyncRequests(void)
-{
-	BgWriterRequest *requests = NULL;
-	BgWriterRequest *request;
-	int			n;
-
-	if (!am_bg_writer)
-		return;
-
-	/*
-	 * We have to PANIC if we fail to absorb all the pending requests (eg,
-	 * because our hashtable runs out of memory).  This is because the system
-	 * cannot run safely if we are unable to fsync what we have been told to
-	 * fsync.  Fortunately, the hashtable is so small that the problem is
-	 * quite unlikely to arise in practice.
-	 */
-	START_CRIT_SECTION();
-
-	/*
-	 * We try to avoid holding the lock for a long time by copying the request
-	 * array.
-	 */
-	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
-
-	/* Transfer write count into pending pgstats message */
-	BgWriterStats.m_buf_written_backend += BgWriterShmem->num_backend_writes;
-	BgWriterStats.m_buf_fsync_backend += BgWriterShmem->num_backend_fsync;
-
-	BgWriterShmem->num_backend_writes = 0;
-	BgWriterShmem->num_backend_fsync = 0;
-
-	n = BgWriterShmem->num_requests;
-	if (n > 0)
-	{
-		requests = (BgWriterRequest *) palloc(n * sizeof(BgWriterRequest));
-		memcpy(requests, BgWriterShmem->requests, n * sizeof(BgWriterRequest));
-	}
-	BgWriterShmem->num_requests = 0;
-
-	LWLockRelease(BgWriterCommLock);
-
-	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
-
-	if (requests)
-		pfree(requests);
-
-	END_CRIT_SECTION();
-}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
new file mode 100644
index 0000000..f9e2e4f
--- /dev/null
+++ b/src/backend/postmaster/checkpointer.c
@@ -0,0 +1,1236 @@
+/*-------------------------------------------------------------------------
+ *
+ * checkpointer.c
+ *
+ * The checkpointer is new as of Postgres 9.2.  It handles all checkpoints.
+ * Checkpoints are automatically dispatched after a certain amount of time has
+ * elapsed since the last one, and it can be signaled to perform requested
+ * checkpoints as well.  (The GUC parameter that mandates a checkpoint every
+ * so many WAL segments is implemented by having backends signal when they
+ * fill WAL segments; the checkpointer itself doesn't watch for the
+ * condition.)
+ *
+ * The checkpointer is started by the postmaster as soon as the startup subprocess
+ * finishes, or as soon as recovery begins if we are doing archive recovery.
+ * It remains alive until the postmaster commands it to terminate.
+ * Normal termination is by SIGUSR2, which instructs the checkpointer to execute
+ * a shutdown checkpoint and then exit(0).	(All backends must be stopped
+ * before SIGUSR2 is issued!)  Emergency termination is by SIGQUIT; like any
+ * backend, the checkpointer will simply abort and exit on SIGQUIT.
+ *
+ * If the checkpointer exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining backends
+ * should be killed by SIGQUIT and then a recovery cycle started.  (Even if
+ * shared memory isn't corrupted, we have lost information about which
+ * files need to be fsync'd for the next checkpoint, and so a system
+ * restart needs to be forced.)
+ *
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/checkpointer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <signal.h>
+#include <sys/time.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "replication/syncrep.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/pmsignal.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+
+/*----------
+ * Shared memory area for communication between checkpointer and backends
+ *
+ * The ckpt counters allow backends to watch for completion of a checkpoint
+ * request they send.  Here's how it works:
+ *	* At start of a checkpoint, checkpointer reads (and clears) the request flags
+ *	  and increments ckpt_started, while holding ckpt_lck.
+ *	* On completion of a checkpoint, checkpointer sets ckpt_done to
+ *	  equal ckpt_started.
+ *	* On failure of a checkpoint, checkpointer increments ckpt_failed
+ *	  and sets ckpt_done to equal ckpt_started.
+ *
+ * The algorithm for backends is:
+ *	1. Record current values of ckpt_failed and ckpt_started, and
+ *	   set request flags, while holding ckpt_lck.
+ *	2. Send signal to request checkpoint.
+ *	3. Sleep until ckpt_started changes.  Now you know a checkpoint has
+ *	   begun since you started this algorithm (although *not* that it was
+ *	   specifically initiated by your signal), and that it is using your flags.
+ *	4. Record new value of ckpt_started.
+ *	5. Sleep until ckpt_done >= saved value of ckpt_started.  (Use modulo
+ *	   arithmetic here in case counters wrap around.)  Now you know a
+ *	   checkpoint has started and completed, but not whether it was
+ *	   successful.
+ *	6. If ckpt_failed is different from the originally saved value,
+ *	   assume request failed; otherwise it was definitely successful.
+ *
+ * ckpt_flags holds the OR of the checkpoint request flags sent by all
+ * requesting backends since the last checkpoint start.  The flags are
+ * chosen so that OR'ing is the correct way to combine multiple requests.
+ *
+ * num_backend_writes is used to count the number of buffer writes performed
+ * by non-bgwriter processes.  This counter should be wide enough that it
+ * can't overflow during a single bgwriter cycle.  num_backend_fsync
+ * counts the subset of those writes that also had to do their own fsync,
+ * because the background writer failed to absorb their request.
+ *
+ * The requests array holds fsync requests sent by backends and not yet
+ * absorbed by the checkpointer.
+ *
+ * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
+ * the requests fields are protected by BgWriterCommLock.
+ *----------
+ */
+typedef struct
+{
+	RelFileNodeBackend rnode;
+	ForkNumber	forknum;
+	BlockNumber segno;			/* see md.c for special values */
+	/* might add a real request-type field later; not needed yet */
+} BgWriterRequest;
+
+typedef struct
+{
+	pid_t		checkpointer_pid;	/* PID of bgwriter (0 if not started) */
+
+	slock_t		ckpt_lck;		/* protects all the ckpt_* fields */
+
+	int			ckpt_started;	/* advances when checkpoint starts */
+	int			ckpt_done;		/* advances when checkpoint done */
+	int			ckpt_failed;	/* advances when checkpoint fails */
+
+	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
+
+	uint32		num_backend_writes;		/* counts non-bgwriter buffer writes */
+	uint32		num_backend_fsync;		/* counts non-bgwriter fsync calls */
+
+	int			num_requests;	/* current # of requests */
+	int			max_requests;	/* allocated array size */
+	BgWriterRequest requests[1];	/* VARIABLE LENGTH ARRAY */
+} BgWriterShmemStruct;
+
+static BgWriterShmemStruct *BgWriterShmem;
+
+/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
+#define WRITES_PER_ABSORB		1000
+
+/*
+ * GUC parameters
+ */
+int			CheckPointTimeout = 300;
+int			CheckPointWarning = 30;
+double		CheckPointCompletionTarget = 0.5;
+
+/*
+ * Flags set by interrupt handlers for later service in the main loop.
+ */
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t checkpoint_requested = false;
+static volatile sig_atomic_t shutdown_requested = false;
+
+/*
+ * Private state
+ */
+static bool am_checkpointer = false;
+
+static bool ckpt_active = false;
+
+/* these values are valid when ckpt_active is true: */
+static pg_time_t ckpt_start_time;
+static XLogRecPtr ckpt_start_recptr;
+static double ckpt_cached_elapsed;
+
+static pg_time_t last_checkpoint_time;
+static pg_time_t last_xlog_switch_time;
+
+/* Prototypes for private functions */
+
+static void CheckArchiveTimeout(void);
+static bool IsCheckpointOnSchedule(double progress);
+static bool ImmediateCheckpointRequested(void);
+static bool CompactCheckpointerRequestQueue(void);
+
+/* Signal handlers */
+
+static void chkpt_quickdie(SIGNAL_ARGS);
+static void ChkptSigHupHandler(SIGNAL_ARGS);
+static void ReqCheckpointHandler(SIGNAL_ARGS);
+static void ReqShutdownHandler(SIGNAL_ARGS);
+
+
+/*
+ * Main entry point for checkpointer process
+ *
+ * This is invoked from BootstrapMain, which has already created the basic
+ * execution environment, but not enabled signals yet.
+ */
+void
+CheckpointerMain(void)
+{
+	sigjmp_buf	local_sigjmp_buf;
+	MemoryContext checkpointer_context;
+
+	BgWriterShmem->checkpointer_pid = MyProcPid;
+	am_checkpointer = true;
+
+	/*
+	 * If possible, make this process a group leader, so that the postmaster
+	 * can signal any child processes too.	(checkpointer probably never has any
+	 * child processes, but for consistency we make all postmaster child
+	 * processes do this.)
+	 */
+#ifdef HAVE_SETSID
+	if (setsid() < 0)
+		elog(FATAL, "setsid() failed: %m");
+#endif
+
+	/*
+	 * Properly accept or ignore signals the postmaster might send us
+	 *
+	 * Note: we deliberately ignore SIGTERM, because during a standard Unix
+	 * system shutdown cycle, init will SIGTERM all processes at once.	We
+	 * want to wait for the backends to exit, whereupon the postmaster will
+	 * tell us it's okay to shut down (via SIGUSR2).
+	 *
+	 * SIGUSR1 is presently unused; keep it spare in case someday we want this
+	 * process to participate in ProcSignal signalling.
+	 */
+	pqsignal(SIGHUP, ChkptSigHupHandler);	/* set flag to read config file */
+	pqsignal(SIGINT, ReqCheckpointHandler);	/* request checkpoint */
+	pqsignal(SIGTERM, SIG_IGN);				/* ignore SIGTERM */
+	pqsignal(SIGQUIT, chkpt_quickdie);		/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, SIG_IGN); /* reserve for ProcSignal */
+	pqsignal(SIGUSR2, ReqShutdownHandler);		/* request shutdown */
+
+	/*
+	 * Reset some signals that are accepted by postmaster but not here
+	 */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* We allow SIGQUIT (quickdie) at all times */
+	sigdelset(&BlockSig, SIGQUIT);
+
+	/*
+	 * Initialize so that first time-driven event happens at the correct time.
+	 */
+	last_checkpoint_time = last_xlog_switch_time = (pg_time_t) time(NULL);
+
+	/*
+	 * Create a resource owner to keep track of our resources (currently only
+	 * buffer pins).
+	 */
+	CurrentResourceOwner = ResourceOwnerCreate(NULL, "Checkpointer");
+
+	/*
+	 * Create a memory context that we will do all our work in.  We do this so
+	 * that we can reset the context during error recovery and thereby avoid
+	 * possible memory leaks.  Formerly this code just ran in
+	 * TopMemoryContext, but resetting that would be a really bad idea.
+	 */
+	checkpointer_context = AllocSetContextCreate(TopMemoryContext,
+											 "Checkpointer",
+											 ALLOCSET_DEFAULT_MINSIZE,
+											 ALLOCSET_DEFAULT_INITSIZE,
+											 ALLOCSET_DEFAULT_MAXSIZE);
+	MemoryContextSwitchTo(checkpointer_context);
+
+	/*
+	 * If an exception is encountered, processing resumes here.
+	 *
+	 * See notes in postgres.c about the design of this coding.
+	 */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		/* Since not using PG_TRY, must reset error stack by hand */
+		error_context_stack = NULL;
+
+		/* Prevent interrupts while cleaning up */
+		HOLD_INTERRUPTS();
+
+		/* Report the error to the server log */
+		EmitErrorReport();
+
+		/*
+		 * These operations are really just a minimal subset of
+		 * AbortTransaction().	We don't have very many resources to worry
+		 * about in checkpointer, but we do have LWLocks, buffers, and temp files.
+		 */
+		LWLockReleaseAll();
+		AbortBufferIO();
+		UnlockBuffers();
+		/* buffer pins are released here: */
+		ResourceOwnerRelease(CurrentResourceOwner,
+							 RESOURCE_RELEASE_BEFORE_LOCKS,
+							 false, true);
+		/* we needn't bother with the other ResourceOwnerRelease phases */
+		AtEOXact_Buffers(false);
+		AtEOXact_Files();
+		AtEOXact_HashTables(false);
+
+		/* Warn any waiting backends that the checkpoint failed. */
+		if (ckpt_active)
+		{
+			/* use volatile pointer to prevent code rearrangement */
+			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+
+			SpinLockAcquire(&bgs->ckpt_lck);
+			bgs->ckpt_failed++;
+			bgs->ckpt_done = bgs->ckpt_started;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			ckpt_active = false;
+		}
+
+		/*
+		 * Now return to normal top-level context and clear ErrorContext for
+		 * next time.
+		 */
+		MemoryContextSwitchTo(checkpointer_context);
+		FlushErrorState();
+
+		/* Flush any leaked data in the top-level context */
+		MemoryContextResetAndDeleteChildren(checkpointer_context);
+
+		/* Now we can allow interrupts again */
+		RESUME_INTERRUPTS();
+
+		/*
+		 * Sleep at least 1 second after any error.  A write error is likely
+		 * to be repeated, and we don't want to be filling the error logs as
+		 * fast as we can.
+		 */
+		pg_usleep(1000000L);
+
+		/*
+		 * Close all open files after any error.  This is helpful on Windows,
+		 * where holding deleted files open causes various strange errors.
+		 * It's not clear we need it elsewhere, but shouldn't hurt.
+		 */
+		smgrcloseall();
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	/*
+	 * Unblock signals (they were blocked when the postmaster forked us)
+	 */
+	PG_SETMASK(&UnBlockSig);
+
+	/*
+	 * Use the recovery target timeline ID during recovery
+	 */
+	if (RecoveryInProgress())
+		ThisTimeLineID = GetRecoveryTargetTLI();
+
+	/* Do this once before starting the loop, then just at SIGHUP time. */
+	SyncRepUpdateSyncStandbysDefined();
+
+	/*
+	 * Loop forever
+	 */
+	for (;;)
+	{
+		bool		do_checkpoint = false;
+		int			flags = 0;
+		pg_time_t	now;
+		int			elapsed_secs;
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (!PostmasterIsAlive())
+			exit(1);
+
+		/*
+		 * Process any requests or signals received recently.
+		 */
+		AbsorbFsyncRequests();
+
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+			/* update global shmem state for sync rep */
+			SyncRepUpdateSyncStandbysDefined();
+		}
+		if (checkpoint_requested)
+		{
+			checkpoint_requested = false;
+			do_checkpoint = true;
+			BgWriterStats.m_requested_checkpoints++;
+		}
+		if (shutdown_requested)
+		{
+			/*
+			 * From here on, elog(ERROR) should end with exit(1), not send
+			 * control back to the sigsetjmp block above
+			 */
+			ExitOnAnyError = true;
+			/* Close down the database */
+			ShutdownXLOG(0, 0);
+			/* Normal exit from the checkpointer is here */
+			proc_exit(0);		/* done */
+		}
+
+		/*
+		 * Force a checkpoint if too much time has elapsed since the last one.
+		 * Note that we count a timed checkpoint in stats only when this
+		 * occurs without an external request, but we set the CAUSE_TIME flag
+		 * bit even if there is also an external request.
+		 */
+		now = (pg_time_t) time(NULL);
+		elapsed_secs = now - last_checkpoint_time;
+		if (elapsed_secs >= CheckPointTimeout)
+		{
+			if (!do_checkpoint)
+				BgWriterStats.m_timed_checkpoints++;
+			do_checkpoint = true;
+			flags |= CHECKPOINT_CAUSE_TIME;
+		}
+
+		/*
+		 * Do a checkpoint if requested.
+		 */
+		if (do_checkpoint)
+		{
+			bool		ckpt_performed = false;
+			bool		do_restartpoint;
+
+			/* use volatile pointer to prevent code rearrangement */
+			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+
+			/*
+			 * Check if we should perform a checkpoint or a restartpoint. As a
+			 * side-effect, RecoveryInProgress() initializes TimeLineID if
+			 * it's not set yet.
+			 */
+			do_restartpoint = RecoveryInProgress();
+
+			/*
+			 * Atomically fetch the request flags to figure out what kind of a
+			 * checkpoint we should perform, and increase the started-counter
+			 * to acknowledge that we've started a new checkpoint.
+			 */
+			SpinLockAcquire(&bgs->ckpt_lck);
+			flags |= bgs->ckpt_flags;
+			bgs->ckpt_flags = 0;
+			bgs->ckpt_started++;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			/*
+			 * The end-of-recovery checkpoint is a real checkpoint that's
+			 * performed while we're still in recovery.
+			 */
+			if (flags & CHECKPOINT_END_OF_RECOVERY)
+				do_restartpoint = false;
+
+			/*
+			 * We will warn if (a) too soon since last checkpoint (whatever
+			 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
+			 * since the last checkpoint start.  Note in particular that this
+			 * implementation will not generate warnings caused by
+			 * CheckPointTimeout < CheckPointWarning.
+			 */
+			if (!do_restartpoint &&
+				(flags & CHECKPOINT_CAUSE_XLOG) &&
+				elapsed_secs < CheckPointWarning)
+				ereport(LOG,
+						(errmsg_plural("checkpoints are occurring too frequently (%d second apart)",
+				"checkpoints are occurring too frequently (%d seconds apart)",
+									   elapsed_secs,
+									   elapsed_secs),
+						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
+
+			/*
+			 * Initialize checkpointer-private variables used during checkpoint.
+			 */
+			ckpt_active = true;
+			if (!do_restartpoint)
+				ckpt_start_recptr = GetInsertRecPtr();
+			ckpt_start_time = now;
+			ckpt_cached_elapsed = 0;
+
+			/*
+			 * Do the checkpoint.
+			 */
+			if (!do_restartpoint)
+			{
+				CreateCheckPoint(flags);
+				ckpt_performed = true;
+			}
+			else
+				ckpt_performed = CreateRestartPoint(flags);
+
+			/*
+			 * After any checkpoint, close all smgr files.	This is so we
+			 * won't hang onto smgr references to deleted files indefinitely.
+			 */
+			smgrcloseall();
+
+			/*
+			 * Indicate checkpoint completion to any waiting backends.
+			 */
+			SpinLockAcquire(&bgs->ckpt_lck);
+			bgs->ckpt_done = bgs->ckpt_started;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			if (ckpt_performed)
+			{
+				/*
+				 * Note we record the checkpoint start time not end time as
+				 * last_checkpoint_time.  This is so that time-driven
+				 * checkpoints happen at a predictable spacing.
+				 */
+				last_checkpoint_time = now;
+			}
+			else
+			{
+				/*
+				 * We were not able to perform the restartpoint (checkpoints
+				 * throw an ERROR in case of error).  Most likely because we
+				 * have not received any new checkpoint WAL records since the
+				 * last restartpoint. Try again in 15 s.
+				 */
+				last_checkpoint_time = now - CheckPointTimeout + 15;
+			}
+
+			ckpt_active = false;
+		}
+
+		/*
+		 * Nap for a while and then loop again. Later patches will replace
+		 * this with a latch loop. Keep it simple now for clarity.
+		 * Relatively long sleep because the bgwriter does cleanup now.
+		 */
+		pg_usleep(500000L);
+
+		/* Check for archive_timeout and switch xlog files if necessary. */
+		CheckArchiveTimeout();
+	}
+}
+
+/*
+ * CheckArchiveTimeout -- check for archive_timeout and switch xlog files
+ *
+ * This will switch to a new WAL file and force an archive file write
+ * if any activity is recorded in the current WAL file, including just
+ * a single checkpoint record.
+ */
+static void
+CheckArchiveTimeout(void)
+{
+	pg_time_t	now;
+	pg_time_t	last_time;
+
+	if (XLogArchiveTimeout <= 0 || RecoveryInProgress())
+		return;
+
+	now = (pg_time_t) time(NULL);
+
+	/* First we do a quick check using possibly-stale local state. */
+	if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
+		return;
+
+	/*
+	 * Update local state ... note that last_xlog_switch_time is the last time
+	 * a switch was performed *or requested*.
+	 */
+	last_time = GetLastSegSwitchTime();
+
+	last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
+
+	/* Now we can do the real check */
+	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
+	{
+		XLogRecPtr	switchpoint;
+
+		/* OK, it's time to switch */
+		switchpoint = RequestXLogSwitch();
+
+		/*
+		 * If the returned pointer points exactly to a segment boundary,
+		 * assume nothing happened.
+		 */
+		if ((switchpoint.xrecoff % XLogSegSize) != 0)
+			ereport(DEBUG1,
+				(errmsg("transaction log switch forced (archive_timeout=%d)",
+						XLogArchiveTimeout)));
+
+		/*
+		 * Update state in any case, so we don't retry constantly when the
+		 * system is idle.
+		 */
+		last_xlog_switch_time = now;
+	}
+}
+
+/*
+ * Returns true if an immediate checkpoint request is pending.	(Note that
+ * this does not check the *current* checkpoint's IMMEDIATE flag, but whether
+ * there is one pending behind it.)
+ */
+static bool
+ImmediateCheckpointRequested(void)
+{
+	if (checkpoint_requested)
+	{
+		volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+
+		/*
+		 * We don't need to acquire the ckpt_lck in this case because we're
+		 * only looking at a single flag bit.
+		 */
+		if (bgs->ckpt_flags & CHECKPOINT_IMMEDIATE)
+			return true;
+	}
+	return false;
+}
+
+/*
+ * CheckpointWriteDelay -- control rate of checkpoint
+ *
+ * This function is called after each page write performed by BufferSync().
+ * It is responsible for throttling BufferSync()'s write rate to hit
+ * checkpoint_completion_target.
+ *
+ * The checkpoint request flags should be passed in; currently the only one
+ * examined is CHECKPOINT_IMMEDIATE, which disables delays between writes.
+ *
+ * 'progress' is an estimate of how much of the work has been done, as a
+ * fraction between 0.0 meaning none, and 1.0 meaning all done.
+ */
+void
+CheckpointWriteDelay(int flags, double progress)
+{
+	static int	absorb_counter = WRITES_PER_ABSORB;
+
+	/* Do nothing if checkpoint is being executed by non-checkpointer process */
+	if (!am_checkpointer)
+		return;
+
+	/*
+	 * Perform the usual duties and take a nap, unless we're behind
+	 * schedule, in which case we just try to catch up as quickly as possible.
+	 */
+	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!shutdown_requested &&
+		!ImmediateCheckpointRequested() &&
+		IsCheckpointOnSchedule(progress))
+	{
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+			/* update global shmem state for sync rep */
+			SyncRepUpdateSyncStandbysDefined();
+		}
+
+		AbsorbFsyncRequests();
+		absorb_counter = WRITES_PER_ABSORB;
+
+		CheckArchiveTimeout();
+
+		/*
+		 * Checkpoint sleep used to be connected to bgwriter_delay at 200ms.
+		 * That resulted in more frequent wakeups if not much work to do.
+		 * Checkpointer and bgwriter are no longer related so take the Big Sleep.
+		 */
+		pg_usleep(500000L);
+	}
+	else if (--absorb_counter <= 0)
+	{
+		/*
+		 * Absorb pending fsync requests after each WRITES_PER_ABSORB write
+		 * operations even when we don't sleep, to prevent overflow of the
+		 * fsync request queue.
+		 */
+		AbsorbFsyncRequests();
+		absorb_counter = WRITES_PER_ABSORB;
+	}
+}
+
+/*
+ * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
+ *		 in time?
+ *
+ * Compares the current progress against the time/segments elapsed since last
+ * checkpoint, and returns true if the progress we've made this far is greater
+ * than the elapsed time/segments.
+ */
+static bool
+IsCheckpointOnSchedule(double progress)
+{
+	XLogRecPtr	recptr;
+	struct timeval now;
+	double		elapsed_xlogs,
+				elapsed_time;
+
+	Assert(ckpt_active);
+
+	/* Scale progress according to checkpoint_completion_target. */
+	progress *= CheckPointCompletionTarget;
+
+	/*
+	 * Check against the cached value first. Only do the more expensive
+	 * calculations once we reach the target previously calculated. Since
+	 * neither time or WAL insert pointer moves backwards, a freshly
+	 * calculated value can only be greater than or equal to the cached value.
+	 */
+	if (progress < ckpt_cached_elapsed)
+		return false;
+
+	/*
+	 * Check progress against WAL segments written and checkpoint_segments.
+	 *
+	 * We compare the current WAL insert location against the location
+	 * computed before calling CreateCheckPoint. The code in XLogInsert that
+	 * actually triggers a checkpoint when checkpoint_segments is exceeded
+	 * compares against RedoRecptr, so this is not completely accurate.
+	 * However, it's good enough for our purposes, we're only calculating an
+	 * estimate anyway.
+	 */
+	if (!RecoveryInProgress())
+	{
+		recptr = GetInsertRecPtr();
+		elapsed_xlogs =
+			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
+			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+			CheckPointSegments;
+
+		if (progress < elapsed_xlogs)
+		{
+			ckpt_cached_elapsed = elapsed_xlogs;
+			return false;
+		}
+	}
+
+	/*
+	 * Check progress against time elapsed and checkpoint_timeout.
+	 */
+	gettimeofday(&now, NULL);
+	elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
+					now.tv_usec / 1000000.0) / CheckPointTimeout;
+
+	if (progress < elapsed_time)
+	{
+		ckpt_cached_elapsed = elapsed_time;
+		return false;
+	}
+
+	/* It looks like we're on schedule. */
+	return true;
+}
+
+
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+
+/*
+ * chkpt_quickdie() occurs when signalled SIGQUIT by the postmaster.
+ *
+ * Some backend has bought the farm,
+ * so we need to stop what we're doing and exit.
+ */
+static void
+chkpt_quickdie(SIGNAL_ARGS)
+{
+	PG_SETMASK(&BlockSig);
+
+	/*
+	 * We DO NOT want to run proc_exit() callbacks -- we're here because
+	 * shared memory may be corrupted, so we don't want to try to clean up our
+	 * transaction.  Just nail the windows shut and get out of town.  Now that
+	 * there's an atexit callback to prevent third-party code from breaking
+	 * things by calling exit() directly, we have to reset the callbacks
+	 * explicitly to make this work as intended.
+	 */
+	on_exit_reset();
+
+	/*
+	 * Note we do exit(2) not exit(0).	This is to force the postmaster into a
+	 * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+	 * backend.  This is necessary precisely because we don't clean up our
+	 * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+	 * should ensure the postmaster sees this as a crash, too, but no harm in
+	 * being doubly sure.)
+	 */
+	exit(2);
+}
+
+/* SIGHUP: set flag to re-read config file at next convenient time */
+static void
+ChkptSigHupHandler(SIGNAL_ARGS)
+{
+	got_SIGHUP = true;
+}
+
+/* SIGINT: set flag to run a normal checkpoint right away */
+static void
+ReqCheckpointHandler(SIGNAL_ARGS)
+{
+	checkpoint_requested = true;
+}
+
+/* SIGUSR2: set flag to run a shutdown checkpoint and exit */
+static void
+ReqShutdownHandler(SIGNAL_ARGS)
+{
+	shutdown_requested = true;
+}
+
+
+/* --------------------------------
+ *		communication with backends
+ * --------------------------------
+ */
+
+/*
+ * BgWriterShmemSize
+ *		Compute space needed for bgwriter-related shared memory
+ */
+Size
+BgWriterShmemSize(void)
+{
+	Size		size;
+
+	/*
+	 * Currently, the size of the requests[] array is arbitrarily set equal to
+	 * NBuffers.  This may prove too large or small ...
+	 */
+	size = offsetof(BgWriterShmemStruct, requests);
+	size = add_size(size, mul_size(NBuffers, sizeof(BgWriterRequest)));
+
+	return size;
+}
+
+/*
+ * BgWriterShmemInit
+ *		Allocate and initialize bgwriter-related shared memory
+ */
+void
+BgWriterShmemInit(void)
+{
+	bool		found;
+
+	BgWriterShmem = (BgWriterShmemStruct *)
+		ShmemInitStruct("Background Writer Data",
+						BgWriterShmemSize(),
+						&found);
+
+	if (!found)
+	{
+		/* First time through, so initialize */
+		MemSet(BgWriterShmem, 0, sizeof(BgWriterShmemStruct));
+		SpinLockInit(&BgWriterShmem->ckpt_lck);
+		BgWriterShmem->max_requests = NBuffers;
+	}
+}
+
+/*
+ * RequestCheckpoint
+ *		Called in backend processes to request a checkpoint
+ *
+ * flags is a bitwise OR of the following:
+ *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
+ *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
+ *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
+ *		ignoring checkpoint_completion_target parameter.
+ *	CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occured
+ *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
+ *		CHECKPOINT_END_OF_RECOVERY).
+ *	CHECKPOINT_WAIT: wait for completion before returning (otherwise,
+ *		just signal bgwriter to do it, and return).
+ *	CHECKPOINT_CAUSE_XLOG: checkpoint is requested due to xlog filling.
+ *		(This affects logging, and in particular enables CheckPointWarning.)
+ */
+void
+RequestCheckpoint(int flags)
+{
+	/* use volatile pointer to prevent code rearrangement */
+	volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+	int			ntries;
+	int			old_failed,
+				old_started;
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		/*
+		 * There's no point in doing slow checkpoints in a standalone backend,
+		 * because there's no other backends the checkpoint could disrupt.
+		 */
+		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
+
+		/*
+		 * After any checkpoint, close all smgr files.	This is so we won't
+		 * hang onto smgr references to deleted files indefinitely.
+		 */
+		smgrcloseall();
+
+		return;
+	}
+
+	/*
+	 * Atomically set the request flags, and take a snapshot of the counters.
+	 * When we see ckpt_started > old_started, we know the flags we set here
+	 * have been seen by bgwriter.
+	 *
+	 * Note that we OR the flags with any existing flags, to avoid overriding
+	 * a "stronger" request by another backend.  The flag senses must be
+	 * chosen to make this work!
+	 */
+	SpinLockAcquire(&bgs->ckpt_lck);
+
+	old_failed = bgs->ckpt_failed;
+	old_started = bgs->ckpt_started;
+	bgs->ckpt_flags |= flags;
+
+	SpinLockRelease(&bgs->ckpt_lck);
+
+	/*
+	 * Send signal to request checkpoint.  It's possible that the bgwriter
+	 * hasn't started yet, or is in process of restarting, so we will retry a
+	 * few times if needed.  Also, if not told to wait for the checkpoint to
+	 * occur, we consider failure to send the signal to be nonfatal and merely
+	 * LOG it.
+	 */
+	for (ntries = 0;; ntries++)
+	{
+		if (BgWriterShmem->checkpointer_pid == 0)
+		{
+			if (ntries >= 20)	/* max wait 2.0 sec */
+			{
+				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
+				"could not request checkpoint because bgwriter not running");
+				break;
+			}
+		}
+		else if (kill(BgWriterShmem->checkpointer_pid, SIGINT) != 0)
+		{
+			if (ntries >= 20)	/* max wait 2.0 sec */
+			{
+				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
+					 "could not signal for checkpoint: %m");
+				break;
+			}
+		}
+		else
+			break;				/* signal sent successfully */
+
+		CHECK_FOR_INTERRUPTS();
+		pg_usleep(100000L);		/* wait 0.1 sec, then retry */
+	}
+
+	/*
+	 * If requested, wait for completion.  We detect completion according to
+	 * the algorithm given above.
+	 */
+	if (flags & CHECKPOINT_WAIT)
+	{
+		int			new_started,
+					new_failed;
+
+		/* Wait for a new checkpoint to start. */
+		for (;;)
+		{
+			SpinLockAcquire(&bgs->ckpt_lck);
+			new_started = bgs->ckpt_started;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			if (new_started != old_started)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(100000L);
+		}
+
+		/*
+		 * We are waiting for ckpt_done >= new_started, in a modulo sense.
+		 */
+		for (;;)
+		{
+			int			new_done;
+
+			SpinLockAcquire(&bgs->ckpt_lck);
+			new_done = bgs->ckpt_done;
+			new_failed = bgs->ckpt_failed;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			if (new_done - new_started >= 0)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(100000L);
+		}
+
+		if (new_failed != old_failed)
+			ereport(ERROR,
+					(errmsg("checkpoint request failed"),
+					 errhint("Consult recent messages in the server log for details.")));
+	}
+}
+
+/*
+ * ForwardFsyncRequest
+ *		Forward a file-fsync request from a backend to the bgwriter
+ *
+ * Whenever a backend is compelled to write directly to a relation
+ * (which should be seldom, if the bgwriter is getting its job done),
+ * the backend calls this routine to pass over knowledge that the relation
+ * is dirty and must be fsync'd before next checkpoint.  We also use this
+ * opportunity to count such writes for statistical purposes.
+ *
+ * segno specifies which segment (not block!) of the relation needs to be
+ * fsync'd.  (Since the valid range is much less than BlockNumber, we can
+ * use high values for special flags; that's all internal to md.c, which
+ * see for details.)
+ *
+ * To avoid holding the lock for longer than necessary, we normally write
+ * to the requests[] queue without checking for duplicates.  The bgwriter
+ * will have to eliminate dups internally anyway.  However, if we discover
+ * that the queue is full, we make a pass over the entire queue to compact
+ * it.	This is somewhat expensive, but the alternative is for the backend
+ * to perform its own fsync, which is far more expensive in practice.  It
+ * is theoretically possible a backend fsync might still be necessary, if
+ * the queue is full and contains no duplicate entries.  In that case, we
+ * let the backend know by returning false.
+ */
+bool
+ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
+					BlockNumber segno)
+{
+	BgWriterRequest *request;
+
+	if (!IsUnderPostmaster)
+		return false;			/* probably shouldn't even get here */
+
+	if (am_checkpointer)
+		elog(ERROR, "ForwardFsyncRequest must not be called in bgwriter");
+
+	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
+
+	/* Count all backend writes regardless of if they fit in the queue */
+	BgWriterShmem->num_backend_writes++;
+
+	/*
+	 * If the background writer isn't running or the request queue is full,
+	 * the backend will have to perform its own fsync request.	But before
+	 * forcing that to happen, we can try to compact the background writer
+	 * request queue.
+	 */
+	if (BgWriterShmem->checkpointer_pid == 0 ||
+		(BgWriterShmem->num_requests >= BgWriterShmem->max_requests
+		 && !CompactCheckpointerRequestQueue()))
+	{
+		/*
+		 * Count the subset of writes where backends have to do their own
+		 * fsync
+		 */
+		BgWriterShmem->num_backend_fsync++;
+		LWLockRelease(BgWriterCommLock);
+		return false;
+	}
+	request = &BgWriterShmem->requests[BgWriterShmem->num_requests++];
+	request->rnode = rnode;
+	request->forknum = forknum;
+	request->segno = segno;
+	LWLockRelease(BgWriterCommLock);
+	return true;
+}
+
+/*
+ * CompactCheckpointerRequestQueue
+ *		Remove duplicates from the request queue to avoid backend fsyncs.
+ *
+ * Although a full fsync request queue is not common, it can lead to severe
+ * performance problems when it does happen.  So far, this situation has
+ * only been observed to occur when the system is under heavy write load,
+ * and especially during the "sync" phase of a checkpoint.	Without this
+ * logic, each backend begins doing an fsync for every block written, which
+ * gets very expensive and can slow down the whole system.
+ *
+ * Trying to do this every time the queue is full could lose if there
+ * aren't any removable entries.  But should be vanishingly rare in
+ * practice: there's one queue entry per shared buffer.
+ */
+static bool
+CompactCheckpointerRequestQueue()
+{
+	struct BgWriterSlotMapping
+	{
+		BgWriterRequest request;
+		int			slot;
+	};
+
+	int			n,
+				preserve_count;
+	int			num_skipped = 0;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	bool	   *skip_slot;
+
+	/* must hold BgWriterCommLock in exclusive mode */
+	Assert(LWLockHeldByMe(BgWriterCommLock));
+
+	/* Initialize temporary hash table */
+	MemSet(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(BgWriterRequest);
+	ctl.entrysize = sizeof(struct BgWriterSlotMapping);
+	ctl.hash = tag_hash;
+	htab = hash_create("CompactBgwriterRequestQueue",
+					   BgWriterShmem->num_requests,
+					   &ctl,
+					   HASH_ELEM | HASH_FUNCTION);
+
+	/* Initialize skip_slot array */
+	skip_slot = palloc0(sizeof(bool) * BgWriterShmem->num_requests);
+
+	/*
+	 * The basic idea here is that a request can be skipped if it's followed
+	 * by a later, identical request.  It might seem more sensible to work
+	 * backwards from the end of the queue and check whether a request is
+	 * *preceded* by an earlier, identical request, in the hopes of doing less
+	 * copying.  But that might change the semantics, if there's an
+	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
+	 * we do it this way.  It would be possible to be even smarter if we made
+	 * the code below understand the specific semantics of such requests (it
+	 * could blow away preceding entries that would end up being canceled
+	 * anyhow), but it's not clear that the extra complexity would buy us
+	 * anything.
+	 */
+	for (n = 0; n < BgWriterShmem->num_requests; ++n)
+	{
+		BgWriterRequest *request;
+		struct BgWriterSlotMapping *slotmap;
+		bool		found;
+
+		request = &BgWriterShmem->requests[n];
+		slotmap = hash_search(htab, request, HASH_ENTER, &found);
+		if (found)
+		{
+			skip_slot[slotmap->slot] = true;
+			++num_skipped;
+		}
+		slotmap->slot = n;
+	}
+
+	/* Done with the hash table. */
+	hash_destroy(htab);
+
+	/* If no duplicates, we're out of luck. */
+	if (!num_skipped)
+	{
+		pfree(skip_slot);
+		return false;
+	}
+
+	/* We found some duplicates; remove them. */
+	for (n = 0, preserve_count = 0; n < BgWriterShmem->num_requests; ++n)
+	{
+		if (skip_slot[n])
+			continue;
+		BgWriterShmem->requests[preserve_count++] = BgWriterShmem->requests[n];
+	}
+	ereport(DEBUG1,
+	   (errmsg("compacted fsync request queue from %d entries to %d entries",
+			   BgWriterShmem->num_requests, preserve_count)));
+	BgWriterShmem->num_requests = preserve_count;
+
+	/* Cleanup. */
+	pfree(skip_slot);
+	return true;
+}
+
+/*
+ * AbsorbFsyncRequests
+ *		Retrieve queued fsync requests and pass them to local smgr.
+ *
+ * This is exported because it must be called during CreateCheckPoint;
+ * we have to be sure we have accepted all pending requests just before
+ * we start fsync'ing.  Since CreateCheckPoint sometimes runs in
+ * non-checkpointer processes, do nothing if not checkpointer.
+ */
+void
+AbsorbFsyncRequests(void)
+{
+	BgWriterRequest *requests = NULL;
+	BgWriterRequest *request;
+	int			n;
+
+	if (!am_checkpointer)
+		return;
+
+	/*
+	 * We have to PANIC if we fail to absorb all the pending requests (eg,
+	 * because our hashtable runs out of memory).  This is because the system
+	 * cannot run safely if we are unable to fsync what we have been told to
+	 * fsync.  Fortunately, the hashtable is so small that the problem is
+	 * quite unlikely to arise in practice.
+	 */
+	START_CRIT_SECTION();
+
+	/*
+	 * We try to avoid holding the lock for a long time by copying the request
+	 * array.
+	 */
+	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
+
+	/* Transfer write count into pending pgstats message */
+	BgWriterStats.m_buf_written_backend += BgWriterShmem->num_backend_writes;
+	BgWriterStats.m_buf_fsync_backend += BgWriterShmem->num_backend_fsync;
+
+	BgWriterShmem->num_backend_writes = 0;
+	BgWriterShmem->num_backend_fsync = 0;
+
+	n = BgWriterShmem->num_requests;
+	if (n > 0)
+	{
+		requests = (BgWriterRequest *) palloc(n * sizeof(BgWriterRequest));
+		memcpy(requests, BgWriterShmem->requests, n * sizeof(BgWriterRequest));
+	}
+	BgWriterShmem->num_requests = 0;
+
+	LWLockRelease(BgWriterCommLock);
+
+	for (request = requests; n > 0; request++, n--)
+		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+
+	if (requests)
+		pfree(requests);
+
+	END_CRIT_SECTION();
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 0a84d97..c8599c2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -206,6 +206,7 @@ bool		restart_after_crash = true;
 /* PIDs of special child processes; 0 when not running */
 static pid_t StartupPID = 0,
 			BgWriterPID = 0,
+			CheckpointerPID = 0,
 			WalWriterPID = 0,
 			WalReceiverPID = 0,
 			AutoVacPID = 0,
@@ -277,7 +278,7 @@ typedef enum
 	PM_WAIT_BACKUP,				/* waiting for online backup mode to end */
 	PM_WAIT_READONLY,			/* waiting for read only backends to exit */
 	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
-	PM_SHUTDOWN,				/* waiting for bgwriter to do shutdown ckpt */
+	PM_SHUTDOWN,				/* waiting for checkpointer to do shutdown ckpt */
 	PM_SHUTDOWN_2,				/* waiting for archiver and walsenders to
 								 * finish */
 	PM_WAIT_DEAD_END,			/* waiting for dead_end children to exit */
@@ -463,6 +464,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 
 #define StartupDataBase()		StartChildProcess(StartupProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
+#define StartCheckpointer()		StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()		StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()		StartChildProcess(WalReceiverProcess)
 
@@ -1015,8 +1017,8 @@ PostmasterMain(int argc, char *argv[])
 	 * CAUTION: when changing this list, check for side-effects on the signal
 	 * handling setup of child processes.  See tcop/postgres.c,
 	 * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
-	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c, and
-	 * postmaster/syslogger.c.
+	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c,
+	 * postmaster/syslogger.c and postmaster/checkpointer.c
 	 */
 	pqinitmask();
 	PG_SETMASK(&BlockSig);
@@ -1353,10 +1355,14 @@ ServerLoop(void)
 		 * state that prevents it, start one.  It doesn't matter if this
 		 * fails, we'll just try again later.
 		 */
-		if (BgWriterPID == 0 &&
-			(pmState == PM_RUN || pmState == PM_RECOVERY ||
-			 pmState == PM_HOT_STANDBY))
-			BgWriterPID = StartBackgroundWriter();
+		if (pmState == PM_RUN || pmState == PM_RECOVERY ||
+			 pmState == PM_HOT_STANDBY)
+		{
+			if (BgWriterPID == 0)
+				BgWriterPID = StartBackgroundWriter();
+			if (CheckpointerPID == 0)
+				CheckpointerPID = StartCheckpointer();
+		}
 
 		/*
 		 * Likewise, if we have lost the walwriter process, try to start a new
@@ -2034,6 +2040,8 @@ SIGHUP_handler(SIGNAL_ARGS)
 			signal_child(StartupPID, SIGHUP);
 		if (BgWriterPID != 0)
 			signal_child(BgWriterPID, SIGHUP);
+		if (CheckpointerPID != 0)
+			signal_child(CheckpointerPID, SIGHUP);
 		if (WalWriterPID != 0)
 			signal_child(WalWriterPID, SIGHUP);
 		if (WalReceiverPID != 0)
@@ -2148,7 +2156,7 @@ pmdie(SIGNAL_ARGS)
 				signal_child(WalReceiverPID, SIGTERM);
 			if (pmState == PM_RECOVERY)
 			{
-				/* only bgwriter is active in this state */
+				/* only checkpointer is active in this state */
 				pmState = PM_WAIT_BACKENDS;
 			}
 			else if (pmState == PM_RUN ||
@@ -2193,6 +2201,8 @@ pmdie(SIGNAL_ARGS)
 				signal_child(StartupPID, SIGQUIT);
 			if (BgWriterPID != 0)
 				signal_child(BgWriterPID, SIGQUIT);
+			if (CheckpointerPID != 0)
+				signal_child(CheckpointerPID, SIGQUIT);
 			if (WalWriterPID != 0)
 				signal_child(WalWriterPID, SIGQUIT);
 			if (WalReceiverPID != 0)
@@ -2323,12 +2333,14 @@ reaper(SIGNAL_ARGS)
 			}
 
 			/*
-			 * Crank up the background writer, if we didn't do that already
+			 * Crank up background tasks, if we didn't do that already
 			 * when we entered consistent recovery state.  It doesn't matter
 			 * if this fails, we'll just try again later.
 			 */
 			if (BgWriterPID == 0)
 				BgWriterPID = StartBackgroundWriter();
+			if (CheckpointerPID == 0)
+				CheckpointerPID = StartCheckpointer();
 
 			/*
 			 * Likewise, start other special children as needed.  In a restart
@@ -2356,10 +2368,22 @@ reaper(SIGNAL_ARGS)
 		if (pid == BgWriterPID)
 		{
 			BgWriterPID = 0;
+			if (!EXIT_STATUS_0(exitstatus))
+				HandleChildCrash(pid, exitstatus,
+								 _("background writer process"));
+			continue;
+		}
+
+		/*
+		 * Was it the checkpointer?
+		 */
+		if (pid == CheckpointerPID)
+		{
+			CheckpointerPID = 0;
 			if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN)
 			{
 				/*
-				 * OK, we saw normal exit of the bgwriter after it's been told
+				 * OK, we saw normal exit of the checkpointer after it's been told
 				 * to shut down.  We expect that it wrote a shutdown
 				 * checkpoint.	(If for some reason it didn't, recovery will
 				 * occur on next postmaster start.)
@@ -2396,11 +2420,11 @@ reaper(SIGNAL_ARGS)
 			else
 			{
 				/*
-				 * Any unexpected exit of the bgwriter (including FATAL exit)
+				 * Any unexpected exit of the checkpointer (including FATAL exit)
 				 * is treated as a crash.
 				 */
 				HandleChildCrash(pid, exitstatus,
-								 _("background writer process"));
+								 _("checkpointer process"));
 			}
 
 			continue;
@@ -2584,8 +2608,8 @@ CleanupBackend(int pid,
 }
 
 /*
- * HandleChildCrash -- cleanup after failed backend, bgwriter, walwriter,
- * or autovacuum.
+ * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
+ * walwriter or autovacuum.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -2678,6 +2702,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		signal_child(BgWriterPID, (SendStop ? SIGSTOP : SIGQUIT));
 	}
 
+	/* Take care of the checkpointer too */
+	if (pid == CheckpointerPID)
+		CheckpointerPID = 0;
+	else if (CheckpointerPID != 0 && !FatalError)
+	{
+		ereport(DEBUG2,
+				(errmsg_internal("sending %s to process %d",
+								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+								 (int) CheckpointerPID)));
+		signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
+	}
+
 	/* Take care of the walwriter too */
 	if (pid == WalWriterPID)
 		WalWriterPID = 0;
@@ -2857,9 +2893,10 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_WAIT_BACKENDS state ends when we have no regular backends
-		 * (including autovac workers) and no walwriter or autovac launcher.
-		 * If we are doing crash recovery then we expect the bgwriter to exit
-		 * too, otherwise not.	The archiver, stats, and syslogger processes
+		 * (including autovac workers) and no walwriter, autovac launcher
+		 * or bgwriter.  If we are doing crash recovery then we expect the
+		 * checkpointer to exit as well, otherwise not.
+		 * The archiver, stats, and syslogger processes
 		 * are disregarded since they are not connected to shared memory; we
 		 * also disregard dead_end children here. Walsenders are also
 		 * disregarded, they will be terminated later after writing the
@@ -2868,7 +2905,8 @@ PostmasterStateMachine(void)
 		if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_AUTOVAC) == 0 &&
 			StartupPID == 0 &&
 			WalReceiverPID == 0 &&
-			(BgWriterPID == 0 || !FatalError) &&
+			BgWriterPID == 0 &&
+			(CheckpointerPID == 0 || !FatalError) &&
 			WalWriterPID == 0 &&
 			AutoVacPID == 0)
 		{
@@ -2890,22 +2928,22 @@ PostmasterStateMachine(void)
 				/*
 				 * If we get here, we are proceeding with normal shutdown. All
 				 * the regular children are gone, and it's time to tell the
-				 * bgwriter to do a shutdown checkpoint.
+				 * checkpointer to do a shutdown checkpoint.
 				 */
 				Assert(Shutdown > NoShutdown);
-				/* Start the bgwriter if not running */
-				if (BgWriterPID == 0)
-					BgWriterPID = StartBackgroundWriter();
+				/* Start the checkpointer if not running */
+				if (CheckpointerPID == 0)
+					CheckpointerPID = StartCheckpointer();
 				/* And tell it to shut down */
-				if (BgWriterPID != 0)
+				if (CheckpointerPID != 0)
 				{
-					signal_child(BgWriterPID, SIGUSR2);
+					signal_child(CheckpointerPID, SIGUSR2);
 					pmState = PM_SHUTDOWN;
 				}
 				else
 				{
 					/*
-					 * If we failed to fork a bgwriter, just shut down. Any
+					 * If we failed to fork a checkpointer, just shut down. Any
 					 * required cleanup will happen at next restart. We set
 					 * FatalError so that an "abnormal shutdown" message gets
 					 * logged when we exit.
@@ -2964,6 +3002,7 @@ PostmasterStateMachine(void)
 			Assert(StartupPID == 0);
 			Assert(WalReceiverPID == 0);
 			Assert(BgWriterPID == 0);
+			Assert(CheckpointerPID == 0);
 			Assert(WalWriterPID == 0);
 			Assert(AutoVacPID == 0);
 			/* syslogger is not considered here */
@@ -4143,6 +4182,8 @@ sigusr1_handler(SIGNAL_ARGS)
 		 */
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
+		Assert(CheckpointerPID == 0);
+		CheckpointerPID = StartCheckpointer();
 
 		pmState = PM_RECOVERY;
 	}
@@ -4429,6 +4470,10 @@ StartChildProcess(AuxProcType type)
 				ereport(LOG,
 				   (errmsg("could not fork background writer process: %m")));
 				break;
+			case CheckpointerProcess:
+				ereport(LOG,
+				   (errmsg("could not fork checkpointer process: %m")));
+				break;
 			case WalWriterProcess:
 				ereport(LOG,
 						(errmsg("could not fork WAL writer process: %m")));
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8647edd..184e820 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1278,11 +1278,9 @@ BufferSync(int flags)
 					break;
 
 				/*
-				 * Perform normal bgwriter duties and sleep to throttle our
-				 * I/O rate.
+				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags,
-									 (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
 			}
 		}
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 3015885..a761369 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -38,7 +38,7 @@
 /*
  * Special values for the segno arg to RememberFsyncRequest.
  *
- * Note that CompactBgwriterRequestQueue assumes that it's OK to remove an
+ * Note that CompactcheckpointerRequestQueue assumes that it's OK to remove an
  * fsync request from the queue if an identical, subsequent request is found.
  * See comments there before making changes here.
  */
@@ -77,7 +77,7 @@
  *	Inactive segments are those that once contained data but are currently
  *	not needed because of an mdtruncate() operation.  The reason for leaving
  *	them present at size zero, rather than unlinking them, is that other
- *	backends and/or the bgwriter might be holding open file references to
+ *	backends and/or the checkpointer might be holding open file references to
  *	such segments.	If the relation expands again after mdtruncate(), such
  *	that a deactivated segment becomes active again, it is important that
  *	such file references still be valid --- else data might get written
@@ -111,7 +111,7 @@ static MemoryContext MdCxt;		/* context for all md.c allocations */
 
 
 /*
- * In some contexts (currently, standalone backends and the bgwriter process)
+ * In some contexts (currently, standalone backends and the checkpointer process)
  * we keep track of pending fsync operations: we need to remember all relation
  * segments that have been written since the last checkpoint, so that we can
  * fsync them down to disk before completing the next checkpoint.  This hash
@@ -123,7 +123,7 @@ static MemoryContext MdCxt;		/* context for all md.c allocations */
  * a hash table, because we don't expect there to be any duplicate requests.
  *
  * (Regular backends do not track pending operations locally, but forward
- * them to the bgwriter.)
+ * them to the checkpointer.)
  */
 typedef struct
 {
@@ -194,7 +194,7 @@ mdinit(void)
 	 * Create pending-operations hashtable if we need it.  Currently, we need
 	 * it if we are standalone (not under a postmaster) OR if we are a
 	 * bootstrap-mode subprocess of a postmaster (that is, a startup or
-	 * bgwriter process).
+	 * checkpointer process).
 	 */
 	if (!IsUnderPostmaster || IsBootstrapProcessingMode())
 	{
@@ -214,10 +214,10 @@ mdinit(void)
 }
 
 /*
- * In archive recovery, we rely on bgwriter to do fsyncs, but we will have
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
  * already created the pendingOpsTable during initialization of the startup
  * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to bgwriter.
+ * subsequent requests will be forwarded to checkpointer.
  */
 void
 SetForwardFsyncRequests(void)
@@ -765,9 +765,9 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 	 * NOTE: this assumption could only be wrong if another backend has
 	 * truncated the relation.	We rely on higher code levels to handle that
 	 * scenario by closing and re-opening the md fd, which is handled via
-	 * relcache flush.	(Since the bgwriter doesn't participate in relcache
+	 * relcache flush.	(Since the checkpointer doesn't participate in relcache
 	 * flush, it could have segment chain entries for inactive segments;
-	 * that's OK because the bgwriter never needs to compute relation size.)
+	 * that's OK because the checkpointer never needs to compute relation size.)
 	 */
 	while (v->mdfd_chain != NULL)
 	{
@@ -957,7 +957,7 @@ mdsync(void)
 		elog(ERROR, "cannot sync without a pendingOpsTable");
 
 	/*
-	 * If we are in the bgwriter, the sync had better include all fsync
+	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.	The tightest
 	 * race condition that could occur is that a buffer that must be written
 	 * and fsync'd for the checkpoint could have been dumped by a backend just
@@ -1033,7 +1033,7 @@ mdsync(void)
 			int			failures;
 
 			/*
-			 * If in bgwriter, we want to absorb pending requests every so
+			 * If in checkpointer, we want to absorb pending requests every so
 			 * often to prevent overflow of the fsync request queue.  It is
 			 * unspecified whether newly-added entries will be visited by
 			 * hash_seq_search, but we don't care since we don't need to
@@ -1070,9 +1070,9 @@ mdsync(void)
 				 * say "but an unreferenced SMgrRelation is still a leak!" Not
 				 * really, because the only case in which a checkpoint is done
 				 * by a process that isn't about to shut down is in the
-				 * bgwriter, and it will periodically do smgrcloseall(). This
+				 * checkpointer, and it will periodically do smgrcloseall(). This
 				 * fact justifies our not closing the reln in the success path
-				 * either, which is a good thing since in non-bgwriter cases
+				 * either, which is a good thing since in non-checkpointer cases
 				 * we couldn't safely do that.)  Furthermore, in many cases
 				 * the relation will have been dirtied through this same smgr
 				 * relation, and so we can save a file open/close cycle.
@@ -1301,7 +1301,7 @@ register_unlink(RelFileNodeBackend rnode)
 	else
 	{
 		/*
-		 * Notify the bgwriter about it.  If we fail to queue the request
+		 * Notify the checkpointer about it.  If we fail to queue the request
 		 * message, we have to sleep and try again, because we can't simply
 		 * delete the file now.  Ugly, but hopefully won't happen often.
 		 *
@@ -1315,10 +1315,10 @@ register_unlink(RelFileNodeBackend rnode)
 }
 
 /*
- * RememberFsyncRequest() -- callback from bgwriter side of fsync request
+ * RememberFsyncRequest() -- callback from checkpointer side of fsync request
  *
  * We stuff most fsync requests into the local hash table for execution
- * during the bgwriter's next checkpoint.  UNLINK requests go into a
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
  * separate linked list, however, because they get processed separately.
  *
  * The range of possible segment numbers is way less than the range of
@@ -1460,20 +1460,20 @@ ForgetRelationFsyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
 	else if (IsUnderPostmaster)
 	{
 		/*
-		 * Notify the bgwriter about it.  If we fail to queue the revoke
+		 * Notify the checkpointer about it.  If we fail to queue the revoke
 		 * message, we have to sleep and try again ... ugly, but hopefully
 		 * won't happen often.
 		 *
 		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
 		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the bgwriter
+		 * which would be bad, so I'm inclined to assume that the checkpointer
 		 * will always empty the queue soon.
 		 */
 		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
 			pg_usleep(10000L);	/* 10 msec seems a good number */
 
 		/*
-		 * Note we don't wait for the bgwriter to actually absorb the revoke
+		 * Note we don't wait for the checkpointer to actually absorb the revoke
 		 * message; see mdsync() for the implications.
 		 */
 	}
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 4eaa243..cb43879 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -256,7 +256,7 @@ typedef struct RmgrData
 extern const RmgrData RmgrTable[];
 
 /*
- * Exported to support xlog switching from bgwriter
+ * Exported to support xlog switching from checkpointer
  */
 extern pg_time_t GetLastSegSwitchTime(void);
 extern XLogRecPtr RequestXLogSwitch(void);
diff --git a/src/include/bootstrap/bootstrap.h b/src/include/bootstrap/bootstrap.h
index cee9bd1..6153b7a 100644
--- a/src/include/bootstrap/bootstrap.h
+++ b/src/include/bootstrap/bootstrap.h
@@ -22,6 +22,7 @@ typedef enum
 	BootstrapProcess,
 	StartupProcess,
 	BgWriterProcess,
+	CheckpointerProcess,
 	WalWriterProcess,
 	WalReceiverProcess,
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index eaf2206..c05901e 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -23,6 +23,7 @@ extern int	CheckPointWarning;
 extern double CheckPointCompletionTarget;
 
 extern void BackgroundWriterMain(void);
+extern void CheckpointerMain(void);
 
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 46ec625..6e798b1 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -190,11 +190,11 @@ extern PROC_HDR *ProcGlobal;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer and WAL writer run during normal operation. Startup
- * process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 3 slots.
+ * Background writer, checkpointer and WAL writer run during normal operation.
+ * Startup process and WAL receiver also consume 2 slots, but WAL writer is
+ * launched only after startup has exited, so we only need 4 slots.
  */
-#define NUM_AUXILIARY_PROCS		3
+#define NUM_AUXILIARY_PROCS		4
 
 
 /* configurable options */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 2a27e0b..d5afe01 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -19,7 +19,7 @@
 
 /*
  * Reasons for signalling a Postgres child process (a backend or an auxiliary
- * process, like bgwriter).  We can cope with concurrent signals for different
+ * process, like checkpointer).  We can cope with concurrent signals for different
  * reasons.  However, if the same reason is signaled multiple times in quick
  * succession, the process is likely to observe only one notification of it.
  * This is okay for the present uses.

#22

Dickson S. Guedes

listas@guedesoft.net

over 14 years ago

In reply to: Simon Riggs (#21)

Re: Separating bgwriter and checkpointer

2011/10/3 Simon Riggs <simon@2ndquadrant.com>:

On Sun, Oct 2, 2011 at 11:45 PM, Dickson S. Guedes <listas@guedesoft.net> wrote:

I'm trying your patch, it was applied cleanly to master and compiled
ok. But since I started postgres I'm seeing a 99% of CPU usage:

Oh, thanks. I see what happened. I was toying with the idea of going
straight to a WaitLatch implementation for the loop but decided to
leave it out for a later patch, and then skipped the sleep as well.

New version attached.

Working now but even passing all tests for make check, the
regress_database's postmaster doesn't shutdown properly.

$ make check
...
...
============== creating temporary installation ==============
============== initializing database system ==============
============== starting postmaster ==============
running on port 57432 with PID 20094
============== creating database "regression" ==============
...
============== shutting down postmaster ==============
pg_ctl: server does not shut down
pg_regress: could not stop postmaster: exit code was 256

$ uname -a
Linux betelgeuse 2.6.38-11-generic-pae #50-Ubuntu SMP Mon Sep 12
22:21:04 UTC 2011 i686 i686 i386 GNU/Linux

$ grep "$ ./configure" config.log
$ ./configure --enable-debug --enable-cassert
--prefix=/srv/postgres/bgwriter_split

Best regards,
--
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br

#23

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Dickson S. Guedes (#22)

Re: Separating bgwriter and checkpointer

On Tue, Oct 4, 2011 at 2:51 AM, Dickson S. Guedes <listas@guedesoft.net> wrote:

2011/10/3 Simon Riggs <simon@2ndquadrant.com>:

On Sun, Oct 2, 2011 at 11:45 PM, Dickson S. Guedes <listas@guedesoft.net> wrote:

I'm trying your patch, it was applied cleanly to master and compiled
ok. But since I started postgres I'm seeing a 99% of CPU usage:

Oh, thanks. I see what happened. I was toying with the idea of going
straight to a WaitLatch implementation for the loop but decided to
leave it out for a later patch, and then skipped the sleep as well.

New version attached.

Working now but even passing all tests for make check, the
regress_database's postmaster doesn't shutdown properly.

$ make check
...
...
============== creating temporary installation ==============
============== initializing database system ==============
============== starting postmaster ==============
running on port 57432 with PID 20094
============== creating database "regression" ==============
...
============== shutting down postmaster ==============
pg_ctl: server does not shut down
pg_regress: could not stop postmaster: exit code was 256

$ uname -a
Linux betelgeuse 2.6.38-11-generic-pae #50-Ubuntu SMP Mon Sep 12
22:21:04 UTC 2011 i686 i686 i386 GNU/Linux

$ grep "$ ./configure" config.log
$ ./configure --enable-debug --enable-cassert
--prefix=/srv/postgres/bgwriter_split

Yes, I see this also. At the same time, pg_ctl start and stop seem to
work fine in every mode, which is what I tested. Which seems a little
weird.

I seem to be having problems with HEAD as well.

Investigating further.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#24

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Simon Riggs (#23)

1 attachment(s)

Re: Separating bgwriter and checkpointer

On Tue, Oct 4, 2011 at 10:05 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

============== shutting down postmaster ==============
pg_ctl: server does not shut down
pg_regress: could not stop postmaster: exit code was 256

Yes, I see this also. At the same time, pg_ctl start and stop seem to
work fine in every mode, which is what I tested. Which seems a little
weird.

I seem to be having problems with HEAD as well.

Investigating further.

Doh.

The problem is the *same* one I fixed in v2, yet now I see I managed
to somehow exclude that fix from the earlier patch. Slap. Anyway,
fixed again now.

Problem observed in head was because of this bug causing later make
checks to fail on port 57432, so it looked like a problem in head at
first. Nothing actual bug there at all.

Thanks for your patience.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

bgwriter_split.v4.patchapplication/octet-stream; name=bgwriter_split.v4.patchDownload

diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4fe08df..f9b839c 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -315,6 +315,9 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			case BgWriterProcess:
 				statmsg = "writer process";
 				break;
+			case CheckpointerProcess:
+				statmsg = "checkpointer process";
+				break;
 			case WalWriterProcess:
 				statmsg = "wal writer process";
 				break;
@@ -415,6 +418,11 @@ AuxiliaryProcessMain(int argc, char *argv[])
 			BackgroundWriterMain();
 			proc_exit(1);		/* should never return */
 
+		case CheckpointerProcess:
+			/* don't set signals, checkpointer has its own agenda */
+			CheckpointerMain();
+			proc_exit(1);		/* should never return */
+
 		case WalWriterProcess:
 			/* don't set signals, walwriter has its own agenda */
 			InitXLOGAccess();
diff --git a/src/backend/postmaster/Makefile b/src/backend/postmaster/Makefile
index 0767e97..e7414d2 100644
--- a/src/backend/postmaster/Makefile
+++ b/src/backend/postmaster/Makefile
@@ -13,6 +13,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 OBJS = autovacuum.o bgwriter.o fork_process.o pgarch.o pgstat.o postmaster.o \
-	syslogger.o walwriter.o
+	syslogger.o walwriter.o checkpointer.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 2d0b639..2841cdf 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -10,20 +10,13 @@
  * still empowered to issue writes if the bgwriter fails to maintain enough
  * clean shared buffers.
  *
- * The bgwriter is also charged with handling all checkpoints.	It will
- * automatically dispatch a checkpoint after a certain amount of time has
- * elapsed since the last one, and it can be signaled to perform requested
- * checkpoints as well.  (The GUC parameter that mandates a checkpoint every
- * so many WAL segments is implemented by having backends signal the bgwriter
- * when they fill WAL segments; the bgwriter itself doesn't watch for the
- * condition.)
+ * As of Postgres 9.2 the bgwriter no longer handles checkpoints.
  *
  * The bgwriter is started by the postmaster as soon as the startup subprocess
  * finishes, or as soon as recovery begins if we are doing archive recovery.
  * It remains alive until the postmaster commands it to terminate.
- * Normal termination is by SIGUSR2, which instructs the bgwriter to execute
- * a shutdown checkpoint and then exit(0).	(All backends must be stopped
- * before SIGUSR2 is issued!)  Emergency termination is by SIGQUIT; like any
+ * Normal termination is by SIGUSR2, which instructs the bgwriter to exit(0).
+ * Emergency termination is by SIGQUIT; like any
  * backend, the bgwriter will simply abort and exit on SIGQUIT.
  *
  * If the bgwriter exits unexpectedly, the postmaster treats that the same
@@ -54,7 +47,6 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
-#include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
@@ -67,96 +59,15 @@
 #include "utils/resowner.h"
 
 
-/*----------
- * Shared memory area for communication between bgwriter and backends
- *
- * The ckpt counters allow backends to watch for completion of a checkpoint
- * request they send.  Here's how it works:
- *	* At start of a checkpoint, bgwriter reads (and clears) the request flags
- *	  and increments ckpt_started, while holding ckpt_lck.
- *	* On completion of a checkpoint, bgwriter sets ckpt_done to
- *	  equal ckpt_started.
- *	* On failure of a checkpoint, bgwriter increments ckpt_failed
- *	  and sets ckpt_done to equal ckpt_started.
- *
- * The algorithm for backends is:
- *	1. Record current values of ckpt_failed and ckpt_started, and
- *	   set request flags, while holding ckpt_lck.
- *	2. Send signal to request checkpoint.
- *	3. Sleep until ckpt_started changes.  Now you know a checkpoint has
- *	   begun since you started this algorithm (although *not* that it was
- *	   specifically initiated by your signal), and that it is using your flags.
- *	4. Record new value of ckpt_started.
- *	5. Sleep until ckpt_done >= saved value of ckpt_started.  (Use modulo
- *	   arithmetic here in case counters wrap around.)  Now you know a
- *	   checkpoint has started and completed, but not whether it was
- *	   successful.
- *	6. If ckpt_failed is different from the originally saved value,
- *	   assume request failed; otherwise it was definitely successful.
- *
- * ckpt_flags holds the OR of the checkpoint request flags sent by all
- * requesting backends since the last checkpoint start.  The flags are
- * chosen so that OR'ing is the correct way to combine multiple requests.
- *
- * num_backend_writes is used to count the number of buffer writes performed
- * by non-bgwriter processes.  This counter should be wide enough that it
- * can't overflow during a single bgwriter cycle.  num_backend_fsync
- * counts the subset of those writes that also had to do their own fsync,
- * because the background writer failed to absorb their request.
- *
- * The requests array holds fsync requests sent by backends and not yet
- * absorbed by the bgwriter.
- *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by BgWriterCommLock.
- *----------
- */
-typedef struct
-{
-	RelFileNodeBackend rnode;
-	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
-	/* might add a real request-type field later; not needed yet */
-} BgWriterRequest;
-
-typedef struct
-{
-	pid_t		bgwriter_pid;	/* PID of bgwriter (0 if not started) */
-
-	slock_t		ckpt_lck;		/* protects all the ckpt_* fields */
-
-	int			ckpt_started;	/* advances when checkpoint starts */
-	int			ckpt_done;		/* advances when checkpoint done */
-	int			ckpt_failed;	/* advances when checkpoint fails */
-
-	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
-
-	uint32		num_backend_writes;		/* counts non-bgwriter buffer writes */
-	uint32		num_backend_fsync;		/* counts non-bgwriter fsync calls */
-
-	int			num_requests;	/* current # of requests */
-	int			max_requests;	/* allocated array size */
-	BgWriterRequest requests[1];	/* VARIABLE LENGTH ARRAY */
-} BgWriterShmemStruct;
-
-static BgWriterShmemStruct *BgWriterShmem;
-
-/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
-#define WRITES_PER_ABSORB		1000
-
 /*
  * GUC parameters
  */
 int			BgWriterDelay = 200;
-int			CheckPointTimeout = 300;
-int			CheckPointWarning = 30;
-double		CheckPointCompletionTarget = 0.5;
 
 /*
  * Flags set by interrupt handlers for later service in the main loop.
  */
 static volatile sig_atomic_t got_SIGHUP = false;
-static volatile sig_atomic_t checkpoint_requested = false;
 static volatile sig_atomic_t shutdown_requested = false;
 
 /*
@@ -164,29 +75,14 @@ static volatile sig_atomic_t shutdown_requested = false;
  */
 static bool am_bg_writer = false;
 
-static bool ckpt_active = false;
-
-/* these values are valid when ckpt_active is true: */
-static pg_time_t ckpt_start_time;
-static XLogRecPtr ckpt_start_recptr;
-static double ckpt_cached_elapsed;
-
-static pg_time_t last_checkpoint_time;
-static pg_time_t last_xlog_switch_time;
-
 /* Prototypes for private functions */
 
-static void CheckArchiveTimeout(void);
 static void BgWriterNap(void);
-static bool IsCheckpointOnSchedule(double progress);
-static bool ImmediateCheckpointRequested(void);
-static bool CompactBgwriterRequestQueue(void);
 
 /* Signal handlers */
 
 static void bg_quickdie(SIGNAL_ARGS);
 static void BgSigHupHandler(SIGNAL_ARGS);
-static void ReqCheckpointHandler(SIGNAL_ARGS);
 static void ReqShutdownHandler(SIGNAL_ARGS);
 
 
@@ -202,7 +98,6 @@ BackgroundWriterMain(void)
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext bgwriter_context;
 
-	BgWriterShmem->bgwriter_pid = MyProcPid;
 	am_bg_writer = true;
 
 	/*
@@ -228,13 +123,13 @@ BackgroundWriterMain(void)
 	 * process to participate in ProcSignal signalling.
 	 */
 	pqsignal(SIGHUP, BgSigHupHandler);	/* set flag to read config file */
-	pqsignal(SIGINT, ReqCheckpointHandler);		/* request checkpoint */
-	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
+	pqsignal(SIGINT, SIG_IGN);			/* as of 9.2 no longer requests checkpoint */
+	pqsignal(SIGTERM, ReqShutdownHandler); 	/* shutdown */
 	pqsignal(SIGQUIT, bg_quickdie);		/* hard crash time */
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
-	pqsignal(SIGUSR1, SIG_IGN); /* reserve for ProcSignal */
-	pqsignal(SIGUSR2, ReqShutdownHandler);		/* request shutdown */
+	pqsignal(SIGUSR1, SIG_IGN);			/* reserve for ProcSignal */
+	pqsignal(SIGUSR2, SIG_IGN);			/* request shutdown */
 
 	/*
 	 * Reset some signals that are accepted by postmaster but not here
@@ -249,11 +144,6 @@ BackgroundWriterMain(void)
 	sigdelset(&BlockSig, SIGQUIT);
 
 	/*
-	 * Initialize so that first time-driven event happens at the correct time.
-	 */
-	last_checkpoint_time = last_xlog_switch_time = (pg_time_t) time(NULL);
-
-	/*
 	 * Create a resource owner to keep track of our resources (currently only
 	 * buffer pins).
 	 */
@@ -305,20 +195,6 @@ BackgroundWriterMain(void)
 		AtEOXact_Files();
 		AtEOXact_HashTables(false);
 
-		/* Warn any waiting backends that the checkpoint failed. */
-		if (ckpt_active)
-		{
-			/* use volatile pointer to prevent code rearrangement */
-			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-
-			SpinLockAcquire(&bgs->ckpt_lck);
-			bgs->ckpt_failed++;
-			bgs->ckpt_done = bgs->ckpt_started;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			ckpt_active = false;
-		}
-
 		/*
 		 * Now return to normal top-level context and clear ErrorContext for
 		 * next time.
@@ -361,19 +237,11 @@ BackgroundWriterMain(void)
 	if (RecoveryInProgress())
 		ThisTimeLineID = GetRecoveryTargetTLI();
 
-	/* Do this once before starting the loop, then just at SIGHUP time. */
-	SyncRepUpdateSyncStandbysDefined();
-
 	/*
 	 * Loop forever
 	 */
 	for (;;)
 	{
-		bool		do_checkpoint = false;
-		int			flags = 0;
-		pg_time_t	now;
-		int			elapsed_secs;
-
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
 		 * necessity for manual cleanup of all postmaster children.
@@ -381,23 +249,11 @@ BackgroundWriterMain(void)
 		if (!PostmasterIsAlive())
 			exit(1);
 
-		/*
-		 * Process any requests or signals received recently.
-		 */
-		AbsorbFsyncRequests();
-
 		if (got_SIGHUP)
 		{
 			got_SIGHUP = false;
 			ProcessConfigFile(PGC_SIGHUP);
 			/* update global shmem state for sync rep */
-			SyncRepUpdateSyncStandbysDefined();
-		}
-		if (checkpoint_requested)
-		{
-			checkpoint_requested = false;
-			do_checkpoint = true;
-			BgWriterStats.m_requested_checkpoints++;
 		}
 		if (shutdown_requested)
 		{
@@ -406,142 +262,14 @@ BackgroundWriterMain(void)
 			 * control back to the sigsetjmp block above
 			 */
 			ExitOnAnyError = true;
-			/* Close down the database */
-			ShutdownXLOG(0, 0);
 			/* Normal exit from the bgwriter is here */
 			proc_exit(0);		/* done */
 		}
 
 		/*
-		 * Force a checkpoint if too much time has elapsed since the last one.
-		 * Note that we count a timed checkpoint in stats only when this
-		 * occurs without an external request, but we set the CAUSE_TIME flag
-		 * bit even if there is also an external request.
+		 * Do one cycle of dirty-buffer writing.
 		 */
-		now = (pg_time_t) time(NULL);
-		elapsed_secs = now - last_checkpoint_time;
-		if (elapsed_secs >= CheckPointTimeout)
-		{
-			if (!do_checkpoint)
-				BgWriterStats.m_timed_checkpoints++;
-			do_checkpoint = true;
-			flags |= CHECKPOINT_CAUSE_TIME;
-		}
-
-		/*
-		 * Do a checkpoint if requested, otherwise do one cycle of
-		 * dirty-buffer writing.
-		 */
-		if (do_checkpoint)
-		{
-			bool		ckpt_performed = false;
-			bool		do_restartpoint;
-
-			/* use volatile pointer to prevent code rearrangement */
-			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-
-			/*
-			 * Check if we should perform a checkpoint or a restartpoint. As a
-			 * side-effect, RecoveryInProgress() initializes TimeLineID if
-			 * it's not set yet.
-			 */
-			do_restartpoint = RecoveryInProgress();
-
-			/*
-			 * Atomically fetch the request flags to figure out what kind of a
-			 * checkpoint we should perform, and increase the started-counter
-			 * to acknowledge that we've started a new checkpoint.
-			 */
-			SpinLockAcquire(&bgs->ckpt_lck);
-			flags |= bgs->ckpt_flags;
-			bgs->ckpt_flags = 0;
-			bgs->ckpt_started++;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			/*
-			 * The end-of-recovery checkpoint is a real checkpoint that's
-			 * performed while we're still in recovery.
-			 */
-			if (flags & CHECKPOINT_END_OF_RECOVERY)
-				do_restartpoint = false;
-
-			/*
-			 * We will warn if (a) too soon since last checkpoint (whatever
-			 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
-			 * since the last checkpoint start.  Note in particular that this
-			 * implementation will not generate warnings caused by
-			 * CheckPointTimeout < CheckPointWarning.
-			 */
-			if (!do_restartpoint &&
-				(flags & CHECKPOINT_CAUSE_XLOG) &&
-				elapsed_secs < CheckPointWarning)
-				ereport(LOG,
-						(errmsg_plural("checkpoints are occurring too frequently (%d second apart)",
-				"checkpoints are occurring too frequently (%d seconds apart)",
-									   elapsed_secs,
-									   elapsed_secs),
-						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
-
-			/*
-			 * Initialize bgwriter-private variables used during checkpoint.
-			 */
-			ckpt_active = true;
-			if (!do_restartpoint)
-				ckpt_start_recptr = GetInsertRecPtr();
-			ckpt_start_time = now;
-			ckpt_cached_elapsed = 0;
-
-			/*
-			 * Do the checkpoint.
-			 */
-			if (!do_restartpoint)
-			{
-				CreateCheckPoint(flags);
-				ckpt_performed = true;
-			}
-			else
-				ckpt_performed = CreateRestartPoint(flags);
-
-			/*
-			 * After any checkpoint, close all smgr files.	This is so we
-			 * won't hang onto smgr references to deleted files indefinitely.
-			 */
-			smgrcloseall();
-
-			/*
-			 * Indicate checkpoint completion to any waiting backends.
-			 */
-			SpinLockAcquire(&bgs->ckpt_lck);
-			bgs->ckpt_done = bgs->ckpt_started;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			if (ckpt_performed)
-			{
-				/*
-				 * Note we record the checkpoint start time not end time as
-				 * last_checkpoint_time.  This is so that time-driven
-				 * checkpoints happen at a predictable spacing.
-				 */
-				last_checkpoint_time = now;
-			}
-			else
-			{
-				/*
-				 * We were not able to perform the restartpoint (checkpoints
-				 * throw an ERROR in case of error).  Most likely because we
-				 * have not received any new checkpoint WAL records since the
-				 * last restartpoint. Try again in 15 s.
-				 */
-				last_checkpoint_time = now - CheckPointTimeout + 15;
-			}
-
-			ckpt_active = false;
-		}
-		else
-			BgBufferSync();
-
-		/* Check for archive_timeout and switch xlog files if necessary. */
-		CheckArchiveTimeout();
+		BgBufferSync();
 
 		/* Nap for the configured time. */
 		BgWriterNap();
@@ -549,61 +277,6 @@ BackgroundWriterMain(void)
 }
 
 /*
- * CheckArchiveTimeout -- check for archive_timeout and switch xlog files
- *
- * This will switch to a new WAL file and force an archive file write
- * if any activity is recorded in the current WAL file, including just
- * a single checkpoint record.
- */
-static void
-CheckArchiveTimeout(void)
-{
-	pg_time_t	now;
-	pg_time_t	last_time;
-
-	if (XLogArchiveTimeout <= 0 || RecoveryInProgress())
-		return;
-
-	now = (pg_time_t) time(NULL);
-
-	/* First we do a quick check using possibly-stale local state. */
-	if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
-		return;
-
-	/*
-	 * Update local state ... note that last_xlog_switch_time is the last time
-	 * a switch was performed *or requested*.
-	 */
-	last_time = GetLastSegSwitchTime();
-
-	last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
-
-	/* Now we can do the real check */
-	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
-	{
-		XLogRecPtr	switchpoint;
-
-		/* OK, it's time to switch */
-		switchpoint = RequestXLogSwitch();
-
-		/*
-		 * If the returned pointer points exactly to a segment boundary,
-		 * assume nothing happened.
-		 */
-		if ((switchpoint.xrecoff % XLogSegSize) != 0)
-			ereport(DEBUG1,
-				(errmsg("transaction log switch forced (archive_timeout=%d)",
-						XLogArchiveTimeout)));
-
-		/*
-		 * Update state in any case, so we don't retry constantly when the
-		 * system is idle.
-		 */
-		last_xlog_switch_time = now;
-	}
-}
-
-/*
  * BgWriterNap -- Nap for the configured time or until a signal is received.
  */
 static void
@@ -624,185 +297,24 @@ BgWriterNap(void)
 	 * respond reasonably promptly when someone signals us, break down the
 	 * sleep into 1-second increments, and check for interrupts after each
 	 * nap.
-	 *
-	 * We absorb pending requests after each short sleep.
 	 */
-	if (bgwriter_lru_maxpages > 0 || ckpt_active)
+	if (bgwriter_lru_maxpages > 0)
 		udelay = BgWriterDelay * 1000L;
-	else if (XLogArchiveTimeout > 0)
-		udelay = 1000000L;		/* One second */
 	else
 		udelay = 10000000L;		/* Ten seconds */
 
 	while (udelay > 999999L)
 	{
-		if (got_SIGHUP || shutdown_requested ||
-		(ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
+		if (got_SIGHUP || shutdown_requested)
 			break;
 		pg_usleep(1000000L);
-		AbsorbFsyncRequests();
 		udelay -= 1000000L;
 	}
 
-	if (!(got_SIGHUP || shutdown_requested ||
-	  (ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested)))
+	if (!(got_SIGHUP || shutdown_requested))
 		pg_usleep(udelay);
 }
 
-/*
- * Returns true if an immediate checkpoint request is pending.	(Note that
- * this does not check the *current* checkpoint's IMMEDIATE flag, but whether
- * there is one pending behind it.)
- */
-static bool
-ImmediateCheckpointRequested(void)
-{
-	if (checkpoint_requested)
-	{
-		volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-
-		/*
-		 * We don't need to acquire the ckpt_lck in this case because we're
-		 * only looking at a single flag bit.
-		 */
-		if (bgs->ckpt_flags & CHECKPOINT_IMMEDIATE)
-			return true;
-	}
-	return false;
-}
-
-/*
- * CheckpointWriteDelay -- yield control to bgwriter during a checkpoint
- *
- * This function is called after each page write performed by BufferSync().
- * It is responsible for keeping the bgwriter's normal activities in
- * progress during a long checkpoint, and for throttling BufferSync()'s
- * write rate to hit checkpoint_completion_target.
- *
- * The checkpoint request flags should be passed in; currently the only one
- * examined is CHECKPOINT_IMMEDIATE, which disables delays between writes.
- *
- * 'progress' is an estimate of how much of the work has been done, as a
- * fraction between 0.0 meaning none, and 1.0 meaning all done.
- */
-void
-CheckpointWriteDelay(int flags, double progress)
-{
-	static int	absorb_counter = WRITES_PER_ABSORB;
-
-	/* Do nothing if checkpoint is being executed by non-bgwriter process */
-	if (!am_bg_writer)
-		return;
-
-	/*
-	 * Perform the usual bgwriter duties and take a nap, unless we're behind
-	 * schedule, in which case we just try to catch up as quickly as possible.
-	 */
-	if (!(flags & CHECKPOINT_IMMEDIATE) &&
-		!shutdown_requested &&
-		!ImmediateCheckpointRequested() &&
-		IsCheckpointOnSchedule(progress))
-	{
-		if (got_SIGHUP)
-		{
-			got_SIGHUP = false;
-			ProcessConfigFile(PGC_SIGHUP);
-			/* update global shmem state for sync rep */
-			SyncRepUpdateSyncStandbysDefined();
-		}
-
-		AbsorbFsyncRequests();
-		absorb_counter = WRITES_PER_ABSORB;
-
-		BgBufferSync();
-		CheckArchiveTimeout();
-		BgWriterNap();
-	}
-	else if (--absorb_counter <= 0)
-	{
-		/*
-		 * Absorb pending fsync requests after each WRITES_PER_ABSORB write
-		 * operations even when we don't sleep, to prevent overflow of the
-		 * fsync request queue.
-		 */
-		AbsorbFsyncRequests();
-		absorb_counter = WRITES_PER_ABSORB;
-	}
-}
-
-/*
- * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
- *		 in time?
- *
- * Compares the current progress against the time/segments elapsed since last
- * checkpoint, and returns true if the progress we've made this far is greater
- * than the elapsed time/segments.
- */
-static bool
-IsCheckpointOnSchedule(double progress)
-{
-	XLogRecPtr	recptr;
-	struct timeval now;
-	double		elapsed_xlogs,
-				elapsed_time;
-
-	Assert(ckpt_active);
-
-	/* Scale progress according to checkpoint_completion_target. */
-	progress *= CheckPointCompletionTarget;
-
-	/*
-	 * Check against the cached value first. Only do the more expensive
-	 * calculations once we reach the target previously calculated. Since
-	 * neither time or WAL insert pointer moves backwards, a freshly
-	 * calculated value can only be greater than or equal to the cached value.
-	 */
-	if (progress < ckpt_cached_elapsed)
-		return false;
-
-	/*
-	 * Check progress against WAL segments written and checkpoint_segments.
-	 *
-	 * We compare the current WAL insert location against the location
-	 * computed before calling CreateCheckPoint. The code in XLogInsert that
-	 * actually triggers a checkpoint when checkpoint_segments is exceeded
-	 * compares against RedoRecptr, so this is not completely accurate.
-	 * However, it's good enough for our purposes, we're only calculating an
-	 * estimate anyway.
-	 */
-	if (!RecoveryInProgress())
-	{
-		recptr = GetInsertRecPtr();
-		elapsed_xlogs =
-			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
-			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
-			CheckPointSegments;
-
-		if (progress < elapsed_xlogs)
-		{
-			ckpt_cached_elapsed = elapsed_xlogs;
-			return false;
-		}
-	}
-
-	/*
-	 * Check progress against time elapsed and checkpoint_timeout.
-	 */
-	gettimeofday(&now, NULL);
-	elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
-					now.tv_usec / 1000000.0) / CheckPointTimeout;
-
-	if (progress < elapsed_time)
-	{
-		ckpt_cached_elapsed = elapsed_time;
-		return false;
-	}
-
-	/* It looks like we're on schedule. */
-	return true;
-}
-
-
 /* --------------------------------
  *		signal handler routines
  * --------------------------------
@@ -847,441 +359,9 @@ BgSigHupHandler(SIGNAL_ARGS)
 	got_SIGHUP = true;
 }
 
-/* SIGINT: set flag to run a normal checkpoint right away */
-static void
-ReqCheckpointHandler(SIGNAL_ARGS)
-{
-	checkpoint_requested = true;
-}
-
 /* SIGUSR2: set flag to run a shutdown checkpoint and exit */
 static void
 ReqShutdownHandler(SIGNAL_ARGS)
 {
 	shutdown_requested = true;
 }
-
-
-/* --------------------------------
- *		communication with backends
- * --------------------------------
- */
-
-/*
- * BgWriterShmemSize
- *		Compute space needed for bgwriter-related shared memory
- */
-Size
-BgWriterShmemSize(void)
-{
-	Size		size;
-
-	/*
-	 * Currently, the size of the requests[] array is arbitrarily set equal to
-	 * NBuffers.  This may prove too large or small ...
-	 */
-	size = offsetof(BgWriterShmemStruct, requests);
-	size = add_size(size, mul_size(NBuffers, sizeof(BgWriterRequest)));
-
-	return size;
-}
-
-/*
- * BgWriterShmemInit
- *		Allocate and initialize bgwriter-related shared memory
- */
-void
-BgWriterShmemInit(void)
-{
-	bool		found;
-
-	BgWriterShmem = (BgWriterShmemStruct *)
-		ShmemInitStruct("Background Writer Data",
-						BgWriterShmemSize(),
-						&found);
-
-	if (!found)
-	{
-		/* First time through, so initialize */
-		MemSet(BgWriterShmem, 0, sizeof(BgWriterShmemStruct));
-		SpinLockInit(&BgWriterShmem->ckpt_lck);
-		BgWriterShmem->max_requests = NBuffers;
-	}
-}
-
-/*
- * RequestCheckpoint
- *		Called in backend processes to request a checkpoint
- *
- * flags is a bitwise OR of the following:
- *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
- *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
- *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
- *		ignoring checkpoint_completion_target parameter.
- *	CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occured
- *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
- *		CHECKPOINT_END_OF_RECOVERY).
- *	CHECKPOINT_WAIT: wait for completion before returning (otherwise,
- *		just signal bgwriter to do it, and return).
- *	CHECKPOINT_CAUSE_XLOG: checkpoint is requested due to xlog filling.
- *		(This affects logging, and in particular enables CheckPointWarning.)
- */
-void
-RequestCheckpoint(int flags)
-{
-	/* use volatile pointer to prevent code rearrangement */
-	volatile BgWriterShmemStruct *bgs = BgWriterShmem;
-	int			ntries;
-	int			old_failed,
-				old_started;
-
-	/*
-	 * If in a standalone backend, just do it ourselves.
-	 */
-	if (!IsPostmasterEnvironment)
-	{
-		/*
-		 * There's no point in doing slow checkpoints in a standalone backend,
-		 * because there's no other backends the checkpoint could disrupt.
-		 */
-		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
-
-		/*
-		 * After any checkpoint, close all smgr files.	This is so we won't
-		 * hang onto smgr references to deleted files indefinitely.
-		 */
-		smgrcloseall();
-
-		return;
-	}
-
-	/*
-	 * Atomically set the request flags, and take a snapshot of the counters.
-	 * When we see ckpt_started > old_started, we know the flags we set here
-	 * have been seen by bgwriter.
-	 *
-	 * Note that we OR the flags with any existing flags, to avoid overriding
-	 * a "stronger" request by another backend.  The flag senses must be
-	 * chosen to make this work!
-	 */
-	SpinLockAcquire(&bgs->ckpt_lck);
-
-	old_failed = bgs->ckpt_failed;
-	old_started = bgs->ckpt_started;
-	bgs->ckpt_flags |= flags;
-
-	SpinLockRelease(&bgs->ckpt_lck);
-
-	/*
-	 * Send signal to request checkpoint.  It's possible that the bgwriter
-	 * hasn't started yet, or is in process of restarting, so we will retry a
-	 * few times if needed.  Also, if not told to wait for the checkpoint to
-	 * occur, we consider failure to send the signal to be nonfatal and merely
-	 * LOG it.
-	 */
-	for (ntries = 0;; ntries++)
-	{
-		if (BgWriterShmem->bgwriter_pid == 0)
-		{
-			if (ntries >= 20)	/* max wait 2.0 sec */
-			{
-				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-				"could not request checkpoint because bgwriter not running");
-				break;
-			}
-		}
-		else if (kill(BgWriterShmem->bgwriter_pid, SIGINT) != 0)
-		{
-			if (ntries >= 20)	/* max wait 2.0 sec */
-			{
-				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: %m");
-				break;
-			}
-		}
-		else
-			break;				/* signal sent successfully */
-
-		CHECK_FOR_INTERRUPTS();
-		pg_usleep(100000L);		/* wait 0.1 sec, then retry */
-	}
-
-	/*
-	 * If requested, wait for completion.  We detect completion according to
-	 * the algorithm given above.
-	 */
-	if (flags & CHECKPOINT_WAIT)
-	{
-		int			new_started,
-					new_failed;
-
-		/* Wait for a new checkpoint to start. */
-		for (;;)
-		{
-			SpinLockAcquire(&bgs->ckpt_lck);
-			new_started = bgs->ckpt_started;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			if (new_started != old_started)
-				break;
-
-			CHECK_FOR_INTERRUPTS();
-			pg_usleep(100000L);
-		}
-
-		/*
-		 * We are waiting for ckpt_done >= new_started, in a modulo sense.
-		 */
-		for (;;)
-		{
-			int			new_done;
-
-			SpinLockAcquire(&bgs->ckpt_lck);
-			new_done = bgs->ckpt_done;
-			new_failed = bgs->ckpt_failed;
-			SpinLockRelease(&bgs->ckpt_lck);
-
-			if (new_done - new_started >= 0)
-				break;
-
-			CHECK_FOR_INTERRUPTS();
-			pg_usleep(100000L);
-		}
-
-		if (new_failed != old_failed)
-			ereport(ERROR,
-					(errmsg("checkpoint request failed"),
-					 errhint("Consult recent messages in the server log for details.")));
-	}
-}
-
-/*
- * ForwardFsyncRequest
- *		Forward a file-fsync request from a backend to the bgwriter
- *
- * Whenever a backend is compelled to write directly to a relation
- * (which should be seldom, if the bgwriter is getting its job done),
- * the backend calls this routine to pass over knowledge that the relation
- * is dirty and must be fsync'd before next checkpoint.  We also use this
- * opportunity to count such writes for statistical purposes.
- *
- * segno specifies which segment (not block!) of the relation needs to be
- * fsync'd.  (Since the valid range is much less than BlockNumber, we can
- * use high values for special flags; that's all internal to md.c, which
- * see for details.)
- *
- * To avoid holding the lock for longer than necessary, we normally write
- * to the requests[] queue without checking for duplicates.  The bgwriter
- * will have to eliminate dups internally anyway.  However, if we discover
- * that the queue is full, we make a pass over the entire queue to compact
- * it.	This is somewhat expensive, but the alternative is for the backend
- * to perform its own fsync, which is far more expensive in practice.  It
- * is theoretically possible a backend fsync might still be necessary, if
- * the queue is full and contains no duplicate entries.  In that case, we
- * let the backend know by returning false.
- */
-bool
-ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
-					BlockNumber segno)
-{
-	BgWriterRequest *request;
-
-	if (!IsUnderPostmaster)
-		return false;			/* probably shouldn't even get here */
-
-	if (am_bg_writer)
-		elog(ERROR, "ForwardFsyncRequest must not be called in bgwriter");
-
-	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
-
-	/* Count all backend writes regardless of if they fit in the queue */
-	BgWriterShmem->num_backend_writes++;
-
-	/*
-	 * If the background writer isn't running or the request queue is full,
-	 * the backend will have to perform its own fsync request.	But before
-	 * forcing that to happen, we can try to compact the background writer
-	 * request queue.
-	 */
-	if (BgWriterShmem->bgwriter_pid == 0 ||
-		(BgWriterShmem->num_requests >= BgWriterShmem->max_requests
-		 && !CompactBgwriterRequestQueue()))
-	{
-		/*
-		 * Count the subset of writes where backends have to do their own
-		 * fsync
-		 */
-		BgWriterShmem->num_backend_fsync++;
-		LWLockRelease(BgWriterCommLock);
-		return false;
-	}
-	request = &BgWriterShmem->requests[BgWriterShmem->num_requests++];
-	request->rnode = rnode;
-	request->forknum = forknum;
-	request->segno = segno;
-	LWLockRelease(BgWriterCommLock);
-	return true;
-}
-
-/*
- * CompactBgwriterRequestQueue
- *		Remove duplicates from the request queue to avoid backend fsyncs.
- *
- * Although a full fsync request queue is not common, it can lead to severe
- * performance problems when it does happen.  So far, this situation has
- * only been observed to occur when the system is under heavy write load,
- * and especially during the "sync" phase of a checkpoint.	Without this
- * logic, each backend begins doing an fsync for every block written, which
- * gets very expensive and can slow down the whole system.
- *
- * Trying to do this every time the queue is full could lose if there
- * aren't any removable entries.  But should be vanishingly rare in
- * practice: there's one queue entry per shared buffer.
- */
-static bool
-CompactBgwriterRequestQueue()
-{
-	struct BgWriterSlotMapping
-	{
-		BgWriterRequest request;
-		int			slot;
-	};
-
-	int			n,
-				preserve_count;
-	int			num_skipped = 0;
-	HASHCTL		ctl;
-	HTAB	   *htab;
-	bool	   *skip_slot;
-
-	/* must hold BgWriterCommLock in exclusive mode */
-	Assert(LWLockHeldByMe(BgWriterCommLock));
-
-	/* Initialize temporary hash table */
-	MemSet(&ctl, 0, sizeof(ctl));
-	ctl.keysize = sizeof(BgWriterRequest);
-	ctl.entrysize = sizeof(struct BgWriterSlotMapping);
-	ctl.hash = tag_hash;
-	htab = hash_create("CompactBgwriterRequestQueue",
-					   BgWriterShmem->num_requests,
-					   &ctl,
-					   HASH_ELEM | HASH_FUNCTION);
-
-	/* Initialize skip_slot array */
-	skip_slot = palloc0(sizeof(bool) * BgWriterShmem->num_requests);
-
-	/*
-	 * The basic idea here is that a request can be skipped if it's followed
-	 * by a later, identical request.  It might seem more sensible to work
-	 * backwards from the end of the queue and check whether a request is
-	 * *preceded* by an earlier, identical request, in the hopes of doing less
-	 * copying.  But that might change the semantics, if there's an
-	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
-	 * we do it this way.  It would be possible to be even smarter if we made
-	 * the code below understand the specific semantics of such requests (it
-	 * could blow away preceding entries that would end up being canceled
-	 * anyhow), but it's not clear that the extra complexity would buy us
-	 * anything.
-	 */
-	for (n = 0; n < BgWriterShmem->num_requests; ++n)
-	{
-		BgWriterRequest *request;
-		struct BgWriterSlotMapping *slotmap;
-		bool		found;
-
-		request = &BgWriterShmem->requests[n];
-		slotmap = hash_search(htab, request, HASH_ENTER, &found);
-		if (found)
-		{
-			skip_slot[slotmap->slot] = true;
-			++num_skipped;
-		}
-		slotmap->slot = n;
-	}
-
-	/* Done with the hash table. */
-	hash_destroy(htab);
-
-	/* If no duplicates, we're out of luck. */
-	if (!num_skipped)
-	{
-		pfree(skip_slot);
-		return false;
-	}
-
-	/* We found some duplicates; remove them. */
-	for (n = 0, preserve_count = 0; n < BgWriterShmem->num_requests; ++n)
-	{
-		if (skip_slot[n])
-			continue;
-		BgWriterShmem->requests[preserve_count++] = BgWriterShmem->requests[n];
-	}
-	ereport(DEBUG1,
-	   (errmsg("compacted fsync request queue from %d entries to %d entries",
-			   BgWriterShmem->num_requests, preserve_count)));
-	BgWriterShmem->num_requests = preserve_count;
-
-	/* Cleanup. */
-	pfree(skip_slot);
-	return true;
-}
-
-/*
- * AbsorbFsyncRequests
- *		Retrieve queued fsync requests and pass them to local smgr.
- *
- * This is exported because it must be called during CreateCheckPoint;
- * we have to be sure we have accepted all pending requests just before
- * we start fsync'ing.  Since CreateCheckPoint sometimes runs in
- * non-bgwriter processes, do nothing if not bgwriter.
- */
-void
-AbsorbFsyncRequests(void)
-{
-	BgWriterRequest *requests = NULL;
-	BgWriterRequest *request;
-	int			n;
-
-	if (!am_bg_writer)
-		return;
-
-	/*
-	 * We have to PANIC if we fail to absorb all the pending requests (eg,
-	 * because our hashtable runs out of memory).  This is because the system
-	 * cannot run safely if we are unable to fsync what we have been told to
-	 * fsync.  Fortunately, the hashtable is so small that the problem is
-	 * quite unlikely to arise in practice.
-	 */
-	START_CRIT_SECTION();
-
-	/*
-	 * We try to avoid holding the lock for a long time by copying the request
-	 * array.
-	 */
-	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
-
-	/* Transfer write count into pending pgstats message */
-	BgWriterStats.m_buf_written_backend += BgWriterShmem->num_backend_writes;
-	BgWriterStats.m_buf_fsync_backend += BgWriterShmem->num_backend_fsync;
-
-	BgWriterShmem->num_backend_writes = 0;
-	BgWriterShmem->num_backend_fsync = 0;
-
-	n = BgWriterShmem->num_requests;
-	if (n > 0)
-	{
-		requests = (BgWriterRequest *) palloc(n * sizeof(BgWriterRequest));
-		memcpy(requests, BgWriterShmem->requests, n * sizeof(BgWriterRequest));
-	}
-	BgWriterShmem->num_requests = 0;
-
-	LWLockRelease(BgWriterCommLock);
-
-	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
-
-	if (requests)
-		pfree(requests);
-
-	END_CRIT_SECTION();
-}
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
new file mode 100644
index 0000000..6c17c26
--- /dev/null
+++ b/src/backend/postmaster/checkpointer.c
@@ -0,0 +1,1236 @@
+/*-------------------------------------------------------------------------
+ *
+ * checkpointer.c
+ *
+ * The checkpointer is new as of Postgres 9.2.  It handles all checkpoints.
+ * Checkpoints are automatically dispatched after a certain amount of time has
+ * elapsed since the last one, and it can be signaled to perform requested
+ * checkpoints as well.  (The GUC parameter that mandates a checkpoint every
+ * so many WAL segments is implemented by having backends signal when they
+ * fill WAL segments; the checkpointer itself doesn't watch for the
+ * condition.)
+ *
+ * The checkpointer is started by the postmaster as soon as the startup subprocess
+ * finishes, or as soon as recovery begins if we are doing archive recovery.
+ * It remains alive until the postmaster commands it to terminate.
+ * Normal termination is by SIGUSR2, which instructs the checkpointer to execute
+ * a shutdown checkpoint and then exit(0).	(All backends must be stopped
+ * before SIGUSR2 is issued!)  Emergency termination is by SIGQUIT; like any
+ * backend, the checkpointer will simply abort and exit on SIGQUIT.
+ *
+ * If the checkpointer exits unexpectedly, the postmaster treats that the same
+ * as a backend crash: shared memory may be corrupted, so remaining backends
+ * should be killed by SIGQUIT and then a recovery cycle started.  (Even if
+ * shared memory isn't corrupted, we have lost information about which
+ * files need to be fsync'd for the next checkpoint, and so a system
+ * restart needs to be forced.)
+ *
+ *
+ * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/postmaster/checkpointer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <signal.h>
+#include <sys/time.h>
+#include <time.h>
+#include <unistd.h>
+
+#include "access/xlog_internal.h"
+#include "libpq/pqsignal.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "replication/syncrep.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/pmsignal.h"
+#include "storage/shmem.h"
+#include "storage/smgr.h"
+#include "storage/spin.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/resowner.h"
+
+
+/*----------
+ * Shared memory area for communication between checkpointer and backends
+ *
+ * The ckpt counters allow backends to watch for completion of a checkpoint
+ * request they send.  Here's how it works:
+ *	* At start of a checkpoint, checkpointer reads (and clears) the request flags
+ *	  and increments ckpt_started, while holding ckpt_lck.
+ *	* On completion of a checkpoint, checkpointer sets ckpt_done to
+ *	  equal ckpt_started.
+ *	* On failure of a checkpoint, checkpointer increments ckpt_failed
+ *	  and sets ckpt_done to equal ckpt_started.
+ *
+ * The algorithm for backends is:
+ *	1. Record current values of ckpt_failed and ckpt_started, and
+ *	   set request flags, while holding ckpt_lck.
+ *	2. Send signal to request checkpoint.
+ *	3. Sleep until ckpt_started changes.  Now you know a checkpoint has
+ *	   begun since you started this algorithm (although *not* that it was
+ *	   specifically initiated by your signal), and that it is using your flags.
+ *	4. Record new value of ckpt_started.
+ *	5. Sleep until ckpt_done >= saved value of ckpt_started.  (Use modulo
+ *	   arithmetic here in case counters wrap around.)  Now you know a
+ *	   checkpoint has started and completed, but not whether it was
+ *	   successful.
+ *	6. If ckpt_failed is different from the originally saved value,
+ *	   assume request failed; otherwise it was definitely successful.
+ *
+ * ckpt_flags holds the OR of the checkpoint request flags sent by all
+ * requesting backends since the last checkpoint start.  The flags are
+ * chosen so that OR'ing is the correct way to combine multiple requests.
+ *
+ * num_backend_writes is used to count the number of buffer writes performed
+ * by non-bgwriter processes.  This counter should be wide enough that it
+ * can't overflow during a single bgwriter cycle.  num_backend_fsync
+ * counts the subset of those writes that also had to do their own fsync,
+ * because the background writer failed to absorb their request.
+ *
+ * The requests array holds fsync requests sent by backends and not yet
+ * absorbed by the checkpointer.
+ *
+ * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
+ * the requests fields are protected by BgWriterCommLock.
+ *----------
+ */
+typedef struct
+{
+	RelFileNodeBackend rnode;
+	ForkNumber	forknum;
+	BlockNumber segno;			/* see md.c for special values */
+	/* might add a real request-type field later; not needed yet */
+} BgWriterRequest;
+
+typedef struct
+{
+	pid_t		checkpointer_pid;	/* PID of bgwriter (0 if not started) */
+
+	slock_t		ckpt_lck;		/* protects all the ckpt_* fields */
+
+	int			ckpt_started;	/* advances when checkpoint starts */
+	int			ckpt_done;		/* advances when checkpoint done */
+	int			ckpt_failed;	/* advances when checkpoint fails */
+
+	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
+
+	uint32		num_backend_writes;		/* counts non-bgwriter buffer writes */
+	uint32		num_backend_fsync;		/* counts non-bgwriter fsync calls */
+
+	int			num_requests;	/* current # of requests */
+	int			max_requests;	/* allocated array size */
+	BgWriterRequest requests[1];	/* VARIABLE LENGTH ARRAY */
+} BgWriterShmemStruct;
+
+static BgWriterShmemStruct *BgWriterShmem;
+
+/* interval for calling AbsorbFsyncRequests in CheckpointWriteDelay */
+#define WRITES_PER_ABSORB		1000
+
+/*
+ * GUC parameters
+ */
+int			CheckPointTimeout = 300;
+int			CheckPointWarning = 30;
+double		CheckPointCompletionTarget = 0.5;
+
+/*
+ * Flags set by interrupt handlers for later service in the main loop.
+ */
+static volatile sig_atomic_t got_SIGHUP = false;
+static volatile sig_atomic_t checkpoint_requested = false;
+static volatile sig_atomic_t shutdown_requested = false;
+
+/*
+ * Private state
+ */
+static bool am_checkpointer = false;
+
+static bool ckpt_active = false;
+
+/* these values are valid when ckpt_active is true: */
+static pg_time_t ckpt_start_time;
+static XLogRecPtr ckpt_start_recptr;
+static double ckpt_cached_elapsed;
+
+static pg_time_t last_checkpoint_time;
+static pg_time_t last_xlog_switch_time;
+
+/* Prototypes for private functions */
+
+static void CheckArchiveTimeout(void);
+static bool IsCheckpointOnSchedule(double progress);
+static bool ImmediateCheckpointRequested(void);
+static bool CompactCheckpointerRequestQueue(void);
+
+/* Signal handlers */
+
+static void chkpt_quickdie(SIGNAL_ARGS);
+static void ChkptSigHupHandler(SIGNAL_ARGS);
+static void ReqCheckpointHandler(SIGNAL_ARGS);
+static void ReqShutdownHandler(SIGNAL_ARGS);
+
+
+/*
+ * Main entry point for checkpointer process
+ *
+ * This is invoked from BootstrapMain, which has already created the basic
+ * execution environment, but not enabled signals yet.
+ */
+void
+CheckpointerMain(void)
+{
+	sigjmp_buf	local_sigjmp_buf;
+	MemoryContext checkpointer_context;
+
+	BgWriterShmem->checkpointer_pid = MyProcPid;
+	am_checkpointer = true;
+
+	/*
+	 * If possible, make this process a group leader, so that the postmaster
+	 * can signal any child processes too.	(checkpointer probably never has any
+	 * child processes, but for consistency we make all postmaster child
+	 * processes do this.)
+	 */
+#ifdef HAVE_SETSID
+	if (setsid() < 0)
+		elog(FATAL, "setsid() failed: %m");
+#endif
+
+	/*
+	 * Properly accept or ignore signals the postmaster might send us
+	 *
+	 * Note: we deliberately ignore SIGTERM, because during a standard Unix
+	 * system shutdown cycle, init will SIGTERM all processes at once.	We
+	 * want to wait for the backends to exit, whereupon the postmaster will
+	 * tell us it's okay to shut down (via SIGUSR2).
+	 *
+	 * SIGUSR1 is presently unused; keep it spare in case someday we want this
+	 * process to participate in ProcSignal signalling.
+	 */
+	pqsignal(SIGHUP, ChkptSigHupHandler);	/* set flag to read config file */
+	pqsignal(SIGINT, ReqCheckpointHandler);	/* request checkpoint */
+	pqsignal(SIGTERM, SIG_IGN);				/* ignore SIGTERM */
+	pqsignal(SIGQUIT, chkpt_quickdie);		/* hard crash time */
+	pqsignal(SIGALRM, SIG_IGN);
+	pqsignal(SIGPIPE, SIG_IGN);
+	pqsignal(SIGUSR1, SIG_IGN); /* reserve for ProcSignal */
+	pqsignal(SIGUSR2, ReqShutdownHandler);		/* request shutdown */
+
+	/*
+	 * Reset some signals that are accepted by postmaster but not here
+	 */
+	pqsignal(SIGCHLD, SIG_DFL);
+	pqsignal(SIGTTIN, SIG_DFL);
+	pqsignal(SIGTTOU, SIG_DFL);
+	pqsignal(SIGCONT, SIG_DFL);
+	pqsignal(SIGWINCH, SIG_DFL);
+
+	/* We allow SIGQUIT (quickdie) at all times */
+	sigdelset(&BlockSig, SIGQUIT);
+
+	/*
+	 * Initialize so that first time-driven event happens at the correct time.
+	 */
+	last_checkpoint_time = last_xlog_switch_time = (pg_time_t) time(NULL);
+
+	/*
+	 * Create a resource owner to keep track of our resources (currently only
+	 * buffer pins).
+	 */
+	CurrentResourceOwner = ResourceOwnerCreate(NULL, "Checkpointer");
+
+	/*
+	 * Create a memory context that we will do all our work in.  We do this so
+	 * that we can reset the context during error recovery and thereby avoid
+	 * possible memory leaks.  Formerly this code just ran in
+	 * TopMemoryContext, but resetting that would be a really bad idea.
+	 */
+	checkpointer_context = AllocSetContextCreate(TopMemoryContext,
+											 "Checkpointer",
+											 ALLOCSET_DEFAULT_MINSIZE,
+											 ALLOCSET_DEFAULT_INITSIZE,
+											 ALLOCSET_DEFAULT_MAXSIZE);
+	MemoryContextSwitchTo(checkpointer_context);
+
+	/*
+	 * If an exception is encountered, processing resumes here.
+	 *
+	 * See notes in postgres.c about the design of this coding.
+	 */
+	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
+	{
+		/* Since not using PG_TRY, must reset error stack by hand */
+		error_context_stack = NULL;
+
+		/* Prevent interrupts while cleaning up */
+		HOLD_INTERRUPTS();
+
+		/* Report the error to the server log */
+		EmitErrorReport();
+
+		/*
+		 * These operations are really just a minimal subset of
+		 * AbortTransaction().	We don't have very many resources to worry
+		 * about in checkpointer, but we do have LWLocks, buffers, and temp files.
+		 */
+		LWLockReleaseAll();
+		AbortBufferIO();
+		UnlockBuffers();
+		/* buffer pins are released here: */
+		ResourceOwnerRelease(CurrentResourceOwner,
+							 RESOURCE_RELEASE_BEFORE_LOCKS,
+							 false, true);
+		/* we needn't bother with the other ResourceOwnerRelease phases */
+		AtEOXact_Buffers(false);
+		AtEOXact_Files();
+		AtEOXact_HashTables(false);
+
+		/* Warn any waiting backends that the checkpoint failed. */
+		if (ckpt_active)
+		{
+			/* use volatile pointer to prevent code rearrangement */
+			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+
+			SpinLockAcquire(&bgs->ckpt_lck);
+			bgs->ckpt_failed++;
+			bgs->ckpt_done = bgs->ckpt_started;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			ckpt_active = false;
+		}
+
+		/*
+		 * Now return to normal top-level context and clear ErrorContext for
+		 * next time.
+		 */
+		MemoryContextSwitchTo(checkpointer_context);
+		FlushErrorState();
+
+		/* Flush any leaked data in the top-level context */
+		MemoryContextResetAndDeleteChildren(checkpointer_context);
+
+		/* Now we can allow interrupts again */
+		RESUME_INTERRUPTS();
+
+		/*
+		 * Sleep at least 1 second after any error.  A write error is likely
+		 * to be repeated, and we don't want to be filling the error logs as
+		 * fast as we can.
+		 */
+		pg_usleep(1000000L);
+
+		/*
+		 * Close all open files after any error.  This is helpful on Windows,
+		 * where holding deleted files open causes various strange errors.
+		 * It's not clear we need it elsewhere, but shouldn't hurt.
+		 */
+		smgrcloseall();
+	}
+
+	/* We can now handle ereport(ERROR) */
+	PG_exception_stack = &local_sigjmp_buf;
+
+	/*
+	 * Unblock signals (they were blocked when the postmaster forked us)
+	 */
+	PG_SETMASK(&UnBlockSig);
+
+	/*
+	 * Use the recovery target timeline ID during recovery
+	 */
+	if (RecoveryInProgress())
+		ThisTimeLineID = GetRecoveryTargetTLI();
+
+	/* Do this once before starting the loop, then just at SIGHUP time. */
+	SyncRepUpdateSyncStandbysDefined();
+
+	/*
+	 * Loop forever
+	 */
+	for (;;)
+	{
+		bool		do_checkpoint = false;
+		int			flags = 0;
+		pg_time_t	now;
+		int			elapsed_secs;
+
+		/*
+		 * Emergency bailout if postmaster has died.  This is to avoid the
+		 * necessity for manual cleanup of all postmaster children.
+		 */
+		if (!PostmasterIsAlive())
+			exit(1);
+
+		/*
+		 * Process any requests or signals received recently.
+		 */
+		AbsorbFsyncRequests();
+
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+			/* update global shmem state for sync rep */
+			SyncRepUpdateSyncStandbysDefined();
+		}
+		if (checkpoint_requested)
+		{
+			checkpoint_requested = false;
+			do_checkpoint = true;
+			BgWriterStats.m_requested_checkpoints++;
+		}
+		if (shutdown_requested)
+		{
+			/*
+			 * From here on, elog(ERROR) should end with exit(1), not send
+			 * control back to the sigsetjmp block above
+			 */
+			ExitOnAnyError = true;
+			/* Close down the database */
+			ShutdownXLOG(0, 0);
+			/* Normal exit from the checkpointer is here */
+			proc_exit(0);		/* done */
+		}
+
+		/*
+		 * Force a checkpoint if too much time has elapsed since the last one.
+		 * Note that we count a timed checkpoint in stats only when this
+		 * occurs without an external request, but we set the CAUSE_TIME flag
+		 * bit even if there is also an external request.
+		 */
+		now = (pg_time_t) time(NULL);
+		elapsed_secs = now - last_checkpoint_time;
+		if (elapsed_secs >= CheckPointTimeout)
+		{
+			if (!do_checkpoint)
+				BgWriterStats.m_timed_checkpoints++;
+			do_checkpoint = true;
+			flags |= CHECKPOINT_CAUSE_TIME;
+		}
+
+		/*
+		 * Do a checkpoint if requested.
+		 */
+		if (do_checkpoint)
+		{
+			bool		ckpt_performed = false;
+			bool		do_restartpoint;
+
+			/* use volatile pointer to prevent code rearrangement */
+			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+
+			/*
+			 * Check if we should perform a checkpoint or a restartpoint. As a
+			 * side-effect, RecoveryInProgress() initializes TimeLineID if
+			 * it's not set yet.
+			 */
+			do_restartpoint = RecoveryInProgress();
+
+			/*
+			 * Atomically fetch the request flags to figure out what kind of a
+			 * checkpoint we should perform, and increase the started-counter
+			 * to acknowledge that we've started a new checkpoint.
+			 */
+			SpinLockAcquire(&bgs->ckpt_lck);
+			flags |= bgs->ckpt_flags;
+			bgs->ckpt_flags = 0;
+			bgs->ckpt_started++;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			/*
+			 * The end-of-recovery checkpoint is a real checkpoint that's
+			 * performed while we're still in recovery.
+			 */
+			if (flags & CHECKPOINT_END_OF_RECOVERY)
+				do_restartpoint = false;
+
+			/*
+			 * We will warn if (a) too soon since last checkpoint (whatever
+			 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
+			 * since the last checkpoint start.  Note in particular that this
+			 * implementation will not generate warnings caused by
+			 * CheckPointTimeout < CheckPointWarning.
+			 */
+			if (!do_restartpoint &&
+				(flags & CHECKPOINT_CAUSE_XLOG) &&
+				elapsed_secs < CheckPointWarning)
+				ereport(LOG,
+						(errmsg_plural("checkpoints are occurring too frequently (%d second apart)",
+				"checkpoints are occurring too frequently (%d seconds apart)",
+									   elapsed_secs,
+									   elapsed_secs),
+						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
+
+			/*
+			 * Initialize checkpointer-private variables used during checkpoint.
+			 */
+			ckpt_active = true;
+			if (!do_restartpoint)
+				ckpt_start_recptr = GetInsertRecPtr();
+			ckpt_start_time = now;
+			ckpt_cached_elapsed = 0;
+
+			/*
+			 * Do the checkpoint.
+			 */
+			if (!do_restartpoint)
+			{
+				CreateCheckPoint(flags);
+				ckpt_performed = true;
+			}
+			else
+				ckpt_performed = CreateRestartPoint(flags);
+
+			/*
+			 * After any checkpoint, close all smgr files.	This is so we
+			 * won't hang onto smgr references to deleted files indefinitely.
+			 */
+			smgrcloseall();
+
+			/*
+			 * Indicate checkpoint completion to any waiting backends.
+			 */
+			SpinLockAcquire(&bgs->ckpt_lck);
+			bgs->ckpt_done = bgs->ckpt_started;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			if (ckpt_performed)
+			{
+				/*
+				 * Note we record the checkpoint start time not end time as
+				 * last_checkpoint_time.  This is so that time-driven
+				 * checkpoints happen at a predictable spacing.
+				 */
+				last_checkpoint_time = now;
+			}
+			else
+			{
+				/*
+				 * We were not able to perform the restartpoint (checkpoints
+				 * throw an ERROR in case of error).  Most likely because we
+				 * have not received any new checkpoint WAL records since the
+				 * last restartpoint. Try again in 15 s.
+				 */
+				last_checkpoint_time = now - CheckPointTimeout + 15;
+			}
+
+			ckpt_active = false;
+		}
+
+		/*
+		 * Nap for a while and then loop again. Later patches will replace
+		 * this with a latch loop. Keep it simple now for clarity.
+		 * Relatively long sleep because the bgwriter does cleanup now.
+		 */
+		pg_usleep(500000L);
+
+		/* Check for archive_timeout and switch xlog files if necessary. */
+		CheckArchiveTimeout();
+	}
+}
+
+/*
+ * CheckArchiveTimeout -- check for archive_timeout and switch xlog files
+ *
+ * This will switch to a new WAL file and force an archive file write
+ * if any activity is recorded in the current WAL file, including just
+ * a single checkpoint record.
+ */
+static void
+CheckArchiveTimeout(void)
+{
+	pg_time_t	now;
+	pg_time_t	last_time;
+
+	if (XLogArchiveTimeout <= 0 || RecoveryInProgress())
+		return;
+
+	now = (pg_time_t) time(NULL);
+
+	/* First we do a quick check using possibly-stale local state. */
+	if ((int) (now - last_xlog_switch_time) < XLogArchiveTimeout)
+		return;
+
+	/*
+	 * Update local state ... note that last_xlog_switch_time is the last time
+	 * a switch was performed *or requested*.
+	 */
+	last_time = GetLastSegSwitchTime();
+
+	last_xlog_switch_time = Max(last_xlog_switch_time, last_time);
+
+	/* Now we can do the real check */
+	if ((int) (now - last_xlog_switch_time) >= XLogArchiveTimeout)
+	{
+		XLogRecPtr	switchpoint;
+
+		/* OK, it's time to switch */
+		switchpoint = RequestXLogSwitch();
+
+		/*
+		 * If the returned pointer points exactly to a segment boundary,
+		 * assume nothing happened.
+		 */
+		if ((switchpoint.xrecoff % XLogSegSize) != 0)
+			ereport(DEBUG1,
+				(errmsg("transaction log switch forced (archive_timeout=%d)",
+						XLogArchiveTimeout)));
+
+		/*
+		 * Update state in any case, so we don't retry constantly when the
+		 * system is idle.
+		 */
+		last_xlog_switch_time = now;
+	}
+}
+
+/*
+ * Returns true if an immediate checkpoint request is pending.	(Note that
+ * this does not check the *current* checkpoint's IMMEDIATE flag, but whether
+ * there is one pending behind it.)
+ */
+static bool
+ImmediateCheckpointRequested(void)
+{
+	if (checkpoint_requested)
+	{
+		volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+
+		/*
+		 * We don't need to acquire the ckpt_lck in this case because we're
+		 * only looking at a single flag bit.
+		 */
+		if (bgs->ckpt_flags & CHECKPOINT_IMMEDIATE)
+			return true;
+	}
+	return false;
+}
+
+/*
+ * CheckpointWriteDelay -- control rate of checkpoint
+ *
+ * This function is called after each page write performed by BufferSync().
+ * It is responsible for throttling BufferSync()'s write rate to hit
+ * checkpoint_completion_target.
+ *
+ * The checkpoint request flags should be passed in; currently the only one
+ * examined is CHECKPOINT_IMMEDIATE, which disables delays between writes.
+ *
+ * 'progress' is an estimate of how much of the work has been done, as a
+ * fraction between 0.0 meaning none, and 1.0 meaning all done.
+ */
+void
+CheckpointWriteDelay(int flags, double progress)
+{
+	static int	absorb_counter = WRITES_PER_ABSORB;
+
+	/* Do nothing if checkpoint is being executed by non-checkpointer process */
+	if (!am_checkpointer)
+		return;
+
+	/*
+	 * Perform the usual duties and take a nap, unless we're behind
+	 * schedule, in which case we just try to catch up as quickly as possible.
+	 */
+	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!shutdown_requested &&
+		!ImmediateCheckpointRequested() &&
+		IsCheckpointOnSchedule(progress))
+	{
+		if (got_SIGHUP)
+		{
+			got_SIGHUP = false;
+			ProcessConfigFile(PGC_SIGHUP);
+			/* update global shmem state for sync rep */
+			SyncRepUpdateSyncStandbysDefined();
+		}
+
+		AbsorbFsyncRequests();
+		absorb_counter = WRITES_PER_ABSORB;
+
+		CheckArchiveTimeout();
+
+		/*
+		 * Checkpoint sleep used to be connected to bgwriter_delay at 200ms.
+		 * That resulted in more frequent wakeups if not much work to do.
+		 * Checkpointer and bgwriter are no longer related so take the Big Sleep.
+		 */
+		pg_usleep(100000L);
+	}
+	else if (--absorb_counter <= 0)
+	{
+		/*
+		 * Absorb pending fsync requests after each WRITES_PER_ABSORB write
+		 * operations even when we don't sleep, to prevent overflow of the
+		 * fsync request queue.
+		 */
+		AbsorbFsyncRequests();
+		absorb_counter = WRITES_PER_ABSORB;
+	}
+}
+
+/*
+ * IsCheckpointOnSchedule -- are we on schedule to finish this checkpoint
+ *		 in time?
+ *
+ * Compares the current progress against the time/segments elapsed since last
+ * checkpoint, and returns true if the progress we've made this far is greater
+ * than the elapsed time/segments.
+ */
+static bool
+IsCheckpointOnSchedule(double progress)
+{
+	XLogRecPtr	recptr;
+	struct timeval now;
+	double		elapsed_xlogs,
+				elapsed_time;
+
+	Assert(ckpt_active);
+
+	/* Scale progress according to checkpoint_completion_target. */
+	progress *= CheckPointCompletionTarget;
+
+	/*
+	 * Check against the cached value first. Only do the more expensive
+	 * calculations once we reach the target previously calculated. Since
+	 * neither time or WAL insert pointer moves backwards, a freshly
+	 * calculated value can only be greater than or equal to the cached value.
+	 */
+	if (progress < ckpt_cached_elapsed)
+		return false;
+
+	/*
+	 * Check progress against WAL segments written and checkpoint_segments.
+	 *
+	 * We compare the current WAL insert location against the location
+	 * computed before calling CreateCheckPoint. The code in XLogInsert that
+	 * actually triggers a checkpoint when checkpoint_segments is exceeded
+	 * compares against RedoRecptr, so this is not completely accurate.
+	 * However, it's good enough for our purposes, we're only calculating an
+	 * estimate anyway.
+	 */
+	if (!RecoveryInProgress())
+	{
+		recptr = GetInsertRecPtr();
+		elapsed_xlogs =
+			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
+			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
+			CheckPointSegments;
+
+		if (progress < elapsed_xlogs)
+		{
+			ckpt_cached_elapsed = elapsed_xlogs;
+			return false;
+		}
+	}
+
+	/*
+	 * Check progress against time elapsed and checkpoint_timeout.
+	 */
+	gettimeofday(&now, NULL);
+	elapsed_time = ((double) ((pg_time_t) now.tv_sec - ckpt_start_time) +
+					now.tv_usec / 1000000.0) / CheckPointTimeout;
+
+	if (progress < elapsed_time)
+	{
+		ckpt_cached_elapsed = elapsed_time;
+		return false;
+	}
+
+	/* It looks like we're on schedule. */
+	return true;
+}
+
+
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+
+/*
+ * chkpt_quickdie() occurs when signalled SIGQUIT by the postmaster.
+ *
+ * Some backend has bought the farm,
+ * so we need to stop what we're doing and exit.
+ */
+static void
+chkpt_quickdie(SIGNAL_ARGS)
+{
+	PG_SETMASK(&BlockSig);
+
+	/*
+	 * We DO NOT want to run proc_exit() callbacks -- we're here because
+	 * shared memory may be corrupted, so we don't want to try to clean up our
+	 * transaction.  Just nail the windows shut and get out of town.  Now that
+	 * there's an atexit callback to prevent third-party code from breaking
+	 * things by calling exit() directly, we have to reset the callbacks
+	 * explicitly to make this work as intended.
+	 */
+	on_exit_reset();
+
+	/*
+	 * Note we do exit(2) not exit(0).	This is to force the postmaster into a
+	 * system reset cycle if some idiot DBA sends a manual SIGQUIT to a random
+	 * backend.  This is necessary precisely because we don't clean up our
+	 * shared memory state.  (The "dead man switch" mechanism in pmsignal.c
+	 * should ensure the postmaster sees this as a crash, too, but no harm in
+	 * being doubly sure.)
+	 */
+	exit(2);
+}
+
+/* SIGHUP: set flag to re-read config file at next convenient time */
+static void
+ChkptSigHupHandler(SIGNAL_ARGS)
+{
+	got_SIGHUP = true;
+}
+
+/* SIGINT: set flag to run a normal checkpoint right away */
+static void
+ReqCheckpointHandler(SIGNAL_ARGS)
+{
+	checkpoint_requested = true;
+}
+
+/* SIGUSR2: set flag to run a shutdown checkpoint and exit */
+static void
+ReqShutdownHandler(SIGNAL_ARGS)
+{
+	shutdown_requested = true;
+}
+
+
+/* --------------------------------
+ *		communication with backends
+ * --------------------------------
+ */
+
+/*
+ * BgWriterShmemSize
+ *		Compute space needed for bgwriter-related shared memory
+ */
+Size
+BgWriterShmemSize(void)
+{
+	Size		size;
+
+	/*
+	 * Currently, the size of the requests[] array is arbitrarily set equal to
+	 * NBuffers.  This may prove too large or small ...
+	 */
+	size = offsetof(BgWriterShmemStruct, requests);
+	size = add_size(size, mul_size(NBuffers, sizeof(BgWriterRequest)));
+
+	return size;
+}
+
+/*
+ * BgWriterShmemInit
+ *		Allocate and initialize bgwriter-related shared memory
+ */
+void
+BgWriterShmemInit(void)
+{
+	bool		found;
+
+	BgWriterShmem = (BgWriterShmemStruct *)
+		ShmemInitStruct("Background Writer Data",
+						BgWriterShmemSize(),
+						&found);
+
+	if (!found)
+	{
+		/* First time through, so initialize */
+		MemSet(BgWriterShmem, 0, sizeof(BgWriterShmemStruct));
+		SpinLockInit(&BgWriterShmem->ckpt_lck);
+		BgWriterShmem->max_requests = NBuffers;
+	}
+}
+
+/*
+ * RequestCheckpoint
+ *		Called in backend processes to request a checkpoint
+ *
+ * flags is a bitwise OR of the following:
+ *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
+ *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
+ *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
+ *		ignoring checkpoint_completion_target parameter.
+ *	CHECKPOINT_FORCE: force a checkpoint even if no XLOG activity has occured
+ *		since the last one (implied by CHECKPOINT_IS_SHUTDOWN or
+ *		CHECKPOINT_END_OF_RECOVERY).
+ *	CHECKPOINT_WAIT: wait for completion before returning (otherwise,
+ *		just signal bgwriter to do it, and return).
+ *	CHECKPOINT_CAUSE_XLOG: checkpoint is requested due to xlog filling.
+ *		(This affects logging, and in particular enables CheckPointWarning.)
+ */
+void
+RequestCheckpoint(int flags)
+{
+	/* use volatile pointer to prevent code rearrangement */
+	volatile BgWriterShmemStruct *bgs = BgWriterShmem;
+	int			ntries;
+	int			old_failed,
+				old_started;
+
+	/*
+	 * If in a standalone backend, just do it ourselves.
+	 */
+	if (!IsPostmasterEnvironment)
+	{
+		/*
+		 * There's no point in doing slow checkpoints in a standalone backend,
+		 * because there's no other backends the checkpoint could disrupt.
+		 */
+		CreateCheckPoint(flags | CHECKPOINT_IMMEDIATE);
+
+		/*
+		 * After any checkpoint, close all smgr files.	This is so we won't
+		 * hang onto smgr references to deleted files indefinitely.
+		 */
+		smgrcloseall();
+
+		return;
+	}
+
+	/*
+	 * Atomically set the request flags, and take a snapshot of the counters.
+	 * When we see ckpt_started > old_started, we know the flags we set here
+	 * have been seen by bgwriter.
+	 *
+	 * Note that we OR the flags with any existing flags, to avoid overriding
+	 * a "stronger" request by another backend.  The flag senses must be
+	 * chosen to make this work!
+	 */
+	SpinLockAcquire(&bgs->ckpt_lck);
+
+	old_failed = bgs->ckpt_failed;
+	old_started = bgs->ckpt_started;
+	bgs->ckpt_flags |= flags;
+
+	SpinLockRelease(&bgs->ckpt_lck);
+
+	/*
+	 * Send signal to request checkpoint.  It's possible that the bgwriter
+	 * hasn't started yet, or is in process of restarting, so we will retry a
+	 * few times if needed.  Also, if not told to wait for the checkpoint to
+	 * occur, we consider failure to send the signal to be nonfatal and merely
+	 * LOG it.
+	 */
+	for (ntries = 0;; ntries++)
+	{
+		if (BgWriterShmem->checkpointer_pid == 0)
+		{
+			if (ntries >= 20)	/* max wait 2.0 sec */
+			{
+				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
+				"could not request checkpoint because checkpointer not running");
+				break;
+			}
+		}
+		else if (kill(BgWriterShmem->checkpointer_pid, SIGINT) != 0)
+		{
+			if (ntries >= 20)	/* max wait 2.0 sec */
+			{
+				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
+					 "could not signal for checkpoint: %m");
+				break;
+			}
+		}
+		else
+			break;				/* signal sent successfully */
+
+		CHECK_FOR_INTERRUPTS();
+		pg_usleep(100000L);		/* wait 0.1 sec, then retry */
+	}
+
+	/*
+	 * If requested, wait for completion.  We detect completion according to
+	 * the algorithm given above.
+	 */
+	if (flags & CHECKPOINT_WAIT)
+	{
+		int			new_started,
+					new_failed;
+
+		/* Wait for a new checkpoint to start. */
+		for (;;)
+		{
+			SpinLockAcquire(&bgs->ckpt_lck);
+			new_started = bgs->ckpt_started;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			if (new_started != old_started)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(100000L);
+		}
+
+		/*
+		 * We are waiting for ckpt_done >= new_started, in a modulo sense.
+		 */
+		for (;;)
+		{
+			int			new_done;
+
+			SpinLockAcquire(&bgs->ckpt_lck);
+			new_done = bgs->ckpt_done;
+			new_failed = bgs->ckpt_failed;
+			SpinLockRelease(&bgs->ckpt_lck);
+
+			if (new_done - new_started >= 0)
+				break;
+
+			CHECK_FOR_INTERRUPTS();
+			pg_usleep(100000L);
+		}
+
+		if (new_failed != old_failed)
+			ereport(ERROR,
+					(errmsg("checkpoint request failed"),
+					 errhint("Consult recent messages in the server log for details.")));
+	}
+}
+
+/*
+ * ForwardFsyncRequest
+ *		Forward a file-fsync request from a backend to the bgwriter
+ *
+ * Whenever a backend is compelled to write directly to a relation
+ * (which should be seldom, if the bgwriter is getting its job done),
+ * the backend calls this routine to pass over knowledge that the relation
+ * is dirty and must be fsync'd before next checkpoint.  We also use this
+ * opportunity to count such writes for statistical purposes.
+ *
+ * segno specifies which segment (not block!) of the relation needs to be
+ * fsync'd.  (Since the valid range is much less than BlockNumber, we can
+ * use high values for special flags; that's all internal to md.c, which
+ * see for details.)
+ *
+ * To avoid holding the lock for longer than necessary, we normally write
+ * to the requests[] queue without checking for duplicates.  The bgwriter
+ * will have to eliminate dups internally anyway.  However, if we discover
+ * that the queue is full, we make a pass over the entire queue to compact
+ * it.	This is somewhat expensive, but the alternative is for the backend
+ * to perform its own fsync, which is far more expensive in practice.  It
+ * is theoretically possible a backend fsync might still be necessary, if
+ * the queue is full and contains no duplicate entries.  In that case, we
+ * let the backend know by returning false.
+ */
+bool
+ForwardFsyncRequest(RelFileNodeBackend rnode, ForkNumber forknum,
+					BlockNumber segno)
+{
+	BgWriterRequest *request;
+
+	if (!IsUnderPostmaster)
+		return false;			/* probably shouldn't even get here */
+
+	if (am_checkpointer)
+		elog(ERROR, "ForwardFsyncRequest must not be called in bgwriter");
+
+	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
+
+	/* Count all backend writes regardless of if they fit in the queue */
+	BgWriterShmem->num_backend_writes++;
+
+	/*
+	 * If the background writer isn't running or the request queue is full,
+	 * the backend will have to perform its own fsync request.	But before
+	 * forcing that to happen, we can try to compact the background writer
+	 * request queue.
+	 */
+	if (BgWriterShmem->checkpointer_pid == 0 ||
+		(BgWriterShmem->num_requests >= BgWriterShmem->max_requests
+		 && !CompactCheckpointerRequestQueue()))
+	{
+		/*
+		 * Count the subset of writes where backends have to do their own
+		 * fsync
+		 */
+		BgWriterShmem->num_backend_fsync++;
+		LWLockRelease(BgWriterCommLock);
+		return false;
+	}
+	request = &BgWriterShmem->requests[BgWriterShmem->num_requests++];
+	request->rnode = rnode;
+	request->forknum = forknum;
+	request->segno = segno;
+	LWLockRelease(BgWriterCommLock);
+	return true;
+}
+
+/*
+ * CompactCheckpointerRequestQueue
+ *		Remove duplicates from the request queue to avoid backend fsyncs.
+ *
+ * Although a full fsync request queue is not common, it can lead to severe
+ * performance problems when it does happen.  So far, this situation has
+ * only been observed to occur when the system is under heavy write load,
+ * and especially during the "sync" phase of a checkpoint.	Without this
+ * logic, each backend begins doing an fsync for every block written, which
+ * gets very expensive and can slow down the whole system.
+ *
+ * Trying to do this every time the queue is full could lose if there
+ * aren't any removable entries.  But should be vanishingly rare in
+ * practice: there's one queue entry per shared buffer.
+ */
+static bool
+CompactCheckpointerRequestQueue()
+{
+	struct BgWriterSlotMapping
+	{
+		BgWriterRequest request;
+		int			slot;
+	};
+
+	int			n,
+				preserve_count;
+	int			num_skipped = 0;
+	HASHCTL		ctl;
+	HTAB	   *htab;
+	bool	   *skip_slot;
+
+	/* must hold BgWriterCommLock in exclusive mode */
+	Assert(LWLockHeldByMe(BgWriterCommLock));
+
+	/* Initialize temporary hash table */
+	MemSet(&ctl, 0, sizeof(ctl));
+	ctl.keysize = sizeof(BgWriterRequest);
+	ctl.entrysize = sizeof(struct BgWriterSlotMapping);
+	ctl.hash = tag_hash;
+	htab = hash_create("CompactBgwriterRequestQueue",
+					   BgWriterShmem->num_requests,
+					   &ctl,
+					   HASH_ELEM | HASH_FUNCTION);
+
+	/* Initialize skip_slot array */
+	skip_slot = palloc0(sizeof(bool) * BgWriterShmem->num_requests);
+
+	/*
+	 * The basic idea here is that a request can be skipped if it's followed
+	 * by a later, identical request.  It might seem more sensible to work
+	 * backwards from the end of the queue and check whether a request is
+	 * *preceded* by an earlier, identical request, in the hopes of doing less
+	 * copying.  But that might change the semantics, if there's an
+	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
+	 * we do it this way.  It would be possible to be even smarter if we made
+	 * the code below understand the specific semantics of such requests (it
+	 * could blow away preceding entries that would end up being canceled
+	 * anyhow), but it's not clear that the extra complexity would buy us
+	 * anything.
+	 */
+	for (n = 0; n < BgWriterShmem->num_requests; ++n)
+	{
+		BgWriterRequest *request;
+		struct BgWriterSlotMapping *slotmap;
+		bool		found;
+
+		request = &BgWriterShmem->requests[n];
+		slotmap = hash_search(htab, request, HASH_ENTER, &found);
+		if (found)
+		{
+			skip_slot[slotmap->slot] = true;
+			++num_skipped;
+		}
+		slotmap->slot = n;
+	}
+
+	/* Done with the hash table. */
+	hash_destroy(htab);
+
+	/* If no duplicates, we're out of luck. */
+	if (!num_skipped)
+	{
+		pfree(skip_slot);
+		return false;
+	}
+
+	/* We found some duplicates; remove them. */
+	for (n = 0, preserve_count = 0; n < BgWriterShmem->num_requests; ++n)
+	{
+		if (skip_slot[n])
+			continue;
+		BgWriterShmem->requests[preserve_count++] = BgWriterShmem->requests[n];
+	}
+	ereport(DEBUG1,
+	   (errmsg("compacted fsync request queue from %d entries to %d entries",
+			   BgWriterShmem->num_requests, preserve_count)));
+	BgWriterShmem->num_requests = preserve_count;
+
+	/* Cleanup. */
+	pfree(skip_slot);
+	return true;
+}
+
+/*
+ * AbsorbFsyncRequests
+ *		Retrieve queued fsync requests and pass them to local smgr.
+ *
+ * This is exported because it must be called during CreateCheckPoint;
+ * we have to be sure we have accepted all pending requests just before
+ * we start fsync'ing.  Since CreateCheckPoint sometimes runs in
+ * non-checkpointer processes, do nothing if not checkpointer.
+ */
+void
+AbsorbFsyncRequests(void)
+{
+	BgWriterRequest *requests = NULL;
+	BgWriterRequest *request;
+	int			n;
+
+	if (!am_checkpointer)
+		return;
+
+	/*
+	 * We have to PANIC if we fail to absorb all the pending requests (eg,
+	 * because our hashtable runs out of memory).  This is because the system
+	 * cannot run safely if we are unable to fsync what we have been told to
+	 * fsync.  Fortunately, the hashtable is so small that the problem is
+	 * quite unlikely to arise in practice.
+	 */
+	START_CRIT_SECTION();
+
+	/*
+	 * We try to avoid holding the lock for a long time by copying the request
+	 * array.
+	 */
+	LWLockAcquire(BgWriterCommLock, LW_EXCLUSIVE);
+
+	/* Transfer write count into pending pgstats message */
+	BgWriterStats.m_buf_written_backend += BgWriterShmem->num_backend_writes;
+	BgWriterStats.m_buf_fsync_backend += BgWriterShmem->num_backend_fsync;
+
+	BgWriterShmem->num_backend_writes = 0;
+	BgWriterShmem->num_backend_fsync = 0;
+
+	n = BgWriterShmem->num_requests;
+	if (n > 0)
+	{
+		requests = (BgWriterRequest *) palloc(n * sizeof(BgWriterRequest));
+		memcpy(requests, BgWriterShmem->requests, n * sizeof(BgWriterRequest));
+	}
+	BgWriterShmem->num_requests = 0;
+
+	LWLockRelease(BgWriterCommLock);
+
+	for (request = requests; n > 0; request++, n--)
+		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+
+	if (requests)
+		pfree(requests);
+
+	END_CRIT_SECTION();
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 0a84d97..d057942 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -206,6 +206,7 @@ bool		restart_after_crash = true;
 /* PIDs of special child processes; 0 when not running */
 static pid_t StartupPID = 0,
 			BgWriterPID = 0,
+			CheckpointerPID = 0,
 			WalWriterPID = 0,
 			WalReceiverPID = 0,
 			AutoVacPID = 0,
@@ -277,7 +278,7 @@ typedef enum
 	PM_WAIT_BACKUP,				/* waiting for online backup mode to end */
 	PM_WAIT_READONLY,			/* waiting for read only backends to exit */
 	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
-	PM_SHUTDOWN,				/* waiting for bgwriter to do shutdown ckpt */
+	PM_SHUTDOWN,				/* waiting for checkpointer to do shutdown ckpt */
 	PM_SHUTDOWN_2,				/* waiting for archiver and walsenders to
 								 * finish */
 	PM_WAIT_DEAD_END,			/* waiting for dead_end children to exit */
@@ -463,6 +464,7 @@ static void ShmemBackendArrayRemove(Backend *bn);
 
 #define StartupDataBase()		StartChildProcess(StartupProcess)
 #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
+#define StartCheckpointer()		StartChildProcess(CheckpointerProcess)
 #define StartWalWriter()		StartChildProcess(WalWriterProcess)
 #define StartWalReceiver()		StartChildProcess(WalReceiverProcess)
 
@@ -1015,8 +1017,8 @@ PostmasterMain(int argc, char *argv[])
 	 * CAUTION: when changing this list, check for side-effects on the signal
 	 * handling setup of child processes.  See tcop/postgres.c,
 	 * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
-	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c, and
-	 * postmaster/syslogger.c.
+	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c,
+	 * postmaster/syslogger.c and postmaster/checkpointer.c
 	 */
 	pqinitmask();
 	PG_SETMASK(&BlockSig);
@@ -1353,10 +1355,14 @@ ServerLoop(void)
 		 * state that prevents it, start one.  It doesn't matter if this
 		 * fails, we'll just try again later.
 		 */
-		if (BgWriterPID == 0 &&
-			(pmState == PM_RUN || pmState == PM_RECOVERY ||
-			 pmState == PM_HOT_STANDBY))
-			BgWriterPID = StartBackgroundWriter();
+		if (pmState == PM_RUN || pmState == PM_RECOVERY ||
+			 pmState == PM_HOT_STANDBY)
+		{
+			if (BgWriterPID == 0)
+				BgWriterPID = StartBackgroundWriter();
+			if (CheckpointerPID == 0)
+				CheckpointerPID = StartCheckpointer();
+		}
 
 		/*
 		 * Likewise, if we have lost the walwriter process, try to start a new
@@ -2034,6 +2040,8 @@ SIGHUP_handler(SIGNAL_ARGS)
 			signal_child(StartupPID, SIGHUP);
 		if (BgWriterPID != 0)
 			signal_child(BgWriterPID, SIGHUP);
+		if (CheckpointerPID != 0)
+			signal_child(CheckpointerPID, SIGHUP);
 		if (WalWriterPID != 0)
 			signal_child(WalWriterPID, SIGHUP);
 		if (WalReceiverPID != 0)
@@ -2106,6 +2114,8 @@ pmdie(SIGNAL_ARGS)
 				/* and the walwriter too */
 				if (WalWriterPID != 0)
 					signal_child(WalWriterPID, SIGTERM);
+				if (BgWriterPID != 0)
+					signal_child(BgWriterPID, SIGTERM);
 
 				/*
 				 * If we're in recovery, we can't kill the startup process
@@ -2146,9 +2156,11 @@ pmdie(SIGNAL_ARGS)
 				signal_child(StartupPID, SIGTERM);
 			if (WalReceiverPID != 0)
 				signal_child(WalReceiverPID, SIGTERM);
+			if (BgWriterPID != 0)
+				signal_child(BgWriterPID, SIGTERM);
 			if (pmState == PM_RECOVERY)
 			{
-				/* only bgwriter is active in this state */
+				/* only checkpointer is active in this state */
 				pmState = PM_WAIT_BACKENDS;
 			}
 			else if (pmState == PM_RUN ||
@@ -2193,6 +2205,8 @@ pmdie(SIGNAL_ARGS)
 				signal_child(StartupPID, SIGQUIT);
 			if (BgWriterPID != 0)
 				signal_child(BgWriterPID, SIGQUIT);
+			if (CheckpointerPID != 0)
+				signal_child(CheckpointerPID, SIGQUIT);
 			if (WalWriterPID != 0)
 				signal_child(WalWriterPID, SIGQUIT);
 			if (WalReceiverPID != 0)
@@ -2323,12 +2337,14 @@ reaper(SIGNAL_ARGS)
 			}
 
 			/*
-			 * Crank up the background writer, if we didn't do that already
+			 * Crank up background tasks, if we didn't do that already
 			 * when we entered consistent recovery state.  It doesn't matter
 			 * if this fails, we'll just try again later.
 			 */
 			if (BgWriterPID == 0)
 				BgWriterPID = StartBackgroundWriter();
+			if (CheckpointerPID == 0)
+				CheckpointerPID = StartCheckpointer();
 
 			/*
 			 * Likewise, start other special children as needed.  In a restart
@@ -2356,10 +2372,22 @@ reaper(SIGNAL_ARGS)
 		if (pid == BgWriterPID)
 		{
 			BgWriterPID = 0;
+			if (!EXIT_STATUS_0(exitstatus))
+				HandleChildCrash(pid, exitstatus,
+								 _("background writer process"));
+			continue;
+		}
+
+		/*
+		 * Was it the checkpointer?
+		 */
+		if (pid == CheckpointerPID)
+		{
+			CheckpointerPID = 0;
 			if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN)
 			{
 				/*
-				 * OK, we saw normal exit of the bgwriter after it's been told
+				 * OK, we saw normal exit of the checkpointer after it's been told
 				 * to shut down.  We expect that it wrote a shutdown
 				 * checkpoint.	(If for some reason it didn't, recovery will
 				 * occur on next postmaster start.)
@@ -2396,11 +2424,11 @@ reaper(SIGNAL_ARGS)
 			else
 			{
 				/*
-				 * Any unexpected exit of the bgwriter (including FATAL exit)
+				 * Any unexpected exit of the checkpointer (including FATAL exit)
 				 * is treated as a crash.
 				 */
 				HandleChildCrash(pid, exitstatus,
-								 _("background writer process"));
+								 _("checkpointer process"));
 			}
 
 			continue;
@@ -2584,8 +2612,8 @@ CleanupBackend(int pid,
 }
 
 /*
- * HandleChildCrash -- cleanup after failed backend, bgwriter, walwriter,
- * or autovacuum.
+ * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
+ * walwriter or autovacuum.
  *
  * The objectives here are to clean up our local state about the child
  * process, and to signal all other remaining children to quickdie.
@@ -2678,6 +2706,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		signal_child(BgWriterPID, (SendStop ? SIGSTOP : SIGQUIT));
 	}
 
+	/* Take care of the checkpointer too */
+	if (pid == CheckpointerPID)
+		CheckpointerPID = 0;
+	else if (CheckpointerPID != 0 && !FatalError)
+	{
+		ereport(DEBUG2,
+				(errmsg_internal("sending %s to process %d",
+								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+								 (int) CheckpointerPID)));
+		signal_child(CheckpointerPID, (SendStop ? SIGSTOP : SIGQUIT));
+	}
+
 	/* Take care of the walwriter too */
 	if (pid == WalWriterPID)
 		WalWriterPID = 0;
@@ -2857,9 +2897,10 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_WAIT_BACKENDS state ends when we have no regular backends
-		 * (including autovac workers) and no walwriter or autovac launcher.
-		 * If we are doing crash recovery then we expect the bgwriter to exit
-		 * too, otherwise not.	The archiver, stats, and syslogger processes
+		 * (including autovac workers) and no walwriter, autovac launcher
+		 * or bgwriter.  If we are doing crash recovery then we expect the
+		 * checkpointer to exit as well, otherwise not.
+		 * The archiver, stats, and syslogger processes
 		 * are disregarded since they are not connected to shared memory; we
 		 * also disregard dead_end children here. Walsenders are also
 		 * disregarded, they will be terminated later after writing the
@@ -2868,7 +2909,8 @@ PostmasterStateMachine(void)
 		if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_AUTOVAC) == 0 &&
 			StartupPID == 0 &&
 			WalReceiverPID == 0 &&
-			(BgWriterPID == 0 || !FatalError) &&
+			BgWriterPID == 0 &&
+			(CheckpointerPID == 0 || !FatalError) &&
 			WalWriterPID == 0 &&
 			AutoVacPID == 0)
 		{
@@ -2890,22 +2932,22 @@ PostmasterStateMachine(void)
 				/*
 				 * If we get here, we are proceeding with normal shutdown. All
 				 * the regular children are gone, and it's time to tell the
-				 * bgwriter to do a shutdown checkpoint.
+				 * checkpointer to do a shutdown checkpoint.
 				 */
 				Assert(Shutdown > NoShutdown);
-				/* Start the bgwriter if not running */
-				if (BgWriterPID == 0)
-					BgWriterPID = StartBackgroundWriter();
+				/* Start the checkpointer if not running */
+				if (CheckpointerPID == 0)
+					CheckpointerPID = StartCheckpointer();
 				/* And tell it to shut down */
-				if (BgWriterPID != 0)
+				if (CheckpointerPID != 0)
 				{
-					signal_child(BgWriterPID, SIGUSR2);
+					signal_child(CheckpointerPID, SIGUSR2);
 					pmState = PM_SHUTDOWN;
 				}
 				else
 				{
 					/*
-					 * If we failed to fork a bgwriter, just shut down. Any
+					 * If we failed to fork a checkpointer, just shut down. Any
 					 * required cleanup will happen at next restart. We set
 					 * FatalError so that an "abnormal shutdown" message gets
 					 * logged when we exit.
@@ -2964,6 +3006,7 @@ PostmasterStateMachine(void)
 			Assert(StartupPID == 0);
 			Assert(WalReceiverPID == 0);
 			Assert(BgWriterPID == 0);
+			Assert(CheckpointerPID == 0);
 			Assert(WalWriterPID == 0);
 			Assert(AutoVacPID == 0);
 			/* syslogger is not considered here */
@@ -4143,6 +4186,8 @@ sigusr1_handler(SIGNAL_ARGS)
 		 */
 		Assert(BgWriterPID == 0);
 		BgWriterPID = StartBackgroundWriter();
+		Assert(CheckpointerPID == 0);
+		CheckpointerPID = StartCheckpointer();
 
 		pmState = PM_RECOVERY;
 	}
@@ -4429,6 +4474,10 @@ StartChildProcess(AuxProcType type)
 				ereport(LOG,
 				   (errmsg("could not fork background writer process: %m")));
 				break;
+			case CheckpointerProcess:
+				ereport(LOG,
+				   (errmsg("could not fork checkpointer process: %m")));
+				break;
 			case WalWriterProcess:
 				ereport(LOG,
 						(errmsg("could not fork WAL writer process: %m")));
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8647edd..184e820 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1278,11 +1278,9 @@ BufferSync(int flags)
 					break;
 
 				/*
-				 * Perform normal bgwriter duties and sleep to throttle our
-				 * I/O rate.
+				 * Sleep to throttle our I/O rate.
 				 */
-				CheckpointWriteDelay(flags,
-									 (double) num_written / num_to_write);
+				CheckpointWriteDelay(flags, (double) num_written / num_to_write);
 			}
 		}
 
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 3015885..a761369 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -38,7 +38,7 @@
 /*
  * Special values for the segno arg to RememberFsyncRequest.
  *
- * Note that CompactBgwriterRequestQueue assumes that it's OK to remove an
+ * Note that CompactcheckpointerRequestQueue assumes that it's OK to remove an
  * fsync request from the queue if an identical, subsequent request is found.
  * See comments there before making changes here.
  */
@@ -77,7 +77,7 @@
  *	Inactive segments are those that once contained data but are currently
  *	not needed because of an mdtruncate() operation.  The reason for leaving
  *	them present at size zero, rather than unlinking them, is that other
- *	backends and/or the bgwriter might be holding open file references to
+ *	backends and/or the checkpointer might be holding open file references to
  *	such segments.	If the relation expands again after mdtruncate(), such
  *	that a deactivated segment becomes active again, it is important that
  *	such file references still be valid --- else data might get written
@@ -111,7 +111,7 @@ static MemoryContext MdCxt;		/* context for all md.c allocations */
 
 
 /*
- * In some contexts (currently, standalone backends and the bgwriter process)
+ * In some contexts (currently, standalone backends and the checkpointer process)
  * we keep track of pending fsync operations: we need to remember all relation
  * segments that have been written since the last checkpoint, so that we can
  * fsync them down to disk before completing the next checkpoint.  This hash
@@ -123,7 +123,7 @@ static MemoryContext MdCxt;		/* context for all md.c allocations */
  * a hash table, because we don't expect there to be any duplicate requests.
  *
  * (Regular backends do not track pending operations locally, but forward
- * them to the bgwriter.)
+ * them to the checkpointer.)
  */
 typedef struct
 {
@@ -194,7 +194,7 @@ mdinit(void)
 	 * Create pending-operations hashtable if we need it.  Currently, we need
 	 * it if we are standalone (not under a postmaster) OR if we are a
 	 * bootstrap-mode subprocess of a postmaster (that is, a startup or
-	 * bgwriter process).
+	 * checkpointer process).
 	 */
 	if (!IsUnderPostmaster || IsBootstrapProcessingMode())
 	{
@@ -214,10 +214,10 @@ mdinit(void)
 }
 
 /*
- * In archive recovery, we rely on bgwriter to do fsyncs, but we will have
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
  * already created the pendingOpsTable during initialization of the startup
  * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to bgwriter.
+ * subsequent requests will be forwarded to checkpointer.
  */
 void
 SetForwardFsyncRequests(void)
@@ -765,9 +765,9 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 	 * NOTE: this assumption could only be wrong if another backend has
 	 * truncated the relation.	We rely on higher code levels to handle that
 	 * scenario by closing and re-opening the md fd, which is handled via
-	 * relcache flush.	(Since the bgwriter doesn't participate in relcache
+	 * relcache flush.	(Since the checkpointer doesn't participate in relcache
 	 * flush, it could have segment chain entries for inactive segments;
-	 * that's OK because the bgwriter never needs to compute relation size.)
+	 * that's OK because the checkpointer never needs to compute relation size.)
 	 */
 	while (v->mdfd_chain != NULL)
 	{
@@ -957,7 +957,7 @@ mdsync(void)
 		elog(ERROR, "cannot sync without a pendingOpsTable");
 
 	/*
-	 * If we are in the bgwriter, the sync had better include all fsync
+	 * If we are in the checkpointer, the sync had better include all fsync
 	 * requests that were queued by backends up to this point.	The tightest
 	 * race condition that could occur is that a buffer that must be written
 	 * and fsync'd for the checkpoint could have been dumped by a backend just
@@ -1033,7 +1033,7 @@ mdsync(void)
 			int			failures;
 
 			/*
-			 * If in bgwriter, we want to absorb pending requests every so
+			 * If in checkpointer, we want to absorb pending requests every so
 			 * often to prevent overflow of the fsync request queue.  It is
 			 * unspecified whether newly-added entries will be visited by
 			 * hash_seq_search, but we don't care since we don't need to
@@ -1070,9 +1070,9 @@ mdsync(void)
 				 * say "but an unreferenced SMgrRelation is still a leak!" Not
 				 * really, because the only case in which a checkpoint is done
 				 * by a process that isn't about to shut down is in the
-				 * bgwriter, and it will periodically do smgrcloseall(). This
+				 * checkpointer, and it will periodically do smgrcloseall(). This
 				 * fact justifies our not closing the reln in the success path
-				 * either, which is a good thing since in non-bgwriter cases
+				 * either, which is a good thing since in non-checkpointer cases
 				 * we couldn't safely do that.)  Furthermore, in many cases
 				 * the relation will have been dirtied through this same smgr
 				 * relation, and so we can save a file open/close cycle.
@@ -1301,7 +1301,7 @@ register_unlink(RelFileNodeBackend rnode)
 	else
 	{
 		/*
-		 * Notify the bgwriter about it.  If we fail to queue the request
+		 * Notify the checkpointer about it.  If we fail to queue the request
 		 * message, we have to sleep and try again, because we can't simply
 		 * delete the file now.  Ugly, but hopefully won't happen often.
 		 *
@@ -1315,10 +1315,10 @@ register_unlink(RelFileNodeBackend rnode)
 }
 
 /*
- * RememberFsyncRequest() -- callback from bgwriter side of fsync request
+ * RememberFsyncRequest() -- callback from checkpointer side of fsync request
  *
  * We stuff most fsync requests into the local hash table for execution
- * during the bgwriter's next checkpoint.  UNLINK requests go into a
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
  * separate linked list, however, because they get processed separately.
  *
  * The range of possible segment numbers is way less than the range of
@@ -1460,20 +1460,20 @@ ForgetRelationFsyncRequests(RelFileNodeBackend rnode, ForkNumber forknum)
 	else if (IsUnderPostmaster)
 	{
 		/*
-		 * Notify the bgwriter about it.  If we fail to queue the revoke
+		 * Notify the checkpointer about it.  If we fail to queue the revoke
 		 * message, we have to sleep and try again ... ugly, but hopefully
 		 * won't happen often.
 		 *
 		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
 		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the bgwriter
+		 * which would be bad, so I'm inclined to assume that the checkpointer
 		 * will always empty the queue soon.
 		 */
 		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
 			pg_usleep(10000L);	/* 10 msec seems a good number */
 
 		/*
-		 * Note we don't wait for the bgwriter to actually absorb the revoke
+		 * Note we don't wait for the checkpointer to actually absorb the revoke
 		 * message; see mdsync() for the implications.
 		 */
 	}
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 4eaa243..cb43879 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -256,7 +256,7 @@ typedef struct RmgrData
 extern const RmgrData RmgrTable[];
 
 /*
- * Exported to support xlog switching from bgwriter
+ * Exported to support xlog switching from checkpointer
  */
 extern pg_time_t GetLastSegSwitchTime(void);
 extern XLogRecPtr RequestXLogSwitch(void);
diff --git a/src/include/bootstrap/bootstrap.h b/src/include/bootstrap/bootstrap.h
index cee9bd1..6153b7a 100644
--- a/src/include/bootstrap/bootstrap.h
+++ b/src/include/bootstrap/bootstrap.h
@@ -22,6 +22,7 @@ typedef enum
 	BootstrapProcess,
 	StartupProcess,
 	BgWriterProcess,
+	CheckpointerProcess,
 	WalWriterProcess,
 	WalReceiverProcess,
 
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index eaf2206..c05901e 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -23,6 +23,7 @@ extern int	CheckPointWarning;
 extern double CheckPointCompletionTarget;
 
 extern void BackgroundWriterMain(void);
+extern void CheckpointerMain(void);
 
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
diff --git a/src/include/storage/proc.h b/src/include/storage/proc.h
index 46ec625..6e798b1 100644
--- a/src/include/storage/proc.h
+++ b/src/include/storage/proc.h
@@ -190,11 +190,11 @@ extern PROC_HDR *ProcGlobal;
  * We set aside some extra PGPROC structures for auxiliary processes,
  * ie things that aren't full-fledged backends but need shmem access.
  *
- * Background writer and WAL writer run during normal operation. Startup
- * process and WAL receiver also consume 2 slots, but WAL writer is
- * launched only after startup has exited, so we only need 3 slots.
+ * Background writer, checkpointer and WAL writer run during normal operation.
+ * Startup process and WAL receiver also consume 2 slots, but WAL writer is
+ * launched only after startup has exited, so we only need 4 slots.
  */
-#define NUM_AUXILIARY_PROCS		3
+#define NUM_AUXILIARY_PROCS		4
 
 
 /* configurable options */
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index 2a27e0b..d5afe01 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -19,7 +19,7 @@
 
 /*
  * Reasons for signalling a Postgres child process (a backend or an auxiliary
- * process, like bgwriter).  We can cope with concurrent signals for different
+ * process, like checkpointer).  We can cope with concurrent signals for different
  * reasons.  However, if the same reason is signaled multiple times in quick
  * succession, the process is likely to observe only one notification of it.
  * This is okay for the present uses.

#25

Dickson S. Guedes

listas@guedesoft.net

over 14 years ago

In reply to: Simon Riggs (#24)

Re: Separating bgwriter and checkpointer

2011/10/4 Simon Riggs <simon@2ndquadrant.com>:

The problem is the *same* one I fixed in v2, yet now I see I managed
to somehow exclude that fix from the earlier patch. Slap. Anyway,
fixed again now.

Ah ok! I started reviewing the v4 patch version, this is my comments:

Submission review
===============

1. The patch applies cleanly to current master (fa56a0c3e) but isn't
in context diff format;

Feature test
==========

1. Since I patched and make installed it, I can see the expected
processes: writer and checkpointer;

2. I did the following tests with the following results:

2.1 Running a long time pgbench didn't emit any assertion failure or
crash and the checkpoint works as before patch:

LOG: checkpoint starting: xlog
LOG: checkpoint complete: wrote 300 buffers (9.8%); 0 transaction
log file(s) added, 0 removed, 0 recycled; write=26.103 s, sync=6.492
s, total=34.000 s; sync files=13, longest=4.684 s, average=0.499 s
LOG: checkpoint starting: time
LOG: checkpoint complete: wrote 257 buffers (8.4%); 0 transaction
log file(s) added, 0 removed, 3 recycled; write=21.819 s, sync=9.610
s, total=32.076 s; sync files=7, longest=6.452 s, average=1.372 s

2.2 Forcing a checkpoint when filesystem has enough free space works
fine while pgbench is running:

LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 1605 buffers (52.2%); 0 transaction
log file(s) added, 0 removed, 2 recycled; write=0.344 s, sync=22.750
s, total=23.700 s; sync files=10, longest=15.586 s, average=2.275 s

2.3 Forcing a checkpoint when filesystem are full, works as expected:

LOG: checkpoint starting: immediate force wait time
ERROR: could not write to file "pg_xlog/xlogtemp.4380": Não há
espaço disponível no dispositivo
ERROR: checkpoint request failed
HINT: Consult recent messages in the server log for details.
STATEMENT: CHECKPOINT ;
...
ERROR: could not extend file "base/16384/16405": wrote only 4096 of
8192 bytes at block 10
HINT: Check free disk space.
STATEMENT: INSERT INTO pgbench_history (tid, bid, aid, delta,
mtime) VALUES (69, 3, 609672, -3063, CURRENT_TIMESTAMP);
PANIC: could not write to file "pg_xlog/xlogtemp.4528": Não há
espaço disponível no dispositivo
STATEMENT: END;
LOG: server process (PID 4528) was terminated by signal 6: Aborted

Then I freed some space and started it again and the server ran properly:

LOG: database system was shut down at 2011-10-05 00:46:33 BRT
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
...
LOG: checkpoint starting: immediate force wait
LOG: checkpoint complete: wrote 0 buffers (0.0%); 0 transaction
log file(s) added, 0 removed, 0 recycled; write=0.000 s, sync=0.000 s,
total=0.181 s; sync files=0, longest=0.000 s, average=0.000 s

2.2 Running a pgbench and interrupting postmaster a few seconds later,
seems to work as expected, returning the output:

... cut ...
LOG: statement: SELECT abalance FROM pgbench_accounts WHERE aid = 148253;
^CLOG: statement: UPDATE pgbench_tellers SET tbalance = tbalance +
934 WHERE tid = 85;
DEBUG: postmaster received signal 2
LOG: received fast shutdown request
LOG: aborting any active transactions
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
... cut ...
LOG: disconnection: session time: 0:00:14.917 user=guedes
database=bench host=[local]
LOG: disconnection: session time: 0:00:14.773 user=guedes
database=bench host=[local]
DEBUG: server process (PID 1910) exited with exit code 1
DEBUG: server process (PID 1941) exited with exit code 1
LOG: shutting down
LOG: checkpoint starting: shutdown immediate
DEBUG: SlruScanDirectory invoking callback on pg_multixact/offsets/0000
DEBUG: SlruScanDirectory invoking callback on pg_multixact/members/0000
DEBUG: checkpoint sync: number=1 file=base/16384/16398 time=1075.227 msec
DEBUG: checkpoint sync: number=2 file=base/16384/16406 time=16.832 msec
DEBUG: checkpoint sync: number=3 file=base/16384/16397 time=12306.204 msec
DEBUG: checkpoint sync: number=4 file=base/16384/16397.1 time=2122.141 msec
DEBUG: checkpoint sync: number=5 file=base/16384/16406_fsm time=32.278 msec
DEBUG: checkpoint sync: number=6 file=base/16384/16385_fsm time=11.248 msec
DEBUG: checkpoint sync: number=7 file=base/16384/16388 time=11.083 msec
DEBUG: checkpoint sync: number=8 file=base/16384/11712 time=11.314 msec
DEBUG: checkpoint sync: number=9 file=base/16384/16397_vm time=11.103 msec
DEBUG: checkpoint sync: number=10 file=base/16384/16385 time=19.308 msec
DEBUG: attempting to remove WAL segments older than log file
000000010000000000000000
DEBUG: SlruScanDirectory invoking callback on pg_subtrans/0000
LOG: checkpoint complete: wrote 1659 buffers (54.0%); 0 transaction
log file(s) added, 0 removed, 0 recycled; write=0.025 s, sync=15.617
s, total=15.898 s; sync files=10, longest=12.306 s, average=1.561 s
LOG: database system is shut down

Then I started the server again and it ran properly.

Well, all the tests was running with the default postgresql.conf in my
laptop but I'll setup a more "real world" environment to test for
performance regression. Until now I couldn't notice any significant
difference in TPS before and after patch in a small environment. I'll
post something soon.

Best regards,
--
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br

#26

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Dickson S. Guedes (#25)

Re: Separating bgwriter and checkpointer

On Wed, Oct 5, 2011 at 5:10 AM, Dickson S. Guedes <listas@guedesoft.net> wrote:

Ah ok! I started reviewing the v4 patch version, this is my comments:

...

Well, all the tests was running with the default postgresql.conf in my
laptop but I'll setup a more "real world" environment to test for
performance regression. Until now I couldn't notice any significant
difference in TPS before and after patch in a small environment. I'll
post something soon.

Great testing, thanks. Likely will have no effect in non-I/O swamped
environment, but no regression expected either.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#27

Simon Riggs

simon@2ndQuadrant.com

about 14 years ago

In reply to: Simon Riggs (#26)

Re: Separating bgwriter and checkpointer

On Wed, Oct 5, 2011 at 8:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Wed, Oct 5, 2011 at 5:10 AM, Dickson S. Guedes <listas@guedesoft.net> wrote:

Ah ok! I started reviewing the v4 patch version, this is my comments:

...

Well, all the tests was running with the default postgresql.conf in my
laptop but I'll setup a more "real world" environment to test for
performance regression. Until now I couldn't notice any significant
difference in TPS before and after patch in a small environment. I'll
post something soon.

Great testing, thanks. Likely will have no effect in non-I/O swamped
environment, but no regression expected either.

Any reason or objection to committing this patch?

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#28

Robert Haas

robertmhaas@gmail.com

about 14 years ago

In reply to: Simon Riggs (#27)

Re: Separating bgwriter and checkpointer

On Tue, Oct 18, 2011 at 9:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Any reason or objection to committing this patch?

Not on my end, though I haven't reviewed it in detail. One minor note
- I was mildly surprised to see that you moved this to the
checkpointer rather than leaving it in the bgwriter:

+	/* Do this once before starting the loop, then just at SIGHUP time. */
+	SyncRepUpdateSyncStandbysDefined();

My preference would probably have been to leave that in the background
writer, on the theory that the checkpointer's work is likely to be
more bursty and therefore it might be less responsive.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#29

Simon Riggs

simon@2ndQuadrant.com

about 14 years ago

In reply to: Robert Haas (#28)

Re: Separating bgwriter and checkpointer

On Tue, Oct 18, 2011 at 5:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Oct 18, 2011 at 9:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Any reason or objection to committing this patch?

Not on my end, though I haven't reviewed it in detail. One minor note
- I was mildly surprised to see that you moved this to the
checkpointer rather than leaving it in the bgwriter:
+       /* Do this once before starting the loop, then just at SIGHUP time. */
+       SyncRepUpdateSyncStandbysDefined();
My preference would probably have been to leave that in the background
writer, on the theory that the checkpointer's work is likely to be
more bursty and therefore it might be less responsive.

That needs to be in the checkpointer because that is the process that
shuts down last.

The bgwriter is now more like the walwriter. It shuts down early in
the shutdown process, while the checkpointer is last out.

So it wasn't preference, it was a requirement of the new role definitions.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#30

Robert Haas

robertmhaas@gmail.com

about 14 years ago

In reply to: Simon Riggs (#29)

Re: Separating bgwriter and checkpointer

On Tue, Oct 18, 2011 at 12:53 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, Oct 18, 2011 at 5:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Oct 18, 2011 at 9:18 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Any reason or objection to committing this patch?

Not on my end, though I haven't reviewed it in detail. One minor note
- I was mildly surprised to see that you moved this to the
checkpointer rather than leaving it in the bgwriter:
+       /* Do this once before starting the loop, then just at SIGHUP time. */
+       SyncRepUpdateSyncStandbysDefined();
My preference would probably have been to leave that in the background
writer, on the theory that the checkpointer's work is likely to be
more bursty and therefore it might be less responsive.
That needs to be in the checkpointer because that is the process that
shuts down last.

The bgwriter is now more like the walwriter. It shuts down early in
the shutdown process, while the checkpointer is last out.

So it wasn't preference, it was a requirement of the new role definitions.

Oh, I see.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#31

Fujii Masao

masao.fujii@gmail.com

about 14 years ago

In reply to: Simon Riggs (#27)

Re: Separating bgwriter and checkpointer

On Tue, Oct 18, 2011 at 10:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Any reason or objection to committing this patch?

The checkpointer doesn't call pgstat_send_bgwriter(), but it should.
Otherwise, some entries in pg_stat_bgwriter will never be updated.

If we adopt the patch, checkpoint is performed by checkpointer. So
it looks odd that information related to checkpoint exist in
pg_stat_bgwriter. We should move them to new catalog even if
it breaks the compatibility?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#32

Dickson S. Guedes

listas@guedesoft.net

about 14 years ago

In reply to: Simon Riggs (#27)

Re: Separating bgwriter and checkpointer

2011/10/18 Simon Riggs <simon@2ndquadrant.com>:

On Wed, Oct 5, 2011 at 8:02 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Wed, Oct 5, 2011 at 5:10 AM, Dickson S. Guedes <listas@guedesoft.net> wrote:

Ah ok! I started reviewing the v4 patch version, this is my comments:

...

Well, all the tests was running with the default postgresql.conf in my
laptop but I'll setup a more "real world" environment to test for
performance regression. Until now I couldn't notice any significant
difference in TPS before and after patch in a small environment. I'll
post something soon.

Great testing, thanks. Likely will have no effect in non-I/O swamped
environment, but no regression expected either.

Any reason or objection to committing this patch?

I didn't see any performance regression (as expected) in the
environments that I tested. About the code, I prefer someone with more
experience to review it.

Thanks.
--
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br

#33

Dickson S. Guedes

listas@guedesoft.net

about 14 years ago

In reply to: Fujii Masao (#31)

Re: Separating bgwriter and checkpointer

2011/10/19 Fujii Masao <masao.fujii@gmail.com>:

On Tue, Oct 18, 2011 at 10:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Any reason or objection to committing this patch?

The checkpointer doesn't call pgstat_send_bgwriter(), but it should.
Otherwise, some entries in pg_stat_bgwriter will never be updated.

Yes, checkpoints_req, checkpoints_timed and buffer_checkpoint are not
being updated with this patch.

If we adopt the patch, checkpoint is performed by checkpointer. So
it looks odd that information related to checkpoint exist in
pg_stat_bgwriter. We should move them to new catalog even if
it breaks the compatibility?

Splitting pg_stat_bgwriter into pg_stat_bgwriter and
pg_stat_checkpointer will break something internal?

With this modification we'll see applications like monitoring tools
breaking, but they could use a view to put data back together in a
compatible way, IMHO.

--
Dickson S. Guedes
mail/xmpp: guedes@guedesoft.net - skype: guediz
http://guedesoft.net - http://www.postgresql.org.br

#34

Robert Haas

robertmhaas@gmail.com

about 14 years ago

In reply to: Dickson S. Guedes (#33)

Re: Separating bgwriter and checkpointer

On Wed, Oct 19, 2011 at 8:43 AM, Dickson S. Guedes <listas@guedesoft.net> wrote:

2011/10/19 Fujii Masao <masao.fujii@gmail.com>:

On Tue, Oct 18, 2011 at 10:18 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

Any reason or objection to committing this patch?

The checkpointer doesn't call pgstat_send_bgwriter(), but it should.
Otherwise, some entries in pg_stat_bgwriter will never be updated.

Yes, checkpoints_req, checkpoints_timed and buffer_checkpoint are not
being updated with this patch.

If we adopt the patch, checkpoint is performed by checkpointer. So
it looks odd that information related to checkpoint exist in
pg_stat_bgwriter. We should move them to new catalog even if
it breaks the compatibility?

Splitting pg_stat_bgwriter into pg_stat_bgwriter and
pg_stat_checkpointer will break something internal?

With this modification we'll see applications like monitoring tools
breaking, but they could use a view to put data back together in a
compatible way, IMHO.

I don't really see any reason to break the monitoring view just
because we did some internal refactoring. I'd rather have backward
compatibility.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#35

Fujii Masao

masao.fujii@gmail.com

about 14 years ago

In reply to: Robert Haas (#34)

Re: Separating bgwriter and checkpointer

On Wed, Oct 19, 2011 at 9:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't really see any reason to break the monitoring view just
because we did some internal refactoring. I'd rather have backward
compatibility.

Fair enough.

The patch doesn't change any document, but at least the description
of pg_stat_bgwriter seems to need to be changed.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#36

Simon Riggs

simon@2ndQuadrant.com

about 14 years ago

In reply to: Fujii Masao (#35)

Re: Separating bgwriter and checkpointer

On Wed, Oct 19, 2011 at 3:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Oct 19, 2011 at 9:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't really see any reason to break the monitoring view just
because we did some internal refactoring. I'd rather have backward
compatibility.

Fair enough.

The patch doesn't change any document, but at least the description
of pg_stat_bgwriter seems to need to be changed.

Thanks for the review.

Will follow up on suggestions.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#37

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 14 years ago

In reply to: Simon Riggs (#36)

Re: Separating bgwriter and checkpointer

On 19.10.2011 17:58, Simon Riggs wrote:

On Wed, Oct 19, 2011 at 3:29 PM, Fujii Masao<masao.fujii@gmail.com> wrote:

On Wed, Oct 19, 2011 at 9:45 PM, Robert Haas<robertmhaas@gmail.com> wrote:

I don't really see any reason to break the monitoring view just
because we did some internal refactoring. I'd rather have backward
compatibility.

Fair enough.

The patch doesn't change any document, but at least the description
of pg_stat_bgwriter seems to need to be changed.

Thanks for the review.

Will follow up on suggestions.

The patch looks sane, it's mostly just moving existing code around, but
there's one thing that's been bothering me about this whole idea from
the get-go:

If the bgwriter and checkpointer are two different processes, whenever
bgwriter writes out a page it needs to send an fsync-request to the
checkpointer. We avoided that when both functions were performed by the
same process, but now we have to send and absorb a fsync-request message
for every single write() that happens in the system, except for those
done at checkpoints. Isn't that very expensive? Does it make the
fsync-request queue a bottleneck on some workloads?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#38

Simon Riggs

simon@2ndQuadrant.com

about 14 years ago

In reply to: Heikki Linnakangas (#37)

Re: Separating bgwriter and checkpointer

On Mon, Oct 24, 2011 at 11:40 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

The patch looks sane, it's mostly just moving existing code around, but
there's one thing that's been bothering me about this whole idea from the
get-go:

If the bgwriter and checkpointer are two different processes, whenever
bgwriter writes out a page it needs to send an fsync-request to the
checkpointer. We avoided that when both functions were performed by the same
process, but now we have to send and absorb a fsync-request message for
every single write() that happens in the system, except for those done at
checkpoints. Isn't that very expensive? Does it make the fsync-request queue
a bottleneck on some workloads?

That is a reasonable question and one I considered.

I did some benchmarking earlier to see the overhead of that.
Basically, its very small, much, much smaller than I thought.

The benefit of allowing the bgwriter to continue working during long
fsyncs easily outweighs the loss of doing more fsync-requests. Both of
those overheads/problems occur at the same time so there is the
overhead is always covered.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#39

Simon Riggs

simon@2ndQuadrant.com

about 14 years ago

In reply to: Simon Riggs (#36)

Re: Separating bgwriter and checkpointer

On Wed, Oct 19, 2011 at 3:58 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Wed, Oct 19, 2011 at 3:29 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Oct 19, 2011 at 9:45 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I don't really see any reason to break the monitoring view just
because we did some internal refactoring. I'd rather have backward
compatibility.

Fair enough.

The patch doesn't change any document, but at least the description
of pg_stat_bgwriter seems to need to be changed.

Thanks for the review.

Will follow up on suggestions.

I'll add this in as a separate item after commit.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services