Infrastructure changes for recovery (v8)

Started by Simon Riggsover 17 years ago25 messages

simon@2ndQuadrant.com

over 17 years ago

1 attachment(s)

Patch now includes all previous agreed changes, plus I've found what
looks to be a workable method of removing the shutdown checkpoint
without loss of robustness.

Patch summary

Tuning
* Bgwriter performs dirty block cleaning during recovery
* Bgwriter performs restartpoints, offloading this task from Startup
process to allow it to continue with recovery actions
* Shutdown checkpoint removed at end of recovery. Bgwriter performs
immediate checkpoint instead, so we have same protection, but
connections and transactions can be started earlier than previously.
* PreAllocXLogs() not performed by startup process, so we do not delay
startup while we write zeroes to next WAL file. bgwriter does that now.
* XLogCtl structure padding for enhanced scalability

Recovery State Changes
* If archive recovery proceeds past a safe stopping point we signal the
postmaster that database is now in a consistent state, PM_RECOVERY. This
state change is also linked to startup of the bgwriter and stats
processes (and will in the future be the place where read only backends
may connect also)
* optional recovery_safe_start_location parameter now provided in
recovery.conf, to allow a consistency point to be manually defined if a
base backup was not taken using standard pg_start/stop backup functions
* New minSafeStopPoint added to controlfile to allow us to determine
consistency if archive recovery crashes/restarts. Value is updated each
time we access new WAL file.
* stats file removed earlier in recovery, so we may accumulate new stats
during recovery
* End of recovery is now marked by a clear global state change. Change
is global, atomic and fast - tested for using IsRecoveryProcessingMode()

Additional Safeguards
* Locks are placed around all ControlFile operations
* XLogInsert() and AssignTransactionId() now have specific checks to
prevent their use during recovery
* Makes StartupMultiXact() atomic. Adds comments to show that
StartCLOG() is already atomic, though StartupSUBTRANS() is not (this
will be addressed in a later patch, so not touched here)
* recovery.conf is not removed until slightly later now, to protect
against crash at the end of startup
* New WAL record XLOG_RECOVERY_END is now only place where timelineid
may change

Other Changes
* log_restartpoints removed, use log_checkpoints in postgresql.conf
* pg_controldata and pg_resetxlog changed to show safe start point
* designed to work in EXEC_BACKEND mode for Windows
* additional function signature for pg_start_backup('label', true |
false) to allow definition of immediate checkpoint/not
* doc changes for recovery.conf parameters
* fixes bug discovered while other testing: if pg_stop_backup() is run
when xlogswitch has just occurred then we do not switch log files, yet
we return current filename even though nothing of value in it. If
archive_timeout not enabled we would wait forever for pg_stop_backup()
to return.
* Substantial comments throughout

Patch is now v8.

Please review everybody. Many thanks.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Attachments:

recovery_infrastruc.v8.patchtext/x-patch; charset=UTF-8; name=recovery_infrastruc.v8.patchDownload

Index: doc/src/sgml/backup.sgml
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/doc/src/sgml/backup.sgml,v
retrieving revision 2.120
diff -c -r2.120 backup.sgml
*** doc/src/sgml/backup.sgml	18 Jul 2008 17:33:17 -0000	2.120
--- doc/src/sgml/backup.sgml	30 Sep 2008 17:15:15 -0000
***************
*** 1200,1205 ****
--- 1200,1229 ----
        </listitem>
       </varlistentry>
  
+      <varlistentry id="recovery-safe-start-location"
+                    xreflabel="recovery_safe_start_location">
+       <term><varname>recovery_safe_start_location</varname>
+         (<type>string</type>)
+       </term>
+       <listitem>
+        <para>
+         Allows user to optionally specify a safe start location for a base
+ 		backup that was not made online using <function>pg_start_backup()</> 
+ 		and <function>pg_stop_backup()</>.  If those functions were used, 
+ 		this parameter need not be set because the server sets this for you
+ 		automatically to avoid error.  You cannot use this parameter to move
+ 		the safe stopping point to an earlier transaction log location. The
+ 		format for this parameter is identical to the output of 
+ 		<function>pg_current_xlog_insert_location()</>, example: 
+ <programlisting>
+ recovery_safe_start_location = '0/D4445B8'
+ </programlisting>
+ 		The location always has a forward slash, even on Windows, since it
+ 		is not a file path.
+        </para>
+       </listitem>
+      </varlistentry>
+ 
       <varlistentry id="log-restartpoints"
                     xreflabel="log_restartpoints">
        <term><varname>log_restartpoints</varname>
***************
*** 1207,1215 ****
        </term>
        <listitem>
         <para>
!         Specifies whether to log each restart point as it occurs. This
!         can be helpful to track the progress of a long recovery.
!         Default is <literal>false</>.
         </para>
        </listitem>
       </varlistentry>
--- 1231,1239 ----
        </term>
        <listitem>
         <para>
!         This parameter has now been deprecated. Instead, please set
! 		<varname>log_checkpoints</varname> in <filename>postgresql.conf</>
! 		if you want similar log entries during recovery.
         </para>
        </listitem>
       </varlistentry>
Index: doc/src/sgml/func.sgml
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/doc/src/sgml/func.sgml,v
retrieving revision 1.447
diff -c -r1.447 func.sgml
*** doc/src/sgml/func.sgml	11 Sep 2008 17:32:33 -0000	1.447
--- doc/src/sgml/func.sgml	30 Sep 2008 17:15:15 -0000
***************
*** 12262,12267 ****
--- 12262,12275 ----
        </row>
        <row>
         <entry>
+         <literal><function>pg_start_backup</function>(<parameter>label</> <type>text</>)</literal>
+         </entry>
+        <entry><type>text</type>, <type>boolean</type></entry>
+        <entry>Set up for performing on-line backup, specifying if
+ 		we want an immediate checkpoint or not.</entry>
+       </row>
+       <row>
+        <entry>
          <literal><function>pg_stop_backup</function>()</literal>
          </entry>
         <entry><type>text</type></entry>
***************
*** 12333,12338 ****
--- 12341,12350 ----
      interest).  After noting the ending location, the current transaction log insertion
      point is automatically advanced to the next transaction log file, so that the
      ending transaction log file can be archived immediately to complete the backup.
+ 	<function>pg_start_backup</> issues a checkpoint while we wait. 
+ 	<function>pg_start_backup</> can also be specified with two parameters,
+ 	the second parameter defining whether the checkpoint is an immediate
+ 	checkpoint or whether we write out buffers smoothly over a short period.
     </para>
  
     <para>
Index: src/backend/access/transam/clog.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/clog.c,v
retrieving revision 1.47
diff -c -r1.47 clog.c
*** src/backend/access/transam/clog.c	1 Aug 2008 13:16:08 -0000	1.47
--- src/backend/access/transam/clog.c	30 Sep 2008 17:15:15 -0000
***************
*** 260,265 ****
--- 260,268 ----
  /*
   * This must be called ONCE during postmaster or standalone-backend startup,
   * after StartupXLOG has initialized ShmemVariableCache->nextXid.
+  *
+  * We access just a single clog page, so this action is atomic and safe
+  * for use if other processes are active during recovery.
   */
  void
  StartupCLOG(void)
Index: src/backend/access/transam/multixact.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/multixact.c,v
retrieving revision 1.28
diff -c -r1.28 multixact.c
*** src/backend/access/transam/multixact.c	1 Aug 2008 13:16:08 -0000	1.28
--- src/backend/access/transam/multixact.c	30 Sep 2008 17:15:15 -0000
***************
*** 1413,1420 ****
   * MultiXactSetNextMXact and/or MultiXactAdvanceNextMXact.	Note that we
   * may already have replayed WAL data into the SLRU files.
   *
!  * We don't need any locks here, really; the SLRU locks are taken
!  * only because slru.c expects to be called with locks held.
   */
  void
  StartupMultiXact(void)
--- 1413,1423 ----
   * MultiXactSetNextMXact and/or MultiXactAdvanceNextMXact.	Note that we
   * may already have replayed WAL data into the SLRU files.
   *
!  * We want this operation to be atomic to ensure that other processes can 
!  * use MultiXact while we complete recovery. We access one page only from the
!  * offset and members buffers, so once locks are acquired they will not be
!  * dropped and re-acquired by SLRU code. So we take both locks at start, then
!  * hold them all the way to the end.
   */
  void
  StartupMultiXact(void)
***************
*** 1426,1431 ****
--- 1429,1435 ----
  
  	/* Clean up offsets state */
  	LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE);
+ 	LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);
  
  	/*
  	 * Initialize our idea of the latest page number.
***************
*** 1452,1461 ****
  		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
  	}
  
- 	LWLockRelease(MultiXactOffsetControlLock);
- 
  	/* And the same for members */
- 	LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);
  
  	/*
  	 * Initialize our idea of the latest page number.
--- 1456,1462 ----
***************
*** 1483,1488 ****
--- 1484,1490 ----
  	}
  
  	LWLockRelease(MultiXactMemberControlLock);
+ 	LWLockRelease(MultiXactOffsetControlLock);
  
  	/*
  	 * Initialize lastTruncationPoint to invalid, ensuring that the first
***************
*** 1543,1549 ****
  	 * SimpleLruTruncate would get confused.  It seems best not to risk
  	 * removing any data during recovery anyway, so don't truncate.
  	 */
! 	if (!InRecovery)
  		TruncateMultiXact();
  
  	TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
--- 1545,1551 ----
  	 * SimpleLruTruncate would get confused.  It seems best not to risk
  	 * removing any data during recovery anyway, so don't truncate.
  	 */
! 	if (!IsRecoveryProcessingMode())
  		TruncateMultiXact();
  
  	TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
Index: src/backend/access/transam/subtrans.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/subtrans.c,v
retrieving revision 1.23
diff -c -r1.23 subtrans.c
*** src/backend/access/transam/subtrans.c	1 Aug 2008 13:16:08 -0000	1.23
--- src/backend/access/transam/subtrans.c	30 Sep 2008 17:15:15 -0000
***************
*** 226,231 ****
--- 226,234 ----
   *
   * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
   * if there are none.
+  *
+  * Note that this is not atomic and is not yet safe to perform while other
+  * processes might access subtrans.
   */
  void
  StartupSUBTRANS(TransactionId oldestActiveXID)
Index: src/backend/access/transam/xact.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xact.c,v
retrieving revision 1.265
diff -c -r1.265 xact.c
*** src/backend/access/transam/xact.c	11 Aug 2008 11:05:10 -0000	1.265
--- src/backend/access/transam/xact.c	30 Sep 2008 17:15:15 -0000
***************
*** 393,398 ****
--- 393,401 ----
  	bool		isSubXact = (s->parent != NULL);
  	ResourceOwner currentOwner;
  
+ 	if (IsRecoveryProcessingMode())
+ 		elog(FATAL, "cannot assign TransactionIds during recovery");
+ 
  	/* Assert that caller didn't screw up */
  	Assert(!TransactionIdIsValid(s->transactionId));
  	Assert(s->state == TRANS_INPROGRESS);
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.319
diff -c -r1.319 xlog.c
*** src/backend/access/transam/xlog.c	23 Sep 2008 09:20:35 -0000	1.319
--- src/backend/access/transam/xlog.c	30 Sep 2008 22:32:49 -0000
***************
*** 113,119 ****
  
  /*
   * ThisTimeLineID will be same in all backends --- it identifies current
!  * WAL timeline for the database system.
   */
  TimeLineID	ThisTimeLineID = 0;
  
--- 113,120 ----
  
  /*
   * ThisTimeLineID will be same in all backends --- it identifies current
!  * WAL timeline for the database system. Zero is always a bug, so we 
!  * start with that to allow us to spot any errors.
   */
  TimeLineID	ThisTimeLineID = 0;
  
***************
*** 123,128 ****
--- 124,133 ----
  /* Are we recovering using offline XLOG archives? */
  static bool InArchiveRecovery = false;
  
+ /* Local copy of shared RecoveryProcessingMode state */
+ static bool LocalRecoveryProcessingMode = true;
+ static bool knownProcessingMode = false;
+ 
  /* Was the last xlog file restored from archive, or local? */
  static bool restoredFromArchive = false;
  
***************
*** 131,137 ****
  static bool recoveryTarget = false;
  static bool recoveryTargetExact = false;
  static bool recoveryTargetInclusive = true;
- static bool recoveryLogRestartpoints = false;
  static TransactionId recoveryTargetXid;
  static TimestampTz recoveryTargetTime;
  static TimestampTz recoveryLastXTime = 0;
--- 136,141 ----
***************
*** 141,146 ****
--- 145,153 ----
  static TimestampTz recoveryStopTime;
  static bool recoveryStopAfter;
  
+ /* is the database proven consistent yet? */
+ bool	reachedSafeStartPoint = false;
+ 
  /*
   * During normal operation, the only timeline we care about is ThisTimeLineID.
   * During recovery, however, things are more complicated.  To simplify life
***************
*** 240,248 ****
   * ControlFileLock: must be held to read/update control file or create
   * new log file.
   *
!  * CheckpointLock: must be held to do a checkpoint (ensures only one
!  * checkpointer at a time; currently, with all checkpoints done by the
!  * bgwriter, this is just pro forma).
   *
   *----------
   */
--- 247,256 ----
   * ControlFileLock: must be held to read/update control file or create
   * new log file.
   *
!  * CheckpointLock: must be held to do a checkpoint or restartpoint, ensuring
!  * we get just one of those at any time. In 8.4+ recovery, both startup and
!  * bgwriter processes may take restartpoints, so this locking must be strict 
!  * to ensure there are no mistakes.
   *
   *----------
   */
***************
*** 285,295 ****
--- 293,310 ----
  
  /*
   * Total shared-memory state for XLOG.
+  *
+  * This small structure is accessed by many backends, so we take care to
+  * pad out the parts of the structure so they can be accessed by separate
+  * CPUs without causing false sharing cache flushes. Padding is generous
+  * to allow for a wide variety of CPU architectures.
   */
+ #define	XLOGCTL_BUFFER_SPACING	128
  typedef struct XLogCtlData
  {
  	/* Protected by WALInsertLock: */
  	XLogCtlInsert Insert;
+ 	char	InsertPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlInsert)];
  
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
***************
*** 297,305 ****
--- 312,327 ----
  	uint32		ckptXidEpoch;	/* nextXID & epoch of latest checkpoint */
  	TransactionId ckptXid;
  	XLogRecPtr	asyncCommitLSN; /* LSN of newest async commit */
+ 	/* add data structure padding for above info_lck declarations */
+ 	char	InfoPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogwrtRqst) 
+ 											- sizeof(XLogwrtResult)
+ 											- sizeof(uint32)
+ 											- sizeof(TransactionId)
+ 											- sizeof(XLogRecPtr)];
  
  	/* Protected by WALWriteLock: */
  	XLogCtlWrite Write;
+ 	char	WritePadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlWrite)];
  
  	/*
  	 * These values do not change after startup, although the pointed-to pages
***************
*** 311,316 ****
--- 333,356 ----
  	int			XLogCacheBlck;	/* highest allocated xlog buffer index */
  	TimeLineID	ThisTimeLineID;
  
+ 	/*
+ 	 * IsRecoveryProcessingMode shows whether the postmaster is in a
+ 	 * postmaster state earlier than PM_RUN, or not. This is a globally
+ 	 * accessible state to allow EXEC_BACKEND case.
+ 	 *
+ 	 * We also retain a local state variable InRecovery. InRecovery=true
+ 	 * means the code is being executed by Startup process and therefore
+ 	 * always during Recovery Processing Mode. This allows us to identify
+ 	 * code executed *during* Recovery Processing Mode but not necessarily
+ 	 * by Startup process itself.
+ 	 *
+ 	 * Protected by mode_lck
+ 	 */
+ 	bool		SharedRecoveryProcessingMode;
+ 	slock_t		mode_lck;
+ 
+ 	char		InfoLockPadding[XLOGCTL_BUFFER_SPACING];
+ 
  	slock_t		info_lck;		/* locks shared variables shown above */
  } XLogCtlData;
  
***************
*** 397,404 ****
--- 437,446 ----
  static void readRecoveryCommandFile(void);
  static void exitArchiveRecovery(TimeLineID endTLI,
  					uint32 endLogId, uint32 endLogSeg);
+ static void exitRecovery(void);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
  static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+ static XLogRecPtr GetRedoLocationForCheckpoint(void);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
***************
*** 480,485 ****
--- 522,532 ----
  	bool		updrqst;
  	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+ 	bool		isRecoveryEnd = (rmid == RM_XLOG_ID && info == XLOG_RECOVERY_END);
+ 
+ 	/* cross-check on whether we should be here or not */
+ 	if (IsRecoveryProcessingMode() && !isRecoveryEnd)
+ 		elog(FATAL, "cannot make new WAL entries during recovery");
  
  	/* info's high bits are reserved for use by me */
  	if (info & XLR_INFO_MASK)
***************
*** 1720,1727 ****
  	XLogRecPtr	WriteRqstPtr;
  	XLogwrtRqst WriteRqst;
  
! 	/* Disabled during REDO */
! 	if (InRedo)
  		return;
  
  	/* Quick exit if already known flushed */
--- 1767,1773 ----
  	XLogRecPtr	WriteRqstPtr;
  	XLogwrtRqst WriteRqst;
  
! 	if (IsRecoveryProcessingMode())
  		return;
  
  	/* Quick exit if already known flushed */
***************
*** 1809,1817 ****
  	 * the bad page is encountered again during recovery then we would be
  	 * unable to restart the database at all!  (This scenario has actually
  	 * happened in the field several times with 7.1 releases. Note that we
! 	 * cannot get here while InRedo is true, but if the bad page is brought in
! 	 * and marked dirty during recovery then CreateCheckPoint will try to
! 	 * flush it at the end of recovery.)
  	 *
  	 * The current approach is to ERROR under normal conditions, but only
  	 * WARNING during recovery, so that the system can be brought up even if
--- 1855,1863 ----
  	 * the bad page is encountered again during recovery then we would be
  	 * unable to restart the database at all!  (This scenario has actually
  	 * happened in the field several times with 7.1 releases. Note that we
! 	 * cannot get here while IsRecoveryProcessingMode(), but if the bad page is
! 	 * brought in and marked dirty during recovery then if a checkpoint were
! 	 * performed at the end of recovery it will try to flush it.
  	 *
  	 * The current approach is to ERROR under normal conditions, but only
  	 * WARNING during recovery, so that the system can be brought up even if
***************
*** 1821,1827 ****
  	 * and so we will not force a restart for a bad LSN on a data page.
  	 */
  	if (XLByteLT(LogwrtResult.Flush, record))
! 		elog(InRecovery ? WARNING : ERROR,
  		"xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
  			 record.xlogid, record.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
--- 1867,1873 ----
  	 * and so we will not force a restart for a bad LSN on a data page.
  	 */
  	if (XLByteLT(LogwrtResult.Flush, record))
! 		elog(ERROR,
  		"xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
  			 record.xlogid, record.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
***************
*** 2094,2100 ****
  		unlink(tmppath);
  	}
  
! 	elog(DEBUG2, "done creating and filling new WAL file");
  
  	/* Set flag to tell caller there was no existent file */
  	*use_existent = false;
--- 2140,2147 ----
  		unlink(tmppath);
  	}
  
! 	XLogFileName(tmppath, ThisTimeLineID, log, seg);
! 	elog(DEBUG2, "done creating and filling new WAL file %s", tmppath);
  
  	/* Set flag to tell caller there was no existent file */
  	*use_existent = false;
***************
*** 2400,2405 ****
--- 2447,2474 ----
  					 xlogfname);
  			set_ps_display(activitymsg, false);
  
+ 			/* 
+ 			 * Calculate and write out a new safeStartPoint. This defines
+ 			 * the latest LSN that might appear on-disk while we apply
+ 			 * the WAL records in this file. If we crash during recovery
+ 			 * we must reach this point again before we can prove
+ 			 * database consistency. Not a restartpoint! Restart points
+ 			 * define where we should start recovery from, if we crash.
+ 			 */
+ 			if (InArchiveRecovery)
+ 			{
+ 				uint32 nextLog = log;
+ 				uint32 nextSeg = seg;
+ 
+ 				NextLogSeg(nextLog, nextSeg);
+ 
+ 				LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 				ControlFile->minSafeStartPoint.xlogid = nextLog;
+ 				ControlFile->minSafeStartPoint.xrecoff = nextSeg * XLogSegSize;
+ 				UpdateControlFile();
+ 				LWLockRelease(ControlFileLock);
+ 			}
+ 
  			return fd;
  		}
  		if (errno != ENOENT)	/* unexpected failure? */
***************
*** 4228,4233 ****
--- 4297,4303 ----
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
  	SpinLockInit(&XLogCtl->info_lck);
+ 	SpinLockInit(&XLogCtl->mode_lck);
  
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
***************
*** 4532,4548 ****
  			ereport(LOG,
  					(errmsg("recovery_target_inclusive = %s", tok2)));
  		}
  		else if (strcmp(tok1, "log_restartpoints") == 0)
  		{
- 			/*
- 			 * does nothing if a recovery_target is not also set
- 			 */
- 			if (!parse_bool(tok2, &recoveryLogRestartpoints))
- 				  ereport(ERROR,
- 							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
- 					  errmsg("parameter \"log_restartpoints\" requires a Boolean value")));
  			ereport(LOG,
! 					(errmsg("log_restartpoints = %s", tok2)));
  		}
  		else
  			ereport(FATAL,
--- 4602,4642 ----
  			ereport(LOG,
  					(errmsg("recovery_target_inclusive = %s", tok2)));
  		}
+ 		else if (strcmp(tok1, "recovery_safe_start_location") == 0)
+ 		{
+ 			unsigned int uxlogid;
+ 			unsigned int uxrecoff;
+ 			XLogRecPtr	NewSafeStartPtr;
+ 
+ 			if (sscanf(tok2, "%X/%X", &uxlogid, &uxrecoff) != 2)
+ 				ereport(ERROR,
+ 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 						 errmsg("could not parse transaction log location \"%s\"",
+ 								tok2)));
+ 
+ 			NewSafeStartPtr.xlogid = uxlogid;
+ 			NewSafeStartPtr.xrecoff = uxrecoff;
+ 			if (XLByteLE(ControlFile->minSafeStartPoint, NewSafeStartPtr))
+ 			{
+ 				ControlFile->minSafeStartPoint.xlogid = uxlogid;
+ 				ControlFile->minSafeStartPoint.xrecoff = uxrecoff;
+ 
+ 				ereport(LOG,
+ 					(errmsg("recovery_safe_start_location = '%s'", tok2)));
+ 			}
+ 			else if (ControlFile->state != DB_IN_ARCHIVE_RECOVERY)
+ 				ereport(ERROR,
+ 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 						 errmsg("recovery_safe_start_location = '%s' is earlier than control file %X/%X",
+ 								tok2,
+ 								ControlFile->minSafeStartPoint.xlogid,
+ 								ControlFile->minSafeStartPoint.xrecoff)));
+ 		}
  		else if (strcmp(tok1, "log_restartpoints") == 0)
  		{
  			ereport(LOG,
! 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
! 					  errmsg("parameter \"log_restartpoints\" has been deprecated")));
  		}
  		else
  			ereport(FATAL,
***************
*** 4678,4692 ****
  	unlink(recoveryPath);		/* ignore any error */
  
  	/*
! 	 * Rename the config file out of the way, so that we don't accidentally
! 	 * re-enter archive recovery mode in a subsequent crash.
  	 */
- 	unlink(RECOVERY_COMMAND_DONE);
- 	if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
- 		ereport(FATAL,
- 				(errcode_for_file_access(),
- 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
- 						RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
  
  	ereport(LOG,
  			(errmsg("archive recovery complete")));
--- 4772,4784 ----
  	unlink(recoveryPath);		/* ignore any error */
  
  	/*
! 	 * As of 8.4 we no longer rename the recovery.conf file out of the
! 	 * way until after we have performed a full checkpoint. This ensures
! 	 * that any crash between now and the end of the checkpoint does not
! 	 * attempt to restart from a WAL file that is no longer available to us.
! 	 * As soon as we remove recovery.conf we lose our recovery_command and
! 	 * cannot reaccess WAL files from the archive.
  	 */
  
  	ereport(LOG,
  			(errmsg("archive recovery complete")));
***************
*** 4813,4818 ****
--- 4905,4911 ----
  	CheckPoint	checkPoint;
  	bool		wasShutdown;
  	bool		reachedStopPoint = false;
+ 	bool		performedRecovery = false;
  	bool		haveBackupLabel = false;
  	XLogRecPtr	RecPtr,
  				LastRec,
***************
*** 4825,4830 ****
--- 4918,4925 ----
  	uint32		freespace;
  	TransactionId oldestActiveXID;
  
+ 	XLogCtl->SharedRecoveryProcessingMode = true;
+ 
  	/*
  	 * Read control file and check XLOG status looks valid.
  	 *
***************
*** 5038,5046 ****
--- 5133,5147 ----
  		if (minRecoveryLoc.xlogid != 0 || minRecoveryLoc.xrecoff != 0)
  			ControlFile->minRecoveryPoint = minRecoveryLoc;
  		ControlFile->time = (pg_time_t) time(NULL);
+ 		/* No need to hold ControlFileLock yet, we aren't up far enough */
  		UpdateControlFile();
  
  		/*
+ 		 * Reset pgstat data, because it may be invalid after recovery.
+ 		 */
+ 		pgstat_reset_all();
+ 
+ 		/*
  		 * If there was a backup label file, it's done its job and the info
  		 * has now been propagated into pg_control.  We must get rid of the
  		 * label file so that if we crash during recovery, we'll pick up at
***************
*** 5150,5155 ****
--- 5251,5282 ----
  
  				LastRec = ReadRecPtr;
  
+ 				/*
+ 				 * Have we reached our safe starting point? If so, we can
+ 				 * signal Postmaster to enter consistent recovery mode.
+ 				 *
+ 				 * There are two point in the log we must pass. The first is
+ 				 * the minRecoveryPoint, which is the LSN at the time the
+ 				 * base backup was taken that we are about to rollfoward from.
+ 				 * If recovery has ever crashed or was stopped there is 
+ 				 * another point also: minSafeStartPoint, which we know the
+ 				 * latest LSN that recovery could have reached prior to crash.
+ 				 */
+ 				if (!reachedSafeStartPoint && 
+ 					 XLByteLE(ControlFile->minSafeStartPoint, EndRecPtr) && 
+ 					 XLByteLE(ControlFile->minRecoveryPoint, EndRecPtr))
+ 				{
+ 					reachedSafeStartPoint = true;
+ 					if (InArchiveRecovery)
+ 					{
+ 						ereport(LOG,
+ 							(errmsg("consistent recovery state reached at %X/%X",
+ 								EndRecPtr.xlogid, EndRecPtr.xrecoff)));
+ 						if (IsUnderPostmaster)
+ 							SendPostmasterSignal(PMSIGNAL_RECOVERY_START);
+ 					}
+ 				}
+ 
  				record = ReadRecord(NULL, LOG);
  			} while (record != NULL && recoveryContinue);
  
***************
*** 5171,5176 ****
--- 5298,5304 ----
  			/* there are no WAL records following the checkpoint */
  			ereport(LOG,
  					(errmsg("redo is not required")));
+ 			reachedSafeStartPoint = true;
  		}
  	}
  
***************
*** 5184,5192 ****
  
  	/*
  	 * Complain if we did not roll forward far enough to render the backup
! 	 * dump consistent.
  	 */
! 	if (XLByteLT(EndOfLog, ControlFile->minRecoveryPoint))
  	{
  		if (reachedStopPoint)	/* stopped because of stop request */
  			ereport(FATAL,
--- 5312,5320 ----
  
  	/*
  	 * Complain if we did not roll forward far enough to render the backup
! 	 * dump consistent and start safely.
  	 */
! 	if (InRecovery && !reachedSafeStartPoint)
  	{
  		if (reachedStopPoint)	/* stopped because of stop request */
  			ereport(FATAL,
***************
*** 5308,5346 ****
  		XLogCheckInvalidPages();
  
  		/*
! 		 * Reset pgstat data, because it may be invalid after recovery.
  		 */
! 		pgstat_reset_all();
  
! 		/*
! 		 * Perform a checkpoint to update all our recovery activity to disk.
! 		 *
! 		 * Note that we write a shutdown checkpoint rather than an on-line
! 		 * one. This is not particularly critical, but since we may be
! 		 * assigning a new TLI, using a shutdown checkpoint allows us to have
! 		 * the rule that TLI only changes in shutdown checkpoints, which
! 		 * allows some extra error checking in xlog_redo.
! 		 */
! 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
  	}
  
- 	/*
- 	 * Preallocate additional log files, if wanted.
- 	 */
- 	PreallocXlogFiles(EndOfLog);
- 
- 	/*
- 	 * Okay, we're officially UP.
- 	 */
- 	InRecovery = false;
- 
- 	ControlFile->state = DB_IN_PRODUCTION;
- 	ControlFile->time = (pg_time_t) time(NULL);
- 	UpdateControlFile();
- 
- 	/* start the archive_timeout timer running */
- 	XLogCtl->Write.lastSegSwitchTime = ControlFile->time;
- 
  	/* initialize shared-memory copy of latest checkpoint XID/epoch */
  	XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
  	XLogCtl->ckptXid = ControlFile->checkPointCopy.nextXid;
--- 5436,5449 ----
  		XLogCheckInvalidPages();
  
  		/*
! 		 * Finally exit recovery and mark that in WAL. Pre-8.4 we wrote
! 		 * a shutdown checkpoint here, but we ask bgwriter to do that now.
  		 */
! 		exitRecovery();
  
! 		performedRecovery = true;
  	}
  
  	/* initialize shared-memory copy of latest checkpoint XID/epoch */
  	XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
  	XLogCtl->ckptXid = ControlFile->checkPointCopy.nextXid;
***************
*** 5374,5379 ****
--- 5477,5565 ----
  		readRecordBuf = NULL;
  		readRecordBufSize = 0;
  	}
+ 
+ 	/*
+ 	 * Prior to 8.4 we wrote a Shutdown Checkpoint at the end of recovery.
+ 	 * This could add minutes to the startup time, so we want bgwriter
+ 	 * to perform it. This then frees the Startup process to complete so we can
+ 	 * allow transactions and WAL inserts. We still write a checkpoint, but
+ 	 * it will be an online checkpoint. Online checkpoints have a redo
+ 	 * location that can be prior to the actual checkpoint record. So we want
+ 	 * to derive that redo location *before* we let anybody else write WAL,
+ 	 * otherwise we might miss some WAL records if we crash.
+ 	 */
+ 	if (performedRecovery)
+ 	{
+ 		XLogRecPtr	redo;
+ 
+ 		/* 
+ 		 * We must grab the pointer before anybody writes WAL 
+ 		 */
+ 		redo = GetRedoLocationForCheckpoint();
+ 
+ 		/* 
+ 		 * Tell the bgwriter
+ 		 */
+ 		SetRedoLocationForArchiveCheckpoint(redo);
+ 
+ 		/*
+ 		 * Okay, we can come up now. Allow others to write WAL.
+ 		 */
+ 		XLogCtl->SharedRecoveryProcessingMode = false;
+ 
+ 		/*
+ 		 * Now request checkpoint
+ 		 */
+ 		RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_IMMEDIATE);
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * No recovery, so lets just get on with it. 
+ 		 */
+ 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 		ControlFile->state = DB_IN_PRODUCTION;
+ 		ControlFile->time = (pg_time_t) time(NULL);
+ 		UpdateControlFile();
+ 		LWLockRelease(ControlFileLock);
+ 
+ 		/*
+ 		 * Okay, we're officially UP.
+ 		 */
+ 		XLogCtl->SharedRecoveryProcessingMode = false;
+ 	}
+ 
+ 	/* start the archive_timeout timer running */
+ 	XLogCtl->Write.lastSegSwitchTime = (pg_time_t) time(NULL);
+ 
+ }
+ 
+ /*
+  * IsRecoveryProcessingMode()
+  *
+  * Fast test for whether we're still in recovery or not. We test the shared
+  * state each time only until we leave recovery mode. After that we never
+  * look again, relying upon the settings of our local state variables. This
+  * is designed to avoid the need for a separate initialisation step.
+  */
+ bool
+ IsRecoveryProcessingMode(void)
+ {
+ 	if (knownProcessingMode && !LocalRecoveryProcessingMode)
+ 		return false;
+ 
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+ 		SpinLockAcquire(&xlogctl->mode_lck);
+ 		LocalRecoveryProcessingMode = XLogCtl->SharedRecoveryProcessingMode;
+ 		SpinLockRelease(&xlogctl->mode_lck);
+ 	}
+ 
+ 	knownProcessingMode = true;
+ 
+ 	return LocalRecoveryProcessingMode;
  }
  
  /*
***************
*** 5631,5650 ****
  static void
  LogCheckpointStart(int flags)
  {
! 	elog(LOG, "checkpoint starting:%s%s%s%s%s%s",
! 		 (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
! 		 (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
! 		 (flags & CHECKPOINT_FORCE) ? " force" : "",
! 		 (flags & CHECKPOINT_WAIT) ? " wait" : "",
! 		 (flags & CHECKPOINT_CAUSE_XLOG) ? " xlog" : "",
! 		 (flags & CHECKPOINT_CAUSE_TIME) ? " time" : "");
  }
  
  /*
   * Log end of a checkpoint.
   */
  static void
! LogCheckpointEnd(void)
  {
  	long		write_secs,
  				sync_secs,
--- 5817,5840 ----
  static void
  LogCheckpointStart(int flags)
  {
! 	if (flags & CHECKPOINT_RESTARTPOINT)
! 		elog(LOG, "restartpoint starting:%s",
! 			 (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "");
! 	else
! 		elog(LOG, "checkpoint starting:%s%s%s%s%s%s",
! 			 (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
! 			 (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
! 			 (flags & CHECKPOINT_FORCE) ? " force" : "",
! 			 (flags & CHECKPOINT_WAIT) ? " wait" : "",
! 			 (flags & CHECKPOINT_CAUSE_XLOG) ? " xlog" : "",
! 			 (flags & CHECKPOINT_CAUSE_TIME) ? " time" : "");
  }
  
  /*
   * Log end of a checkpoint.
   */
  static void
! LogCheckpointEnd(int flags)
  {
  	long		write_secs,
  				sync_secs,
***************
*** 5667,5683 ****
  						CheckpointStats.ckpt_sync_end_t,
  						&sync_secs, &sync_usecs);
  
! 	elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
! 		 "%d transaction log file(s) added, %d removed, %d recycled; "
! 		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
! 		 CheckpointStats.ckpt_bufs_written,
! 		 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
! 		 CheckpointStats.ckpt_segs_added,
! 		 CheckpointStats.ckpt_segs_removed,
! 		 CheckpointStats.ckpt_segs_recycled,
! 		 write_secs, write_usecs / 1000,
! 		 sync_secs, sync_usecs / 1000,
! 		 total_secs, total_usecs / 1000);
  }
  
  /*
--- 5857,5882 ----
  						CheckpointStats.ckpt_sync_end_t,
  						&sync_secs, &sync_usecs);
  
! 	if (flags & CHECKPOINT_RESTARTPOINT)
! 		elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); "
! 			 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
! 			 CheckpointStats.ckpt_bufs_written,
! 			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
! 			 write_secs, write_usecs / 1000,
! 			 sync_secs, sync_usecs / 1000,
! 			 total_secs, total_usecs / 1000);
! 	else
! 		elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
! 			 "%d transaction log file(s) added, %d removed, %d recycled; "
! 			 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
! 			 CheckpointStats.ckpt_bufs_written,
! 			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
! 			 CheckpointStats.ckpt_segs_added,
! 			 CheckpointStats.ckpt_segs_removed,
! 			 CheckpointStats.ckpt_segs_recycled,
! 			 write_secs, write_usecs / 1000,
! 			 sync_secs, sync_usecs / 1000,
! 			 total_secs, total_usecs / 1000);
  }
  
  /*
***************
*** 5702,5718 ****
  	XLogRecPtr	recptr;
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecData rdata;
- 	uint32		freespace;
  	uint32		_logId;
  	uint32		_logSeg;
  	TransactionId *inCommitXids;
  	int			nInCommit;
  
  	/*
  	 * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
! 	 * (This is just pro forma, since in the present system structure there is
! 	 * only one process that is allowed to issue checkpoints at any given
! 	 * time.)
  	 */
  	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
  
--- 5901,5916 ----
  	XLogRecPtr	recptr;
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecData rdata;
  	uint32		_logId;
  	uint32		_logSeg;
  	TransactionId *inCommitXids;
  	int			nInCommit;
+ 	bool		leavingArchiveRecovery = false;
  
  	/*
  	 * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
! 	 * That shouldn't be happening, but checkpoints are an important aspect
! 	 * of our resilience, so we take no chances.
  	 */
  	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
  
***************
*** 5727,5741 ****
--- 5925,5948 ----
  	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
  
  	/*
+ 	 * Find out if this is the first checkpoint after archive recovery.
+ 	 */
+ 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 	leavingArchiveRecovery = (ControlFile->state == DB_IN_ARCHIVE_RECOVERY);
+ 	LWLockRelease(ControlFileLock);
+ 
+ 	/*
  	 * Use a critical section to force system panic if we have trouble.
  	 */
  	START_CRIT_SECTION();
  
  	if (shutdown)
  	{
+ 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
  		ControlFile->state = DB_SHUTDOWNING;
  		ControlFile->time = (pg_time_t) time(NULL);
  		UpdateControlFile();
+ 		LWLockRelease(ControlFileLock);
  	}
  
  	/*
***************
*** 5750,5840 ****
  	checkPoint.ThisTimeLineID = ThisTimeLineID;
  	checkPoint.time = (pg_time_t) time(NULL);
  
! 	/*
! 	 * We must hold WALInsertLock while examining insert state to determine
! 	 * the checkpoint REDO pointer.
! 	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 
! 	/*
! 	 * If this isn't a shutdown or forced checkpoint, and we have not inserted
! 	 * any XLOG records since the start of the last checkpoint, skip the
! 	 * checkpoint.	The idea here is to avoid inserting duplicate checkpoints
! 	 * when the system is idle. That wastes log space, and more importantly it
! 	 * exposes us to possible loss of both current and previous checkpoint
! 	 * records if the machine crashes just as we're writing the update.
! 	 * (Perhaps it'd make even more sense to checkpoint only when the previous
! 	 * checkpoint record is in a different xlog page?)
! 	 *
! 	 * We have to make two tests to determine that nothing has happened since
! 	 * the start of the last checkpoint: current insertion point must match
! 	 * the end of the last checkpoint record, and its redo pointer must point
! 	 * to itself.
! 	 */
! 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_FORCE)) == 0)
  	{
! 		XLogRecPtr	curInsert;
  
! 		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
! 		if (curInsert.xlogid == ControlFile->checkPoint.xlogid &&
! 			curInsert.xrecoff == ControlFile->checkPoint.xrecoff +
! 			MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) &&
! 			ControlFile->checkPoint.xlogid ==
! 			ControlFile->checkPointCopy.redo.xlogid &&
! 			ControlFile->checkPoint.xrecoff ==
! 			ControlFile->checkPointCopy.redo.xrecoff)
  		{
! 			LWLockRelease(WALInsertLock);
! 			LWLockRelease(CheckpointLock);
! 			END_CRIT_SECTION();
! 			return;
! 		}
! 	}
  
! 	/*
! 	 * Compute new REDO record ptr = location of next XLOG record.
! 	 *
! 	 * NB: this is NOT necessarily where the checkpoint record itself will be,
! 	 * since other backends may insert more XLOG records while we're off doing
! 	 * the buffer flush work.  Those XLOG records are logically after the
! 	 * checkpoint, even though physically before it.  Got that?
! 	 */
! 	freespace = INSERT_FREESPACE(Insert);
! 	if (freespace < SizeOfXLogRecord)
! 	{
! 		(void) AdvanceXLInsertBuffer(false);
! 		/* OK to ignore update return flag, since we will do flush anyway */
! 		freespace = INSERT_FREESPACE(Insert);
! 	}
! 	INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx);
  
! 	/*
! 	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
! 	 * must be done while holding the insert lock AND the info_lck.
! 	 *
! 	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
! 	 * pointing past where it really needs to point.  This is okay; the only
! 	 * consequence is that XLogInsert might back up whole buffers that it
! 	 * didn't really need to.  We can't postpone advancing RedoRecPtr because
! 	 * XLogInserts that happen while we are dumping buffers must assume that
! 	 * their buffer changes are not included in the checkpoint.
! 	 */
! 	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
  
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		RedoRecPtr = xlogctl->Insert.RedoRecPtr = checkPoint.redo;
! 		SpinLockRelease(&xlogctl->info_lck);
  	}
  
  	/*
- 	 * Now we can release WAL insert lock, allowing other xacts to proceed
- 	 * while we are flushing disk buffers.
- 	 */
- 	LWLockRelease(WALInsertLock);
- 
- 	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
  	 * to log anything if we decided to skip the checkpoint.
  	 */
--- 5957,6025 ----
  	checkPoint.ThisTimeLineID = ThisTimeLineID;
  	checkPoint.time = (pg_time_t) time(NULL);
  
! 	if (leavingArchiveRecovery)
! 		checkPoint.redo = GetRedoLocationForArchiveCheckpoint();
! 	else
  	{
! 		/*
! 		 * We must hold WALInsertLock while examining insert state to determine
! 		 * the checkpoint REDO pointer.
! 		 */
! 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  
! 		/*
! 		 * If this isn't a shutdown or forced checkpoint, and we have not inserted
! 		 * any XLOG records since the start of the last checkpoint, skip the
! 		 * checkpoint.	The idea here is to avoid inserting duplicate checkpoints
! 		 * when the system is idle. That wastes log space, and more importantly it
! 		 * exposes us to possible loss of both current and previous checkpoint
! 		 * records if the machine crashes just as we're writing the update.
! 		 * (Perhaps it'd make even more sense to checkpoint only when the previous
! 		 * checkpoint record is in a different xlog page?)
! 		 *
! 		 * We have to make two tests to determine that nothing has happened since
! 		 * the start of the last checkpoint: current insertion point must match
! 		 * the end of the last checkpoint record, and its redo pointer must point
! 		 * to itself.
! 		 */
! 		if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_FORCE)) == 0)
  		{
! 			XLogRecPtr	curInsert;
  
! 			INSERT_RECPTR(curInsert, Insert, Insert->curridx);
! 			if (curInsert.xlogid == ControlFile->checkPoint.xlogid &&
! 				curInsert.xrecoff == ControlFile->checkPoint.xrecoff +
! 				MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) &&
! 				ControlFile->checkPoint.xlogid ==
! 				ControlFile->checkPointCopy.redo.xlogid &&
! 				ControlFile->checkPoint.xrecoff ==
! 				ControlFile->checkPointCopy.redo.xrecoff)
! 			{
! 				LWLockRelease(WALInsertLock);
! 				LWLockRelease(CheckpointLock);
! 				END_CRIT_SECTION();
! 				return;
! 			}
! 		}
  
! 		/*
! 		 * Compute new REDO record ptr = location of next XLOG record.
! 		 *
! 		 * NB: this is NOT necessarily where the checkpoint record itself will be,
! 		 * since other backends may insert more XLOG records while we're off doing
! 		 * the buffer flush work.  Those XLOG records are logically after the
! 		 * checkpoint, even though physically before it.  Got that?
! 		 */
! 		checkPoint.redo = GetRedoLocationForCheckpoint();
  
! 		/*
! 		 * Now we can release WAL insert lock, allowing other xacts to proceed
! 		 * while we are flushing disk buffers.
! 		 */
! 		LWLockRelease(WALInsertLock);
  	}
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
  	 * to log anything if we decided to skip the checkpoint.
  	 */
***************
*** 5941,5958 ****
  	XLByteToSeg(ControlFile->checkPointCopy.redo, _logId, _logSeg);
  
  	/*
! 	 * Update the control file.
  	 */
  	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
  	if (shutdown)
  		ControlFile->state = DB_SHUTDOWNED;
  	ControlFile->prevCheckPoint = ControlFile->checkPoint;
  	ControlFile->checkPoint = ProcLastRecPtr;
  	ControlFile->checkPointCopy = checkPoint;
  	ControlFile->time = (pg_time_t) time(NULL);
  	UpdateControlFile();
  	LWLockRelease(ControlFileLock);
  
  	/* Update shared-memory copy of checkpoint XID/epoch */
  	{
  		/* use volatile pointer to prevent code rearrangement */
--- 6126,6168 ----
  	XLByteToSeg(ControlFile->checkPointCopy.redo, _logId, _logSeg);
  
  	/*
! 	 * Update the control file. In 8.4, this routine becomes the primary
! 	 * point for recording changes of state in the control file at the 
! 	 * end of recovery. Postmaster state already shows us being in 
! 	 * normal running mode, but it is only after this point that we
! 	 * are completely free of reperforming a recovery if we crash.  Note
! 	 * that this is executed by bgwriter after the death of Startup process.
  	 */
  	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 
  	if (shutdown)
  		ControlFile->state = DB_SHUTDOWNED;
+ 	else
+ 		ControlFile->state = DB_IN_PRODUCTION;
+ 
  	ControlFile->prevCheckPoint = ControlFile->checkPoint;
  	ControlFile->checkPoint = ProcLastRecPtr;
  	ControlFile->checkPointCopy = checkPoint;
  	ControlFile->time = (pg_time_t) time(NULL);
  	UpdateControlFile();
+ 
  	LWLockRelease(ControlFileLock);
  
+ 	if (leavingArchiveRecovery)
+ 	{
+ 		/*
+ 		 * Rename the config file out of the way, so that we don't accidentally
+ 		 * re-enter archive recovery mode in a subsequent crash. Prior to
+ 		 * 8.4 this step was performed at end of exitArchiveRecovery().
+ 		 */
+ 		unlink(RECOVERY_COMMAND_DONE);
+ 		if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
+ 			ereport(ERROR,
+ 					(errcode_for_file_access(),
+ 					 errmsg("could not rename file \"%s\" to \"%s\": %m",
+ 							RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
+ 	}
+ 
  	/* Update shared-memory copy of checkpoint XID/epoch */
  	{
  		/* use volatile pointer to prevent code rearrangement */
***************
*** 5999,6014 ****
  	 * in subtrans.c).	During recovery, though, we mustn't do this because
  	 * StartupSUBTRANS hasn't been called yet.
  	 */
! 	if (!InRecovery)
! 		TruncateSUBTRANS(GetOldestXmin(true, false));
  
  	/* All real work is done, but log before releasing lock. */
  	if (log_checkpoints)
! 		LogCheckpointEnd();
  
  	LWLockRelease(CheckpointLock);
  }
  
  /*
   * Flush all data in shared memory to disk, and fsync
   *
--- 6209,6268 ----
  	 * in subtrans.c).	During recovery, though, we mustn't do this because
  	 * StartupSUBTRANS hasn't been called yet.
  	 */
! 	TruncateSUBTRANS(GetOldestXmin(true, false));
  
  	/* All real work is done, but log before releasing lock. */
  	if (log_checkpoints)
! 		LogCheckpointEnd(flags);
  
  	LWLockRelease(CheckpointLock);
  }
  
+ /* 
+  * GetRedoLocationForCheckpoint()
+  *
+  * When !IsRecoveryProcessingMode() this must be called while holding 
+  * WALInsertLock().
+  */
+ static XLogRecPtr
+ GetRedoLocationForCheckpoint()
+ {
+ 	XLogCtlInsert  *Insert = &XLogCtl->Insert;
+ 	uint32			freespace;
+ 	XLogRecPtr		redo;
+ 
+ 	freespace = INSERT_FREESPACE(Insert);
+ 	if (freespace < SizeOfXLogRecord)
+ 	{
+ 		(void) AdvanceXLInsertBuffer(false);
+ 		/* OK to ignore update return flag, since we will do flush anyway */
+ 		freespace = INSERT_FREESPACE(Insert);
+ 	}
+ 	INSERT_RECPTR(redo, Insert, Insert->curridx);
+ 
+ 	/*
+ 	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
+ 	 * must be done while holding the insert lock AND the info_lck.
+ 	 *
+ 	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
+ 	 * pointing past where it really needs to point.  This is okay; the only
+ 	 * consequence is that XLogInsert might back up whole buffers that it
+ 	 * didn't really need to.  We can't postpone advancing RedoRecPtr because
+ 	 * XLogInserts that happen while we are dumping buffers must assume that
+ 	 * their buffer changes are not included in the checkpoint.
+ 	 */
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+ 		SpinLockAcquire(&xlogctl->info_lck);
+ 		RedoRecPtr = xlogctl->Insert.RedoRecPtr = redo;
+ 		SpinLockRelease(&xlogctl->info_lck);
+ 	}
+ 
+ 	return redo;
+ }
+ 
  /*
   * Flush all data in shared memory to disk, and fsync
   *
***************
*** 6073,6101 ****
  			}
  	}
  
  	/*
! 	 * OK, force data out to disk
  	 */
! 	CheckPointGuts(checkPoint->redo, CHECKPOINT_IMMEDIATE);
  
  	/*
! 	 * Update pg_control so that any subsequent crash will restart from this
! 	 * checkpoint.	Note: ReadRecPtr gives the XLOG address of the checkpoint
! 	 * record itself.
  	 */
  	ControlFile->prevCheckPoint = ControlFile->checkPoint;
! 	ControlFile->checkPoint = ReadRecPtr;
! 	ControlFile->checkPointCopy = *checkPoint;
  	ControlFile->time = (pg_time_t) time(NULL);
  	UpdateControlFile();
  
! 	ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
  			(errmsg("recovery restart point at %X/%X",
! 					checkPoint->redo.xlogid, checkPoint->redo.xrecoff)));
  	if (recoveryLastXTime)
! 		ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
! 				(errmsg("last completed transaction was at log time %s",
! 						timestamptz_to_str(recoveryLastXTime))));
  }
  
  /*
--- 6327,6395 ----
  			}
  	}
  
+ 	RequestRestartPoint(ReadRecPtr, checkPoint, reachedSafeStartPoint);
+ }
+ 
+ /*
+  * As of 8.4, RestartPoints are always created by the bgwriter
+  * once we have reachedSafeStartPoint. We use bgwriter's shared memory
+  * area wherever we call it from, to keep better code structure.
+  */
+ void
+ CreateRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, int flags)
+ {
+ 	if (log_checkpoints)
+ 	{
+ 		/*
+ 		 * Prepare to accumulate statistics.
+ 		 */
+ 
+ 		MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
+ 		CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
+ 
+ 		LogCheckpointStart(CHECKPOINT_RESTARTPOINT | flags);
+ 	}
+ 
  	/*
! 	 * Acquire CheckpointLock to ensure only one restartpoint happens at a time.
! 	 * We rely on this lock to ensure that the startup process doesn't exit
! 	 * Recovery while we are half way through a restartpoint.
  	 */
! 	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
! 
! 	CheckPointGuts(restartPoint->redo, CHECKPOINT_RESTARTPOINT | flags);
  
  	/*
! 	 * Update pg_control, using current time
  	 */
+ 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
  	ControlFile->prevCheckPoint = ControlFile->checkPoint;
! 	ControlFile->checkPoint = ReadPtr;
! 	ControlFile->checkPointCopy = *restartPoint;
  	ControlFile->time = (pg_time_t) time(NULL);
  	UpdateControlFile();
+ 	LWLockRelease(ControlFileLock);
  
! 	/*
! 	 * Currently, there is no need to truncate pg_subtrans during recovery.
! 	 * If we did do that, we will need to have called StartupSUBTRANS()
! 	 * already and then TruncateSUBTRANS() would go here.
! 	 */
! 
! 	/* All real work is done, but log before releasing lock. */
! 	if (log_checkpoints)
! 		LogCheckpointEnd(CHECKPOINT_RESTARTPOINT);
! 
! 	ereport((log_checkpoints ? LOG : DEBUG2),
  			(errmsg("recovery restart point at %X/%X",
! 					restartPoint->redo.xlogid, restartPoint->redo.xrecoff)));
! 
  	if (recoveryLastXTime)
! 		ereport((log_checkpoints ? LOG : DEBUG2),
! 			(errmsg("last completed transaction was at log time %s",
! 					timestamptz_to_str(recoveryLastXTime))));
! 
! 	LWLockRelease(CheckpointLock);
  }
  
  /*
***************
*** 6160,6166 ****
  }
  
  /*
!  * XLOG resource manager's routines
   */
  void
  xlog_redo(XLogRecPtr lsn, XLogRecord *record)
--- 6454,6516 ----
  }
  
  /*
!  * exitRecovery()
!  *
!  * Exit recovery state and write a XLOG_RECOVERY_END record. This is the
!  * only record type that can record a change of timelineID. We assume
!  * caller has already set ThisTimeLineID, if appropriate.
!  */
! static void
! exitRecovery(void)
! {
! 	XLogRecData rdata;
! 
! 	rdata.buffer = InvalidBuffer;
! 	rdata.data = (char *) (&ThisTimeLineID);
! 	rdata.len = sizeof(TimeLineID);
! 	rdata.next = NULL;
! 
! 	/*
! 	 * If a restartpoint is in progress, we will not be able to successfully
! 	 * acquire CheckpointLock. If bgwriter is still in progress then send
! 	 * a second signal to nudge bgwriter to go faster so we can avoid delay.
! 	 * Then wait for lock, so we know the restartpoint has completed. We do
! 	 * this because we don't want to interrupt the restartpoint half way
! 	 * through, which might leave us in a mess and we want to be robust. We're
! 	 * going to checkpoint soon anyway, so not it's not wasted effort.
! 	 */
! 	if (LWLockConditionalAcquire(CheckpointLock, LW_EXCLUSIVE))
! 		LWLockRelease(CheckpointLock);
! 	else
! 	{
! 		RequestRestartPointCompletion();
! 		ereport(LOG,
! 			(errmsg("startup process waiting for restartpoint to complete")));
! 		LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
! 		LWLockRelease(CheckpointLock);
! 	}	
! 
! 	/*
! 	 * This is the only type of WAL message that can be inserted during
! 	 * recovery. This ensures that we don't allow others to get access
! 	 * until after we have changed state.
! 	 */
! 	(void) XLogInsert(RM_XLOG_ID, XLOG_RECOVERY_END, &rdata);
! 
! 	/*
! 	 * We don't XLogFlush() here otherwise we'll end up zeroing the WAL
! 	 * file ourselves. So just let bgwriter's forthcoming checkpoint do
! 	 * that for us.
! 	 */
! 
! 	InRecovery = false;
! }
! 
! /*
!  * XLOG resource manager's routines.
!  *
!  * Definitions of message info are in include/catalog/pg_control.h,
!  * though not all messages relate to control file processing.
   */
  void
  xlog_redo(XLogRecPtr lsn, XLogRecord *record)
***************
*** 6195,6215 ****
  		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
  
  		/*
! 		 * TLI may change in a shutdown checkpoint, but it shouldn't decrease
  		 */
! 		if (checkPoint.ThisTimeLineID != ThisTimeLineID)
  		{
! 			if (checkPoint.ThisTimeLineID < ThisTimeLineID ||
  				!list_member_int(expectedTLIs,
! 								 (int) checkPoint.ThisTimeLineID))
  				ereport(PANIC,
! 						(errmsg("unexpected timeline ID %u (after %u) in checkpoint record",
! 								checkPoint.ThisTimeLineID, ThisTimeLineID)));
  			/* Following WAL records should be run with new TLI */
! 			ThisTimeLineID = checkPoint.ThisTimeLineID;
  		}
- 
- 		RecoveryRestartPoint(&checkPoint);
  	}
  	else if (info == XLOG_CHECKPOINT_ONLINE)
  	{
--- 6545,6582 ----
  		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
  
  		/*
! 		 * TLI no longer changes at shutdown checkpoint, since as of 8.4,
! 		 * shutdown checkpoints only occur at shutdown. Much less confusing.
  		 */
! 
! 		RecoveryRestartPoint(&checkPoint);
! 	}
! 	else if (info == XLOG_RECOVERY_END)
! 	{
! 		TimeLineID	tli;
! 
! 		memcpy(&tli, XLogRecGetData(record), sizeof(TimeLineID));
! 
! 		/*
! 		 * TLI may change when recovery ends, but it shouldn't decrease.
! 		 *
! 		 * This is the only WAL record that can tell us to change timelineID
! 		 * while we process WAL records. 
! 		 *
! 		 * We can *choose* to stop recovery at any point, generating a
! 		 * new timelineID which is recorded using this record type.
! 		 */
! 		if (tli != ThisTimeLineID)
  		{
! 			if (tli < ThisTimeLineID ||
  				!list_member_int(expectedTLIs,
! 								 (int) tli))
  				ereport(PANIC,
! 						(errmsg("unexpected timeline ID %u (after %u) at recovery end record",
! 								tli, ThisTimeLineID)));
  			/* Following WAL records should be run with new TLI */
! 			ThisTimeLineID = tli;
  		}
  	}
  	else if (info == XLOG_CHECKPOINT_ONLINE)
  	{
***************
*** 6232,6238 ****
  		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
  		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
  
! 		/* TLI should not change in an on-line checkpoint */
  		if (checkPoint.ThisTimeLineID != ThisTimeLineID)
  			ereport(PANIC,
  					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
--- 6599,6605 ----
  		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
  		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
  
! 		/* TLI must not change at a checkpoint */
  		if (checkPoint.ThisTimeLineID != ThisTimeLineID)
  			ereport(PANIC,
  					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
***************
*** 6290,6296 ****
  }
  
  #ifdef WAL_DEBUG
- 
  static void
  xlog_outrec(StringInfo buf, XLogRecord *record)
  {
--- 6657,6662 ----
***************
*** 6310,6316 ****
  }
  #endif   /* WAL_DEBUG */
  
- 
  /*
   * Return the (possible) sync flag used for opening a file, depending on the
   * value of the GUC wal_sync_method.
--- 6676,6681 ----
***************
*** 6449,6454 ****
--- 6814,6820 ----
  	uint32		_logSeg;
  	struct stat stat_buf;
  	FILE	   *fp;
+ 	bool		immediate_checkpoint = false;
  
  	if (!superuser())
  		ereport(ERROR,
***************
*** 6502,6516 ****
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) 0);
  	{
  		/*
  		 * Force a CHECKPOINT.	Aside from being necessary to prevent torn
  		 * page problems, this guarantees that two successive backup runs will
  		 * have different checkpoint positions and hence different history
  		 * file names, even if nothing happened in between.
- 		 *
- 		 * We don't use CHECKPOINT_IMMEDIATE, hence this can take awhile.
  		 */
! 		RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT);
  
  		/*
  		 * Now we need to fetch the checkpoint record location, and also its
--- 6868,6905 ----
  	/* Ensure we release forcePageWrites if fail below */
  	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) 0);
  	{
+ 		bool flags = CHECKPOINT_FORCE | CHECKPOINT_WAIT;
+ 
+ 		/* 
+ 		 * We support both variants of the pg_start_backup() SQL function
+ 		 * with a single C function. If we requested two parameter variant,
+ 		 * then get the value for the second parameter.
+ 		 */
+ 		if (PG_NARGS() == 2)
+ 		{
+ 			immediate_checkpoint = PG_GETARG_BOOL(1);
+ 
+ 			/* By default, this can take some time */
+ 			if (immediate_checkpoint)
+ 			{
+ 				flags |= CHECKPOINT_IMMEDIATE;
+ 				ereport(NOTICE,
+ 					(errmsg("pg_start_backup() signalling for immediate checkpoint")));
+ 			}
+ 			else
+ 				ereport(NOTICE,
+ 					(errmsg("pg_start_backup() signalling for smooth checkpoint"
+ 							", may last up to %u s",
+ 							(int) (CheckPointTimeout * CheckPointCompletionTarget))));			
+ 		}
+ 
  		/*
  		 * Force a CHECKPOINT.	Aside from being necessary to prevent torn
  		 * page problems, this guarantees that two successive backup runs will
  		 * have different checkpoint positions and hence different history
  		 * file names, even if nothing happened in between.
  		 */
! 		RequestCheckpoint(flags);
  
  		/*
  		 * Now we need to fetch the checkpoint record location, and also its
***************
*** 6639,6651 ****
  	LWLockRelease(WALInsertLock);
  
  	/*
! 	 * Force a switch to a new xlog segment file, so that the backup is valid
  	 * as soon as archiver moves out the current segment file. We'll report
  	 * the end address of the XLOG SWITCH record as the backup stopping point.
  	 */
  	stoppoint = RequestXLogSwitch();
  
  	XLByteToSeg(stoppoint, _logId, _logSeg);
  	XLogFileName(stopxlogfilename, ThisTimeLineID, _logId, _logSeg);
  
  	/* Use the log timezone here, not the session timezone */
--- 7028,7049 ----
  	LWLockRelease(WALInsertLock);
  
  	/*
! 	 * Request switch to a new xlog segment file, so that the backup is valid
  	 * as soon as archiver moves out the current segment file. We'll report
  	 * the end address of the XLOG SWITCH record as the backup stopping point.
  	 */
  	stoppoint = RequestXLogSwitch();
  
  	XLByteToSeg(stoppoint, _logId, _logSeg);
+ 
+ 	/*
+ 	 * If we didn't actually switch xlog files then there is nothing in
+ 	 * this file for us to wait for, so set stopxlogfilename to be the
+ 	 * previous file instead. We still report the same ending location.
+ 	 */
+ 	if ((stoppoint.xrecoff % XLogSegSize) == 0)
+ 		PrevLogSeg(_logId, _logSeg);
+ 
  	XLogFileName(stopxlogfilename, ThisTimeLineID, _logId, _logSeg);
  
  	/* Use the log timezone here, not the session timezone */
***************
*** 6741,6747 ****
  	BackupHistoryFileName(histfilepath, ThisTimeLineID, _logId, _logSeg,
  						  startpoint.xrecoff % XLogSegSize);
  
! 	seconds_before_warning = 60;
  	waits = 0;
  
  	while (XLogArchiveIsBusy(stopxlogfilename) ||
--- 7139,7145 ----
  	BackupHistoryFileName(histfilepath, ThisTimeLineID, _logId, _logSeg,
  						  startpoint.xrecoff % XLogSegSize);
  
! 	seconds_before_warning = 10;
  	waits = 0;
  
  	while (XLogArchiveIsBusy(stopxlogfilename) ||
Index: src/backend/postmaster/bgwriter.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.51
diff -c -r1.51 bgwriter.c
*** src/backend/postmaster/bgwriter.c	11 Aug 2008 11:05:11 -0000	1.51
--- src/backend/postmaster/bgwriter.c	30 Sep 2008 18:33:55 -0000
***************
*** 49,54 ****
--- 49,55 ----
  #include <unistd.h>
  
  #include "access/xlog_internal.h"
+ #include "catalog/pg_control.h"
  #include "libpq/pqsignal.h"
  #include "miscadmin.h"
  #include "pgstat.h"
***************
*** 130,135 ****
--- 131,143 ----
  
  	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
  
+ 	/* 
+ 	 * When the Startup process wants bgwriter to perform a restartpoint, it 
+ 	 * sets these fields so that we can update the control file afterwards.
+ 	 */
+ 	XLogRecPtr	ReadPtr;		/* Requested log pointer */
+ 	CheckPoint  restartPoint;	/* restartPoint data for ControlFile */
+ 
  	uint32		num_backend_writes;		/* counts non-bgwriter buffer writes */
  
  	int			num_requests;	/* current # of requests */
***************
*** 166,172 ****
  
  /* these values are valid when ckpt_active is true: */
  static pg_time_t ckpt_start_time;
! static XLogRecPtr ckpt_start_recptr;
  static double ckpt_cached_elapsed;
  
  static pg_time_t last_checkpoint_time;
--- 174,180 ----
  
  /* these values are valid when ckpt_active is true: */
  static pg_time_t ckpt_start_time;
! static XLogRecPtr ckpt_start_recptr;	/* not used if IsRecoveryProcessingMode */
  static double ckpt_cached_elapsed;
  
  static pg_time_t last_checkpoint_time;
***************
*** 198,203 ****
--- 206,212 ----
  {
  	sigjmp_buf	local_sigjmp_buf;
  	MemoryContext bgwriter_context;
+ 	bool		BgWriterRecoveryMode;
  
  	BgWriterShmem->bgwriter_pid = MyProcPid;
  	am_bg_writer = true;
***************
*** 356,371 ****
  	 */
  	PG_SETMASK(&UnBlockSig);
  
  	/*
  	 * Loop forever
  	 */
  	for (;;)
  	{
- 		bool		do_checkpoint = false;
- 		int			flags = 0;
- 		pg_time_t	now;
- 		int			elapsed_secs;
- 
  		/*
  		 * Emergency bailout if postmaster has died.  This is to avoid the
  		 * necessity for manual cleanup of all postmaster children.
--- 365,381 ----
  	 */
  	PG_SETMASK(&UnBlockSig);
  
+ 	BgWriterRecoveryMode = IsRecoveryProcessingMode();
+ 
+ 	if (BgWriterRecoveryMode)
+ 		elog(DEBUG1, "bgwriter starting during recovery, pid = %u", 
+ 			BgWriterShmem->bgwriter_pid);
+ 
  	/*
  	 * Loop forever
  	 */
  	for (;;)
  	{
  		/*
  		 * Emergency bailout if postmaster has died.  This is to avoid the
  		 * necessity for manual cleanup of all postmaster children.
***************
*** 383,501 ****
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
  		}
- 		if (checkpoint_requested)
- 		{
- 			checkpoint_requested = false;
- 			do_checkpoint = true;
- 			BgWriterStats.m_requested_checkpoints++;
- 		}
- 		if (shutdown_requested)
- 		{
- 			/*
- 			 * From here on, elog(ERROR) should end with exit(1), not send
- 			 * control back to the sigsetjmp block above
- 			 */
- 			ExitOnAnyError = true;
- 			/* Close down the database */
- 			ShutdownXLOG(0, 0);
- 			DumpFreeSpaceMap(0, 0);
- 			/* Normal exit from the bgwriter is here */
- 			proc_exit(0);		/* done */
- 		}
  
! 		/*
! 		 * Force a checkpoint if too much time has elapsed since the last one.
! 		 * Note that we count a timed checkpoint in stats only when this
! 		 * occurs without an external request, but we set the CAUSE_TIME flag
! 		 * bit even if there is also an external request.
! 		 */
! 		now = (pg_time_t) time(NULL);
! 		elapsed_secs = now - last_checkpoint_time;
! 		if (elapsed_secs >= CheckPointTimeout)
  		{
! 			if (!do_checkpoint)
! 				BgWriterStats.m_timed_checkpoints++;
! 			do_checkpoint = true;
! 			flags |= CHECKPOINT_CAUSE_TIME;
  		}
! 
! 		/*
! 		 * Do a checkpoint if requested, otherwise do one cycle of
! 		 * dirty-buffer writing.
! 		 */
! 		if (do_checkpoint)
  		{
! 			/* use volatile pointer to prevent code rearrangement */
! 			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
  
  			/*
! 			 * Atomically fetch the request flags to figure out what kind of a
! 			 * checkpoint we should perform, and increase the started-counter
! 			 * to acknowledge that we've started a new checkpoint.
  			 */
! 			SpinLockAcquire(&bgs->ckpt_lck);
! 			flags |= bgs->ckpt_flags;
! 			bgs->ckpt_flags = 0;
! 			bgs->ckpt_started++;
! 			SpinLockRelease(&bgs->ckpt_lck);
  
  			/*
! 			 * We will warn if (a) too soon since last checkpoint (whatever
! 			 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
! 			 * since the last checkpoint start.  Note in particular that this
! 			 * implementation will not generate warnings caused by
! 			 * CheckPointTimeout < CheckPointWarning.
  			 */
! 			if ((flags & CHECKPOINT_CAUSE_XLOG) &&
! 				elapsed_secs < CheckPointWarning)
! 				ereport(LOG,
! 						(errmsg("checkpoints are occurring too frequently (%d seconds apart)",
! 								elapsed_secs),
! 						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
  
! 			/*
! 			 * Initialize bgwriter-private variables used during checkpoint.
! 			 */
! 			ckpt_active = true;
! 			ckpt_start_recptr = GetInsertRecPtr();
! 			ckpt_start_time = now;
! 			ckpt_cached_elapsed = 0;
  
! 			/*
! 			 * Do the checkpoint.
! 			 */
! 			CreateCheckPoint(flags);
! 
! 			/*
! 			 * After any checkpoint, close all smgr files.	This is so we
! 			 * won't hang onto smgr references to deleted files indefinitely.
! 			 */
! 			smgrcloseall();
! 
! 			/*
! 			 * Indicate checkpoint completion to any waiting backends.
! 			 */
! 			SpinLockAcquire(&bgs->ckpt_lck);
! 			bgs->ckpt_done = bgs->ckpt_started;
! 			SpinLockRelease(&bgs->ckpt_lck);
! 
! 			ckpt_active = false;
! 
! 			/*
! 			 * Note we record the checkpoint start time not end time as
! 			 * last_checkpoint_time.  This is so that time-driven checkpoints
! 			 * happen at a predictable spacing.
! 			 */
! 			last_checkpoint_time = now;
  		}
- 		else
- 			BgBufferSync();
- 
- 		/* Check for archive_timeout and switch xlog files if necessary. */
- 		CheckArchiveTimeout();
- 
- 		/* Nap for the configured time. */
- 		BgWriterNap();
  	}
  }
  
--- 393,599 ----
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
  		}
  
! 		if (BgWriterRecoveryMode)
  		{
! 			if (shutdown_requested)
! 			{
! 				/*
! 				 * From here on, elog(ERROR) should end with exit(1), not send
! 				 * control back to the sigsetjmp block above
! 				 */
! 				ExitOnAnyError = true;
! 				/* Normal exit from the bgwriter is here */
! 				proc_exit(0);		/* done */
! 			}
! 
! 			if (!IsRecoveryProcessingMode())
! 			{
! 				elog(DEBUG2, "bgwriter changing from recovery to normal mode");
! 
! 				InitXLOGAccess();
! 				BgWriterRecoveryMode = false;
! 
! 				/*
! 				 * Start time-driven events from now
! 				 */
! 				last_checkpoint_time = last_xlog_switch_time = (pg_time_t) time(NULL);
! 
! 				/* 
! 				 * Notice that we do *not* act on a checkpoint_requested
! 				 * state at this point. We have changed mode, so we wish to
! 				 * perform a checkpoint not a restartpoint.
! 				 */
! 				continue;
! 			}
! 
! 			if (checkpoint_requested) 
! 			{
! 				XLogRecPtr		ReadPtr;
! 				CheckPoint		restartPoint;
! 
! 				checkpoint_requested = false;
! 
! 				/*
! 				 * Initialize bgwriter-private variables used during checkpoint.
! 				 */
! 				ckpt_active = true;
! 				ckpt_start_time = (pg_time_t) time(NULL);
! 				ckpt_cached_elapsed = 0;
! 
! 				/*
! 				 * Get the requested values from shared memory that the 
! 				 * Startup process has put there for us.
! 				 */
! 				SpinLockAcquire(&BgWriterShmem->ckpt_lck);
! 				ReadPtr = BgWriterShmem->ReadPtr;
! 				memcpy(&restartPoint, &BgWriterShmem->restartPoint, sizeof(CheckPoint));
! 				SpinLockRelease(&BgWriterShmem->ckpt_lck);
! 
! 				/* Use smoothed writes, until interrupted if ever */
! 				CreateRestartPoint(ReadPtr, &restartPoint, 0);
! 
! 				/*
! 				 * After any checkpoint, close all smgr files.	This is so we
! 				 * won't hang onto smgr references to deleted files indefinitely.
! 				 */
! 				smgrcloseall();
! 
! 				ckpt_active = false;
! 				checkpoint_requested = false;
! 			}
! 			else
! 			{
! 				/* Clean buffers dirtied by recovery */
! 				BgBufferSync();
! 
! 				/* Nap for the configured time. */
! 				BgWriterNap();
! 			}
  		}
! 		else	/* Normal processing */
  		{
! 			bool		do_checkpoint = false;
! 			int			flags = 0;
! 			pg_time_t	now;
! 			int			elapsed_secs;
! 
! 			Assert(!IsRecoveryProcessingMode());
! 
! 			if (checkpoint_requested) 
! 			{
! 				checkpoint_requested = false;
! 				do_checkpoint = true;
! 				BgWriterStats.m_requested_checkpoints++;
! 			}
! 			if (shutdown_requested)
! 			{
! 				/*
! 				 * From here on, elog(ERROR) should end with exit(1), not send
! 				 * control back to the sigsetjmp block above
! 				 */
! 				ExitOnAnyError = true;
! 				/* Close down the database */
! 				ShutdownXLOG(0, 0);
! 				DumpFreeSpaceMap(0, 0);
! 				/* Normal exit from the bgwriter is here */
! 				proc_exit(0);		/* done */
! 			}
  
  			/*
! 			 * Force a checkpoint if too much time has elapsed since the last one.
! 			 * Note that we count a timed checkpoint in stats only when this
! 			 * occurs without an external request, but we set the CAUSE_TIME flag
! 			 * bit even if there is also an external request.
  			 */
! 			now = (pg_time_t) time(NULL);
! 			elapsed_secs = now - last_checkpoint_time;
! 			if (elapsed_secs >= CheckPointTimeout)
! 			{
! 				if (!do_checkpoint)
! 					BgWriterStats.m_timed_checkpoints++;
! 				do_checkpoint = true;
! 				flags |= CHECKPOINT_CAUSE_TIME;
! 			}
  
  			/*
! 			 * Do a checkpoint if requested, otherwise do one cycle of
! 			 * dirty-buffer writing.
  			 */
! 			if (do_checkpoint)
! 			{
! 				/* use volatile pointer to prevent code rearrangement */
! 				volatile BgWriterShmemStruct *bgs = BgWriterShmem;
! 
! 				/*
! 				 * Atomically fetch the request flags to figure out what kind of a
! 				 * checkpoint we should perform, and increase the started-counter
! 				 * to acknowledge that we've started a new checkpoint.
! 				 */
! 				SpinLockAcquire(&bgs->ckpt_lck);
! 				flags |= bgs->ckpt_flags;
! 				bgs->ckpt_flags = 0;
! 				bgs->ckpt_started++;
! 				SpinLockRelease(&bgs->ckpt_lck);
! 
! 				/*
! 				 * We will warn if (a) too soon since last checkpoint (whatever
! 				 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
! 				 * since the last checkpoint start.  Note in particular that this
! 				 * implementation will not generate warnings caused by
! 				 * CheckPointTimeout < CheckPointWarning.
! 				 */
! 				if ((flags & CHECKPOINT_CAUSE_XLOG) &&
! 					elapsed_secs < CheckPointWarning)
! 					ereport(LOG,
! 							(errmsg("checkpoints are occurring too frequently (%d seconds apart)",
! 									elapsed_secs),
! 							 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
! 
! 				/*
! 				 * Initialize bgwriter-private variables used during checkpoint.
! 				 */
! 				ckpt_active = true;
! 				ckpt_start_recptr = GetInsertRecPtr();
! 				ckpt_start_time = now;
! 				ckpt_cached_elapsed = 0;
! 
! 				/*
! 				 * Do the checkpoint.
! 				 */
! 				CreateCheckPoint(flags);
! 
! 				/*
! 				 * After any checkpoint, close all smgr files.	This is so we
! 				 * won't hang onto smgr references to deleted files indefinitely.
! 				 */
! 				smgrcloseall();
! 
! 				/*
! 				 * Indicate checkpoint completion to any waiting backends.
! 				 */
! 				SpinLockAcquire(&bgs->ckpt_lck);
! 				bgs->ckpt_done = bgs->ckpt_started;
! 				SpinLockRelease(&bgs->ckpt_lck);
! 
! 				ckpt_active = false;
! 
! 				/*
! 				 * Note we record the checkpoint start time not end time as
! 				 * last_checkpoint_time.  This is so that time-driven checkpoints
! 				 * happen at a predictable spacing.
! 				 */
! 				last_checkpoint_time = now;
! 			}
! 			else
! 				BgBufferSync();
  
! 			/* Check for archive_timeout and switch xlog files if necessary. */
! 			CheckArchiveTimeout();
  
! 			/* Nap for the configured time. */
! 			BgWriterNap();
  		}
  	}
  }
  
***************
*** 588,594 ****
  		(ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
  			break;
  		pg_usleep(1000000L);
! 		AbsorbFsyncRequests();
  		udelay -= 1000000L;
  	}
  
--- 686,693 ----
  		(ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
  			break;
  		pg_usleep(1000000L);
! 		if (!IsRecoveryProcessingMode())
! 			AbsorbFsyncRequests();
  		udelay -= 1000000L;
  	}
  
***************
*** 642,647 ****
--- 741,759 ----
  	if (!am_bg_writer)
  		return;
  
+ 	/* Perform minimal duties during recovery and skip wait if requested */
+ 	if (IsRecoveryProcessingMode())
+ 	{
+ 		BgBufferSync();
+ 
+ 		if (!shutdown_requested &&
+ 			!checkpoint_requested &&
+ 			IsCheckpointOnSchedule(progress))
+ 			BgWriterNap();
+ 
+ 		return;
+ 	}
+ 
  	/*
  	 * Perform the usual bgwriter duties and take a nap, unless we're behind
  	 * schedule, in which case we just try to catch up as quickly as possible.
***************
*** 716,731 ****
  	 * However, it's good enough for our purposes, we're only calculating an
  	 * estimate anyway.
  	 */
! 	recptr = GetInsertRecPtr();
! 	elapsed_xlogs =
! 		(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
! 		 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
! 		CheckPointSegments;
! 
! 	if (progress < elapsed_xlogs)
  	{
! 		ckpt_cached_elapsed = elapsed_xlogs;
! 		return false;
  	}
  
  	/*
--- 828,846 ----
  	 * However, it's good enough for our purposes, we're only calculating an
  	 * estimate anyway.
  	 */
! 	if (!IsRecoveryProcessingMode())
  	{
! 		recptr = GetInsertRecPtr();
! 		elapsed_xlogs =
! 			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
! 			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
! 			CheckPointSegments;
! 
! 		if (progress < elapsed_xlogs)
! 		{
! 			ckpt_cached_elapsed = elapsed_xlogs;
! 			return false;
! 		}
  	}
  
  	/*
***************
*** 967,972 ****
--- 1082,1158 ----
  }
  
  /*
+  * Always runs in Startup process (see xlog.c)
+  */
+ void
+ RequestRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, bool sendToBGWriter)
+ {
+ 	/*
+ 	 * Should we just do it ourselves?
+ 	 */
+ 	if (!IsPostmasterEnvironment || !sendToBGWriter)
+ 	{
+ 		CreateRestartPoint(ReadPtr, restartPoint, CHECKPOINT_IMMEDIATE);
+ 		return;
+ 	}
+ 
+ 	/*
+ 	 * Push requested values into shared memory, then signal to request restartpoint.
+ 	 */
+ 	if (BgWriterShmem->bgwriter_pid == 0)
+ 		elog(LOG, "could not request restartpoint because bgwriter not running");
+ 
+ #ifdef NOT_USED
+ 	elog(LOG, "tli = %u nextXidEpoch = %u nextXid = %u nextOid = %u",
+ 		restartPoint->ThisTimeLineID,
+ 		restartPoint->nextXidEpoch,
+ 		restartPoint->nextXid,
+ 		restartPoint->nextOid);
+ #endif
+ 
+ 	SpinLockAcquire(&BgWriterShmem->ckpt_lck);
+ 	BgWriterShmem->ReadPtr = ReadPtr;
+ 	memcpy(&BgWriterShmem->restartPoint, restartPoint, sizeof(CheckPoint));
+ 	SpinLockRelease(&BgWriterShmem->ckpt_lck);
+ 
+ 	if (kill(BgWriterShmem->bgwriter_pid, SIGINT) != 0)
+ 		elog(LOG, "could not signal for restartpoint: %m");	
+ }
+ 
+ /* 
+  * Sends another checkpoint request signal to bgwriter, which causes it
+  * to avoid smoothed writes and continue processing as if it had been
+  * called with CHECKPOINT_IMMEDIATE. This is used at the end of recovery.
+  */
+ void
+ RequestRestartPointCompletion(void)
+ {
+ 	if (BgWriterShmem->bgwriter_pid != 0 &&
+ 		kill(BgWriterShmem->bgwriter_pid, SIGINT) != 0)
+ 		elog(LOG, "could not signal for restartpoint immediate: %m");
+ }
+ 
+ XLogRecPtr
+ GetRedoLocationForArchiveCheckpoint(void)
+ {
+ 	XLogRecPtr	redo;
+ 
+ 	SpinLockAcquire(&BgWriterShmem->ckpt_lck);
+ 	redo = BgWriterShmem->ReadPtr;
+ 	SpinLockRelease(&BgWriterShmem->ckpt_lck);
+ 
+ 	return redo;
+ }
+ 
+ void
+ SetRedoLocationForArchiveCheckpoint(XLogRecPtr redo)
+ {
+ 	SpinLockAcquire(&BgWriterShmem->ckpt_lck);
+ 	BgWriterShmem->ReadPtr = redo;
+ 	SpinLockRelease(&BgWriterShmem->ckpt_lck);
+ }
+ 
+ /*
   * ForwardFsyncRequest
   *		Forward a file-fsync request from a backend to the bgwriter
   *
Index: src/backend/postmaster/postmaster.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/postmaster/postmaster.c,v
retrieving revision 1.565
diff -c -r1.565 postmaster.c
*** src/backend/postmaster/postmaster.c	23 Sep 2008 20:35:38 -0000	1.565
--- src/backend/postmaster/postmaster.c	30 Sep 2008 17:15:15 -0000
***************
*** 254,259 ****
--- 254,264 ----
  {
  	PM_INIT,					/* postmaster starting */
  	PM_STARTUP,					/* waiting for startup subprocess */
+ 	PM_RECOVERY,				/* consistent recovery mode; state only
+ 								 * entered for archive and streaming recovery,
+ 								 * and only after the point where the 
+ 								 * all data is in consistent state.
+ 								 */
  	PM_RUN,						/* normal "database is alive" state */
  	PM_WAIT_BACKUP,				/* waiting for online backup mode to end */
  	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
***************
*** 1302,1308 ****
  		 * state that prevents it, start one.  It doesn't matter if this
  		 * fails, we'll just try again later.
  		 */
! 		if (BgWriterPID == 0 && pmState == PM_RUN)
  			BgWriterPID = StartBackgroundWriter();
  
  		/*
--- 1307,1313 ----
  		 * state that prevents it, start one.  It doesn't matter if this
  		 * fails, we'll just try again later.
  		 */
! 		if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY))
  			BgWriterPID = StartBackgroundWriter();
  
  		/*
***************
*** 2116,2122 ****
  		if (pid == StartupPID)
  		{
  			StartupPID = 0;
! 			Assert(pmState == PM_STARTUP);
  
  			/* FATAL exit of startup is treated as catastrophic */
  			if (!EXIT_STATUS_0(exitstatus))
--- 2121,2127 ----
  		if (pid == StartupPID)
  		{
  			StartupPID = 0;
! 			Assert(pmState == PM_STARTUP || pmState == PM_RECOVERY);
  
  			/* FATAL exit of startup is treated as catastrophic */
  			if (!EXIT_STATUS_0(exitstatus))
***************
*** 2157,2167 ****
  			load_role();
  
  			/*
! 			 * Crank up the background writer.	It doesn't matter if this
! 			 * fails, we'll just try again later.
  			 */
! 			Assert(BgWriterPID == 0);
! 			BgWriterPID = StartBackgroundWriter();
  
  			/*
  			 * Likewise, start other special children as needed.  In a restart
--- 2162,2172 ----
  			load_role();
  
  			/*
! 			 * Check whether we need to start background writer, if not
! 			 * already running.
  			 */
! 			if (BgWriterPID == 0)
! 				BgWriterPID = StartBackgroundWriter();
  
  			/*
  			 * Likewise, start other special children as needed.  In a restart
***************
*** 3845,3850 ****
--- 3850,3900 ----
  
  	PG_SETMASK(&BlockSig);
  
+ 	if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_START))
+ 	{
+ 		Assert(pmState == PM_STARTUP);
+ 
+ 		/*
+ 		 * Go to shutdown mode if a shutdown request was pending.
+ 		 */
+ 		if (Shutdown > NoShutdown)
+ 		{
+ 			pmState = PM_WAIT_BACKENDS;
+ 			/* PostmasterStateMachine logic does the rest */
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * Startup process has entered recovery
+ 			 */
+ 			pmState = PM_RECOVERY;
+ 
+ 			/*
+ 			 * Load the flat authorization file into postmaster's cache. The
+ 			 * startup process won't have recomputed this from the database yet,
+ 			 * so we it may change following recovery. 
+ 			 */
+ 			load_role();
+ 
+ 			/*
+ 			 * Crank up the background writer.	It doesn't matter if this
+ 			 * fails, we'll just try again later.
+ 			 */
+ 			Assert(BgWriterPID == 0);
+ 			BgWriterPID = StartBackgroundWriter();
+ 
+ 			/*
+ 			 * Likewise, start other special children as needed.
+ 			 */
+ 			Assert(PgStatPID == 0);
+ 			PgStatPID = pgstat_start();
+ 
+ 			/* XXX at this point we could accept read-only connections */
+ 			ereport(DEBUG1,
+ 				 (errmsg("database system is in consistent recovery mode")));
+ 		}
+ 	}
+ 
  	if (CheckPostmasterSignal(PMSIGNAL_PASSWORD_CHANGE))
  	{
  		/*
Index: src/backend/storage/buffer/README
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/buffer/README,v
retrieving revision 1.14
diff -c -r1.14 README
*** src/backend/storage/buffer/README	21 Mar 2008 13:23:28 -0000	1.14
--- src/backend/storage/buffer/README	30 Sep 2008 17:15:15 -0000
***************
*** 264,266 ****
--- 264,275 ----
  This ensures that the page image transferred to disk is reasonably consistent.
  We might miss a hint-bit update or two but that isn't a problem, for the same
  reasons mentioned under buffer access rules.
+ 
+ As of 8.4, background writer starts during recovery mode when there is
+ some form of potentially extended recovery to perform. It performs an
+ identical service to normal processing, except that checkpoints it
+ writes are technically restartpoints. Flushing outstanding WAL for dirty
+ buffers is also skipped, though there shouldn't ever be new WAL entries
+ at that time in any case. We could choose to start background writer
+ immediately but we hold off until we can prove the database is in a 
+ consistent state so that postmaster has a single, clean state change.
Index: src/bin/pg_controldata/pg_controldata.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/bin/pg_controldata/pg_controldata.c,v
retrieving revision 1.41
diff -c -r1.41 pg_controldata.c
*** src/bin/pg_controldata/pg_controldata.c	24 Sep 2008 08:59:42 -0000	1.41
--- src/bin/pg_controldata/pg_controldata.c	30 Sep 2008 17:15:15 -0000
***************
*** 197,202 ****
--- 197,205 ----
  	printf(_("Minimum recovery ending location:     %X/%X\n"),
  		   ControlFile.minRecoveryPoint.xlogid,
  		   ControlFile.minRecoveryPoint.xrecoff);
+ 	printf(_("Minimum safe starting location:       %X/%X\n"),
+ 		   ControlFile.minSafeStartPoint.xlogid,
+ 		   ControlFile.minSafeStartPoint.xrecoff);
  	printf(_("Maximum data alignment:               %u\n"),
  		   ControlFile.maxAlign);
  	/* we don't print floatFormat since can't say much useful about it */
Index: src/bin/pg_resetxlog/pg_resetxlog.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/bin/pg_resetxlog/pg_resetxlog.c,v
retrieving revision 1.68
diff -c -r1.68 pg_resetxlog.c
*** src/bin/pg_resetxlog/pg_resetxlog.c	24 Sep 2008 09:00:44 -0000	1.68
--- src/bin/pg_resetxlog/pg_resetxlog.c	30 Sep 2008 17:15:15 -0000
***************
*** 595,600 ****
--- 595,602 ----
  	ControlFile.prevCheckPoint.xrecoff = 0;
  	ControlFile.minRecoveryPoint.xlogid = 0;
  	ControlFile.minRecoveryPoint.xrecoff = 0;
+ 	ControlFile.minSafeStartPoint.xlogid = 0;
+ 	ControlFile.minSafeStartPoint.xrecoff = 0;
  
  	/* Now we can force the recorded xlog seg size to the right thing. */
  	ControlFile.xlog_seg_size = XLogSegSize;
Index: src/include/access/xlog.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/xlog.h,v
retrieving revision 1.88
diff -c -r1.88 xlog.h
*** src/include/access/xlog.h	12 May 2008 08:35:05 -0000	1.88
--- src/include/access/xlog.h	30 Sep 2008 17:15:15 -0000
***************
*** 133,139 ****
  } XLogRecData;
  
  extern TimeLineID ThisTimeLineID;		/* current TLI */
! extern bool InRecovery;
  extern XLogRecPtr XactLastRecEnd;
  
  /* these variables are GUC parameters related to XLOG */
--- 133,148 ----
  } XLogRecData;
  
  extern TimeLineID ThisTimeLineID;		/* current TLI */
! 
! /* 
!  * Prior to 8.4, all activity during recovery were carried out by Startup
!  * process. This local variable continues to be used in many parts of the
!  * code to indicate actions taken by RecoveryManagers. Other processes who
!  * potentially perform work during recovery should check
!  * IsRecoveryProcessingMode(), see XLogCtl notes in xlog.c
!  */
! extern bool InRecovery;	
! 										
  extern XLogRecPtr XactLastRecEnd;
  
  /* these variables are GUC parameters related to XLOG */
***************
*** 166,171 ****
--- 175,181 ----
  /* These indicate the cause of a checkpoint request */
  #define CHECKPOINT_CAUSE_XLOG	0x0010	/* XLOG consumption */
  #define CHECKPOINT_CAUSE_TIME	0x0020	/* Elapsed time */
+ #define CHECKPOINT_RESTARTPOINT	0x0040	/* Restartpoint during recovery */
  
  /* Checkpoint statistics */
  typedef struct CheckpointStatsData
***************
*** 197,202 ****
--- 207,214 ----
  extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
  
+ extern bool IsRecoveryProcessingMode(void);
+ 
  extern void UpdateControlFile(void);
  extern Size XLOGShmemSize(void);
  extern void XLOGShmemInit(void);
Index: src/include/access/xlog_internal.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/xlog_internal.h,v
retrieving revision 1.24
diff -c -r1.24 xlog_internal.h
*** src/include/access/xlog_internal.h	11 Aug 2008 11:05:11 -0000	1.24
--- src/include/access/xlog_internal.h	30 Sep 2008 17:15:15 -0000
***************
*** 17,22 ****
--- 17,23 ----
  #define XLOG_INTERNAL_H
  
  #include "access/xlog.h"
+ #include "catalog/pg_control.h"
  #include "fmgr.h"
  #include "pgtime.h"
  #include "storage/block.h"
***************
*** 245,250 ****
--- 246,254 ----
  extern pg_time_t GetLastSegSwitchTime(void);
  extern XLogRecPtr RequestXLogSwitch(void);
  
+ extern void CreateRestartPoint(const XLogRecPtr ReadPtr, 
+ 				const CheckPoint *restartPoint, int flags);
+ 
  /*
   * These aren't in xlog.h because I'd rather not include fmgr.h there.
   */
Index: src/include/catalog/pg_control.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/catalog/pg_control.h,v
retrieving revision 1.42
diff -c -r1.42 pg_control.h
*** src/include/catalog/pg_control.h	23 Sep 2008 09:20:39 -0000	1.42
--- src/include/catalog/pg_control.h	30 Sep 2008 17:15:15 -0000
***************
*** 46,52 ****
  #define XLOG_NOOP						0x20
  #define XLOG_NEXTOID					0x30
  #define XLOG_SWITCH						0x40
! 
  
  /* System status indicator */
  typedef enum DBState
--- 46,52 ----
  #define XLOG_NOOP						0x20
  #define XLOG_NEXTOID					0x30
  #define XLOG_SWITCH						0x40
! #define XLOG_RECOVERY_END			0x50
  
  /* System status indicator */
  typedef enum DBState
***************
*** 102,107 ****
--- 102,108 ----
  	CheckPoint	checkPointCopy; /* copy of last check point record */
  
  	XLogRecPtr	minRecoveryPoint;		/* must replay xlog to here */
+ 	XLogRecPtr	minSafeStartPoint;		/* safe point after recovery crashes */
  
  	/*
  	 * This data is used to check for hardware-architecture compatibility of
Index: src/include/postmaster/bgwriter.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/postmaster/bgwriter.h,v
retrieving revision 1.12
diff -c -r1.12 bgwriter.h
*** src/include/postmaster/bgwriter.h	11 Aug 2008 11:05:11 -0000	1.12
--- src/include/postmaster/bgwriter.h	30 Sep 2008 17:15:15 -0000
***************
*** 12,17 ****
--- 12,18 ----
  #ifndef _BGWRITER_H
  #define _BGWRITER_H
  
+ #include "catalog/pg_control.h"
  #include "storage/block.h"
  #include "storage/relfilenode.h"
  
***************
*** 25,30 ****
--- 26,36 ----
  extern void BackgroundWriterMain(void);
  
  extern void RequestCheckpoint(int flags);
+ extern void RequestRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, bool sendToBGWriter);
+ extern void RequestRestartPointCompletion(void);
+ extern XLogRecPtr GetRedoLocationForArchiveCheckpoint(void);
+ extern void SetRedoLocationForArchiveCheckpoint(XLogRecPtr redo);
+ 
  extern void CheckpointWriteDelay(int flags, double progress);
  
  extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
Index: src/include/storage/pmsignal.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/storage/pmsignal.h,v
retrieving revision 1.20
diff -c -r1.20 pmsignal.h
*** src/include/storage/pmsignal.h	19 Jun 2008 21:32:56 -0000	1.20
--- src/include/storage/pmsignal.h	30 Sep 2008 17:15:15 -0000
***************
*** 22,27 ****
--- 22,28 ----
   */
  typedef enum
  {
+ 	PMSIGNAL_RECOVERY_START,	/* move to PM_RECOVERY state */
  	PMSIGNAL_PASSWORD_CHANGE,	/* pg_auth file has changed */
  	PMSIGNAL_WAKEN_ARCHIVER,	/* send a NOTIFY signal to xlog archiver */
  	PMSIGNAL_ROTATE_LOGFILE,	/* send SIGUSR1 to syslogger to rotate logfile */
Index: src/test/regress/expected/opr_sanity.out
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/test/regress/expected/opr_sanity.out,v
retrieving revision 1.84
diff -c -r1.84 opr_sanity.out
*** src/test/regress/expected/opr_sanity.out	16 Aug 2008 00:01:38 -0000	1.84
--- src/test/regress/expected/opr_sanity.out	30 Sep 2008 17:15:15 -0000
***************
*** 109,117 ****
       p1.proretset != p2.proretset OR
       p1.provolatile != p2.provolatile OR
       p1.pronargs != p2.pronargs);
!  oid | proname | oid | proname 
! -----+---------+-----+---------
! (0 rows)
  
  -- Look for uses of different type OIDs in the argument/result type fields
  -- for different aliases of the same built-in function.
--- 109,118 ----
       p1.proretset != p2.proretset OR
       p1.provolatile != p2.provolatile OR
       p1.pronargs != p2.pronargs);
!  oid  |     proname     | oid  |     proname
! ------+-----------------+------+-----------------
!  2172 | pg_start_backup | 2176 | pg_start_backup
! (1 row)
  
  -- Look for uses of different type OIDs in the argument/result type fields
  -- for different aliases of the same built-in function.

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

over 17 years ago

In reply to: Simon Riggs (#1)

Re: [PATCHES] Infrastructure changes for recovery (v8)

Simon Riggs wrote:

* optional recovery_safe_start_location parameter now provided in
recovery.conf, to allow a consistency point to be manually defined if a
base backup was not taken using standard pg_start/stop backup functions

Do we want to cater for that? It only seems safe if you have
full_page_writes turned on, and you perform a checkpoint first. But if
you do that, why don't you just use pg_start_backup()?

Other Changes
* log_restartpoints removed, use log_checkpoints in postgresql.conf

Is this something that would make sense regardless of the rest of the
patch? If so, we could apply that separately, which would make this
patch a little less overwhelming to review.

* additional function signature for pg_start_backup('label', true |
false) to allow definition of immediate checkpoint/not

Wouldn't this need a new entry in pg_proc.h? Again, perhaps we should do
this as a separate patch.

* fixes bug discovered while other testing: if pg_stop_backup() is run
when xlogswitch has just occurred then we do not switch log files, yet
we return current filename even though nothing of value in it. If
archive_timeout not enabled we would wait forever for pg_stop_backup()
to return.
* Substantial comments throughout

These comments on CheckPointLock seem contradictory:

--- 247,256 ----
* ControlFileLock: must be held to read/update control file or create
* new log file.
*
!  * CheckpointLock: must be held to do a checkpoint or restartpoint, ensuring
!  * we get just one of those at any time. In 8.4+ recovery, both startup and
!  * bgwriter processes may take restartpoints, so this locking must be strict 
!  * to ensure there are no mistakes.
*
*----------
*/

and

--- 5901,5916 ----
XLogRecPtr	recptr;
XLogCtlInsert *Insert = &XLogCtl->Insert;
XLogRecData rdata;
uint32		_logId;
uint32		_logSeg;
TransactionId *inCommitXids;
int			nInCommit;
+ 	bool		leavingArchiveRecovery = false;
/*
* Acquire CheckpointLock to ensure only one checkpoint happens at a time.
! * That shouldn't be happening, but checkpoints are an important aspect
! * of our resilience, so we take no chances.
*/
LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);

If I've understood the patch correctly, only bgwriter does checkpoints
and restart points now?

There's a trivial merge conflict in bgwriter.c, due to the FSM patch.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Heikki Linnakangas (#2)

Re: [PATCHES] Infrastructure changes for recovery (v8)

On Wed, 2008-10-08 at 14:43 +0300, Heikki Linnakangas wrote:

Simon Riggs wrote:

* optional recovery_safe_start_location parameter now provided in
recovery.conf, to allow a consistency point to be manually defined if a
base backup was not taken using standard pg_start/stop backup functions

Do we want to cater for that? It only seems safe if you have
full_page_writes turned on, and you perform a checkpoint first. But if
you do that, why don't you just use pg_start_backup()?

I'm easy on that one. It is a supported backup method, so without this
it would not be possible to utilise Hot Standby in conjunction with this
backup technique. Not many people use it, but I guess some do.

Other Changes
* log_restartpoints removed, use log_checkpoints in postgresql.conf

Is this something that would make sense regardless of the rest of the
patch? If so, we could apply that separately, which would make this
patch a little less overwhelming to review.

Maybe, it's fairly minor.

* additional function signature for pg_start_backup('label', true |
false) to allow definition of immediate checkpoint/not

Wouldn't this need a new entry in pg_proc.h? Again, perhaps we should do
this as a separate patch.

That's concerning. I remember adding the entry and assigning a new oid,
but it isn't in the patch. The multi-argument version was definitely
tested, that's how I discovered the bug also fixed in the patch.

* fixes bug discovered while other testing: if pg_stop_backup() is run
when xlogswitch has just occurred then we do not switch log files, yet
we return current filename even though nothing of value in it. If
archive_timeout not enabled we would wait forever for pg_stop_backup()
to return.

OK, I'll strip all of the above out, for separate consideration.

* Substantial comments throughout

These comments on CheckPointLock seem contradictory:
--- 247,256 ----
* ControlFileLock: must be held to read/update control file or create
* new log file.
*
!  * CheckpointLock: must be held to do a checkpoint or restartpoint, ensuring
!  * we get just one of those at any time. In 8.4+ recovery, both startup and
!  * bgwriter processes may take restartpoints, so this locking must be strict 
!  * to ensure there are no mistakes.
*
*----------
*/
and
--- 5901,5916 ----
XLogRecPtr	recptr;
XLogCtlInsert *Insert = &XLogCtl->Insert;
XLogRecData rdata;
uint32		_logId;
uint32		_logSeg;
TransactionId *inCommitXids;
int			nInCommit;
+ 	bool		leavingArchiveRecovery = false;
/*
* Acquire CheckpointLock to ensure only one checkpoint happens at a time.
! * That shouldn't be happening, but checkpoints are an important aspect
! * of our resilience, so we take no chances.
*/
LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
If I've understood the patch correctly, only bgwriter does checkpoints
and restart points now?

Tom requested that we retain the ability for Startup process to perform
restartpoints up until the point that bgwriter spawns, then after that
bgwriter performs them.

The form is this

PM_START startup process performs restartpoints
transition when database is consistent state
PM_RECOVERY bgwriter process performs restartpoints
delicate transition between two states
PM_RUN bgwriter process performs checkpoints

There's a trivial merge conflict in bgwriter.c, due to the FSM patch.

OK, will look.

Thanks for looking.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Simon Riggs

simon@2ndQuadrant.com

over 17 years ago

In reply to: Simon Riggs (#3)

1 attachment(s)

Re: [PATCHES] Infrastructure changes for recovery (v8)

On Wed, 2008-10-08 at 13:24 +0100, Simon Riggs wrote:

On Wed, 2008-10-08 at 14:43 +0300, Heikki Linnakangas wrote:

Again, perhaps we should do
this as a separate patch.

New patch enclosed with other stuff stripped out. 15% lighter...

New thread spawned for bug fix patch. Will resubmit others when we can
be sure they'll not cause patch conflicts.

OK, I'll strip all of the above out, for separate consideration.
The form is this

PM_START startup process performs restartpoints
transition when database is consistent state
PM_RECOVERY bgwriter process performs restartpoints
delicate transition between two states
PM_RUN bgwriter process performs checkpoints

Above added as comments in patch.

The patch agonises over the two state transitions above. First
transition needs to be exactly correct otherwise we might be using an
inconsistent database during recovery. Second transition is harder
because it isn't just the startup process working alone any more.

The key to understanding it is all in concurrent behaviour. Startup and
bgwriter chat together through bgwriter shared memory and call functions
back and forth between xlog.c and bgwriter.c.

I haven't retested the patch yet, but it passes make check. I'll be
rechecking it later today, starting in about 2 hours time.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Attachments:

recovery_infrastruc.v9.patchtext/x-patch; charset=utf-8; name=recovery_infrastruc.v9.patchDownload

Index: src/backend/access/transam/clog.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/clog.c,v
retrieving revision 1.47
diff -c -r1.47 clog.c
*** src/backend/access/transam/clog.c	1 Aug 2008 13:16:08 -0000	1.47
--- src/backend/access/transam/clog.c	8 Oct 2008 12:27:33 -0000
***************
*** 260,265 ****
--- 260,268 ----
  /*
   * This must be called ONCE during postmaster or standalone-backend startup,
   * after StartupXLOG has initialized ShmemVariableCache->nextXid.
+  *
+  * We access just a single clog page, so this action is atomic and safe
+  * for use if other processes are active during recovery.
   */
  void
  StartupCLOG(void)
Index: src/backend/access/transam/multixact.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/multixact.c,v
retrieving revision 1.28
diff -c -r1.28 multixact.c
*** src/backend/access/transam/multixact.c	1 Aug 2008 13:16:08 -0000	1.28
--- src/backend/access/transam/multixact.c	8 Oct 2008 12:27:33 -0000
***************
*** 1413,1420 ****
   * MultiXactSetNextMXact and/or MultiXactAdvanceNextMXact.	Note that we
   * may already have replayed WAL data into the SLRU files.
   *
!  * We don't need any locks here, really; the SLRU locks are taken
!  * only because slru.c expects to be called with locks held.
   */
  void
  StartupMultiXact(void)
--- 1413,1423 ----
   * MultiXactSetNextMXact and/or MultiXactAdvanceNextMXact.	Note that we
   * may already have replayed WAL data into the SLRU files.
   *
!  * We want this operation to be atomic to ensure that other processes can 
!  * use MultiXact while we complete recovery. We access one page only from the
!  * offset and members buffers, so once locks are acquired they will not be
!  * dropped and re-acquired by SLRU code. So we take both locks at start, then
!  * hold them all the way to the end.
   */
  void
  StartupMultiXact(void)
***************
*** 1426,1431 ****
--- 1429,1435 ----
  
  	/* Clean up offsets state */
  	LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE);
+ 	LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);
  
  	/*
  	 * Initialize our idea of the latest page number.
***************
*** 1452,1461 ****
  		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
  	}
  
- 	LWLockRelease(MultiXactOffsetControlLock);
- 
  	/* And the same for members */
- 	LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);
  
  	/*
  	 * Initialize our idea of the latest page number.
--- 1456,1462 ----
***************
*** 1483,1488 ****
--- 1484,1490 ----
  	}
  
  	LWLockRelease(MultiXactMemberControlLock);
+ 	LWLockRelease(MultiXactOffsetControlLock);
  
  	/*
  	 * Initialize lastTruncationPoint to invalid, ensuring that the first
***************
*** 1543,1549 ****
  	 * SimpleLruTruncate would get confused.  It seems best not to risk
  	 * removing any data during recovery anyway, so don't truncate.
  	 */
! 	if (!InRecovery)
  		TruncateMultiXact();
  
  	TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
--- 1545,1551 ----
  	 * SimpleLruTruncate would get confused.  It seems best not to risk
  	 * removing any data during recovery anyway, so don't truncate.
  	 */
! 	if (!IsRecoveryProcessingMode())
  		TruncateMultiXact();
  
  	TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
Index: src/backend/access/transam/subtrans.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/subtrans.c,v
retrieving revision 1.23
diff -c -r1.23 subtrans.c
*** src/backend/access/transam/subtrans.c	1 Aug 2008 13:16:08 -0000	1.23
--- src/backend/access/transam/subtrans.c	8 Oct 2008 12:27:33 -0000
***************
*** 226,231 ****
--- 226,234 ----
   *
   * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
   * if there are none.
+  *
+  * Note that this is not atomic and is not yet safe to perform while other
+  * processes might access subtrans.
   */
  void
  StartupSUBTRANS(TransactionId oldestActiveXID)
Index: src/backend/access/transam/xact.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xact.c,v
retrieving revision 1.265
diff -c -r1.265 xact.c
*** src/backend/access/transam/xact.c	11 Aug 2008 11:05:10 -0000	1.265
--- src/backend/access/transam/xact.c	8 Oct 2008 12:27:33 -0000
***************
*** 393,398 ****
--- 393,401 ----
  	bool		isSubXact = (s->parent != NULL);
  	ResourceOwner currentOwner;
  
+ 	if (IsRecoveryProcessingMode())
+ 		elog(FATAL, "cannot assign TransactionIds during recovery");
+ 
  	/* Assert that caller didn't screw up */
  	Assert(!TransactionIdIsValid(s->transactionId));
  	Assert(s->state == TRANS_INPROGRESS);
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.319
diff -c -r1.319 xlog.c
*** src/backend/access/transam/xlog.c	23 Sep 2008 09:20:35 -0000	1.319
--- src/backend/access/transam/xlog.c	8 Oct 2008 13:23:16 -0000
***************
*** 113,119 ****
  
  /*
   * ThisTimeLineID will be same in all backends --- it identifies current
!  * WAL timeline for the database system.
   */
  TimeLineID	ThisTimeLineID = 0;
  
--- 113,120 ----
  
  /*
   * ThisTimeLineID will be same in all backends --- it identifies current
!  * WAL timeline for the database system. Zero is always a bug, so we 
!  * start with that to allow us to spot any errors.
   */
  TimeLineID	ThisTimeLineID = 0;
  
***************
*** 123,128 ****
--- 124,133 ----
  /* Are we recovering using offline XLOG archives? */
  static bool InArchiveRecovery = false;
  
+ /* Local copy of shared RecoveryProcessingMode state */
+ static bool LocalRecoveryProcessingMode = true;
+ static bool knownProcessingMode = false;
+ 
  /* Was the last xlog file restored from archive, or local? */
  static bool restoredFromArchive = false;
  
***************
*** 141,146 ****
--- 146,154 ----
  static TimestampTz recoveryStopTime;
  static bool recoveryStopAfter;
  
+ /* is the database proven consistent yet? */
+ bool	reachedSafeStartPoint = false;
+ 
  /*
   * During normal operation, the only timeline we care about is ThisTimeLineID.
   * During recovery, however, things are more complicated.  To simplify life
***************
*** 240,249 ****
   * ControlFileLock: must be held to read/update control file or create
   * new log file.
   *
!  * CheckpointLock: must be held to do a checkpoint (ensures only one
!  * checkpointer at a time; currently, with all checkpoints done by the
!  * bgwriter, this is just pro forma).
!  *
   *----------
   */
  
--- 248,277 ----
   * ControlFileLock: must be held to read/update control file or create
   * new log file.
   *
!  * CheckpointLock: must be held to do a checkpoint or restartpoint, ensuring
!  * we get just one of those at any time. In 8.4+ recovery, both startup and
!  * bgwriter processes may take restartpoints, so this locking must be strict 
!  * to ensure there are no mistakes.
!  *
!  * In 8.4 we progress through a number of states at startup. Initially, the
!  * postmaster is in PM_STARTUP state and spawns the Startup process. We then
!  * progress until the database is in a consistent state, then if we are in
!  * InArchiveRecovery we go into PM_RECOVERY state. The bgwriter then starts
!  * up and takes over responsibility for performing restartpoints. We then
!  * progress until the end of recovery when we enter PM_RUN state upon
!  * termination of the Startup process. In summary:
!  * 
!  * PM_STARTUP state:	Startup process performs restartpoints
!  * PM_RECOVERY state:	bgwriter process performs restartpoints
!  * PM_RUN state: 		bgwriter process performs checkpoints
!  *
!  * These transitions are fairly delicate, with many things that need to
!  * happen at the same time in order to change state successfully throughout
!  * the system. Changing PM_STARTUP to PM_RECOVERY only occurs when we can
!  * prove the databases are in a consistent state. Changing from PM_RECOVERY
!  * to PM_RUN happens whenever recovery ends, which could be forced upon us
!  * externally or it can occur becasue of damage or termination of the WAL
!  * sequence.
   *----------
   */
  
***************
*** 285,295 ****
--- 313,330 ----
  
  /*
   * Total shared-memory state for XLOG.
+  *
+  * This small structure is accessed by many backends, so we take care to
+  * pad out the parts of the structure so they can be accessed by separate
+  * CPUs without causing false sharing cache flushes. Padding is generous
+  * to allow for a wide variety of CPU architectures.
   */
+ #define	XLOGCTL_BUFFER_SPACING	128
  typedef struct XLogCtlData
  {
  	/* Protected by WALInsertLock: */
  	XLogCtlInsert Insert;
+ 	char	InsertPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlInsert)];
  
  	/* Protected by info_lck: */
  	XLogwrtRqst LogwrtRqst;
***************
*** 297,305 ****
--- 332,347 ----
  	uint32		ckptXidEpoch;	/* nextXID & epoch of latest checkpoint */
  	TransactionId ckptXid;
  	XLogRecPtr	asyncCommitLSN; /* LSN of newest async commit */
+ 	/* add data structure padding for above info_lck declarations */
+ 	char	InfoPadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogwrtRqst) 
+ 											- sizeof(XLogwrtResult)
+ 											- sizeof(uint32)
+ 											- sizeof(TransactionId)
+ 											- sizeof(XLogRecPtr)];
  
  	/* Protected by WALWriteLock: */
  	XLogCtlWrite Write;
+ 	char	WritePadding[XLOGCTL_BUFFER_SPACING - sizeof(XLogCtlWrite)];
  
  	/*
  	 * These values do not change after startup, although the pointed-to pages
***************
*** 311,316 ****
--- 353,376 ----
  	int			XLogCacheBlck;	/* highest allocated xlog buffer index */
  	TimeLineID	ThisTimeLineID;
  
+ 	/*
+ 	 * IsRecoveryProcessingMode shows whether the postmaster is in a
+ 	 * postmaster state earlier than PM_RUN, or not. This is a globally
+ 	 * accessible state to allow EXEC_BACKEND case.
+ 	 *
+ 	 * We also retain a local state variable InRecovery. InRecovery=true
+ 	 * means the code is being executed by Startup process and therefore
+ 	 * always during Recovery Processing Mode. This allows us to identify
+ 	 * code executed *during* Recovery Processing Mode but not necessarily
+ 	 * by Startup process itself.
+ 	 *
+ 	 * Protected by mode_lck
+ 	 */
+ 	bool		SharedRecoveryProcessingMode;
+ 	slock_t		mode_lck;
+ 
+ 	char		InfoLockPadding[XLOGCTL_BUFFER_SPACING];
+ 
  	slock_t		info_lck;		/* locks shared variables shown above */
  } XLogCtlData;
  
***************
*** 397,404 ****
--- 457,466 ----
  static void readRecoveryCommandFile(void);
  static void exitArchiveRecovery(TimeLineID endTLI,
  					uint32 endLogId, uint32 endLogSeg);
+ static void exitRecovery(void);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
  static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+ static XLogRecPtr GetRedoLocationForCheckpoint(void);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
***************
*** 480,485 ****
--- 542,552 ----
  	bool		updrqst;
  	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+ 	bool		isRecoveryEnd = (rmid == RM_XLOG_ID && info == XLOG_RECOVERY_END);
+ 
+ 	/* cross-check on whether we should be here or not */
+ 	if (IsRecoveryProcessingMode() && !isRecoveryEnd)
+ 		elog(FATAL, "cannot make new WAL entries during recovery");
  
  	/* info's high bits are reserved for use by me */
  	if (info & XLR_INFO_MASK)
***************
*** 1720,1727 ****
  	XLogRecPtr	WriteRqstPtr;
  	XLogwrtRqst WriteRqst;
  
! 	/* Disabled during REDO */
! 	if (InRedo)
  		return;
  
  	/* Quick exit if already known flushed */
--- 1787,1793 ----
  	XLogRecPtr	WriteRqstPtr;
  	XLogwrtRqst WriteRqst;
  
! 	if (IsRecoveryProcessingMode())
  		return;
  
  	/* Quick exit if already known flushed */
***************
*** 1809,1817 ****
  	 * the bad page is encountered again during recovery then we would be
  	 * unable to restart the database at all!  (This scenario has actually
  	 * happened in the field several times with 7.1 releases. Note that we
! 	 * cannot get here while InRedo is true, but if the bad page is brought in
! 	 * and marked dirty during recovery then CreateCheckPoint will try to
! 	 * flush it at the end of recovery.)
  	 *
  	 * The current approach is to ERROR under normal conditions, but only
  	 * WARNING during recovery, so that the system can be brought up even if
--- 1875,1883 ----
  	 * the bad page is encountered again during recovery then we would be
  	 * unable to restart the database at all!  (This scenario has actually
  	 * happened in the field several times with 7.1 releases. Note that we
! 	 * cannot get here while IsRecoveryProcessingMode(), but if the bad page is
! 	 * brought in and marked dirty during recovery then if a checkpoint were
! 	 * performed at the end of recovery it will try to flush it.
  	 *
  	 * The current approach is to ERROR under normal conditions, but only
  	 * WARNING during recovery, so that the system can be brought up even if
***************
*** 1821,1827 ****
  	 * and so we will not force a restart for a bad LSN on a data page.
  	 */
  	if (XLByteLT(LogwrtResult.Flush, record))
! 		elog(InRecovery ? WARNING : ERROR,
  		"xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
  			 record.xlogid, record.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
--- 1887,1893 ----
  	 * and so we will not force a restart for a bad LSN on a data page.
  	 */
  	if (XLByteLT(LogwrtResult.Flush, record))
! 		elog(ERROR,
  		"xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
  			 record.xlogid, record.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
***************
*** 2094,2100 ****
  		unlink(tmppath);
  	}
  
! 	elog(DEBUG2, "done creating and filling new WAL file");
  
  	/* Set flag to tell caller there was no existent file */
  	*use_existent = false;
--- 2160,2167 ----
  		unlink(tmppath);
  	}
  
! 	XLogFileName(tmppath, ThisTimeLineID, log, seg);
! 	elog(DEBUG2, "done creating and filling new WAL file %s", tmppath);
  
  	/* Set flag to tell caller there was no existent file */
  	*use_existent = false;
***************
*** 2400,2405 ****
--- 2467,2494 ----
  					 xlogfname);
  			set_ps_display(activitymsg, false);
  
+ 			/* 
+ 			 * Calculate and write out a new safeStartPoint. This defines
+ 			 * the latest LSN that might appear on-disk while we apply
+ 			 * the WAL records in this file. If we crash during recovery
+ 			 * we must reach this point again before we can prove
+ 			 * database consistency. Not a restartpoint! Restart points
+ 			 * define where we should start recovery from, if we crash.
+ 			 */
+ 			if (InArchiveRecovery)
+ 			{
+ 				uint32 nextLog = log;
+ 				uint32 nextSeg = seg;
+ 
+ 				NextLogSeg(nextLog, nextSeg);
+ 
+ 				LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 				ControlFile->minSafeStartPoint.xlogid = nextLog;
+ 				ControlFile->minSafeStartPoint.xrecoff = nextSeg * XLogSegSize;
+ 				UpdateControlFile();
+ 				LWLockRelease(ControlFileLock);
+ 			}
+ 
  			return fd;
  		}
  		if (errno != ENOENT)	/* unexpected failure? */
***************
*** 4228,4233 ****
--- 4317,4323 ----
  	XLogCtl->XLogCacheBlck = XLOGbuffers - 1;
  	XLogCtl->Insert.currpage = (XLogPageHeader) (XLogCtl->pages);
  	SpinLockInit(&XLogCtl->info_lck);
+ 	SpinLockInit(&XLogCtl->mode_lck);
  
  	/*
  	 * If we are not in bootstrap mode, pg_control should already exist. Read
***************
*** 4538,4549 ****
  			 * does nothing if a recovery_target is not also set
  			 */
  			if (!parse_bool(tok2, &recoveryLogRestartpoints))
! 				  ereport(ERROR,
! 							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
! 					  errmsg("parameter \"log_restartpoints\" requires a Boolean value")));
  			ereport(LOG,
! 					(errmsg("log_restartpoints = %s", tok2)));
! 		}
  		else
  			ereport(FATAL,
  					(errmsg("unrecognized recovery parameter \"%s\"",
--- 4628,4639 ----
  			 * does nothing if a recovery_target is not also set
  			 */
  			if (!parse_bool(tok2, &recoveryLogRestartpoints))
! 				ereport(ERROR,
! 						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
! 							errmsg("parameter \"log_restartpoints\" requires a Boolean value")));
  			ereport(LOG,
! 				(errmsg("log_restartpoints = %s", tok2)));
!  		}
  		else
  			ereport(FATAL,
  					(errmsg("unrecognized recovery parameter \"%s\"",
***************
*** 4678,4692 ****
  	unlink(recoveryPath);		/* ignore any error */
  
  	/*
! 	 * Rename the config file out of the way, so that we don't accidentally
! 	 * re-enter archive recovery mode in a subsequent crash.
  	 */
- 	unlink(RECOVERY_COMMAND_DONE);
- 	if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
- 		ereport(FATAL,
- 				(errcode_for_file_access(),
- 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
- 						RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
  
  	ereport(LOG,
  			(errmsg("archive recovery complete")));
--- 4768,4780 ----
  	unlink(recoveryPath);		/* ignore any error */
  
  	/*
! 	 * As of 8.4 we no longer rename the recovery.conf file out of the
! 	 * way until after we have performed a full checkpoint. This ensures
! 	 * that any crash between now and the end of the checkpoint does not
! 	 * attempt to restart from a WAL file that is no longer available to us.
! 	 * As soon as we remove recovery.conf we lose our recovery_command and
! 	 * cannot reaccess WAL files from the archive.
  	 */
  
  	ereport(LOG,
  			(errmsg("archive recovery complete")));
***************
*** 4813,4818 ****
--- 4901,4907 ----
  	CheckPoint	checkPoint;
  	bool		wasShutdown;
  	bool		reachedStopPoint = false;
+ 	bool		performedRecovery = false;
  	bool		haveBackupLabel = false;
  	XLogRecPtr	RecPtr,
  				LastRec,
***************
*** 4825,4830 ****
--- 4914,4921 ----
  	uint32		freespace;
  	TransactionId oldestActiveXID;
  
+ 	XLogCtl->SharedRecoveryProcessingMode = true;
+ 
  	/*
  	 * Read control file and check XLOG status looks valid.
  	 *
***************
*** 5038,5046 ****
--- 5129,5143 ----
  		if (minRecoveryLoc.xlogid != 0 || minRecoveryLoc.xrecoff != 0)
  			ControlFile->minRecoveryPoint = minRecoveryLoc;
  		ControlFile->time = (pg_time_t) time(NULL);
+ 		/* No need to hold ControlFileLock yet, we aren't up far enough */
  		UpdateControlFile();
  
  		/*
+ 		 * Reset pgstat data, because it may be invalid after recovery.
+ 		 */
+ 		pgstat_reset_all();
+ 
+ 		/*
  		 * If there was a backup label file, it's done its job and the info
  		 * has now been propagated into pg_control.  We must get rid of the
  		 * label file so that if we crash during recovery, we'll pick up at
***************
*** 5150,5155 ****
--- 5247,5278 ----
  
  				LastRec = ReadRecPtr;
  
+ 				/*
+ 				 * Have we reached our safe starting point? If so, we can
+ 				 * signal Postmaster to enter consistent recovery mode.
+ 				 *
+ 				 * There are two point in the log we must pass. The first is
+ 				 * the minRecoveryPoint, which is the LSN at the time the
+ 				 * base backup was taken that we are about to rollfoward from.
+ 				 * If recovery has ever crashed or was stopped there is 
+ 				 * another point also: minSafeStartPoint, which we know the
+ 				 * latest LSN that recovery could have reached prior to crash.
+ 				 */
+ 				if (!reachedSafeStartPoint && 
+ 					 XLByteLE(ControlFile->minSafeStartPoint, EndRecPtr) && 
+ 					 XLByteLE(ControlFile->minRecoveryPoint, EndRecPtr))
+ 				{
+ 					reachedSafeStartPoint = true;
+ 					if (InArchiveRecovery)
+ 					{
+ 						ereport(LOG,
+ 							(errmsg("consistent recovery state reached at %X/%X",
+ 								EndRecPtr.xlogid, EndRecPtr.xrecoff)));
+ 						if (IsUnderPostmaster)
+ 							SendPostmasterSignal(PMSIGNAL_RECOVERY_START);
+ 					}
+ 				}
+ 
  				record = ReadRecord(NULL, LOG);
  			} while (record != NULL && recoveryContinue);
  
***************
*** 5171,5176 ****
--- 5294,5300 ----
  			/* there are no WAL records following the checkpoint */
  			ereport(LOG,
  					(errmsg("redo is not required")));
+ 			reachedSafeStartPoint = true;
  		}
  	}
  
***************
*** 5184,5192 ****
  
  	/*
  	 * Complain if we did not roll forward far enough to render the backup
! 	 * dump consistent.
  	 */
! 	if (XLByteLT(EndOfLog, ControlFile->minRecoveryPoint))
  	{
  		if (reachedStopPoint)	/* stopped because of stop request */
  			ereport(FATAL,
--- 5308,5316 ----
  
  	/*
  	 * Complain if we did not roll forward far enough to render the backup
! 	 * dump consistent and start safely.
  	 */
! 	if (InRecovery && !reachedSafeStartPoint)
  	{
  		if (reachedStopPoint)	/* stopped because of stop request */
  			ereport(FATAL,
***************
*** 5308,5346 ****
  		XLogCheckInvalidPages();
  
  		/*
! 		 * Reset pgstat data, because it may be invalid after recovery.
  		 */
! 		pgstat_reset_all();
  
! 		/*
! 		 * Perform a checkpoint to update all our recovery activity to disk.
! 		 *
! 		 * Note that we write a shutdown checkpoint rather than an on-line
! 		 * one. This is not particularly critical, but since we may be
! 		 * assigning a new TLI, using a shutdown checkpoint allows us to have
! 		 * the rule that TLI only changes in shutdown checkpoints, which
! 		 * allows some extra error checking in xlog_redo.
! 		 */
! 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
  	}
  
- 	/*
- 	 * Preallocate additional log files, if wanted.
- 	 */
- 	PreallocXlogFiles(EndOfLog);
- 
- 	/*
- 	 * Okay, we're officially UP.
- 	 */
- 	InRecovery = false;
- 
- 	ControlFile->state = DB_IN_PRODUCTION;
- 	ControlFile->time = (pg_time_t) time(NULL);
- 	UpdateControlFile();
- 
- 	/* start the archive_timeout timer running */
- 	XLogCtl->Write.lastSegSwitchTime = ControlFile->time;
- 
  	/* initialize shared-memory copy of latest checkpoint XID/epoch */
  	XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
  	XLogCtl->ckptXid = ControlFile->checkPointCopy.nextXid;
--- 5432,5445 ----
  		XLogCheckInvalidPages();
  
  		/*
! 		 * Finally exit recovery and mark that in WAL. Pre-8.4 we wrote
! 		 * a shutdown checkpoint here, but we ask bgwriter to do that now.
  		 */
! 		exitRecovery();
  
! 		performedRecovery = true;
  	}
  
  	/* initialize shared-memory copy of latest checkpoint XID/epoch */
  	XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
  	XLogCtl->ckptXid = ControlFile->checkPointCopy.nextXid;
***************
*** 5374,5379 ****
--- 5473,5561 ----
  		readRecordBuf = NULL;
  		readRecordBufSize = 0;
  	}
+ 
+ 	/*
+ 	 * Prior to 8.4 we wrote a Shutdown Checkpoint at the end of recovery.
+ 	 * This could add minutes to the startup time, so we want bgwriter
+ 	 * to perform it. This then frees the Startup process to complete so we can
+ 	 * allow transactions and WAL inserts. We still write a checkpoint, but
+ 	 * it will be an online checkpoint. Online checkpoints have a redo
+ 	 * location that can be prior to the actual checkpoint record. So we want
+ 	 * to derive that redo location *before* we let anybody else write WAL,
+ 	 * otherwise we might miss some WAL records if we crash.
+ 	 */
+ 	if (performedRecovery)
+ 	{
+ 		XLogRecPtr	redo;
+ 
+ 		/* 
+ 		 * We must grab the pointer before anybody writes WAL 
+ 		 */
+ 		redo = GetRedoLocationForCheckpoint();
+ 
+ 		/* 
+ 		 * Tell the bgwriter
+ 		 */
+ 		SetRedoLocationForArchiveCheckpoint(redo);
+ 
+ 		/*
+ 		 * Okay, we can come up now. Allow others to write WAL.
+ 		 */
+ 		XLogCtl->SharedRecoveryProcessingMode = false;
+ 
+ 		/*
+ 		 * Now request checkpoint
+ 		 */
+ 		RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_IMMEDIATE);
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * No recovery, so lets just get on with it. 
+ 		 */
+ 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 		ControlFile->state = DB_IN_PRODUCTION;
+ 		ControlFile->time = (pg_time_t) time(NULL);
+ 		UpdateControlFile();
+ 		LWLockRelease(ControlFileLock);
+ 
+ 		/*
+ 		 * Okay, we're officially UP.
+ 		 */
+ 		XLogCtl->SharedRecoveryProcessingMode = false;
+ 	}
+ 
+ 	/* start the archive_timeout timer running */
+ 	XLogCtl->Write.lastSegSwitchTime = (pg_time_t) time(NULL);
+ 
+ }
+ 
+ /*
+  * IsRecoveryProcessingMode()
+  *
+  * Fast test for whether we're still in recovery or not. We test the shared
+  * state each time only until we leave recovery mode. After that we never
+  * look again, relying upon the settings of our local state variables. This
+  * is designed to avoid the need for a separate initialisation step.
+  */
+ bool
+ IsRecoveryProcessingMode(void)
+ {
+ 	if (knownProcessingMode && !LocalRecoveryProcessingMode)
+ 		return false;
+ 
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+ 		SpinLockAcquire(&xlogctl->mode_lck);
+ 		LocalRecoveryProcessingMode = XLogCtl->SharedRecoveryProcessingMode;
+ 		SpinLockRelease(&xlogctl->mode_lck);
+ 	}
+ 
+ 	knownProcessingMode = true;
+ 
+ 	return LocalRecoveryProcessingMode;
  }
  
  /*
***************
*** 5631,5650 ****
  static void
  LogCheckpointStart(int flags)
  {
! 	elog(LOG, "checkpoint starting:%s%s%s%s%s%s",
! 		 (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
! 		 (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
! 		 (flags & CHECKPOINT_FORCE) ? " force" : "",
! 		 (flags & CHECKPOINT_WAIT) ? " wait" : "",
! 		 (flags & CHECKPOINT_CAUSE_XLOG) ? " xlog" : "",
! 		 (flags & CHECKPOINT_CAUSE_TIME) ? " time" : "");
  }
  
  /*
   * Log end of a checkpoint.
   */
  static void
! LogCheckpointEnd(void)
  {
  	long		write_secs,
  				sync_secs,
--- 5813,5836 ----
  static void
  LogCheckpointStart(int flags)
  {
! 	if (flags & CHECKPOINT_RESTARTPOINT)
! 		elog(LOG, "restartpoint starting:%s",
! 			 (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "");
! 	else
! 		elog(LOG, "checkpoint starting:%s%s%s%s%s%s",
! 			 (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
! 			 (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
! 			 (flags & CHECKPOINT_FORCE) ? " force" : "",
! 			 (flags & CHECKPOINT_WAIT) ? " wait" : "",
! 			 (flags & CHECKPOINT_CAUSE_XLOG) ? " xlog" : "",
! 			 (flags & CHECKPOINT_CAUSE_TIME) ? " time" : "");
  }
  
  /*
   * Log end of a checkpoint.
   */
  static void
! LogCheckpointEnd(int flags)
  {
  	long		write_secs,
  				sync_secs,
***************
*** 5667,5683 ****
  						CheckpointStats.ckpt_sync_end_t,
  						&sync_secs, &sync_usecs);
  
! 	elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
! 		 "%d transaction log file(s) added, %d removed, %d recycled; "
! 		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
! 		 CheckpointStats.ckpt_bufs_written,
! 		 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
! 		 CheckpointStats.ckpt_segs_added,
! 		 CheckpointStats.ckpt_segs_removed,
! 		 CheckpointStats.ckpt_segs_recycled,
! 		 write_secs, write_usecs / 1000,
! 		 sync_secs, sync_usecs / 1000,
! 		 total_secs, total_usecs / 1000);
  }
  
  /*
--- 5853,5878 ----
  						CheckpointStats.ckpt_sync_end_t,
  						&sync_secs, &sync_usecs);
  
! 	if (flags & CHECKPOINT_RESTARTPOINT)
! 		elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); "
! 			 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
! 			 CheckpointStats.ckpt_bufs_written,
! 			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
! 			 write_secs, write_usecs / 1000,
! 			 sync_secs, sync_usecs / 1000,
! 			 total_secs, total_usecs / 1000);
! 	else
! 		elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
! 			 "%d transaction log file(s) added, %d removed, %d recycled; "
! 			 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
! 			 CheckpointStats.ckpt_bufs_written,
! 			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
! 			 CheckpointStats.ckpt_segs_added,
! 			 CheckpointStats.ckpt_segs_removed,
! 			 CheckpointStats.ckpt_segs_recycled,
! 			 write_secs, write_usecs / 1000,
! 			 sync_secs, sync_usecs / 1000,
! 			 total_secs, total_usecs / 1000);
  }
  
  /*
***************
*** 5702,5718 ****
  	XLogRecPtr	recptr;
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecData rdata;
- 	uint32		freespace;
  	uint32		_logId;
  	uint32		_logSeg;
  	TransactionId *inCommitXids;
  	int			nInCommit;
  
  	/*
  	 * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
! 	 * (This is just pro forma, since in the present system structure there is
! 	 * only one process that is allowed to issue checkpoints at any given
! 	 * time.)
  	 */
  	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
  
--- 5897,5912 ----
  	XLogRecPtr	recptr;
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecData rdata;
  	uint32		_logId;
  	uint32		_logSeg;
  	TransactionId *inCommitXids;
  	int			nInCommit;
+ 	bool		leavingArchiveRecovery = false;
  
  	/*
  	 * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
! 	 * That shouldn't be happening, but checkpoints are an important aspect
! 	 * of our resilience, so we take no chances.
  	 */
  	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
  
***************
*** 5727,5741 ****
--- 5921,5944 ----
  	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
  
  	/*
+ 	 * Find out if this is the first checkpoint after archive recovery.
+ 	 */
+ 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 	leavingArchiveRecovery = (ControlFile->state == DB_IN_ARCHIVE_RECOVERY);
+ 	LWLockRelease(ControlFileLock);
+ 
+ 	/*
  	 * Use a critical section to force system panic if we have trouble.
  	 */
  	START_CRIT_SECTION();
  
  	if (shutdown)
  	{
+ 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
  		ControlFile->state = DB_SHUTDOWNING;
  		ControlFile->time = (pg_time_t) time(NULL);
  		UpdateControlFile();
+ 		LWLockRelease(ControlFileLock);
  	}
  
  	/*
***************
*** 5750,5840 ****
  	checkPoint.ThisTimeLineID = ThisTimeLineID;
  	checkPoint.time = (pg_time_t) time(NULL);
  
! 	/*
! 	 * We must hold WALInsertLock while examining insert state to determine
! 	 * the checkpoint REDO pointer.
! 	 */
! 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 
! 	/*
! 	 * If this isn't a shutdown or forced checkpoint, and we have not inserted
! 	 * any XLOG records since the start of the last checkpoint, skip the
! 	 * checkpoint.	The idea here is to avoid inserting duplicate checkpoints
! 	 * when the system is idle. That wastes log space, and more importantly it
! 	 * exposes us to possible loss of both current and previous checkpoint
! 	 * records if the machine crashes just as we're writing the update.
! 	 * (Perhaps it'd make even more sense to checkpoint only when the previous
! 	 * checkpoint record is in a different xlog page?)
! 	 *
! 	 * We have to make two tests to determine that nothing has happened since
! 	 * the start of the last checkpoint: current insertion point must match
! 	 * the end of the last checkpoint record, and its redo pointer must point
! 	 * to itself.
! 	 */
! 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_FORCE)) == 0)
  	{
! 		XLogRecPtr	curInsert;
  
! 		INSERT_RECPTR(curInsert, Insert, Insert->curridx);
! 		if (curInsert.xlogid == ControlFile->checkPoint.xlogid &&
! 			curInsert.xrecoff == ControlFile->checkPoint.xrecoff +
! 			MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) &&
! 			ControlFile->checkPoint.xlogid ==
! 			ControlFile->checkPointCopy.redo.xlogid &&
! 			ControlFile->checkPoint.xrecoff ==
! 			ControlFile->checkPointCopy.redo.xrecoff)
  		{
! 			LWLockRelease(WALInsertLock);
! 			LWLockRelease(CheckpointLock);
! 			END_CRIT_SECTION();
! 			return;
! 		}
! 	}
  
! 	/*
! 	 * Compute new REDO record ptr = location of next XLOG record.
! 	 *
! 	 * NB: this is NOT necessarily where the checkpoint record itself will be,
! 	 * since other backends may insert more XLOG records while we're off doing
! 	 * the buffer flush work.  Those XLOG records are logically after the
! 	 * checkpoint, even though physically before it.  Got that?
! 	 */
! 	freespace = INSERT_FREESPACE(Insert);
! 	if (freespace < SizeOfXLogRecord)
! 	{
! 		(void) AdvanceXLInsertBuffer(false);
! 		/* OK to ignore update return flag, since we will do flush anyway */
! 		freespace = INSERT_FREESPACE(Insert);
! 	}
! 	INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx);
  
! 	/*
! 	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
! 	 * must be done while holding the insert lock AND the info_lck.
! 	 *
! 	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
! 	 * pointing past where it really needs to point.  This is okay; the only
! 	 * consequence is that XLogInsert might back up whole buffers that it
! 	 * didn't really need to.  We can't postpone advancing RedoRecPtr because
! 	 * XLogInserts that happen while we are dumping buffers must assume that
! 	 * their buffer changes are not included in the checkpoint.
! 	 */
! 	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
  
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		RedoRecPtr = xlogctl->Insert.RedoRecPtr = checkPoint.redo;
! 		SpinLockRelease(&xlogctl->info_lck);
  	}
  
  	/*
- 	 * Now we can release WAL insert lock, allowing other xacts to proceed
- 	 * while we are flushing disk buffers.
- 	 */
- 	LWLockRelease(WALInsertLock);
- 
- 	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
  	 * to log anything if we decided to skip the checkpoint.
  	 */
--- 5953,6021 ----
  	checkPoint.ThisTimeLineID = ThisTimeLineID;
  	checkPoint.time = (pg_time_t) time(NULL);
  
! 	if (leavingArchiveRecovery)
! 		checkPoint.redo = GetRedoLocationForArchiveCheckpoint();
! 	else
  	{
! 		/*
! 		 * We must hold WALInsertLock while examining insert state to determine
! 		 * the checkpoint REDO pointer.
! 		 */
! 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
  
! 		/*
! 		 * If this isn't a shutdown or forced checkpoint, and we have not inserted
! 		 * any XLOG records since the start of the last checkpoint, skip the
! 		 * checkpoint.	The idea here is to avoid inserting duplicate checkpoints
! 		 * when the system is idle. That wastes log space, and more importantly it
! 		 * exposes us to possible loss of both current and previous checkpoint
! 		 * records if the machine crashes just as we're writing the update.
! 		 * (Perhaps it'd make even more sense to checkpoint only when the previous
! 		 * checkpoint record is in a different xlog page?)
! 		 *
! 		 * We have to make two tests to determine that nothing has happened since
! 		 * the start of the last checkpoint: current insertion point must match
! 		 * the end of the last checkpoint record, and its redo pointer must point
! 		 * to itself.
! 		 */
! 		if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_FORCE)) == 0)
  		{
! 			XLogRecPtr	curInsert;
  
! 			INSERT_RECPTR(curInsert, Insert, Insert->curridx);
! 			if (curInsert.xlogid == ControlFile->checkPoint.xlogid &&
! 				curInsert.xrecoff == ControlFile->checkPoint.xrecoff +
! 				MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) &&
! 				ControlFile->checkPoint.xlogid ==
! 				ControlFile->checkPointCopy.redo.xlogid &&
! 				ControlFile->checkPoint.xrecoff ==
! 				ControlFile->checkPointCopy.redo.xrecoff)
! 			{
! 				LWLockRelease(WALInsertLock);
! 				LWLockRelease(CheckpointLock);
! 				END_CRIT_SECTION();
! 				return;
! 			}
! 		}
  
! 		/*
! 		 * Compute new REDO record ptr = location of next XLOG record.
! 		 *
! 		 * NB: this is NOT necessarily where the checkpoint record itself will be,
! 		 * since other backends may insert more XLOG records while we're off doing
! 		 * the buffer flush work.  Those XLOG records are logically after the
! 		 * checkpoint, even though physically before it.  Got that?
! 		 */
! 		checkPoint.redo = GetRedoLocationForCheckpoint();
  
! 		/*
! 		 * Now we can release WAL insert lock, allowing other xacts to proceed
! 		 * while we are flushing disk buffers.
! 		 */
! 		LWLockRelease(WALInsertLock);
  	}
  
  	/*
  	 * If enabled, log checkpoint start.  We postpone this until now so as not
  	 * to log anything if we decided to skip the checkpoint.
  	 */
***************
*** 5941,5958 ****
  	XLByteToSeg(ControlFile->checkPointCopy.redo, _logId, _logSeg);
  
  	/*
! 	 * Update the control file.
  	 */
  	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
  	if (shutdown)
  		ControlFile->state = DB_SHUTDOWNED;
  	ControlFile->prevCheckPoint = ControlFile->checkPoint;
  	ControlFile->checkPoint = ProcLastRecPtr;
  	ControlFile->checkPointCopy = checkPoint;
  	ControlFile->time = (pg_time_t) time(NULL);
  	UpdateControlFile();
  	LWLockRelease(ControlFileLock);
  
  	/* Update shared-memory copy of checkpoint XID/epoch */
  	{
  		/* use volatile pointer to prevent code rearrangement */
--- 6122,6164 ----
  	XLByteToSeg(ControlFile->checkPointCopy.redo, _logId, _logSeg);
  
  	/*
! 	 * Update the control file. In 8.4, this routine becomes the primary
! 	 * point for recording changes of state in the control file at the 
! 	 * end of recovery. Postmaster state already shows us being in 
! 	 * normal running mode, but it is only after this point that we
! 	 * are completely free of reperforming a recovery if we crash.  Note
! 	 * that this is executed by bgwriter after the death of Startup process.
  	 */
  	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 
  	if (shutdown)
  		ControlFile->state = DB_SHUTDOWNED;
+ 	else
+ 		ControlFile->state = DB_IN_PRODUCTION;
+ 
  	ControlFile->prevCheckPoint = ControlFile->checkPoint;
  	ControlFile->checkPoint = ProcLastRecPtr;
  	ControlFile->checkPointCopy = checkPoint;
  	ControlFile->time = (pg_time_t) time(NULL);
  	UpdateControlFile();
+ 
  	LWLockRelease(ControlFileLock);
  
+ 	if (leavingArchiveRecovery)
+ 	{
+ 		/*
+ 		 * Rename the config file out of the way, so that we don't accidentally
+ 		 * re-enter archive recovery mode in a subsequent crash. Prior to
+ 		 * 8.4 this step was performed at end of exitArchiveRecovery().
+ 		 */
+ 		unlink(RECOVERY_COMMAND_DONE);
+ 		if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
+ 			ereport(ERROR,
+ 					(errcode_for_file_access(),
+ 					 errmsg("could not rename file \"%s\" to \"%s\": %m",
+ 							RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
+ 	}
+ 
  	/* Update shared-memory copy of checkpoint XID/epoch */
  	{
  		/* use volatile pointer to prevent code rearrangement */
***************
*** 5999,6014 ****
  	 * in subtrans.c).	During recovery, though, we mustn't do this because
  	 * StartupSUBTRANS hasn't been called yet.
  	 */
! 	if (!InRecovery)
! 		TruncateSUBTRANS(GetOldestXmin(true, false));
  
  	/* All real work is done, but log before releasing lock. */
  	if (log_checkpoints)
! 		LogCheckpointEnd();
  
  	LWLockRelease(CheckpointLock);
  }
  
  /*
   * Flush all data in shared memory to disk, and fsync
   *
--- 6205,6264 ----
  	 * in subtrans.c).	During recovery, though, we mustn't do this because
  	 * StartupSUBTRANS hasn't been called yet.
  	 */
! 	TruncateSUBTRANS(GetOldestXmin(true, false));
  
  	/* All real work is done, but log before releasing lock. */
  	if (log_checkpoints)
! 		LogCheckpointEnd(flags);
  
  	LWLockRelease(CheckpointLock);
  }
  
+ /* 
+  * GetRedoLocationForCheckpoint()
+  *
+  * When !IsRecoveryProcessingMode() this must be called while holding 
+  * WALInsertLock().
+  */
+ static XLogRecPtr
+ GetRedoLocationForCheckpoint()
+ {
+ 	XLogCtlInsert  *Insert = &XLogCtl->Insert;
+ 	uint32			freespace;
+ 	XLogRecPtr		redo;
+ 
+ 	freespace = INSERT_FREESPACE(Insert);
+ 	if (freespace < SizeOfXLogRecord)
+ 	{
+ 		(void) AdvanceXLInsertBuffer(false);
+ 		/* OK to ignore update return flag, since we will do flush anyway */
+ 		freespace = INSERT_FREESPACE(Insert);
+ 	}
+ 	INSERT_RECPTR(redo, Insert, Insert->curridx);
+ 
+ 	/*
+ 	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
+ 	 * must be done while holding the insert lock AND the info_lck.
+ 	 *
+ 	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
+ 	 * pointing past where it really needs to point.  This is okay; the only
+ 	 * consequence is that XLogInsert might back up whole buffers that it
+ 	 * didn't really need to.  We can't postpone advancing RedoRecPtr because
+ 	 * XLogInserts that happen while we are dumping buffers must assume that
+ 	 * their buffer changes are not included in the checkpoint.
+ 	 */
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+ 		SpinLockAcquire(&xlogctl->info_lck);
+ 		RedoRecPtr = xlogctl->Insert.RedoRecPtr = redo;
+ 		SpinLockRelease(&xlogctl->info_lck);
+ 	}
+ 
+ 	return redo;
+ }
+ 
  /*
   * Flush all data in shared memory to disk, and fsync
   *
***************
*** 6073,6101 ****
  			}
  	}
  
  	/*
! 	 * OK, force data out to disk
  	 */
! 	CheckPointGuts(checkPoint->redo, CHECKPOINT_IMMEDIATE);
  
  	/*
! 	 * Update pg_control so that any subsequent crash will restart from this
! 	 * checkpoint.	Note: ReadRecPtr gives the XLOG address of the checkpoint
! 	 * record itself.
  	 */
  	ControlFile->prevCheckPoint = ControlFile->checkPoint;
! 	ControlFile->checkPoint = ReadRecPtr;
! 	ControlFile->checkPointCopy = *checkPoint;
  	ControlFile->time = (pg_time_t) time(NULL);
  	UpdateControlFile();
  
  	ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
  			(errmsg("recovery restart point at %X/%X",
! 					checkPoint->redo.xlogid, checkPoint->redo.xrecoff)));
  	if (recoveryLastXTime)
  		ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
! 				(errmsg("last completed transaction was at log time %s",
! 						timestamptz_to_str(recoveryLastXTime))));
  }
  
  /*
--- 6323,6391 ----
  			}
  	}
  
+ 	RequestRestartPoint(ReadRecPtr, checkPoint, reachedSafeStartPoint);
+ }
+ 
+ /*
+  * As of 8.4, RestartPoints are always created by the bgwriter
+  * once we have reachedSafeStartPoint. We use bgwriter's shared memory
+  * area wherever we call it from, to keep better code structure.
+  */
+ void
+ CreateRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, int flags)
+ {
+ 	if (recoveryLogRestartpoints)
+ 	{
+ 		/*
+ 		 * Prepare to accumulate statistics.
+ 		 */
+ 
+ 		MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
+ 		CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
+ 
+ 		LogCheckpointStart(CHECKPOINT_RESTARTPOINT | flags);
+ 	}
+ 
  	/*
! 	 * Acquire CheckpointLock to ensure only one restartpoint happens at a time.
! 	 * We rely on this lock to ensure that the startup process doesn't exit
! 	 * Recovery while we are half way through a restartpoint.
  	 */
! 	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
! 
! 	CheckPointGuts(restartPoint->redo, CHECKPOINT_RESTARTPOINT | flags);
  
  	/*
! 	 * Update pg_control, using current time
  	 */
+ 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
  	ControlFile->prevCheckPoint = ControlFile->checkPoint;
! 	ControlFile->checkPoint = ReadPtr;
! 	ControlFile->checkPointCopy = *restartPoint;
  	ControlFile->time = (pg_time_t) time(NULL);
  	UpdateControlFile();
+ 	LWLockRelease(ControlFileLock);
+ 
+ 	/*
+ 	 * Currently, there is no need to truncate pg_subtrans during recovery.
+ 	 * If we did do that, we will need to have called StartupSUBTRANS()
+ 	 * already and then TruncateSUBTRANS() would go here.
+ 	 */
+ 
+ 	/* All real work is done, but log before releasing lock. */
+ 	if (recoveryLogRestartpoints)
+ 		LogCheckpointEnd(CHECKPOINT_RESTARTPOINT);
  
  	ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
  			(errmsg("recovery restart point at %X/%X",
! 					restartPoint->redo.xlogid, restartPoint->redo.xrecoff)));
! 
  	if (recoveryLastXTime)
  		ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
! 			(errmsg("last completed transaction was at log time %s",
! 					timestamptz_to_str(recoveryLastXTime))));
! 
! 	LWLockRelease(CheckpointLock);
  }
  
  /*
***************
*** 6160,6166 ****
  }
  
  /*
!  * XLOG resource manager's routines
   */
  void
  xlog_redo(XLogRecPtr lsn, XLogRecord *record)
--- 6450,6512 ----
  }
  
  /*
!  * exitRecovery()
!  *
!  * Exit recovery state and write a XLOG_RECOVERY_END record. This is the
!  * only record type that can record a change of timelineID. We assume
!  * caller has already set ThisTimeLineID, if appropriate.
!  */
! static void
! exitRecovery(void)
! {
! 	XLogRecData rdata;
! 
! 	rdata.buffer = InvalidBuffer;
! 	rdata.data = (char *) (&ThisTimeLineID);
! 	rdata.len = sizeof(TimeLineID);
! 	rdata.next = NULL;
! 
! 	/*
! 	 * If a restartpoint is in progress, we will not be able to successfully
! 	 * acquire CheckpointLock. If bgwriter is still in progress then send
! 	 * a second signal to nudge bgwriter to go faster so we can avoid delay.
! 	 * Then wait for lock, so we know the restartpoint has completed. We do
! 	 * this because we don't want to interrupt the restartpoint half way
! 	 * through, which might leave us in a mess and we want to be robust. We're
! 	 * going to checkpoint soon anyway, so not it's not wasted effort.
! 	 */
! 	if (LWLockConditionalAcquire(CheckpointLock, LW_EXCLUSIVE))
! 		LWLockRelease(CheckpointLock);
! 	else
! 	{
! 		RequestRestartPointCompletion();
! 		ereport(LOG,
! 			(errmsg("startup process waiting for restartpoint to complete")));
! 		LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
! 		LWLockRelease(CheckpointLock);
! 	}	
! 
! 	/*
! 	 * This is the only type of WAL message that can be inserted during
! 	 * recovery. This ensures that we don't allow others to get access
! 	 * until after we have changed state.
! 	 */
! 	(void) XLogInsert(RM_XLOG_ID, XLOG_RECOVERY_END, &rdata);
! 
! 	/*
! 	 * We don't XLogFlush() here otherwise we'll end up zeroing the WAL
! 	 * file ourselves. So just let bgwriter's forthcoming checkpoint do
! 	 * that for us.
! 	 */
! 
! 	InRecovery = false;
! }
! 
! /*
!  * XLOG resource manager's routines.
!  *
!  * Definitions of message info are in include/catalog/pg_control.h,
!  * though not all messages relate to control file processing.
   */
  void
  xlog_redo(XLogRecPtr lsn, XLogRecord *record)
***************
*** 6195,6215 ****
  		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
  
  		/*
! 		 * TLI may change in a shutdown checkpoint, but it shouldn't decrease
  		 */
! 		if (checkPoint.ThisTimeLineID != ThisTimeLineID)
  		{
! 			if (checkPoint.ThisTimeLineID < ThisTimeLineID ||
  				!list_member_int(expectedTLIs,
! 								 (int) checkPoint.ThisTimeLineID))
  				ereport(PANIC,
! 						(errmsg("unexpected timeline ID %u (after %u) in checkpoint record",
! 								checkPoint.ThisTimeLineID, ThisTimeLineID)));
  			/* Following WAL records should be run with new TLI */
! 			ThisTimeLineID = checkPoint.ThisTimeLineID;
  		}
- 
- 		RecoveryRestartPoint(&checkPoint);
  	}
  	else if (info == XLOG_CHECKPOINT_ONLINE)
  	{
--- 6541,6578 ----
  		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
  
  		/*
! 		 * TLI no longer changes at shutdown checkpoint, since as of 8.4,
! 		 * shutdown checkpoints only occur at shutdown. Much less confusing.
  		 */
! 
! 		RecoveryRestartPoint(&checkPoint);
! 	}
! 	else if (info == XLOG_RECOVERY_END)
! 	{
! 		TimeLineID	tli;
! 
! 		memcpy(&tli, XLogRecGetData(record), sizeof(TimeLineID));
! 
! 		/*
! 		 * TLI may change when recovery ends, but it shouldn't decrease.
! 		 *
! 		 * This is the only WAL record that can tell us to change timelineID
! 		 * while we process WAL records. 
! 		 *
! 		 * We can *choose* to stop recovery at any point, generating a
! 		 * new timelineID which is recorded using this record type.
! 		 */
! 		if (tli != ThisTimeLineID)
  		{
! 			if (tli < ThisTimeLineID ||
  				!list_member_int(expectedTLIs,
! 								 (int) tli))
  				ereport(PANIC,
! 						(errmsg("unexpected timeline ID %u (after %u) at recovery end record",
! 								tli, ThisTimeLineID)));
  			/* Following WAL records should be run with new TLI */
! 			ThisTimeLineID = tli;
  		}
  	}
  	else if (info == XLOG_CHECKPOINT_ONLINE)
  	{
***************
*** 6232,6238 ****
  		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
  		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
  
! 		/* TLI should not change in an on-line checkpoint */
  		if (checkPoint.ThisTimeLineID != ThisTimeLineID)
  			ereport(PANIC,
  					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
--- 6595,6601 ----
  		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
  		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
  
! 		/* TLI must not change at a checkpoint */
  		if (checkPoint.ThisTimeLineID != ThisTimeLineID)
  			ereport(PANIC,
  					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
Index: src/backend/postmaster/bgwriter.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/postmaster/bgwriter.c,v
retrieving revision 1.52
diff -c -r1.52 bgwriter.c
*** src/backend/postmaster/bgwriter.c	30 Sep 2008 10:52:13 -0000	1.52
--- src/backend/postmaster/bgwriter.c	8 Oct 2008 13:05:35 -0000
***************
*** 49,54 ****
--- 49,55 ----
  #include <unistd.h>
  
  #include "access/xlog_internal.h"
+ #include "catalog/pg_control.h"
  #include "libpq/pqsignal.h"
  #include "miscadmin.h"
  #include "pgstat.h"
***************
*** 129,134 ****
--- 130,142 ----
  
  	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
  
+ 	/* 
+ 	 * When the Startup process wants bgwriter to perform a restartpoint, it 
+ 	 * sets these fields so that we can update the control file afterwards.
+ 	 */
+ 	XLogRecPtr	ReadPtr;		/* Requested log pointer */
+ 	CheckPoint  restartPoint;	/* restartPoint data for ControlFile */
+ 
  	uint32		num_backend_writes;		/* counts non-bgwriter buffer writes */
  
  	int			num_requests;	/* current # of requests */
***************
*** 165,171 ****
  
  /* these values are valid when ckpt_active is true: */
  static pg_time_t ckpt_start_time;
! static XLogRecPtr ckpt_start_recptr;
  static double ckpt_cached_elapsed;
  
  static pg_time_t last_checkpoint_time;
--- 173,179 ----
  
  /* these values are valid when ckpt_active is true: */
  static pg_time_t ckpt_start_time;
! static XLogRecPtr ckpt_start_recptr;	/* not used if IsRecoveryProcessingMode */
  static double ckpt_cached_elapsed;
  
  static pg_time_t last_checkpoint_time;
***************
*** 197,202 ****
--- 205,211 ----
  {
  	sigjmp_buf	local_sigjmp_buf;
  	MemoryContext bgwriter_context;
+ 	bool		BgWriterRecoveryMode;
  
  	BgWriterShmem->bgwriter_pid = MyProcPid;
  	am_bg_writer = true;
***************
*** 355,370 ****
  	 */
  	PG_SETMASK(&UnBlockSig);
  
  	/*
  	 * Loop forever
  	 */
  	for (;;)
  	{
- 		bool		do_checkpoint = false;
- 		int			flags = 0;
- 		pg_time_t	now;
- 		int			elapsed_secs;
- 
  		/*
  		 * Emergency bailout if postmaster has died.  This is to avoid the
  		 * necessity for manual cleanup of all postmaster children.
--- 364,380 ----
  	 */
  	PG_SETMASK(&UnBlockSig);
  
+ 	BgWriterRecoveryMode = IsRecoveryProcessingMode();
+ 
+ 	if (BgWriterRecoveryMode)
+ 		elog(DEBUG1, "bgwriter starting during recovery, pid = %u", 
+ 			BgWriterShmem->bgwriter_pid);
+ 
  	/*
  	 * Loop forever
  	 */
  	for (;;)
  	{
  		/*
  		 * Emergency bailout if postmaster has died.  This is to avoid the
  		 * necessity for manual cleanup of all postmaster children.
***************
*** 382,499 ****
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
  		}
- 		if (checkpoint_requested)
- 		{
- 			checkpoint_requested = false;
- 			do_checkpoint = true;
- 			BgWriterStats.m_requested_checkpoints++;
- 		}
- 		if (shutdown_requested)
- 		{
- 			/*
- 			 * From here on, elog(ERROR) should end with exit(1), not send
- 			 * control back to the sigsetjmp block above
- 			 */
- 			ExitOnAnyError = true;
- 			/* Close down the database */
- 			ShutdownXLOG(0, 0);
- 			/* Normal exit from the bgwriter is here */
- 			proc_exit(0);		/* done */
- 		}
- 
- 		/*
- 		 * Force a checkpoint if too much time has elapsed since the last one.
- 		 * Note that we count a timed checkpoint in stats only when this
- 		 * occurs without an external request, but we set the CAUSE_TIME flag
- 		 * bit even if there is also an external request.
- 		 */
- 		now = (pg_time_t) time(NULL);
- 		elapsed_secs = now - last_checkpoint_time;
- 		if (elapsed_secs >= CheckPointTimeout)
- 		{
- 			if (!do_checkpoint)
- 				BgWriterStats.m_timed_checkpoints++;
- 			do_checkpoint = true;
- 			flags |= CHECKPOINT_CAUSE_TIME;
- 		}
- 
- 		/*
- 		 * Do a checkpoint if requested, otherwise do one cycle of
- 		 * dirty-buffer writing.
- 		 */
- 		if (do_checkpoint)
- 		{
- 			/* use volatile pointer to prevent code rearrangement */
- 			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
- 
- 			/*
- 			 * Atomically fetch the request flags to figure out what kind of a
- 			 * checkpoint we should perform, and increase the started-counter
- 			 * to acknowledge that we've started a new checkpoint.
- 			 */
- 			SpinLockAcquire(&bgs->ckpt_lck);
- 			flags |= bgs->ckpt_flags;
- 			bgs->ckpt_flags = 0;
- 			bgs->ckpt_started++;
- 			SpinLockRelease(&bgs->ckpt_lck);
  
! 			/*
! 			 * We will warn if (a) too soon since last checkpoint (whatever
! 			 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
! 			 * since the last checkpoint start.  Note in particular that this
! 			 * implementation will not generate warnings caused by
! 			 * CheckPointTimeout < CheckPointWarning.
! 			 */
! 			if ((flags & CHECKPOINT_CAUSE_XLOG) &&
! 				elapsed_secs < CheckPointWarning)
! 				ereport(LOG,
! 						(errmsg("checkpoints are occurring too frequently (%d seconds apart)",
! 								elapsed_secs),
! 						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
  
  			/*
! 			 * Initialize bgwriter-private variables used during checkpoint.
  			 */
! 			ckpt_active = true;
! 			ckpt_start_recptr = GetInsertRecPtr();
! 			ckpt_start_time = now;
! 			ckpt_cached_elapsed = 0;
  
  			/*
! 			 * Do the checkpoint.
  			 */
! 			CreateCheckPoint(flags);
  
! 			/*
! 			 * After any checkpoint, close all smgr files.	This is so we
! 			 * won't hang onto smgr references to deleted files indefinitely.
! 			 */
! 			smgrcloseall();
  
! 			/*
! 			 * Indicate checkpoint completion to any waiting backends.
! 			 */
! 			SpinLockAcquire(&bgs->ckpt_lck);
! 			bgs->ckpt_done = bgs->ckpt_started;
! 			SpinLockRelease(&bgs->ckpt_lck);
! 
! 			ckpt_active = false;
! 
! 			/*
! 			 * Note we record the checkpoint start time not end time as
! 			 * last_checkpoint_time.  This is so that time-driven checkpoints
! 			 * happen at a predictable spacing.
! 			 */
! 			last_checkpoint_time = now;
  		}
- 		else
- 			BgBufferSync();
- 
- 		/* Check for archive_timeout and switch xlog files if necessary. */
- 		CheckArchiveTimeout();
- 
- 		/* Nap for the configured time. */
- 		BgWriterNap();
  	}
  }
  
--- 392,595 ----
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
  		}
  
!  		if (BgWriterRecoveryMode)
!   		{
!  			if (shutdown_requested)
!  			{
!  				/*
!  				 * From here on, elog(ERROR) should end with exit(1), not send
!  				 * control back to the sigsetjmp block above
!  				 */
!  				ExitOnAnyError = true;
!  				/* Normal exit from the bgwriter is here */
!  				proc_exit(0);		/* done */
!  			}
!  
!  			if (!IsRecoveryProcessingMode())
!  			{
!  				elog(DEBUG2, "bgwriter changing from recovery to normal mode");
!  
!  				InitXLOGAccess();
!  				BgWriterRecoveryMode = false;
!  
!  				/*
!  				 * Start time-driven events from now
!  				 */
!  				last_checkpoint_time = last_xlog_switch_time = (pg_time_t) time(NULL);
!  
!  				/* 
!  				 * Notice that we do *not* act on a checkpoint_requested
!  				 * state at this point. We have changed mode, so we wish to
!  				 * perform a checkpoint not a restartpoint.
!  				 */
!  				continue;
!  			}
!  
!  			if (checkpoint_requested)
!  			{
!  				XLogRecPtr		ReadPtr;
!  				CheckPoint		restartPoint;
!  
!  				checkpoint_requested = false;
!  
!  				/*
!  				 * Initialize bgwriter-private variables used during checkpoint.
!  				 */
!  				ckpt_active = true;
!  				ckpt_start_time = (pg_time_t) time(NULL);
!  				ckpt_cached_elapsed = 0;
!  
!  				/*
!  				 * Get the requested values from shared memory that the 
!  				 * Startup process has put there for us.
!  				 */
!  				SpinLockAcquire(&BgWriterShmem->ckpt_lck);
!  				ReadPtr = BgWriterShmem->ReadPtr;
!  				memcpy(&restartPoint, &BgWriterShmem->restartPoint, sizeof(CheckPoint));
!  				SpinLockRelease(&BgWriterShmem->ckpt_lck);
!  
!  				/* Use smoothed writes, until interrupted if ever */
!  				CreateRestartPoint(ReadPtr, &restartPoint, 0);
!  
!  				/*
!  				 * After any checkpoint, close all smgr files.	This is so we
!  				 * won't hang onto smgr references to deleted files indefinitely.
!  				 */
!  				smgrcloseall();
!  
!  				ckpt_active = false;
!  				checkpoint_requested = false;
!  			}
!  			else
!  			{
!  				/* Clean buffers dirtied by recovery */
!  				BgBufferSync();
!  
!  				/* Nap for the configured time. */
!  				BgWriterNap();
!  			}
!   		}
! 		else	/* Normal processing */
!   		{
! 			bool		do_checkpoint = false;
! 			int			flags = 0;
! 			pg_time_t	now;
! 			int			elapsed_secs;
! 
! 			if (checkpoint_requested)
! 			{
! 				checkpoint_requested = false;
! 				do_checkpoint = true;
! 				BgWriterStats.m_requested_checkpoints++;
! 			}
! 			if (shutdown_requested)
! 			{
! 				/*
! 				 * From here on, elog(ERROR) should end with exit(1), not send
! 				 * control back to the sigsetjmp block above
! 				 */
! 				ExitOnAnyError = true;
! 				/* Close down the database */
! 				ShutdownXLOG(0, 0);
! 				/* Normal exit from the bgwriter is here */
! 				proc_exit(0);		/* done */
! 			}
  
  			/*
! 			 * Force a checkpoint if too much time has elapsed since the last one.
! 			 * Note that we count a timed checkpoint in stats only when this
! 			 * occurs without an external request, but we set the CAUSE_TIME flag
! 			 * bit even if there is also an external request.
  			 */
! 			now = (pg_time_t) time(NULL);
! 			elapsed_secs = now - last_checkpoint_time;
! 			if (elapsed_secs >= CheckPointTimeout)
! 			{
! 				if (!do_checkpoint)
! 					BgWriterStats.m_timed_checkpoints++;
! 				do_checkpoint = true;
! 				flags |= CHECKPOINT_CAUSE_TIME;
! 			}
  
  			/*
! 			 * Do a checkpoint if requested, otherwise do one cycle of
! 			 * dirty-buffer writing.
  			 */
! 			if (do_checkpoint)
! 			{
! 				/* use volatile pointer to prevent code rearrangement */
! 				volatile BgWriterShmemStruct *bgs = BgWriterShmem;
! 
! 				/*
! 				 * Atomically fetch the request flags to figure out what kind of a
! 				 * checkpoint we should perform, and increase the started-counter
! 				 * to acknowledge that we've started a new checkpoint.
! 				 */
! 				SpinLockAcquire(&bgs->ckpt_lck);
! 				flags |= bgs->ckpt_flags;
! 				bgs->ckpt_flags = 0;
! 				bgs->ckpt_started++;
! 				SpinLockRelease(&bgs->ckpt_lck);
! 
! 				/*
! 				 * We will warn if (a) too soon since last checkpoint (whatever
! 				 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
! 				 * since the last checkpoint start.  Note in particular that this
! 				 * implementation will not generate warnings caused by
! 				 * CheckPointTimeout < CheckPointWarning.
! 				 */
! 				if ((flags & CHECKPOINT_CAUSE_XLOG) &&
! 					elapsed_secs < CheckPointWarning)
! 					ereport(LOG,
! 							(errmsg("checkpoints are occurring too frequently (%d seconds apart)",
! 									elapsed_secs),
! 							 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
! 
! 				/*
! 				 * Initialize bgwriter-private variables used during checkpoint.
! 				 */
! 				ckpt_active = true;
! 				ckpt_start_recptr = GetInsertRecPtr();
! 				ckpt_start_time = now;
! 				ckpt_cached_elapsed = 0;
! 
! 				/*
! 				 * Do the checkpoint.
! 				 */
! 				CreateCheckPoint(flags);
! 
! 				/*
! 				 * After any checkpoint, close all smgr files.	This is so we
! 				 * won't hang onto smgr references to deleted files indefinitely.
! 				 */
! 				smgrcloseall();
! 
! 				/*
! 				 * Indicate checkpoint completion to any waiting backends.
! 				 */
! 				SpinLockAcquire(&bgs->ckpt_lck);
! 				bgs->ckpt_done = bgs->ckpt_started;
! 				SpinLockRelease(&bgs->ckpt_lck);
! 
! 				ckpt_active = false;
! 
! 				/*
! 				 * Note we record the checkpoint start time not end time as
! 				 * last_checkpoint_time.  This is so that time-driven checkpoints
! 				 * happen at a predictable spacing.
! 				 */
! 				last_checkpoint_time = now;
! 			}
! 			else
! 				BgBufferSync();
  
! 			/* Check for archive_timeout and switch xlog files if necessary. */
! 			CheckArchiveTimeout();
  
! 			/* Nap for the configured time. */
! 			BgWriterNap();
  		}
  	}
  }
  
***************
*** 586,592 ****
  		(ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
  			break;
  		pg_usleep(1000000L);
! 		AbsorbFsyncRequests();
  		udelay -= 1000000L;
  	}
  
--- 682,689 ----
  		(ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
  			break;
  		pg_usleep(1000000L);
! 		if (!IsRecoveryProcessingMode())
! 			AbsorbFsyncRequests();
  		udelay -= 1000000L;
  	}
  
***************
*** 640,645 ****
--- 737,755 ----
  	if (!am_bg_writer)
  		return;
  
+ 	/* Perform minimal duties during recovery and skip wait if requested */
+ 	if (IsRecoveryProcessingMode())
+ 	{
+ 		BgBufferSync();
+ 
+ 		if (!shutdown_requested &&
+ 			!checkpoint_requested &&
+ 			IsCheckpointOnSchedule(progress))
+ 			BgWriterNap();
+ 
+ 		return;
+ 	}
+ 
  	/*
  	 * Perform the usual bgwriter duties and take a nap, unless we're behind
  	 * schedule, in which case we just try to catch up as quickly as possible.
***************
*** 714,729 ****
  	 * However, it's good enough for our purposes, we're only calculating an
  	 * estimate anyway.
  	 */
! 	recptr = GetInsertRecPtr();
! 	elapsed_xlogs =
! 		(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
! 		 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
! 		CheckPointSegments;
! 
! 	if (progress < elapsed_xlogs)
  	{
! 		ckpt_cached_elapsed = elapsed_xlogs;
! 		return false;
  	}
  
  	/*
--- 824,842 ----
  	 * However, it's good enough for our purposes, we're only calculating an
  	 * estimate anyway.
  	 */
! 	if (!IsRecoveryProcessingMode())
  	{
! 		recptr = GetInsertRecPtr();
! 		elapsed_xlogs =
! 			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
! 			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
! 			CheckPointSegments;
! 
! 		if (progress < elapsed_xlogs)
! 		{
! 			ckpt_cached_elapsed = elapsed_xlogs;
! 			return false;
! 		}
  	}
  
  	/*
***************
*** 965,970 ****
--- 1078,1154 ----
  }
  
  /*
+  * Always runs in Startup process (see xlog.c)
+  */
+ void
+ RequestRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, bool sendToBGWriter)
+ {
+ 	/*
+ 	 * Should we just do it ourselves?
+ 	 */
+ 	if (!IsPostmasterEnvironment || !sendToBGWriter)
+ 	{
+ 		CreateRestartPoint(ReadPtr, restartPoint, CHECKPOINT_IMMEDIATE);
+ 		return;
+ 	}
+ 
+ 	/*
+ 	 * Push requested values into shared memory, then signal to request restartpoint.
+ 	 */
+ 	if (BgWriterShmem->bgwriter_pid == 0)
+ 		elog(LOG, "could not request restartpoint because bgwriter not running");
+ 
+ #ifdef NOT_USED
+ 	elog(LOG, "tli = %u nextXidEpoch = %u nextXid = %u nextOid = %u",
+ 		restartPoint->ThisTimeLineID,
+ 		restartPoint->nextXidEpoch,
+ 		restartPoint->nextXid,
+ 		restartPoint->nextOid);
+ #endif
+ 
+ 	SpinLockAcquire(&BgWriterShmem->ckpt_lck);
+ 	BgWriterShmem->ReadPtr = ReadPtr;
+ 	memcpy(&BgWriterShmem->restartPoint, restartPoint, sizeof(CheckPoint));
+ 	SpinLockRelease(&BgWriterShmem->ckpt_lck);
+ 
+ 	if (kill(BgWriterShmem->bgwriter_pid, SIGINT) != 0)
+ 		elog(LOG, "could not signal for restartpoint: %m");	
+ }
+ 
+ /* 
+  * Sends another checkpoint request signal to bgwriter, which causes it
+  * to avoid smoothed writes and continue processing as if it had been
+  * called with CHECKPOINT_IMMEDIATE. This is used at the end of recovery.
+  */
+ void
+ RequestRestartPointCompletion(void)
+ {
+ 	if (BgWriterShmem->bgwriter_pid != 0 &&
+ 		kill(BgWriterShmem->bgwriter_pid, SIGINT) != 0)
+ 		elog(LOG, "could not signal for restartpoint immediate: %m");
+ }
+ 
+ XLogRecPtr
+ GetRedoLocationForArchiveCheckpoint(void)
+ {
+ 	XLogRecPtr	redo;
+ 
+ 	SpinLockAcquire(&BgWriterShmem->ckpt_lck);
+ 	redo = BgWriterShmem->ReadPtr;
+ 	SpinLockRelease(&BgWriterShmem->ckpt_lck);
+ 
+ 	return redo;
+ }
+ 
+ void
+ SetRedoLocationForArchiveCheckpoint(XLogRecPtr redo)
+ {
+ 	SpinLockAcquire(&BgWriterShmem->ckpt_lck);
+ 	BgWriterShmem->ReadPtr = redo;
+ 	SpinLockRelease(&BgWriterShmem->ckpt_lck);
+ }
+ 
+ /*
   * ForwardFsyncRequest
   *		Forward a file-fsync request from a backend to the bgwriter
   *
Index: src/backend/postmaster/postmaster.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/postmaster/postmaster.c,v
retrieving revision 1.565
diff -c -r1.565 postmaster.c
*** src/backend/postmaster/postmaster.c	23 Sep 2008 20:35:38 -0000	1.565
--- src/backend/postmaster/postmaster.c	8 Oct 2008 12:27:33 -0000
***************
*** 254,259 ****
--- 254,264 ----
  {
  	PM_INIT,					/* postmaster starting */
  	PM_STARTUP,					/* waiting for startup subprocess */
+ 	PM_RECOVERY,				/* consistent recovery mode; state only
+ 								 * entered for archive and streaming recovery,
+ 								 * and only after the point where the 
+ 								 * all data is in consistent state.
+ 								 */
  	PM_RUN,						/* normal "database is alive" state */
  	PM_WAIT_BACKUP,				/* waiting for online backup mode to end */
  	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
***************
*** 1302,1308 ****
  		 * state that prevents it, start one.  It doesn't matter if this
  		 * fails, we'll just try again later.
  		 */
! 		if (BgWriterPID == 0 && pmState == PM_RUN)
  			BgWriterPID = StartBackgroundWriter();
  
  		/*
--- 1307,1313 ----
  		 * state that prevents it, start one.  It doesn't matter if this
  		 * fails, we'll just try again later.
  		 */
! 		if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY))
  			BgWriterPID = StartBackgroundWriter();
  
  		/*
***************
*** 2116,2122 ****
  		if (pid == StartupPID)
  		{
  			StartupPID = 0;
! 			Assert(pmState == PM_STARTUP);
  
  			/* FATAL exit of startup is treated as catastrophic */
  			if (!EXIT_STATUS_0(exitstatus))
--- 2121,2127 ----
  		if (pid == StartupPID)
  		{
  			StartupPID = 0;
! 			Assert(pmState == PM_STARTUP || pmState == PM_RECOVERY);
  
  			/* FATAL exit of startup is treated as catastrophic */
  			if (!EXIT_STATUS_0(exitstatus))
***************
*** 2157,2167 ****
  			load_role();
  
  			/*
! 			 * Crank up the background writer.	It doesn't matter if this
! 			 * fails, we'll just try again later.
  			 */
! 			Assert(BgWriterPID == 0);
! 			BgWriterPID = StartBackgroundWriter();
  
  			/*
  			 * Likewise, start other special children as needed.  In a restart
--- 2162,2172 ----
  			load_role();
  
  			/*
! 			 * Check whether we need to start background writer, if not
! 			 * already running.
  			 */
! 			if (BgWriterPID == 0)
! 				BgWriterPID = StartBackgroundWriter();
  
  			/*
  			 * Likewise, start other special children as needed.  In a restart
***************
*** 3845,3850 ****
--- 3850,3900 ----
  
  	PG_SETMASK(&BlockSig);
  
+ 	if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_START))
+ 	{
+ 		Assert(pmState == PM_STARTUP);
+ 
+ 		/*
+ 		 * Go to shutdown mode if a shutdown request was pending.
+ 		 */
+ 		if (Shutdown > NoShutdown)
+ 		{
+ 			pmState = PM_WAIT_BACKENDS;
+ 			/* PostmasterStateMachine logic does the rest */
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * Startup process has entered recovery
+ 			 */
+ 			pmState = PM_RECOVERY;
+ 
+ 			/*
+ 			 * Load the flat authorization file into postmaster's cache. The
+ 			 * startup process won't have recomputed this from the database yet,
+ 			 * so we it may change following recovery. 
+ 			 */
+ 			load_role();
+ 
+ 			/*
+ 			 * Crank up the background writer.	It doesn't matter if this
+ 			 * fails, we'll just try again later.
+ 			 */
+ 			Assert(BgWriterPID == 0);
+ 			BgWriterPID = StartBackgroundWriter();
+ 
+ 			/*
+ 			 * Likewise, start other special children as needed.
+ 			 */
+ 			Assert(PgStatPID == 0);
+ 			PgStatPID = pgstat_start();
+ 
+ 			/* XXX at this point we could accept read-only connections */
+ 			ereport(DEBUG1,
+ 				 (errmsg("database system is in consistent recovery mode")));
+ 		}
+ 	}
+ 
  	if (CheckPostmasterSignal(PMSIGNAL_PASSWORD_CHANGE))
  	{
  		/*
Index: src/backend/storage/buffer/README
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/backend/storage/buffer/README,v
retrieving revision 1.14
diff -c -r1.14 README
*** src/backend/storage/buffer/README	21 Mar 2008 13:23:28 -0000	1.14
--- src/backend/storage/buffer/README	8 Oct 2008 12:27:33 -0000
***************
*** 264,266 ****
--- 264,275 ----
  This ensures that the page image transferred to disk is reasonably consistent.
  We might miss a hint-bit update or two but that isn't a problem, for the same
  reasons mentioned under buffer access rules.
+ 
+ As of 8.4, background writer starts during recovery mode when there is
+ some form of potentially extended recovery to perform. It performs an
+ identical service to normal processing, except that checkpoints it
+ writes are technically restartpoints. Flushing outstanding WAL for dirty
+ buffers is also skipped, though there shouldn't ever be new WAL entries
+ at that time in any case. We could choose to start background writer
+ immediately but we hold off until we can prove the database is in a 
+ consistent state so that postmaster has a single, clean state change.
Index: src/bin/pg_controldata/pg_controldata.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/bin/pg_controldata/pg_controldata.c,v
retrieving revision 1.41
diff -c -r1.41 pg_controldata.c
*** src/bin/pg_controldata/pg_controldata.c	24 Sep 2008 08:59:42 -0000	1.41
--- src/bin/pg_controldata/pg_controldata.c	8 Oct 2008 12:27:33 -0000
***************
*** 197,202 ****
--- 197,205 ----
  	printf(_("Minimum recovery ending location:     %X/%X\n"),
  		   ControlFile.minRecoveryPoint.xlogid,
  		   ControlFile.minRecoveryPoint.xrecoff);
+ 	printf(_("Minimum safe starting location:       %X/%X\n"),
+ 		   ControlFile.minSafeStartPoint.xlogid,
+ 		   ControlFile.minSafeStartPoint.xrecoff);
  	printf(_("Maximum data alignment:               %u\n"),
  		   ControlFile.maxAlign);
  	/* we don't print floatFormat since can't say much useful about it */
Index: src/bin/pg_resetxlog/pg_resetxlog.c
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/bin/pg_resetxlog/pg_resetxlog.c,v
retrieving revision 1.68
diff -c -r1.68 pg_resetxlog.c
*** src/bin/pg_resetxlog/pg_resetxlog.c	24 Sep 2008 09:00:44 -0000	1.68
--- src/bin/pg_resetxlog/pg_resetxlog.c	8 Oct 2008 12:27:33 -0000
***************
*** 595,600 ****
--- 595,602 ----
  	ControlFile.prevCheckPoint.xrecoff = 0;
  	ControlFile.minRecoveryPoint.xlogid = 0;
  	ControlFile.minRecoveryPoint.xrecoff = 0;
+ 	ControlFile.minSafeStartPoint.xlogid = 0;
+ 	ControlFile.minSafeStartPoint.xrecoff = 0;
  
  	/* Now we can force the recorded xlog seg size to the right thing. */
  	ControlFile.xlog_seg_size = XLogSegSize;
Index: src/include/access/xlog.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/xlog.h,v
retrieving revision 1.88
diff -c -r1.88 xlog.h
*** src/include/access/xlog.h	12 May 2008 08:35:05 -0000	1.88
--- src/include/access/xlog.h	8 Oct 2008 12:27:33 -0000
***************
*** 133,139 ****
  } XLogRecData;
  
  extern TimeLineID ThisTimeLineID;		/* current TLI */
! extern bool InRecovery;
  extern XLogRecPtr XactLastRecEnd;
  
  /* these variables are GUC parameters related to XLOG */
--- 133,148 ----
  } XLogRecData;
  
  extern TimeLineID ThisTimeLineID;		/* current TLI */
! 
! /* 
!  * Prior to 8.4, all activity during recovery were carried out by Startup
!  * process. This local variable continues to be used in many parts of the
!  * code to indicate actions taken by RecoveryManagers. Other processes who
!  * potentially perform work during recovery should check
!  * IsRecoveryProcessingMode(), see XLogCtl notes in xlog.c
!  */
! extern bool InRecovery;	
! 										
  extern XLogRecPtr XactLastRecEnd;
  
  /* these variables are GUC parameters related to XLOG */
***************
*** 166,171 ****
--- 175,181 ----
  /* These indicate the cause of a checkpoint request */
  #define CHECKPOINT_CAUSE_XLOG	0x0010	/* XLOG consumption */
  #define CHECKPOINT_CAUSE_TIME	0x0020	/* Elapsed time */
+ #define CHECKPOINT_RESTARTPOINT	0x0040	/* Restartpoint during recovery */
  
  /* Checkpoint statistics */
  typedef struct CheckpointStatsData
***************
*** 197,202 ****
--- 207,214 ----
  extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
  
+ extern bool IsRecoveryProcessingMode(void);
+ 
  extern void UpdateControlFile(void);
  extern Size XLOGShmemSize(void);
  extern void XLOGShmemInit(void);
Index: src/include/access/xlog_internal.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/access/xlog_internal.h,v
retrieving revision 1.24
diff -c -r1.24 xlog_internal.h
*** src/include/access/xlog_internal.h	11 Aug 2008 11:05:11 -0000	1.24
--- src/include/access/xlog_internal.h	8 Oct 2008 12:27:33 -0000
***************
*** 17,22 ****
--- 17,23 ----
  #define XLOG_INTERNAL_H
  
  #include "access/xlog.h"
+ #include "catalog/pg_control.h"
  #include "fmgr.h"
  #include "pgtime.h"
  #include "storage/block.h"
***************
*** 245,250 ****
--- 246,254 ----
  extern pg_time_t GetLastSegSwitchTime(void);
  extern XLogRecPtr RequestXLogSwitch(void);
  
+ extern void CreateRestartPoint(const XLogRecPtr ReadPtr, 
+ 				const CheckPoint *restartPoint, int flags);
+ 
  /*
   * These aren't in xlog.h because I'd rather not include fmgr.h there.
   */
Index: src/include/catalog/pg_control.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/catalog/pg_control.h,v
retrieving revision 1.42
diff -c -r1.42 pg_control.h
*** src/include/catalog/pg_control.h	23 Sep 2008 09:20:39 -0000	1.42
--- src/include/catalog/pg_control.h	8 Oct 2008 12:27:33 -0000
***************
*** 46,52 ****
  #define XLOG_NOOP						0x20
  #define XLOG_NEXTOID					0x30
  #define XLOG_SWITCH						0x40
! 
  
  /* System status indicator */
  typedef enum DBState
--- 46,52 ----
  #define XLOG_NOOP						0x20
  #define XLOG_NEXTOID					0x30
  #define XLOG_SWITCH						0x40
! #define XLOG_RECOVERY_END			0x50
  
  /* System status indicator */
  typedef enum DBState
***************
*** 102,107 ****
--- 102,108 ----
  	CheckPoint	checkPointCopy; /* copy of last check point record */
  
  	XLogRecPtr	minRecoveryPoint;		/* must replay xlog to here */
+ 	XLogRecPtr	minSafeStartPoint;		/* safe point after recovery crashes */
  
  	/*
  	 * This data is used to check for hardware-architecture compatibility of
Index: src/include/postmaster/bgwriter.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/postmaster/bgwriter.h,v
retrieving revision 1.12
diff -c -r1.12 bgwriter.h
*** src/include/postmaster/bgwriter.h	11 Aug 2008 11:05:11 -0000	1.12
--- src/include/postmaster/bgwriter.h	8 Oct 2008 12:27:33 -0000
***************
*** 12,17 ****
--- 12,18 ----
  #ifndef _BGWRITER_H
  #define _BGWRITER_H
  
+ #include "catalog/pg_control.h"
  #include "storage/block.h"
  #include "storage/relfilenode.h"
  
***************
*** 25,30 ****
--- 26,36 ----
  extern void BackgroundWriterMain(void);
  
  extern void RequestCheckpoint(int flags);
+ extern void RequestRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, bool sendToBGWriter);
+ extern void RequestRestartPointCompletion(void);
+ extern XLogRecPtr GetRedoLocationForArchiveCheckpoint(void);
+ extern void SetRedoLocationForArchiveCheckpoint(XLogRecPtr redo);
+ 
  extern void CheckpointWriteDelay(int flags, double progress);
  
  extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
Index: src/include/storage/pmsignal.h
===================================================================
RCS file: /home/sriggs/pg/REPOSITORY/pgsql/src/include/storage/pmsignal.h,v
retrieving revision 1.20
diff -c -r1.20 pmsignal.h
*** src/include/storage/pmsignal.h	19 Jun 2008 21:32:56 -0000	1.20
--- src/include/storage/pmsignal.h	8 Oct 2008 12:27:33 -0000
***************
*** 22,27 ****
--- 22,28 ----
   */
  typedef enum
  {
+ 	PMSIGNAL_RECOVERY_START,	/* move to PM_RECOVERY state */
  	PMSIGNAL_PASSWORD_CHANGE,	/* pg_auth file has changed */
  	PMSIGNAL_WAKEN_ARCHIVER,	/* send a NOTIFY signal to xlog archiver */
  	PMSIGNAL_ROTATE_LOGFILE,	/* send SIGUSR1 to syslogger to rotate logfile */

Jeff Davis

pgsql@j-davis.com

about 17 years ago

In reply to: Simon Riggs (#1)

Re: [PATCHES] Infrastructure changes for recovery (v8)

On Tue, 2008-09-30 at 23:52 +0100, Simon Riggs wrote:

* optional recovery_safe_start_location parameter now provided in
recovery.conf, to allow a consistency point to be manually defined if a
base backup was not taken using standard pg_start/stop backup functions

If using synchronous replication, it seems like this may be useful. For
instance, if the primary server A fails (let's assume power off
failure), then you make the secondary server B the new primary and start
committing transactions, and then you want to bring A back up as a
secondary to B.

Will server A know where to start recovering from, even if many
checkpoints have happened on server B in the meantime? Is there a way to
avoid wiping A and making a new base backup?

Are the safety issues that Heikki brought up potentially solvable, or am
I asking for the impossible?

And also, what if server A is shut down cleanly? Is there any way at all
to get it into recovery mode to catch up with B, or would it require a
new base backup?

I haven't read through the entire thread, so I apologize if this
question has been answered elsewhere.

Regards,
Jeff Davis

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Jeff Davis (#5)

Re: [PATCHES] Infrastructure changes for recovery (v8)

On Fri, 2008-11-07 at 15:44 -0800, Jeff Davis wrote:

Is there a way to avoid wiping A and making a new base backup?

rsync

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 17 years ago

In reply to: Simon Riggs (#4)

Re: [PATCHES] Infrastructure changes for recovery (v8)

Simon Riggs wrote:

diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/
index 063b366..5e64cb4 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -226,6 +226,9 @@ ZeroSUBTRANSPage(int pageno)
*
* oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
* if there are none.
+ *
+ * Note that this is not atomic and is not yet safe to perform while other
+ * processes might access subtrans.
*/
void
StartupSUBTRANS(TransactionId oldestActiveXID)

I'm a bit confused by that comment. Does that need to be fixed? It
sounds like it does, because other processes might access subtrans when
StartupSUBTRANS is called, with the patch to allow read-only queries
during recovery. Or is that done in the hot standby patch?

However, I don't see why that isn't safe. StartupSUBTRANS takes the
SubtransControlLock in exclusive mode while it zeroes out subtrans.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 17 years ago

In reply to: Simon Riggs (#4)

Re: [PATCHES] Infrastructure changes for recovery (v8)

Simon Riggs wrote:

@@ -3845,6 +3850,52 @@ sigusr1_handler(SIGNAL_ARGS)

PG_SETMASK(&BlockSig);

+       if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_START))
+       {
+               Assert(pmState == PM_STARTUP);
+
+               /*
+                * Go to shutdown mode if a shutdown request was pending.
+                */
+               if (Shutdown > NoShutdown)
+               {
+                       pmState = PM_WAIT_BACKENDS;
+                       /* PostmasterStateMachine logic does the rest */
+               }
+               else
+               {
+                       /*
+                        * Startup process has entered recovery
+                        */
+                       pmState = PM_RECOVERY;

Hmm, I smell a race condition here:

1. Startup process goes into consistent state, and signals postmaster
2. Startup process finishes WAL replay and dies
3. Postmaster wakes up in reaper(), noting that the startup process
dies, and goes into PM_RUN mode.
4. The above signal handler for postmaster is run, causing an assertion
failure, or putting postmaster back into PM_RECOVERY mode if assertions
are disabled.

Highly unlikely in practice, given how much code needs to run in the
startup process between signaling the postmaster and exiting, but it
seems theoretically possible. Do we care, and if we do, how can we fix it?

+
+                       /*
+                        * Load the flat authorization file into postmaster's ca
+                        * startup process won't have recomputed this from the d
+                        * yet, so it may change following recovery.
+                        */
+                       load_role();

Is there a race condition here too, if the startup process is writing
the auth file at the same time? I guess we'd have the same problem with
flat files in general, so maybe there's something else that mitigates
the problem?

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 17 years ago

In reply to: Simon Riggs (#4)

Re: [PATCHES] Infrastructure changes for recovery (v8)

This comment in XLogFlush is no longer accurate:

* The current approach is to ERROR under normal conditions, but only
* WARNING during recovery, so that the system can be brought up even if
* there's a corrupt LSN. Note that for calls from xact.c, the ERROR will
* be promoted to PANIC since xact.c calls this routine inside a critical
* section. However, calls from bufmgr.c are not within critical sections
* and so we will not force a restart for a bad LSN on a data page.
*/
if (XLByteLT(LogwrtResult.Flush, record))
elog(ERROR,
"xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
record.xlogid, record.xrecoff,
LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);

Because of this hunk:

***************
*** 1822,1828 **** XLogFlush(XLogRecPtr record)
* and so we will not force a restart for a bad LSN on a data page.
*/
if (XLByteLT(LogwrtResult.Flush, record))
!               elog(InRecovery ? WARNING : ERROR,
"xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
record.xlogid, record.xrecoff,
LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
--- 1874,1880 ----
* and so we will not force a restart for a bad LSN on a data page.
*/
if (XLByteLT(LogwrtResult.Flush, record))
!               elog(ERROR,
"xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
record.xlogid, record.xrecoff,
LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);

I'm not sure what the most robust behavior would be.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#10

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Heikki Linnakangas (#7)

Re: [PATCHES] Infrastructure changes for recovery (v8)

On Mon, 2008-11-17 at 15:51 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:

diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/
index 063b366..5e64cb4 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -226,6 +226,9 @@ ZeroSUBTRANSPage(int pageno)
*
* oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
* if there are none.
+ *
+ * Note that this is not atomic and is not yet safe to perform while other
+ * processes might access subtrans.
*/
void
StartupSUBTRANS(TransactionId oldestActiveXID)

I'm a bit confused by that comment. Does that need to be fixed?

It is, in a later version. Apologies if you're reviewing the wrong one.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#11

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Heikki Linnakangas (#8)

Re: [PATCHES] Infrastructure changes for recovery (v8)

On Mon, 2008-11-17 at 16:18 +0200, Heikki Linnakangas wrote:

+
+                       /*
+                        * Load the flat authorization file into
postmaster's ca

+ * startup process won't have recomputed

this from the d
+                        * yet, so it may change following recovery.
+                        */
+                       load_role();
Is there a race condition here too, if the startup process is writing
the auth file at the same time? I guess we'd have the same problem
with flat files in general, so maybe there's something else that
mitigates the problem?

The flat file is not race condition proof. When the file is read it is
just a guide and the real data is re-accessed from catalog. So the
problem you see does exist, but is handled elsewhere - not in this
patch.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#12

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Heikki Linnakangas (#8)

Re: [PATCHES] Infrastructure changes for recovery (v8)

On Mon, 2008-11-17 at 16:18 +0200, Heikki Linnakangas wrote:

Simon Riggs wrote:
@@ -3845,6 +3850,52 @@ sigusr1_handler(SIGNAL_ARGS)

PG_SETMASK(&BlockSig);
+       if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_START))
+       {
+               Assert(pmState == PM_STARTUP);
+
+               /*
+                * Go to shutdown mode if a shutdown request was pending.
+                */
+               if (Shutdown > NoShutdown)
+               {
+                       pmState = PM_WAIT_BACKENDS;
+                       /* PostmasterStateMachine logic does the rest */
+               }
+               else
+               {
+                       /*
+                        * Startup process has entered recovery
+                        */
+                       pmState = PM_RECOVERY;
Hmm, I smell a race condition here:

1. Startup process goes into consistent state, and signals postmaster
2. Startup process finishes WAL replay and dies
3. Postmaster wakes up in reaper(), noting that the startup process
dies, and goes into PM_RUN mode.
4. The above signal handler for postmaster is run, causing an assertion
failure, or putting postmaster back into PM_RECOVERY mode if assertions
are disabled.

Highly unlikely in practice, given how much code needs to run in the
startup process between signaling the postmaster and exiting, but it
seems theoretically possible. Do we care, and if we do, how can we fix it?

Might be possible - it does depend on the sequence of actions its true.
Agree not likely to happen, except as the result of another bug.

I'll change it to a test for

if (pmState == PM_STARTUP)
pmState = PM_RECOVERY;

The assertion was mainly for documentation, its not protecting anything
critical (IIRC).

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#13

Fujii Masao

masao.fujii@gmail.com

about 17 years ago

In reply to: Simon Riggs (#12)

Re: [PATCHES] Infrastructure changes for recovery (v8)

On Tue, Nov 18, 2008 at 12:39 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, 2008-11-17 at 16:18 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
@@ -3845,6 +3850,52 @@ sigusr1_handler(SIGNAL_ARGS)

PG_SETMASK(&BlockSig);
+       if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_START))
+       {
+               Assert(pmState == PM_STARTUP);
+
+               /*
+                * Go to shutdown mode if a shutdown request was pending.
+                */
+               if (Shutdown > NoShutdown)
+               {
+                       pmState = PM_WAIT_BACKENDS;
+                       /* PostmasterStateMachine logic does the rest */
+               }
+               else
+               {
+                       /*
+                        * Startup process has entered recovery
+                        */
+                       pmState = PM_RECOVERY;
Hmm, I smell a race condition here:

1. Startup process goes into consistent state, and signals postmaster
2. Startup process finishes WAL replay and dies
3. Postmaster wakes up in reaper(), noting that the startup process
dies, and goes into PM_RUN mode.
4. The above signal handler for postmaster is run, causing an assertion
failure, or putting postmaster back into PM_RECOVERY mode if assertions
are disabled.

Highly unlikely in practice, given how much code needs to run in the
startup process between signaling the postmaster and exiting, but it
seems theoretically possible. Do we care, and if we do, how can we fix it?
Might be possible - it does depend on the sequence of actions its true.
Agree not likely to happen, except as the result of another bug.

I'll change it to a test for

if (pmState == PM_STARTUP)
pmState = PM_RECOVERY;

Likewise, should we also change the assertion against the pid of the
background process (BgWriterPID, PgStatPID)?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#14

Pavan Deolasee

pavan.deolasee@gmail.com

about 17 years ago

In reply to: Simon Riggs (#10)

Re: [PATCHES] Infrastructure changes for recovery (v8)

On Mon, Nov 17, 2008 at 9:01 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

It is, in a later version. Apologies if you're reviewing the wrong one.

The most recent version I can find is v9, but I remember you mentioned v10
somewhere else.
Can you please confirm if v9 is the latest version and point to the latest
version if not ? I've some free cycles and would like to help with the
review process.

Thanks,
Pavan

Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

#15

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Pavan Deolasee (#14)

Re: [PATCHES] Infrastructure changes for recovery (v8)

On Thu, 2008-11-20 at 11:06 +0530, Pavan Deolasee wrote:

On Mon, Nov 17, 2008 at 9:01 PM, Simon Riggs <simon@2ndquadrant.com>
wrote:

It is, in a later version. Apologies if you're reviewing the
wrong one.

The most recent version I can find is v9, but I remember you mentioned
v10 somewhere else.
Can you please confirm if v9 is the latest version and point to the
latest version if not ? I've some free cycles and would like to help
with the review process.

The latest Hot Standby patch includes the latest version of
"infrastructure changes" patch. Thanks for reviewing.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#16

Pavan Deolasee

pavan.deolasee@gmail.com

about 17 years ago

In reply to: Simon Riggs (#15)

Re: [PATCHES] Infrastructure changes for recovery (v8)

On Thu, Nov 20, 2008 at 3:12 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

The latest Hot Standby patch includes the latest version of
"infrastructure changes" patch. Thanks for reviewing.

Do you intend to split the patch into smaller pieces ? The latest hot
standby patch is almost 10K+ lines. Splitting that would definitely help the
review process.

Thanks,
Pavan

--
Pavan Deolasee
EnterpriseDB http://www.enterprisedb.com

#17

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Pavan Deolasee (#16)

Re: [PATCHES] Infrastructure changes for recovery (v8)

On Thu, 2008-11-20 at 15:19 +0530, Pavan Deolasee wrote:

Do you intend to split the patch into smaller pieces ? The latest hot
standby patch is almost 10K+ lines. Splitting that would definitely
help the review process.

If it helps you, then I'll do it. Hang on an hour or so.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#18

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Simon Riggs (#17)

Re: [PATCHES] Infrastructure changes for recovery (v8)

On Thu, 2008-11-20 at 10:10 +0000, Simon Riggs wrote:

On Thu, 2008-11-20 at 15:19 +0530, Pavan Deolasee wrote:

Do you intend to split the patch into smaller pieces ? The latest hot
standby patch is almost 10K+ lines. Splitting that would definitely
help the review process.

If it helps you, then I'll do it. Hang on an hour or so.

I've posted a slightly subdivided patch now via Wiki.

Putting "infrastructure" and "hot standby" together was fairly easy, but
splitting them apart has not been and I was unable to complete that
after a lot of hacking.

If you wouldn't mind looking at the major subsystems some more, I'm
happy to attempt some further parceling to make it easier for you to
review. I'm not completely certain the "infra" v "hot standby" is a good
split point anyway.

Please let me know how I can make the reviewer's job easier. Diagrams,
writeups, whatever. Thanks,

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#19

Alvaro Herrera

alvherre@commandprompt.com

about 17 years ago

In reply to: Simon Riggs (#18)

Re: [PATCHES] Infrastructure changes for recovery (v8)

Simon Riggs escribiï¿½:

Please let me know how I can make the reviewer's job easier. Diagrams,
writeups, whatever. Thanks,

A link perhaps?

--
Alvaro Herrera http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

#20

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Alvaro Herrera (#19)

Re: [PATCHES] Infrastructure changes for recovery (v8)

On Wed, 2008-12-17 at 23:32 -0300, Alvaro Herrera wrote:

Simon Riggs escribió:

Please let me know how I can make the reviewer's job easier. Diagrams,
writeups, whatever. Thanks,

A link perhaps?

There is much confusion on this point for which I'm very sorry.

I originally wrote "infra" patch to allow it to be committed separately
in the Sept commitfest, to reduce size of the forthcoming hotstandby
patch. That didn't happen (no moans there) so the eventual "hotstandby"
patch includes all of what was the infra patch, plus the new code.

So currently there is no separate "infra" patch. The two line items on
the CommitFest page are really just one large project. I would be in
favour of removing the "infra" lines from the CommitFest page.

Of course we can consider "hotstandby" patch in parts, but
deconstructing it into wholly separate patches doesn't make much sense
now and would raise many questions about why certain code exists with no
apparent function or why certain design choices made.

If you were to review a part of this, I might ask that you look at the
changes to XidInMVCCSnapshot(), GetSnapshotData() and
AssignTransactionId(), which relate specifically to subtransaction
handling. Comments explain the new approach.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#21

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 17 years ago

In reply to: Simon Riggs (#20)

1 attachment(s)

Re: [PATCHES] Infrastructure changes for recovery (v8)

Simon Riggs wrote:

On Wed, 2008-12-17 at 23:32 -0300, Alvaro Herrera wrote:

Simon Riggs escribió:

Please let me know how I can make the reviewer's job easier. Diagrams,
writeups, whatever. Thanks,

A link perhaps?

There is much confusion on this point for which I'm very sorry.

I originally wrote "infra" patch to allow it to be committed separately
in the Sept commitfest, to reduce size of the forthcoming hotstandby
patch. That didn't happen (no moans there) so the eventual "hotstandby"
patch includes all of what was the infra patch, plus the new code.

So currently there is no separate "infra" patch. The two line items on
the CommitFest page are really just one large project. I would be in
favour of removing the "infra" lines from the CommitFest page.

I think it's useful to review the "infra" part of the patch separately,
so I split it out of the big patch again. I haven't looked at this in
detail yet, but it compiles and passes regression tests.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

recovery-infra-separated-again-1.patchtext/x-diff; name=recovery-infra-separated-again-1.patchDownload

*** src/backend/access/transam/clog.c
--- src/backend/access/transam/clog.c
***************
*** 475,480 **** ZeroCLOGPage(int pageno, bool writeXlog)
--- 475,483 ----
  /*
   * This must be called ONCE during postmaster or standalone-backend startup,
   * after StartupXLOG has initialized ShmemVariableCache->nextXid.
+  *
+  * We access just a single clog page, so this action is atomic and safe
+  * for use if other processes are active during recovery.
   */
  void
  StartupCLOG(void)
*** src/backend/access/transam/multixact.c
--- src/backend/access/transam/multixact.c
***************
*** 1413,1420 **** ZeroMultiXactMemberPage(int pageno, bool writeXlog)
   * MultiXactSetNextMXact and/or MultiXactAdvanceNextMXact.	Note that we
   * may already have replayed WAL data into the SLRU files.
   *
!  * We don't need any locks here, really; the SLRU locks are taken
!  * only because slru.c expects to be called with locks held.
   */
  void
  StartupMultiXact(void)
--- 1413,1423 ----
   * MultiXactSetNextMXact and/or MultiXactAdvanceNextMXact.	Note that we
   * may already have replayed WAL data into the SLRU files.
   *
!  * We want this operation to be atomic to ensure that other processes can 
!  * use MultiXact while we complete recovery. We access one page only from the
!  * offset and members buffers, so once locks are acquired they will not be
!  * dropped and re-acquired by SLRU code. So we take both locks at start, then
!  * hold them all the way to the end.
   */
  void
  StartupMultiXact(void)
***************
*** 1426,1431 **** StartupMultiXact(void)
--- 1429,1435 ----
  
  	/* Clean up offsets state */
  	LWLockAcquire(MultiXactOffsetControlLock, LW_EXCLUSIVE);
+ 	LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);
  
  	/*
  	 * Initialize our idea of the latest page number.
***************
*** 1452,1461 **** StartupMultiXact(void)
  		MultiXactOffsetCtl->shared->page_dirty[slotno] = true;
  	}
  
- 	LWLockRelease(MultiXactOffsetControlLock);
- 
  	/* And the same for members */
- 	LWLockAcquire(MultiXactMemberControlLock, LW_EXCLUSIVE);
  
  	/*
  	 * Initialize our idea of the latest page number.
--- 1456,1462 ----
***************
*** 1483,1488 **** StartupMultiXact(void)
--- 1484,1490 ----
  	}
  
  	LWLockRelease(MultiXactMemberControlLock);
+ 	LWLockRelease(MultiXactOffsetControlLock);
  
  	/*
  	 * Initialize lastTruncationPoint to invalid, ensuring that the first
***************
*** 1542,1549 **** CheckPointMultiXact(void)
  	 * isn't valid (because StartupMultiXact hasn't been called yet) and so
  	 * SimpleLruTruncate would get confused.  It seems best not to risk
  	 * removing any data during recovery anyway, so don't truncate.
  	 */
! 	if (!InRecovery)
  		TruncateMultiXact();
  
  	TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
--- 1544,1552 ----
  	 * isn't valid (because StartupMultiXact hasn't been called yet) and so
  	 * SimpleLruTruncate would get confused.  It seems best not to risk
  	 * removing any data during recovery anyway, so don't truncate.
+ 	 * We are executing in the bgwriter, so we must access shared status.
  	 */
! 	if (!IsRecoveryProcessingMode())
  		TruncateMultiXact();
  
  	TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
*** src/backend/access/transam/slru.c
--- src/backend/access/transam/slru.c
***************
*** 598,604 **** SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
  	 * commands to set the commit status of transactions whose bits are in
  	 * already-truncated segments of the commit log (see notes in
  	 * SlruPhysicalWritePage).	Hence, if we are InRecovery, allow the case
! 	 * where the file doesn't exist, and return zeroes instead.
  	 */
  	fd = BasicOpenFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
  	if (fd < 0)
--- 598,605 ----
  	 * commands to set the commit status of transactions whose bits are in
  	 * already-truncated segments of the commit log (see notes in
  	 * SlruPhysicalWritePage).	Hence, if we are InRecovery, allow the case
! 	 * where the file doesn't exist, and return zeroes instead. We also
! 	 * return a zeroed page when seek and read fails. 
  	 */
  	fd = BasicOpenFile(path, O_RDWR | PG_BINARY, S_IRUSR | S_IWUSR);
  	if (fd < 0)
***************
*** 619,624 **** SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
--- 620,633 ----
  
  	if (lseek(fd, (off_t) offset, SEEK_SET) < 0)
  	{
+ 		if (InRecovery)
+ 		{
+ 			ereport(LOG,
+ 					(errmsg("file \"%s\" doesn't exist, reading as zeroes",
+ 							path)));
+ 			MemSet(shared->page_buffer[slotno], 0, BLCKSZ);
+ 			return true;
+ 		}
  		slru_errcause = SLRU_SEEK_FAILED;
  		slru_errno = errno;
  		close(fd);
***************
*** 628,633 **** SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
--- 637,650 ----
  	errno = 0;
  	if (read(fd, shared->page_buffer[slotno], BLCKSZ) != BLCKSZ)
  	{
+ 		if (InRecovery)
+ 		{
+ 			ereport(LOG,
+ 					(errmsg("file \"%s\" doesn't exist, reading as zeroes",
+ 							path)));
+ 			MemSet(shared->page_buffer[slotno], 0, BLCKSZ);
+ 			return true;
+ 		}
  		slru_errcause = SLRU_READ_FAILED;
  		slru_errno = errno;
  		close(fd);
*** src/backend/access/transam/subtrans.c
--- src/backend/access/transam/subtrans.c
***************
*** 223,255 **** ZeroSUBTRANSPage(int pageno)
  /*
   * This must be called ONCE during postmaster or standalone-backend startup,
   * after StartupXLOG has initialized ShmemVariableCache->nextXid.
-  *
-  * oldestActiveXID is the oldest XID of any prepared transaction, or nextXid
-  * if there are none.
   */
  void
  StartupSUBTRANS(TransactionId oldestActiveXID)
  {
! 	int			startPage;
! 	int			endPage;
  
- 	/*
- 	 * Since we don't expect pg_subtrans to be valid across crashes, we
- 	 * initialize the currently-active page(s) to zeroes during startup.
- 	 * Whenever we advance into a new page, ExtendSUBTRANS will likewise zero
- 	 * the new page without regard to whatever was previously on disk.
- 	 */
  	LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE);
  
! 	startPage = TransactionIdToPage(oldestActiveXID);
! 	endPage = TransactionIdToPage(ShmemVariableCache->nextXid);
! 
! 	while (startPage != endPage)
! 	{
! 		(void) ZeroSUBTRANSPage(startPage);
! 		startPage++;
! 	}
! 	(void) ZeroSUBTRANSPage(startPage);
  
  	LWLockRelease(SubtransControlLock);
  }
--- 223,241 ----
  /*
   * This must be called ONCE during postmaster or standalone-backend startup,
   * after StartupXLOG has initialized ShmemVariableCache->nextXid.
   */
  void
  StartupSUBTRANS(TransactionId oldestActiveXID)
  {
! 	TransactionId xid = ShmemVariableCache->nextXid;
! 	int			pageno = TransactionIdToPage(xid);
  
  	LWLockAcquire(SubtransControlLock, LW_EXCLUSIVE);
  
! 	/*
! 	 * Initialize our idea of the latest page number.
! 	 */
! 	SubTransCtl->shared->latest_page_number = pageno;
  
  	LWLockRelease(SubtransControlLock);
  }
*** src/backend/access/transam/xact.c
--- src/backend/access/transam/xact.c
***************
*** 40,45 ****
--- 40,46 ----
  #include "storage/fd.h"
  #include "storage/lmgr.h"
  #include "storage/procarray.h"
+ #include "storage/sinval.h"
  #include "storage/sinvaladt.h"
  #include "storage/smgr.h"
  #include "utils/combocid.h"
*** src/backend/access/transam/xlog.c
--- src/backend/access/transam/xlog.c
***************
*** 114,120 **** CheckpointStatsData CheckpointStats;
  
  /*
   * ThisTimeLineID will be same in all backends --- it identifies current
!  * WAL timeline for the database system.
   */
  TimeLineID	ThisTimeLineID = 0;
  
--- 114,121 ----
  
  /*
   * ThisTimeLineID will be same in all backends --- it identifies current
!  * WAL timeline for the database system. Zero is always a bug, so we 
!  * start with that to allow us to spot any errors.
   */
  TimeLineID	ThisTimeLineID = 0;
  
***************
*** 122,128 **** TimeLineID	ThisTimeLineID = 0;
  bool		InRecovery = false;
  
  /* Are we recovering using offline XLOG archives? */
! static bool InArchiveRecovery = false;
  
  /* Was the last xlog file restored from archive, or local? */
  static bool restoredFromArchive = false;
--- 123,136 ----
  bool		InRecovery = false;
  
  /* Are we recovering using offline XLOG archives? */
! bool 		InArchiveRecovery = false;
! 
! /* Local copy of shared RecoveryProcessingMode state */
! static bool LocalRecoveryProcessingMode = true;
! static bool knownProcessingMode = false;
! 
! /* is the database proven consistent yet? */
! bool	reachedSafeStartPoint = false;
  
  /* Was the last xlog file restored from archive, or local? */
  static bool restoredFromArchive = false;
***************
*** 241,250 **** static XLogRecPtr RedoRecPtr;
   * ControlFileLock: must be held to read/update control file or create
   * new log file.
   *
!  * CheckpointLock: must be held to do a checkpoint (ensures only one
!  * checkpointer at a time; currently, with all checkpoints done by the
!  * bgwriter, this is just pro forma).
   *
   *----------
   */
  
--- 249,278 ----
   * ControlFileLock: must be held to read/update control file or create
   * new log file.
   *
!  * CheckpointLock: must be held to do a checkpoint or restartpoint, ensuring
!  * we get just one of those at any time. In 8.4+ recovery, both startup and
!  * bgwriter processes may take restartpoints, so this locking must be strict 
!  * to ensure there are no mistakes.
   *
+  * In 8.4 we progress through a number of states at startup. Initially, the
+  * postmaster is in PM_STARTUP state and spawns the Startup process. We then
+  * progress until the database is in a consistent state, then if we are in
+  * InArchiveRecovery we go into PM_RECOVERY state. The bgwriter then starts
+  * up and takes over responsibility for performing restartpoints. We then
+  * progress until the end of recovery when we enter PM_RUN state upon
+  * termination of the Startup process. In summary:
+  * 
+  * PM_STARTUP state:	Startup process performs restartpoints
+  * PM_RECOVERY state:	bgwriter process performs restartpoints
+  * PM_RUN state: 		bgwriter process performs checkpoints
+  *
+  * These transitions are fairly delicate, with many things that need to
+  * happen at the same time in order to change state successfully throughout
+  * the system. Changing PM_STARTUP to PM_RECOVERY only occurs when we can
+  * prove the databases are in a consistent state. Changing from PM_RECOVERY
+  * to PM_RUN happens whenever recovery ends, which could be forced upon us
+  * externally or it can occur because of damage or termination of the WAL
+  * sequence.
   *----------
   */
  
***************
*** 312,317 **** typedef struct XLogCtlData
--- 340,372 ----
  	int			XLogCacheBlck;	/* highest allocated xlog buffer index */
  	TimeLineID	ThisTimeLineID;
  
+ 	/*
+ 	 * IsRecoveryProcessingMode shows whether the postmaster is in a
+ 	 * postmaster state earlier than PM_RUN, or not. This is a globally
+ 	 * accessible state to allow EXEC_BACKEND case.
+ 	 *
+ 	 * We also retain a local state variable InRecovery. InRecovery=true
+ 	 * means the code is being executed by Startup process and therefore
+ 	 * always during Recovery Processing Mode. This allows us to identify
+ 	 * code executed *during* Recovery Processing Mode but not necessarily
+ 	 * by Startup process itself.
+ 	 *
+ 	 * This is only written to by the startup process, so no need for locking.
+ 	 */
+ 	bool		SharedRecoveryProcessingMode;
+ 
+ 	/*
+ 	 * recovery target control information
+ 	 *
+ 	 * Protected by info_lck
+ 	 */
+ 	TransactionId	recoveryTargetXid;
+ 	TimestampTz		recoveryTargetTime;
+ 	int				recoveryTargetAdvance;
+ 
+ 	TimestampTz 	recoveryLastXTime;
+ 	TransactionId 	recoveryLastXid;
+ 
  	slock_t		info_lck;		/* locks shared variables shown above */
  } XLogCtlData;
  
***************
*** 398,405 **** static void XLogArchiveCleanup(const char *xlog);
--- 453,462 ----
  static void readRecoveryCommandFile(void);
  static void exitArchiveRecovery(TimeLineID endTLI,
  					uint32 endLogId, uint32 endLogSeg);
+ static void exitRecovery(void);
  static bool recoveryStopsHere(XLogRecord *record, bool *includeThis);
  static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
+ static XLogRecPtr GetRedoLocationForCheckpoint(void);
  
  static bool XLogCheckBuffer(XLogRecData *rdata, bool doPageWrites,
  				XLogRecPtr *lsn, BkpBlock *bkpb);
***************
*** 482,487 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
--- 539,552 ----
  	bool		updrqst;
  	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+ 	bool		isRecoveryEnd = (rmid == RM_XLOG_ID && 
+ 									(info == XLOG_RECOVERY_END ||
+ 									 info == XLOG_CHECKPOINT_ONLINE));
+ 
+ 	/* cross-check on whether we should be here or not */
+ 	if (IsRecoveryProcessingMode() && !isRecoveryEnd)
+ 		elog(FATAL, "cannot make new WAL entries during recovery "
+ 					"(RMgrId = %d info = %d)", rmid, info);
  
  	/* info's high bits are reserved for use by me */
  	if (info & XLR_INFO_MASK)
***************
*** 1728,1735 **** XLogFlush(XLogRecPtr record)
  	XLogRecPtr	WriteRqstPtr;
  	XLogwrtRqst WriteRqst;
  
! 	/* Disabled during REDO */
! 	if (InRedo)
  		return;
  
  	/* Quick exit if already known flushed */
--- 1793,1799 ----
  	XLogRecPtr	WriteRqstPtr;
  	XLogwrtRqst WriteRqst;
  
! 	if (IsRecoveryProcessingMode())
  		return;
  
  	/* Quick exit if already known flushed */
***************
*** 1817,1826 **** XLogFlush(XLogRecPtr record)
  	 * the bad page is encountered again during recovery then we would be
  	 * unable to restart the database at all!  (This scenario has actually
  	 * happened in the field several times with 7.1 releases. Note that we
! 	 * cannot get here while InRedo is true, but if the bad page is brought in
! 	 * and marked dirty during recovery then CreateCheckPoint will try to
! 	 * flush it at the end of recovery.)
  	 *
  	 * The current approach is to ERROR under normal conditions, but only
  	 * WARNING during recovery, so that the system can be brought up even if
  	 * there's a corrupt LSN.  Note that for calls from xact.c, the ERROR will
--- 1881,1891 ----
  	 * the bad page is encountered again during recovery then we would be
  	 * unable to restart the database at all!  (This scenario has actually
  	 * happened in the field several times with 7.1 releases. Note that we
! 	 * cannot get here while IsRecoveryProcessingMode(), but if the bad page is
! 	 * brought in and marked dirty during recovery, the next checkpoint after
! 	 * recovery will try to flush it.
  	 *
+ 	 * XXX obsolete comment
  	 * The current approach is to ERROR under normal conditions, but only
  	 * WARNING during recovery, so that the system can be brought up even if
  	 * there's a corrupt LSN.  Note that for calls from xact.c, the ERROR will
***************
*** 1829,1835 **** XLogFlush(XLogRecPtr record)
  	 * and so we will not force a restart for a bad LSN on a data page.
  	 */
  	if (XLByteLT(LogwrtResult.Flush, record))
! 		elog(InRecovery ? WARNING : ERROR,
  		"xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
  			 record.xlogid, record.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
--- 1894,1900 ----
  	 * and so we will not force a restart for a bad LSN on a data page.
  	 */
  	if (XLByteLT(LogwrtResult.Flush, record))
! 		elog(ERROR,
  		"xlog flush request %X/%X is not satisfied --- flushed only to %X/%X",
  			 record.xlogid, record.xrecoff,
  			 LogwrtResult.Flush.xlogid, LogwrtResult.Flush.xrecoff);
***************
*** 2102,2108 **** XLogFileInit(uint32 log, uint32 seg,
  		unlink(tmppath);
  	}
  
! 	elog(DEBUG2, "done creating and filling new WAL file");
  
  	/* Set flag to tell caller there was no existent file */
  	*use_existent = false;
--- 2167,2174 ----
  		unlink(tmppath);
  	}
  
! 	XLogFileName(tmppath, ThisTimeLineID, log, seg);
! 	elog(DEBUG2, "done creating and filling new WAL file %s", tmppath);
  
  	/* Set flag to tell caller there was no existent file */
  	*use_existent = false;
***************
*** 2408,2413 **** XLogFileRead(uint32 log, uint32 seg, int emode)
--- 2474,2501 ----
  					 xlogfname);
  			set_ps_display(activitymsg, false);
  
+ 			/* 
+ 			 * Calculate and write out a new safeStartPoint. This defines
+ 			 * the latest LSN that might appear on-disk while we apply
+ 			 * the WAL records in this file. If we crash during recovery
+ 			 * we must reach this point again before we can prove
+ 			 * database consistency. Not a restartpoint! Restart points
+ 			 * define where we should start recovery from, if we crash.
+ 			 */
+ 			if (InArchiveRecovery)
+ 			{
+ 				uint32 nextLog = log;
+ 				uint32 nextSeg = seg;
+ 
+ 				NextLogSeg(nextLog, nextSeg);
+ 
+ 				LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 				ControlFile->minSafeStartPoint.xlogid = nextLog;
+ 				ControlFile->minSafeStartPoint.xrecoff = nextSeg * XLogSegSize;
+ 				UpdateControlFile();
+ 				LWLockRelease(ControlFileLock);
+ 			}
+ 
  			return fd;
  		}
  		if (errno != ENOENT)	/* unexpected failure? */
***************
*** 4733,4754 **** exitArchiveRecovery(TimeLineID endTLI, uint32 endLogId, uint32 endLogSeg)
  	unlink(recoveryPath);		/* ignore any error */
  
  	/*
! 	 * Rename the config file out of the way, so that we don't accidentally
! 	 * re-enter archive recovery mode in a subsequent crash.
  	 */
- 	unlink(RECOVERY_COMMAND_DONE);
- 	if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
- 		ereport(FATAL,
- 				(errcode_for_file_access(),
- 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
- 						RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
  
  	ereport(LOG,
  			(errmsg("archive recovery complete")));
  }
  
  /*
!  * For point-in-time recovery, this function decides whether we want to
   * stop applying the XLOG at or after the current record.
   *
   * Returns TRUE if we are stopping, FALSE otherwise.  On TRUE return,
--- 4821,4840 ----
  	unlink(recoveryPath);		/* ignore any error */
  
  	/*
! 	 * As of 8.4 we no longer rename the recovery.conf file out of the
! 	 * way until after we have performed a full checkpoint. This ensures
! 	 * that any crash between now and the end of the checkpoint does not
! 	 * attempt to restart from a WAL file that is no longer available to us.
! 	 * As soon as we remove recovery.conf we lose our recovery_command and
! 	 * cannot reaccess WAL files from the archive.
  	 */
  
  	ereport(LOG,
  			(errmsg("archive recovery complete")));
  }
  
  /*
!  * For archive recovery, this function decides whether we want to
   * stop applying the XLOG at or after the current record.
   *
   * Returns TRUE if we are stopping, FALSE otherwise.  On TRUE return,
***************
*** 4876,4881 **** StartupXLOG(void)
--- 4962,4968 ----
  	CheckPoint	checkPoint;
  	bool		wasShutdown;
  	bool		reachedStopPoint = false;
+ 	bool		performedRecovery = false;
  	bool		haveBackupLabel = false;
  	XLogRecPtr	RecPtr,
  				LastRec,
***************
*** 4888,4893 **** StartupXLOG(void)
--- 4975,4982 ----
  	uint32		freespace;
  	TransactionId oldestActiveXID;
  
+ 	XLogCtl->SharedRecoveryProcessingMode = true;
+ 
  	/*
  	 * Read control file and check XLOG status looks valid.
  	 *
***************
*** 5108,5116 **** StartupXLOG(void)
--- 5197,5211 ----
  		if (minRecoveryLoc.xlogid != 0 || minRecoveryLoc.xrecoff != 0)
  			ControlFile->minRecoveryPoint = minRecoveryLoc;
  		ControlFile->time = (pg_time_t) time(NULL);
+ 		/* No need to hold ControlFileLock yet, we aren't up far enough */
  		UpdateControlFile();
  
  		/*
+ 		 * Reset pgstat data, because it may be invalid after recovery.
+ 		 */
+ 		pgstat_reset_all();
+ 
+ 		/*
  		 * If there was a backup label file, it's done its job and the info
  		 * has now been propagated into pg_control.  We must get rid of the
  		 * label file so that if we crash during recovery, we'll pick up at
***************
*** 5220,5225 **** StartupXLOG(void)
--- 5315,5348 ----
  
  				LastRec = ReadRecPtr;
  
+ 				/*
+ 				 * Can we signal Postmaster to enter consistent recovery mode?
+ 				 *
+ 				 * There are two points in the log that we must pass. The first
+ 				 * is minRecoveryPoint, which is the LSN at the time the
+ 				 * base backup was taken that we are about to rollforward from.
+ 				 * If recovery has ever crashed or was stopped there is also
+ 				 * another point also: minSafeStartPoint, which we know the
+ 				 * latest LSN that recovery could have reached prior to crash.
+ 				 *
+ 				 * We must also have assembled sufficient information about
+ 				 * transaction state to allow valid snapshots to be taken.
+ 				 */
+ 				if (!reachedSafeStartPoint &&
+ 					 XLByteLE(ControlFile->minSafeStartPoint, EndRecPtr) && 
+ 					 XLByteLE(ControlFile->minRecoveryPoint, EndRecPtr))
+ 				{
+ 					reachedSafeStartPoint = true;
+ 					if (InArchiveRecovery)
+ 					{
+ 						ereport(LOG,
+ 							(errmsg("database has now reached consistent state at %X/%X",
+ 								EndRecPtr.xlogid, EndRecPtr.xrecoff)));
+ 						if (IsUnderPostmaster)
+ 							SendPostmasterSignal(PMSIGNAL_RECOVERY_START);
+ 					}
+ 				}
+ 
  				record = ReadRecord(NULL, LOG);
  			} while (record != NULL && recoveryContinue);
  
***************
*** 5241,5246 **** StartupXLOG(void)
--- 5364,5370 ----
  			/* there are no WAL records following the checkpoint */
  			ereport(LOG,
  					(errmsg("redo is not required")));
+ 			reachedSafeStartPoint = true;
  		}
  	}
  
***************
*** 5254,5269 **** StartupXLOG(void)
  
  	/*
  	 * Complain if we did not roll forward far enough to render the backup
! 	 * dump consistent.
  	 */
! 	if (XLByteLT(EndOfLog, ControlFile->minRecoveryPoint))
  	{
  		if (reachedStopPoint)	/* stopped because of stop request */
  			ereport(FATAL,
  					(errmsg("requested recovery stop point is before end time of backup dump")));
  		else	/* ran off end of WAL */
  			ereport(FATAL,
! 					(errmsg("WAL ends before end time of backup dump")));
  	}
  
  	/*
--- 5378,5393 ----
  
  	/*
  	 * Complain if we did not roll forward far enough to render the backup
! 	 * dump consistent and start safely.
  	 */
! 	if (InArchiveRecovery && !reachedSafeStartPoint)
  	{
  		if (reachedStopPoint)	/* stopped because of stop request */
  			ereport(FATAL,
  					(errmsg("requested recovery stop point is before end time of backup dump")));
  		else	/* ran off end of WAL */
  			ereport(FATAL,
! 					(errmsg("end of WAL reached before end time of backup dump")));
  	}
  
  	/*
***************
*** 5378,5416 **** StartupXLOG(void)
  		XLogCheckInvalidPages();
  
  		/*
! 		 * Reset pgstat data, because it may be invalid after recovery.
  		 */
! 		pgstat_reset_all();
  
! 		/*
! 		 * Perform a checkpoint to update all our recovery activity to disk.
! 		 *
! 		 * Note that we write a shutdown checkpoint rather than an on-line
! 		 * one. This is not particularly critical, but since we may be
! 		 * assigning a new TLI, using a shutdown checkpoint allows us to have
! 		 * the rule that TLI only changes in shutdown checkpoints, which
! 		 * allows some extra error checking in xlog_redo.
! 		 */
! 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
  	}
  
- 	/*
- 	 * Preallocate additional log files, if wanted.
- 	 */
- 	PreallocXlogFiles(EndOfLog);
- 
- 	/*
- 	 * Okay, we're officially UP.
- 	 */
- 	InRecovery = false;
- 
- 	ControlFile->state = DB_IN_PRODUCTION;
- 	ControlFile->time = (pg_time_t) time(NULL);
- 	UpdateControlFile();
- 
- 	/* start the archive_timeout timer running */
- 	XLogCtl->Write.lastSegSwitchTime = ControlFile->time;
- 
  	/* initialize shared-memory copy of latest checkpoint XID/epoch */
  	XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
  	XLogCtl->ckptXid = ControlFile->checkPointCopy.nextXid;
--- 5502,5515 ----
  		XLogCheckInvalidPages();
  
  		/*
! 		 * Finally exit recovery and mark that in WAL. Pre-8.4 we wrote
! 		 * a shutdown checkpoint here, but we ask bgwriter to do that now.
  		 */
! 		exitRecovery();
  
! 		performedRecovery = true;
  	}
  
  	/* initialize shared-memory copy of latest checkpoint XID/epoch */
  	XLogCtl->ckptXidEpoch = ControlFile->checkPointCopy.nextXidEpoch;
  	XLogCtl->ckptXid = ControlFile->checkPointCopy.nextXid;
***************
*** 5444,5449 **** StartupXLOG(void)
--- 5543,5641 ----
  		readRecordBuf = NULL;
  		readRecordBufSize = 0;
  	}
+ 
+ 	/*
+ 	 * Prior to 8.4 we wrote a Shutdown Checkpoint at the end of recovery.
+ 	 * This could add minutes to the startup time, so we want bgwriter
+ 	 * to perform it. This then frees the Startup process to complete so we can
+ 	 * allow transactions and WAL inserts. We still write a checkpoint, but
+ 	 * it will be an online checkpoint. Online checkpoints have a redo
+ 	 * location that can be prior to the actual checkpoint record. So we want
+ 	 * to derive that redo location *before* we let anybody else write WAL,
+ 	 * otherwise we might miss some WAL records if we crash.
+ 	 */
+ 	if (performedRecovery)
+ 	{
+ 		XLogRecPtr	redo;
+ 
+ 		/* 
+ 		 * We must grab the pointer before anybody writes WAL 
+ 		 */
+ 		redo = GetRedoLocationForCheckpoint();
+ 
+ 		/* 
+ 		 * Set up information for the bgwriter, but if it is not active
+ 		 * for whatever reason, perform the checkpoint ourselves.
+ 		 */
+ 		if (SetRedoLocationForArchiveCheckpoint(redo))
+ 		{
+ 			/*
+ 			 * Okay, we can come up now. Allow others to write WAL.
+ 			 */
+ 			XLogCtl->SharedRecoveryProcessingMode = false;
+ 
+ 			/*
+ 			 * Now request checkpoint from bgwriter.
+ 			 */
+ 			RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_IMMEDIATE);
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * Startup process performs the checkpoint, but defers
+ 			 * the change in processing mode until afterwards.
+ 			 */
+ 			CreateCheckPoint(CHECKPOINT_FORCE | CHECKPOINT_IMMEDIATE);
+ 		}
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * No recovery, so lets just get on with it. 
+ 		 */
+ 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 		ControlFile->state = DB_IN_PRODUCTION;
+ 		ControlFile->time = (pg_time_t) time(NULL);
+ 		UpdateControlFile();
+ 		LWLockRelease(ControlFileLock);
+ 	}
+ 
+ 	/*
+ 	 * Okay, we can come up now. Allow others to write WAL.
+ 	 */
+ 	XLogCtl->SharedRecoveryProcessingMode = false;
+ 
+ 	/* start the archive_timeout timer running */
+ 	XLogCtl->Write.lastSegSwitchTime = (pg_time_t) time(NULL);
+ }
+ 
+ /*
+  * IsRecoveryProcessingMode()
+  *
+  * Fast test for whether we're still in recovery or not. We test the shared
+  * state each time only until we leave recovery mode. After that we never
+  * look again, relying upon the settings of our local state variables. This
+  * is designed to avoid the need for a separate initialisation step.
+  */
+ bool
+ IsRecoveryProcessingMode(void)
+ {
+ 	if (knownProcessingMode && !LocalRecoveryProcessingMode)
+ 		return false;
+ 
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+ 		if (xlogctl == NULL)
+ 			return false;
+ 
+ 		LocalRecoveryProcessingMode = XLogCtl->SharedRecoveryProcessingMode;
+ 	}
+ 
+ 	knownProcessingMode = true;
+ 
+ 	return LocalRecoveryProcessingMode;
  }
  
  /*
***************
*** 5701,5720 **** ShutdownXLOG(int code, Datum arg)
  static void
  LogCheckpointStart(int flags)
  {
! 	elog(LOG, "checkpoint starting:%s%s%s%s%s%s",
! 		 (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
! 		 (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
! 		 (flags & CHECKPOINT_FORCE) ? " force" : "",
! 		 (flags & CHECKPOINT_WAIT) ? " wait" : "",
! 		 (flags & CHECKPOINT_CAUSE_XLOG) ? " xlog" : "",
! 		 (flags & CHECKPOINT_CAUSE_TIME) ? " time" : "");
  }
  
  /*
   * Log end of a checkpoint.
   */
  static void
! LogCheckpointEnd(void)
  {
  	long		write_secs,
  				sync_secs,
--- 5893,5916 ----
  static void
  LogCheckpointStart(int flags)
  {
! 	if (flags & CHECKPOINT_RESTARTPOINT)
! 		elog(LOG, "restartpoint starting:%s",
! 			(flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "");
! 	else
! 		elog(LOG, "checkpoint starting:%s%s%s%s%s%s",
! 			 (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
! 			 (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
! 			 (flags & CHECKPOINT_FORCE) ? " force" : "",
! 			 (flags & CHECKPOINT_WAIT) ? " wait" : "",
! 			 (flags & CHECKPOINT_CAUSE_XLOG) ? " xlog" : "",
! 			 (flags & CHECKPOINT_CAUSE_TIME) ? " time" : "");
  }
  
  /*
   * Log end of a checkpoint.
   */
  static void
! LogCheckpointEnd(int flags)
  {
  	long		write_secs,
  				sync_secs,
***************
*** 5737,5753 **** LogCheckpointEnd(void)
  						CheckpointStats.ckpt_sync_end_t,
  						&sync_secs, &sync_usecs);
  
! 	elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
! 		 "%d transaction log file(s) added, %d removed, %d recycled; "
! 		 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
! 		 CheckpointStats.ckpt_bufs_written,
! 		 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
! 		 CheckpointStats.ckpt_segs_added,
! 		 CheckpointStats.ckpt_segs_removed,
! 		 CheckpointStats.ckpt_segs_recycled,
! 		 write_secs, write_usecs / 1000,
! 		 sync_secs, sync_usecs / 1000,
! 		 total_secs, total_usecs / 1000);
  }
  
  /*
--- 5933,5958 ----
  						CheckpointStats.ckpt_sync_end_t,
  						&sync_secs, &sync_usecs);
  
! 	if (flags & CHECKPOINT_RESTARTPOINT)
! 		elog(LOG, "restartpoint complete: wrote %d buffers (%.1f%%); "
! 			 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
! 			 CheckpointStats.ckpt_bufs_written,
! 			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
! 			 write_secs, write_usecs / 1000,
! 			 sync_secs, sync_usecs / 1000,
! 			 total_secs, total_usecs / 1000);
! 	else
! 		elog(LOG, "checkpoint complete: wrote %d buffers (%.1f%%); "
! 			 "%d transaction log file(s) added, %d removed, %d recycled; "
! 			 "write=%ld.%03d s, sync=%ld.%03d s, total=%ld.%03d s",
! 			 CheckpointStats.ckpt_bufs_written,
! 			 (double) CheckpointStats.ckpt_bufs_written * 100 / NBuffers,
! 			 CheckpointStats.ckpt_segs_added,
! 			 CheckpointStats.ckpt_segs_removed,
! 			 CheckpointStats.ckpt_segs_recycled,
! 			 write_secs, write_usecs / 1000,
! 			 sync_secs, sync_usecs / 1000,
! 			 total_secs, total_usecs / 1000);
  }
  
  /*
***************
*** 5772,5788 **** CreateCheckPoint(int flags)
  	XLogRecPtr	recptr;
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecData rdata;
- 	uint32		freespace;
  	uint32		_logId;
  	uint32		_logSeg;
  	TransactionId *inCommitXids;
  	int			nInCommit;
  
  	/*
  	 * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
! 	 * (This is just pro forma, since in the present system structure there is
! 	 * only one process that is allowed to issue checkpoints at any given
! 	 * time.)
  	 */
  	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
  
--- 5977,5992 ----
  	XLogRecPtr	recptr;
  	XLogCtlInsert *Insert = &XLogCtl->Insert;
  	XLogRecData rdata;
  	uint32		_logId;
  	uint32		_logSeg;
  	TransactionId *inCommitXids;
  	int			nInCommit;
+ 	bool		leavingArchiveRecovery = false;
  
  	/*
  	 * Acquire CheckpointLock to ensure only one checkpoint happens at a time.
! 	 * That shouldn't be happening, but checkpoints are an important aspect
! 	 * of our resilience, so we take no chances.
  	 */
  	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
  
***************
*** 5797,5811 **** CreateCheckPoint(int flags)
--- 6001,6024 ----
  	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
  
  	/*
+ 	 * Find out if this is the first checkpoint after archive recovery.
+ 	 */
+ 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 	leavingArchiveRecovery = (ControlFile->state == DB_IN_ARCHIVE_RECOVERY);
+ 	LWLockRelease(ControlFileLock);
+ 
+ 	/*
  	 * Use a critical section to force system panic if we have trouble.
  	 */
  	START_CRIT_SECTION();
  
  	if (shutdown)
  	{
+ 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
  		ControlFile->state = DB_SHUTDOWNING;
  		ControlFile->time = (pg_time_t) time(NULL);
  		UpdateControlFile();
+ 		LWLockRelease(ControlFileLock);
  	}
  
  	/*
***************
*** 5861,5901 **** CreateCheckPoint(int flags)
  		}
  	}
  
! 	/*
! 	 * Compute new REDO record ptr = location of next XLOG record.
! 	 *
! 	 * NB: this is NOT necessarily where the checkpoint record itself will be,
! 	 * since other backends may insert more XLOG records while we're off doing
! 	 * the buffer flush work.  Those XLOG records are logically after the
! 	 * checkpoint, even though physically before it.  Got that?
! 	 */
! 	freespace = INSERT_FREESPACE(Insert);
! 	if (freespace < SizeOfXLogRecord)
! 	{
! 		(void) AdvanceXLInsertBuffer(false);
! 		/* OK to ignore update return flag, since we will do flush anyway */
! 		freespace = INSERT_FREESPACE(Insert);
! 	}
! 	INSERT_RECPTR(checkPoint.redo, Insert, Insert->curridx);
! 
! 	/*
! 	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
! 	 * must be done while holding the insert lock AND the info_lck.
! 	 *
! 	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
! 	 * pointing past where it really needs to point.  This is okay; the only
! 	 * consequence is that XLogInsert might back up whole buffers that it
! 	 * didn't really need to.  We can't postpone advancing RedoRecPtr because
! 	 * XLogInserts that happen while we are dumping buffers must assume that
! 	 * their buffer changes are not included in the checkpoint.
! 	 */
  	{
! 		/* use volatile pointer to prevent code rearrangement */
! 		volatile XLogCtlData *xlogctl = XLogCtl;
! 
! 		SpinLockAcquire(&xlogctl->info_lck);
! 		RedoRecPtr = xlogctl->Insert.RedoRecPtr = checkPoint.redo;
! 		SpinLockRelease(&xlogctl->info_lck);
  	}
  
  	/*
--- 6074,6092 ----
  		}
  	}
  
! 	if (leavingArchiveRecovery)
! 		checkPoint.redo = GetRedoLocationForArchiveCheckpoint();
! 	else
  	{
! 		/*
! 		 * Compute new REDO record ptr = location of next XLOG record.
! 		 *
! 		 * NB: this is NOT necessarily where the checkpoint record itself will be,
! 		 * since other backends may insert more XLOG records while we're off doing
! 		 * the buffer flush work.  Those XLOG records are logically after the
! 		 * checkpoint, even though physically before it.  Got that?
! 		 */
! 		checkPoint.redo = GetRedoLocationForCheckpoint();
  	}
  
  	/*
***************
*** 6013,6023 **** CreateCheckPoint(int flags)
  	XLByteToSeg(ControlFile->checkPointCopy.redo, _logId, _logSeg);
  
  	/*
! 	 * Update the control file.
  	 */
  	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
  	if (shutdown)
  		ControlFile->state = DB_SHUTDOWNED;
  	ControlFile->prevCheckPoint = ControlFile->checkPoint;
  	ControlFile->checkPoint = ProcLastRecPtr;
  	ControlFile->checkPointCopy = checkPoint;
--- 6204,6221 ----
  	XLByteToSeg(ControlFile->checkPointCopy.redo, _logId, _logSeg);
  
  	/*
! 	 * Update the control file. In 8.4, this routine becomes the primary
! 	 * point for recording changes of state in the control file at the 
! 	 * end of recovery. Postmaster state already shows us being in 
! 	 * normal running mode, but it is only after this point that we
! 	 * are completely free of reperforming a recovery if we crash.  Note
! 	 * that this is executed by bgwriter after the death of Startup process.
  	 */
  	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
  	if (shutdown)
  		ControlFile->state = DB_SHUTDOWNED;
+ 	else
+ 		ControlFile->state = DB_IN_PRODUCTION;
  	ControlFile->prevCheckPoint = ControlFile->checkPoint;
  	ControlFile->checkPoint = ProcLastRecPtr;
  	ControlFile->checkPointCopy = checkPoint;
***************
*** 6025,6030 **** CreateCheckPoint(int flags)
--- 6223,6243 ----
  	UpdateControlFile();
  	LWLockRelease(ControlFileLock);
  
+ 	if (leavingArchiveRecovery)
+ 	{
+ 		/*
+ 		 * Rename the config file out of the way, so that we don't accidentally
+ 		 * re-enter archive recovery mode in a subsequent crash. Prior to
+ 		 * 8.4 this step was performed at end of exitArchiveRecovery().
+ 		 */
+ 		unlink(RECOVERY_COMMAND_DONE);
+ 		if (rename(RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE) != 0)
+ 			ereport(ERROR,
+ 				    (errcode_for_file_access(),
+ 					 errmsg("could not rename file \"%s\" to \"%s\": %m",
+ 								RECOVERY_COMMAND_FILE, RECOVERY_COMMAND_DONE)));
+ 	}
+ 
  	/* Update shared-memory copy of checkpoint XID/epoch */
  	{
  		/* use volatile pointer to prevent code rearrangement */
***************
*** 6068,6082 **** CreateCheckPoint(int flags)
  	 * Truncate pg_subtrans if possible.  We can throw away all data before
  	 * the oldest XMIN of any running transaction.	No future transaction will
  	 * attempt to reference any pg_subtrans entry older than that (see Asserts
! 	 * in subtrans.c).	During recovery, though, we mustn't do this because
! 	 * StartupSUBTRANS hasn't been called yet.
  	 */
! 	if (!InRecovery)
  		TruncateSUBTRANS(GetOldestXmin(true, false));
  
  	/* All real work is done, but log before releasing lock. */
  	if (log_checkpoints)
! 		LogCheckpointEnd();
  
          TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
                                  NBuffers, CheckpointStats.ckpt_segs_added,
--- 6281,6294 ----
  	 * Truncate pg_subtrans if possible.  We can throw away all data before
  	 * the oldest XMIN of any running transaction.	No future transaction will
  	 * attempt to reference any pg_subtrans entry older than that (see Asserts
! 	 * in subtrans.c).	
  	 */
! 	if (!shutdown)
  		TruncateSUBTRANS(GetOldestXmin(true, false));
  
  	/* All real work is done, but log before releasing lock. */
  	if (log_checkpoints)
! 		LogCheckpointEnd(flags);
  
          TRACE_POSTGRESQL_CHECKPOINT_DONE(CheckpointStats.ckpt_bufs_written,
                                  NBuffers, CheckpointStats.ckpt_segs_added,
***************
*** 6085,6090 **** CreateCheckPoint(int flags)
--- 6297,6347 ----
  
  	LWLockRelease(CheckpointLock);
  }
+  
+ /* 
+  * GetRedoLocationForCheckpoint()
+  *
+  * When !IsRecoveryProcessingMode() this must be called while holding 
+  * WALInsertLock().
+  */
+ static XLogRecPtr
+ GetRedoLocationForCheckpoint()
+ {
+ 	XLogCtlInsert  *Insert = &XLogCtl->Insert;
+ 	uint32                  freespace;
+ 	XLogRecPtr              redo;
+ 
+ 	freespace = INSERT_FREESPACE(Insert);
+ 	if (freespace < SizeOfXLogRecord)
+ 	{
+ 	        (void) AdvanceXLInsertBuffer(false);
+ 	        /* OK to ignore update return flag, since we will do flush anyway */
+ 	        freespace = INSERT_FREESPACE(Insert);
+ 	}
+ 	INSERT_RECPTR(redo, Insert, Insert->curridx);
+ 
+ 	/*
+ 	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
+ 	 * must be done while holding the insert lock AND the info_lck.
+ 	 *
+ 	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
+ 	 * pointing past where it really needs to point.  This is okay; the only
+ 	 * consequence is that XLogInsert might back up whole buffers that it
+ 	 * didn't really need to.  We can't postpone advancing RedoRecPtr because
+ 	 * XLogInserts that happen while we are dumping buffers must assume that
+ 	 * their buffer changes are not included in the checkpoint.
+ 	 */
+ 	{
+ 	        /* use volatile pointer to prevent code rearrangement */
+ 	        volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+         SpinLockAcquire(&xlogctl->info_lck);
+         RedoRecPtr = xlogctl->Insert.RedoRecPtr = redo;
+         SpinLockRelease(&xlogctl->info_lck);
+ 	}
+ 
+ 	return redo;
+ }
  
  /*
   * Flush all data in shared memory to disk, and fsync
***************
*** 6150,6180 **** RecoveryRestartPoint(const CheckPoint *checkPoint)
  			}
  	}
  
  	/*
! 	 * OK, force data out to disk
  	 */
! 	CheckPointGuts(checkPoint->redo, CHECKPOINT_IMMEDIATE);
  
  	/*
! 	 * Update pg_control so that any subsequent crash will restart from this
! 	 * checkpoint.	Note: ReadRecPtr gives the XLOG address of the checkpoint
! 	 * record itself.
  	 */
- 	ControlFile->prevCheckPoint = ControlFile->checkPoint;
- 	ControlFile->checkPoint = ReadRecPtr;
- 	ControlFile->checkPointCopy = *checkPoint;
- 	ControlFile->time = (pg_time_t) time(NULL);
- 	UpdateControlFile();
  
  	ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
! 			(errmsg("recovery restart point at %X/%X",
! 					checkPoint->redo.xlogid, checkPoint->redo.xrecoff)));
! 	if (recoveryLastXTime)
! 		ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
! 				(errmsg("last completed transaction was at log time %s",
! 						timestamptz_to_str(recoveryLastXTime))));
! }
  
  /*
   * Write a NEXTOID log record
   */
--- 6407,6477 ----
  			}
  	}
  
+ 	RequestRestartPoint(ReadRecPtr, checkPoint, reachedSafeStartPoint);
+ }
+ 
+ /*
+ * As of 8.4, RestartPoints are always created by the bgwriter
+ * once we have reachedSafeStartPoint. We use bgwriter's shared memory
+ * area wherever we call it from, to keep better code structure.
+ */
+ void
+ CreateRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, int flags)
+ {
+ 	if (recoveryLogRestartpoints || log_checkpoints)
+ 	{
+   		/*
+ 		 * Prepare to accumulate statistics.
+   		 */
+ 
+ 		MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
+ 		CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
+ 
+ 		LogCheckpointStart(CHECKPOINT_RESTARTPOINT | flags);
+ 	}
+   
+   	/*
+ 	 * Acquire CheckpointLock to ensure only one restartpoint happens at a time.
+ 	 * We rely on this lock to ensure that the startup process doesn't exit
+ 	 * Recovery while we are half way through a restartpoint.
+   	 */
+ 	LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
+ 
+ 	CheckPointGuts(restartPoint->redo, CHECKPOINT_RESTARTPOINT | flags);
+ 
  	/*
! 	 * Update pg_control, using current time
  	 */
! 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
!   	ControlFile->prevCheckPoint = ControlFile->checkPoint;
! 	ControlFile->checkPoint = ReadPtr;
! 	ControlFile->checkPointCopy = *restartPoint;
!   	ControlFile->time = (pg_time_t) time(NULL);
!   	UpdateControlFile();
! 	LWLockRelease(ControlFileLock);
  
  	/*
! 	 * Currently, there is no need to truncate pg_subtrans during recovery.
! 	 * If we did do that, we will need to have called StartupSUBTRANS()
! 	 * already and then TruncateSUBTRANS() would go here.
  	 */
  
+ 	/* All real work is done, but log before releasing lock. */
+ 	if (recoveryLogRestartpoints || log_checkpoints)
+ 		LogCheckpointEnd(CHECKPOINT_RESTARTPOINT);
+   
  	ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
!   			(errmsg("recovery restart point at %X/%X",
! 					restartPoint->redo.xlogid, restartPoint->redo.xrecoff)));
  
+   	if (recoveryLastXTime)
+   		ereport((recoveryLogRestartpoints ? LOG : DEBUG2),
+ 			(errmsg("last completed transaction was at log time %s",
+ 					timestamptz_to_str(recoveryLastXTime))));
+ 
+ 	LWLockRelease(CheckpointLock);
+ }
+   
  /*
   * Write a NEXTOID log record
   */
***************
*** 6237,6243 **** RequestXLogSwitch(void)
  }
  
  /*
!  * XLOG resource manager's routines
   */
  void
  xlog_redo(XLogRecPtr lsn, XLogRecord *record)
--- 6534,6596 ----
  }
  
  /*
!  * exitRecovery()
!  *
!  * Exit recovery state and write a XLOG_RECOVERY_END record. This is the
!  * only record type that can record a change of timelineID. We assume
!  * caller has already set ThisTimeLineID, if appropriate.
!  */
! static void
! exitRecovery(void)
! {
! 	XLogRecData rdata;
! 
! 	rdata.buffer = InvalidBuffer;
! 	rdata.data = (char *) (&ThisTimeLineID);
! 	rdata.len = sizeof(TimeLineID);
! 	rdata.next = NULL;
! 
! 	/*
! 	 * If a restartpoint is in progress, we will not be able to successfully
! 	 * acquire CheckpointLock. If bgwriter is still in progress then send
! 	 * a second signal to nudge bgwriter to go faster so we can avoid delay.
! 	 * Then wait for lock, so we know the restartpoint has completed. We do
! 	 * this because we don't want to interrupt the restartpoint half way
! 	 * through, which might leave us in a mess and we want to be robust. We're
! 	 * going to checkpoint soon anyway, so not it's not wasted effort.
! 	 */
! 	if (LWLockConditionalAcquire(CheckpointLock, LW_EXCLUSIVE))
! 		LWLockRelease(CheckpointLock);
! 	else
! 	{
! 		RequestRestartPointCompletion();
! 		ereport(DEBUG1,
! 			(errmsg("startup process waiting for restartpoint to complete")));
! 		LWLockAcquire(CheckpointLock, LW_EXCLUSIVE);
! 		LWLockRelease(CheckpointLock);
! 	}	
! 
! 	/*
! 	 * This is the only type of WAL message that can be inserted during
! 	 * recovery. This ensures that we don't allow others to get access
! 	 * until after we have changed state.
! 	 */
! 	(void) XLogInsert(RM_XLOG_ID, XLOG_RECOVERY_END, &rdata);
! 
! 	/*
! 	 * We don't XLogFlush() here otherwise we'll end up zeroing the WAL
! 	 * file ourselves. So just let bgwriter's forthcoming checkpoint do
! 	 * that for us.
! 	 */
! 
! 	InRecovery = false;
! }
! 
! /*
!  * XLOG resource manager's routines.
!  *
!  * Definitions of message info are in include/catalog/pg_control.h,
!  * though not all messages relate to control file processing.
   */
  void
  xlog_redo(XLogRecPtr lsn, XLogRecord *record)
***************
*** 6271,6293 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
  		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
  		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
  
! 		/*
! 		 * TLI may change in a shutdown checkpoint, but it shouldn't decrease
  		 */
- 		if (checkPoint.ThisTimeLineID != ThisTimeLineID)
- 		{
- 			if (checkPoint.ThisTimeLineID < ThisTimeLineID ||
- 				!list_member_int(expectedTLIs,
- 								 (int) checkPoint.ThisTimeLineID))
- 				ereport(PANIC,
- 						(errmsg("unexpected timeline ID %u (after %u) in checkpoint record",
- 								checkPoint.ThisTimeLineID, ThisTimeLineID)));
- 			/* Following WAL records should be run with new TLI */
- 			ThisTimeLineID = checkPoint.ThisTimeLineID;
- 		}
  
  		RecoveryRestartPoint(&checkPoint);
  	}
  	else if (info == XLOG_CHECKPOINT_ONLINE)
  	{
  		CheckPoint	checkPoint;
--- 6624,6663 ----
  		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
  		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
  
!   		/*
! 		 * TLI no longer changes at shutdown checkpoint, since as of 8.4,
! 		 * shutdown checkpoints only occur at shutdown. Much less confusing.
  		 */
  
  		RecoveryRestartPoint(&checkPoint);
  	}
+ 	else if (info == XLOG_RECOVERY_END)
+ 	{
+ 		TimeLineID	tli;
+ 
+ 		memcpy(&tli, XLogRecGetData(record), sizeof(TimeLineID));
+ 
+ 		/*
+ 		 * TLI may change when recovery ends, but it shouldn't decrease.
+ 		 *
+ 		 * This is the only WAL record that can tell us to change timelineID
+ 		 * while we process WAL records. 
+ 		 *
+ 		 * We can *choose* to stop recovery at any point, generating a
+ 		 * new timelineID which is recorded using this record type.
+ 		 */
+ 		if (tli != ThisTimeLineID)
+   		{
+ 			if (tli < ThisTimeLineID ||
+   				!list_member_int(expectedTLIs,
+ 								 (int) tli))
+   				ereport(PANIC,
+ 						(errmsg("unexpected timeline ID %u (after %u) at recovery end record",
+ 								tli, ThisTimeLineID)));
+   			/* Following WAL records should be run with new TLI */
+ 			ThisTimeLineID = tli;
+   		}
+   	}
  	else if (info == XLOG_CHECKPOINT_ONLINE)
  	{
  		CheckPoint	checkPoint;
***************
*** 6309,6315 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
  		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
  		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
  
! 		/* TLI should not change in an on-line checkpoint */
  		if (checkPoint.ThisTimeLineID != ThisTimeLineID)
  			ereport(PANIC,
  					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
--- 6679,6685 ----
  		ControlFile->checkPointCopy.nextXidEpoch = checkPoint.nextXidEpoch;
  		ControlFile->checkPointCopy.nextXid = checkPoint.nextXid;
  
! 		/* TLI must not change at a checkpoint */
  		if (checkPoint.ThisTimeLineID != ThisTimeLineID)
  			ereport(PANIC,
  					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
***************
*** 6545,6550 **** pg_start_backup(PG_FUNCTION_ARGS)
--- 6915,6926 ----
  				 errhint("archive_command must be defined before "
  						 "online backups can be made safely.")));
  
+ 	if (IsRecoveryProcessingMode())
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("recovery is in progress"),
+ 				 errhint("WAL control functions cannot be executed during recovery.")));
+ 
  	backupidstr = text_to_cstring(backupid);
  
  	/*
***************
*** 6710,6715 **** pg_stop_backup(PG_FUNCTION_ARGS)
--- 7086,7097 ----
  				 errmsg("WAL archiving is not active"),
  				 errhint("archive_mode must be enabled at server start.")));
  
+ 	if (IsRecoveryProcessingMode())
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("recovery is in progress"),
+ 				 errhint("WAL control functions cannot be executed during recovery.")));
+ 
  	/*
  	 * OK to clear forcePageWrites
  	 */
***************
*** 6865,6870 **** pg_switch_xlog(PG_FUNCTION_ARGS)
--- 7247,7258 ----
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  			 (errmsg("must be superuser to switch transaction log files"))));
  
+ 	if (IsRecoveryProcessingMode())
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("recovery is in progress"),
+ 				 errhint("WAL control functions cannot be executed during recovery.")));
+ 
  	switchpoint = RequestXLogSwitch();
  
  	/*
***************
*** 6887,6892 **** pg_current_xlog_location(PG_FUNCTION_ARGS)
--- 7275,7286 ----
  {
  	char		location[MAXFNAMELEN];
  
+ 	if (IsRecoveryProcessingMode())
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("recovery is in progress"),
+ 				 errhint("WAL control functions cannot be executed during recovery.")));
+ 
  	/* Make sure we have an up-to-date local LogwrtResult */
  	{
  		/* use volatile pointer to prevent code rearrangement */
***************
*** 6914,6919 **** pg_current_xlog_insert_location(PG_FUNCTION_ARGS)
--- 7308,7319 ----
  	XLogRecPtr	current_recptr;
  	char		location[MAXFNAMELEN];
  
+ 	if (IsRecoveryProcessingMode())
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("recovery is in progress"),
+ 				 errhint("WAL control functions cannot be executed during recovery.")));
+ 
  	/*
  	 * Get the current end-of-WAL position ... shared lock is sufficient
  	 */
*** src/backend/commands/dbcommands.c
--- src/backend/commands/dbcommands.c
***************
*** 1976,1981 **** dbase_redo(XLogRecPtr lsn, XLogRecord *record)
--- 1976,1986 ----
  		 * We don't need to copy subdirectories
  		 */
  		copydir(src_path, dst_path, false);
+ 
+ 		/*
+ 		 * Flat files are updated immediately following transaction commit.
+ 	 	 * Nothing to do here.
+ 		 */
  	}
  	else if (info == XLOG_DBASE_DROP)
  	{
***************
*** 1998,2003 **** dbase_redo(XLogRecPtr lsn, XLogRecord *record)
--- 2003,2012 ----
  			ereport(WARNING,
  					(errmsg("some useless files may be left behind in old database directory \"%s\"",
  							dst_path)));
+ 		/*
+ 		 * Flat files are updated immediately following transaction commit.
+ 	 	 * Nothing to do here.
+ 		 */
  	}
  	else
  		elog(PANIC, "dbase_redo: unknown op code %u", info);
*** src/backend/postmaster/bgwriter.c
--- src/backend/postmaster/bgwriter.c
***************
*** 49,54 ****
--- 49,55 ----
  #include <unistd.h>
  
  #include "access/xlog_internal.h"
+ #include "catalog/pg_control.h"
  #include "libpq/pqsignal.h"
  #include "miscadmin.h"
  #include "pgstat.h"
***************
*** 129,134 **** typedef struct
--- 130,142 ----
  
  	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
  
+ 	/* 
+ 	 * When the Startup process wants bgwriter to perform a restartpoint, it 
+ 	 * sets these fields so that we can update the control file afterwards.
+ 	 */
+ 	XLogRecPtr	ReadPtr;		/* Requested log pointer */
+ 	CheckPoint  restartPoint;	/* restartPoint data for ControlFile */
+ 
  	uint32		num_backend_writes;		/* counts non-bgwriter buffer writes */
  
  	int			num_requests;	/* current # of requests */
***************
*** 165,171 **** static bool ckpt_active = false;
  
  /* these values are valid when ckpt_active is true: */
  static pg_time_t ckpt_start_time;
! static XLogRecPtr ckpt_start_recptr;
  static double ckpt_cached_elapsed;
  
  static pg_time_t last_checkpoint_time;
--- 173,179 ----
  
  /* these values are valid when ckpt_active is true: */
  static pg_time_t ckpt_start_time;
! static XLogRecPtr ckpt_start_recptr;	/* not used if IsRecoveryProcessingMode */
  static double ckpt_cached_elapsed;
  
  static pg_time_t last_checkpoint_time;
***************
*** 197,202 **** BackgroundWriterMain(void)
--- 205,211 ----
  {
  	sigjmp_buf	local_sigjmp_buf;
  	MemoryContext bgwriter_context;
+ 	bool		BgWriterRecoveryMode;
  
  	BgWriterShmem->bgwriter_pid = MyProcPid;
  	am_bg_writer = true;
***************
*** 355,370 **** BackgroundWriterMain(void)
  	 */
  	PG_SETMASK(&UnBlockSig);
  
  	/*
  	 * Loop forever
  	 */
  	for (;;)
  	{
- 		bool		do_checkpoint = false;
- 		int			flags = 0;
- 		pg_time_t	now;
- 		int			elapsed_secs;
- 
  		/*
  		 * Emergency bailout if postmaster has died.  This is to avoid the
  		 * necessity for manual cleanup of all postmaster children.
--- 364,380 ----
  	 */
  	PG_SETMASK(&UnBlockSig);
  
+ 	BgWriterRecoveryMode = IsRecoveryProcessingMode();
+ 
+ 	if (BgWriterRecoveryMode)
+ 		elog(DEBUG1, "bgwriter starting during recovery, pid = %u", 
+ 			BgWriterShmem->bgwriter_pid);
+ 
  	/*
  	 * Loop forever
  	 */
  	for (;;)
  	{
  		/*
  		 * Emergency bailout if postmaster has died.  This is to avoid the
  		 * necessity for manual cleanup of all postmaster children.
***************
*** 372,499 **** BackgroundWriterMain(void)
  		if (!PostmasterIsAlive(true))
  			exit(1);
  
- 		/*
- 		 * Process any requests or signals received recently.
- 		 */
- 		AbsorbFsyncRequests();
- 
  		if (got_SIGHUP)
  		{
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
  		}
- 		if (checkpoint_requested)
- 		{
- 			checkpoint_requested = false;
- 			do_checkpoint = true;
- 			BgWriterStats.m_requested_checkpoints++;
- 		}
- 		if (shutdown_requested)
- 		{
- 			/*
- 			 * From here on, elog(ERROR) should end with exit(1), not send
- 			 * control back to the sigsetjmp block above
- 			 */
- 			ExitOnAnyError = true;
- 			/* Close down the database */
- 			ShutdownXLOG(0, 0);
- 			/* Normal exit from the bgwriter is here */
- 			proc_exit(0);		/* done */
- 		}
  
! 		/*
! 		 * Force a checkpoint if too much time has elapsed since the last one.
! 		 * Note that we count a timed checkpoint in stats only when this
! 		 * occurs without an external request, but we set the CAUSE_TIME flag
! 		 * bit even if there is also an external request.
! 		 */
! 		now = (pg_time_t) time(NULL);
! 		elapsed_secs = now - last_checkpoint_time;
! 		if (elapsed_secs >= CheckPointTimeout)
  		{
! 			if (!do_checkpoint)
! 				BgWriterStats.m_timed_checkpoints++;
! 			do_checkpoint = true;
! 			flags |= CHECKPOINT_CAUSE_TIME;
! 		}
  
! 		/*
! 		 * Do a checkpoint if requested, otherwise do one cycle of
! 		 * dirty-buffer writing.
! 		 */
! 		if (do_checkpoint)
! 		{
! 			/* use volatile pointer to prevent code rearrangement */
! 			volatile BgWriterShmemStruct *bgs = BgWriterShmem;
  
! 			/*
! 			 * Atomically fetch the request flags to figure out what kind of a
! 			 * checkpoint we should perform, and increase the started-counter
! 			 * to acknowledge that we've started a new checkpoint.
! 			 */
! 			SpinLockAcquire(&bgs->ckpt_lck);
! 			flags |= bgs->ckpt_flags;
! 			bgs->ckpt_flags = 0;
! 			bgs->ckpt_started++;
! 			SpinLockRelease(&bgs->ckpt_lck);
  
! 			/*
! 			 * We will warn if (a) too soon since last checkpoint (whatever
! 			 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
! 			 * since the last checkpoint start.  Note in particular that this
! 			 * implementation will not generate warnings caused by
! 			 * CheckPointTimeout < CheckPointWarning.
! 			 */
! 			if ((flags & CHECKPOINT_CAUSE_XLOG) &&
! 				elapsed_secs < CheckPointWarning)
! 				ereport(LOG,
! 						(errmsg("checkpoints are occurring too frequently (%d seconds apart)",
! 								elapsed_secs),
! 						 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
  
  			/*
! 			 * Initialize bgwriter-private variables used during checkpoint.
  			 */
! 			ckpt_active = true;
! 			ckpt_start_recptr = GetInsertRecPtr();
! 			ckpt_start_time = now;
! 			ckpt_cached_elapsed = 0;
  
! 			/*
! 			 * Do the checkpoint.
! 			 */
! 			CreateCheckPoint(flags);
  
  			/*
! 			 * After any checkpoint, close all smgr files.	This is so we
! 			 * won't hang onto smgr references to deleted files indefinitely.
  			 */
! 			smgrcloseall();
  
  			/*
! 			 * Indicate checkpoint completion to any waiting backends.
  			 */
! 			SpinLockAcquire(&bgs->ckpt_lck);
! 			bgs->ckpt_done = bgs->ckpt_started;
! 			SpinLockRelease(&bgs->ckpt_lck);
  
! 			ckpt_active = false;
  
! 			/*
! 			 * Note we record the checkpoint start time not end time as
! 			 * last_checkpoint_time.  This is so that time-driven checkpoints
! 			 * happen at a predictable spacing.
! 			 */
! 			last_checkpoint_time = now;
  		}
- 		else
- 			BgBufferSync();
- 
- 		/* Check for archive_timeout and switch xlog files if necessary. */
- 		CheckArchiveTimeout();
- 
- 		/* Nap for the configured time. */
- 		BgWriterNap();
  	}
  }
  
--- 382,595 ----
  		if (!PostmasterIsAlive(true))
  			exit(1);
  
  		if (got_SIGHUP)
  		{
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
  		}
  
! 		if (BgWriterRecoveryMode)
  		{
! 			if (shutdown_requested)
! 			{
! 				/*
! 				 * From here on, elog(ERROR) should end with exit(1), not send
! 				 * control back to the sigsetjmp block above
! 				 */
! 				ExitOnAnyError = true;
! 				/* Normal exit from the bgwriter is here */
! 				proc_exit(0);		/* done */
! 			}
  
! 			if (!IsRecoveryProcessingMode())
! 			{
! 				elog(DEBUG2, "bgwriter changing from recovery to normal mode");
! 	  
! 				InitXLOGAccess();
! 				BgWriterRecoveryMode = false;
! 
! 				/*
! 				 * Start time-driven events from now
! 				 */
! 				last_checkpoint_time = last_xlog_switch_time = (pg_time_t) time(NULL);
! 
! 				/* 
! 				 * Notice that we do *not* act on a checkpoint_requested
! 				 * state at this point. We have changed mode, so we wish to
! 				 * perform a checkpoint not a restartpoint.
! 				 */
! 				continue;
! 			}
  
! 			if (checkpoint_requested)
! 			{
! 				XLogRecPtr		ReadPtr;
! 				CheckPoint		restartPoint;
! 
! 				checkpoint_requested = false;
! 
! 				/*
! 				 * Initialize bgwriter-private variables used during checkpoint.
! 				 */
! 				ckpt_active = true;
! 				ckpt_start_time = (pg_time_t) time(NULL);
! 				ckpt_cached_elapsed = 0;
! 
! 				/*
! 				 * Get the requested values from shared memory that the 
! 				 * Startup process has put there for us.
! 				 */
! 				SpinLockAcquire(&BgWriterShmem->ckpt_lck);
! 				ReadPtr = BgWriterShmem->ReadPtr;
! 				memcpy(&restartPoint, &BgWriterShmem->restartPoint, sizeof(CheckPoint));
! 				SpinLockRelease(&BgWriterShmem->ckpt_lck);
! 
! 				/* Use smoothed writes, until interrupted if ever */
! 				CreateRestartPoint(ReadPtr, &restartPoint, 0);
! 
! 				/*
! 				 * After any checkpoint, close all smgr files.	This is so we
! 				 * won't hang onto smgr references to deleted files indefinitely.
! 				 */
! 				smgrcloseall();
! 
! 				ckpt_active = false;
! 				checkpoint_requested = false;
! 			}
! 			else
! 			{
! 				/* Clean buffers dirtied by recovery */
! 				BgBufferSync();
  
! 				/* Nap for the configured time. */
! 				BgWriterNap();
! 			}
! 		}
! 		else	/* Normal processing */
! 		{
! 			bool		do_checkpoint = false;
! 			int			flags = 0;
! 			pg_time_t	now;
! 			int			elapsed_secs;
  
  			/*
! 			 * Process any requests or signals received recently.
  			 */
! 			AbsorbFsyncRequests();
  
! 			if (checkpoint_requested)
! 			{
! 				checkpoint_requested = false;
! 				do_checkpoint = true;
! 				BgWriterStats.m_requested_checkpoints++;
! 			}
! 			if (shutdown_requested)
! 			{
! 				/*
! 				 * From here on, elog(ERROR) should end with exit(1), not send
! 				 * control back to the sigsetjmp block above
! 				 */
! 				ExitOnAnyError = true;
! 				/* Close down the database */
! 				ShutdownXLOG(0, 0);
! 				/* Normal exit from the bgwriter is here */
! 				proc_exit(0);		/* done */
! 			}
  
  			/*
! 			 * Force a checkpoint if too much time has elapsed since the last one.
! 			 * Note that we count a timed checkpoint in stats only when this
! 			 * occurs without an external request, but we set the CAUSE_TIME flag
! 			 * bit even if there is also an external request.
  			 */
! 			now = (pg_time_t) time(NULL);
! 			elapsed_secs = now - last_checkpoint_time;
! 			if (elapsed_secs >= CheckPointTimeout)
! 			{
! 				if (!do_checkpoint)
! 					BgWriterStats.m_timed_checkpoints++;
! 				do_checkpoint = true;
! 				flags |= CHECKPOINT_CAUSE_TIME;
! 			}
  
  			/*
! 			 * Do a checkpoint if requested, otherwise do one cycle of
! 			 * dirty-buffer writing.
  			 */
! 			if (do_checkpoint)
! 			{
! 				/* use volatile pointer to prevent code rearrangement */
! 				volatile BgWriterShmemStruct *bgs = BgWriterShmem;
! 
! 				/*
! 				 * Atomically fetch the request flags to figure out what kind of a
! 				 * checkpoint we should perform, and increase the started-counter
! 				 * to acknowledge that we've started a new checkpoint.
! 				 */
! 				SpinLockAcquire(&bgs->ckpt_lck);
! 				flags |= bgs->ckpt_flags;
! 				bgs->ckpt_flags = 0;
! 				bgs->ckpt_started++;
! 				SpinLockRelease(&bgs->ckpt_lck);
! 
! 				/*
! 				 * We will warn if (a) too soon since last checkpoint (whatever
! 				 * caused it) and (b) somebody set the CHECKPOINT_CAUSE_XLOG flag
! 				 * since the last checkpoint start.  Note in particular that this
! 				 * implementation will not generate warnings caused by
! 				 * CheckPointTimeout < CheckPointWarning.
! 				 */
! 				if ((flags & CHECKPOINT_CAUSE_XLOG) &&
! 					elapsed_secs < CheckPointWarning)
! 					ereport(LOG,
! 							(errmsg("checkpoints are occurring too frequently (%d seconds apart)",
! 									elapsed_secs),
! 							 errhint("Consider increasing the configuration parameter \"checkpoint_segments\".")));
! 
! 				/*
! 				 * Initialize bgwriter-private variables used during checkpoint.
! 				 */
! 				ckpt_active = true;
! 				ckpt_start_recptr = GetInsertRecPtr();
! 				ckpt_start_time = now;
! 				ckpt_cached_elapsed = 0;
! 
! 				/*
! 				 * Do the checkpoint.
! 				 */
! 				CreateCheckPoint(flags);
! 
! 				/*
! 				 * After any checkpoint, close all smgr files.	This is so we
! 				 * won't hang onto smgr references to deleted files indefinitely.
! 				 */
! 				smgrcloseall();
! 
! 				/*
! 				 * Indicate checkpoint completion to any waiting backends.
! 				 */
! 				SpinLockAcquire(&bgs->ckpt_lck);
! 				bgs->ckpt_done = bgs->ckpt_started;
! 				SpinLockRelease(&bgs->ckpt_lck);
! 
! 				ckpt_active = false;
! 
! 				/*
! 				 * Note we record the checkpoint start time not end time as
! 				 * last_checkpoint_time.  This is so that time-driven checkpoints
! 				 * happen at a predictable spacing.
! 				 */
! 				last_checkpoint_time = now;
! 			}
! 			else
! 				BgBufferSync();
  
! 			/* Check for archive_timeout and switch xlog files if necessary. */
! 			CheckArchiveTimeout();
  
! 			/* Nap for the configured time. */
! 			BgWriterNap();
  		}
  	}
  }
  
***************
*** 586,592 **** BgWriterNap(void)
  		(ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
  			break;
  		pg_usleep(1000000L);
! 		AbsorbFsyncRequests();
  		udelay -= 1000000L;
  	}
  
--- 682,689 ----
  		(ckpt_active ? ImmediateCheckpointRequested() : checkpoint_requested))
  			break;
  		pg_usleep(1000000L);
! 		if (!IsRecoveryProcessingMode())
! 			AbsorbFsyncRequests();
  		udelay -= 1000000L;
  	}
  
***************
*** 640,645 **** CheckpointWriteDelay(int flags, double progress)
--- 737,755 ----
  	if (!am_bg_writer)
  		return;
  
+ 	/* Perform minimal duties during recovery and skip wait if requested */
+ 	if (IsRecoveryProcessingMode())
+ 	{
+ 		BgBufferSync();
+ 
+ 		if (!shutdown_requested &&
+ 			!checkpoint_requested &&
+ 			IsCheckpointOnSchedule(progress))
+ 			BgWriterNap();
+ 
+ 		return;
+ 	}
+ 
  	/*
  	 * Perform the usual bgwriter duties and take a nap, unless we're behind
  	 * schedule, in which case we just try to catch up as quickly as possible.
***************
*** 714,729 **** IsCheckpointOnSchedule(double progress)
  	 * However, it's good enough for our purposes, we're only calculating an
  	 * estimate anyway.
  	 */
! 	recptr = GetInsertRecPtr();
! 	elapsed_xlogs =
! 		(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
! 		 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
! 		CheckPointSegments;
! 
! 	if (progress < elapsed_xlogs)
  	{
! 		ckpt_cached_elapsed = elapsed_xlogs;
! 		return false;
  	}
  
  	/*
--- 824,842 ----
  	 * However, it's good enough for our purposes, we're only calculating an
  	 * estimate anyway.
  	 */
! 	if (!IsRecoveryProcessingMode())
  	{
! 		recptr = GetInsertRecPtr();
! 		elapsed_xlogs =
! 			(((double) (int32) (recptr.xlogid - ckpt_start_recptr.xlogid)) * XLogSegsPerFile +
! 			 ((double) recptr.xrecoff - (double) ckpt_start_recptr.xrecoff) / XLogSegSize) /
! 			CheckPointSegments;
! 
! 		if (progress < elapsed_xlogs)
! 		{
! 			ckpt_cached_elapsed = elapsed_xlogs;
! 			return false;
! 		}
  	}
  
  	/*
***************
*** 989,994 **** RequestCheckpoint(int flags)
--- 1102,1180 ----
  }
  
  /*
+  * Always runs in Startup process (see xlog.c)
+  */
+ void
+ RequestRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, bool sendToBGWriter)
+ {
+ 	/*
+ 	 * Should we just do it ourselves?
+ 	 */
+ 	if (!IsPostmasterEnvironment || !sendToBGWriter)
+ 	{
+ 		CreateRestartPoint(ReadPtr, restartPoint, CHECKPOINT_IMMEDIATE);
+ 		return;
+ 	}
+ 
+ 	/*
+ 	 * Push requested values into shared memory, then signal to request restartpoint.
+ 	 */
+ 	if (BgWriterShmem->bgwriter_pid == 0)
+ 		elog(LOG, "could not request restartpoint because bgwriter not running");
+ 
+ 	SpinLockAcquire(&BgWriterShmem->ckpt_lck);
+ 	BgWriterShmem->ReadPtr = ReadPtr;
+ 	memcpy(&BgWriterShmem->restartPoint, restartPoint, sizeof(CheckPoint));
+ 	SpinLockRelease(&BgWriterShmem->ckpt_lck);
+ 
+ 	if (kill(BgWriterShmem->bgwriter_pid, SIGINT) != 0)
+ 		elog(LOG, "could not signal for restartpoint: %m");	
+ }
+ 
+ /* 
+  * Sends another checkpoint request signal to bgwriter, which causes it
+  * to avoid smoothed writes and continue processing as if it had been
+  * called with CHECKPOINT_IMMEDIATE. This is used at the end of recovery.
+  */
+ void
+ RequestRestartPointCompletion(void)
+ {
+ 	if (BgWriterShmem->bgwriter_pid != 0 &&
+ 		kill(BgWriterShmem->bgwriter_pid, SIGINT) != 0)
+ 		elog(LOG, "could not signal for restartpoint immediate: %m");
+ }
+ 
+ XLogRecPtr
+ GetRedoLocationForArchiveCheckpoint(void)
+ {
+ 	XLogRecPtr	redo;
+ 
+ 	SpinLockAcquire(&BgWriterShmem->ckpt_lck);
+ 	redo = BgWriterShmem->ReadPtr;
+ 	SpinLockRelease(&BgWriterShmem->ckpt_lck);
+ 
+ 	return redo;
+ }
+ 
+ /* 
+  * Store the information needed for a checkpoint at the end of recovery.
+  * Returns true if bgwriter can perform checkpoint, or false if bgwriter
+  * not active or otherwise unable to comply.
+  */
+ bool
+ SetRedoLocationForArchiveCheckpoint(XLogRecPtr redo)
+ {
+ 	SpinLockAcquire(&BgWriterShmem->ckpt_lck);
+ 	BgWriterShmem->ReadPtr = redo;
+ 	SpinLockRelease(&BgWriterShmem->ckpt_lck);
+ 
+ 	if (BgWriterShmem->bgwriter_pid == 0 || !IsPostmasterEnvironment)
+ 		return false;
+ 
+ 	return true;
+ }
+ 
+ /*
   * ForwardFsyncRequest
   *		Forward a file-fsync request from a backend to the bgwriter
   *
*** src/backend/postmaster/postmaster.c
--- src/backend/postmaster/postmaster.c
***************
*** 230,237 **** static bool FatalError = false; /* T if recovering from backend crash */
   * We use a simple state machine to control startup, shutdown, and
   * crash recovery (which is rather like shutdown followed by startup).
   *
!  * Normal child backends can only be launched when we are in PM_RUN state.
!  * (We also allow it in PM_WAIT_BACKUP state, but only for superusers.)
   * In other states we handle connection requests by launching "dead_end"
   * child processes, which will simply send the client an error message and
   * quit.  (We track these in the BackendList so that we can know when they
--- 230,239 ----
   * We use a simple state machine to control startup, shutdown, and
   * crash recovery (which is rather like shutdown followed by startup).
   *
!  * Normal child backends can only be launched when we are in PM_RUN or
!  * PM_RECOVERY state. Any transaction started in PM_RECOVERY state will
!  * be read-only for the whole of its life.  (We also allow launch of normal
!  * child backends in PM_WAIT_BACKUP state, but only for superusers.)
   * In other states we handle connection requests by launching "dead_end"
   * child processes, which will simply send the client an error message and
   * quit.  (We track these in the BackendList so that we can know when they
***************
*** 254,259 **** typedef enum
--- 256,266 ----
  {
  	PM_INIT,					/* postmaster starting */
  	PM_STARTUP,					/* waiting for startup subprocess */
+ 	PM_RECOVERY,				/* consistent recovery mode; state only
+ 								 * entered for archive and streaming recovery,
+ 								 * and only after the point where the 
+ 								 * all data is in consistent state.
+ 								 */
  	PM_RUN,						/* normal "database is alive" state */
  	PM_WAIT_BACKUP,				/* waiting for online backup mode to end */
  	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
***************
*** 1302,1308 **** ServerLoop(void)
  		 * state that prevents it, start one.  It doesn't matter if this
  		 * fails, we'll just try again later.
  		 */
! 		if (BgWriterPID == 0 && pmState == PM_RUN)
  			BgWriterPID = StartBackgroundWriter();
  
  		/*
--- 1309,1315 ----
  		 * state that prevents it, start one.  It doesn't matter if this
  		 * fails, we'll just try again later.
  		 */
! 		if (BgWriterPID == 0 && (pmState == PM_RUN || pmState == PM_RECOVERY))
  			BgWriterPID = StartBackgroundWriter();
  
  		/*
***************
*** 1651,1661 **** retry1:
  					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
  					 errmsg("the database system is shutting down")));
  			break;
- 		case CAC_RECOVERY:
- 			ereport(FATAL,
- 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
- 					 errmsg("the database system is in recovery mode")));
- 			break;
  		case CAC_TOOMANY:
  			ereport(FATAL,
  					(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
--- 1658,1663 ----
***************
*** 1664,1669 **** retry1:
--- 1666,1672 ----
  		case CAC_WAITBACKUP:
  			/* OK for now, will check in InitPostgres */
  			break;
+ 		case CAC_RECOVERY:
  		case CAC_OK:
  			break;
  	}
***************
*** 1982,1991 **** pmdie(SIGNAL_ARGS)
  			ereport(LOG,
  					(errmsg("received smart shutdown request")));
  
! 			if (pmState == PM_RUN)
  			{
  				/* autovacuum workers are told to shut down immediately */
! 				SignalAutovacWorkers(SIGTERM);
  				/* and the autovac launcher too */
  				if (AutoVacPID != 0)
  					signal_child(AutoVacPID, SIGTERM);
--- 1985,1995 ----
  			ereport(LOG,
  					(errmsg("received smart shutdown request")));
  
! 			if (pmState == PM_RUN || pmState == PM_RECOVERY)
  			{
  				/* autovacuum workers are told to shut down immediately */
! 				if (pmState == PM_RUN)
! 					SignalAutovacWorkers(SIGTERM);
  				/* and the autovac launcher too */
  				if (AutoVacPID != 0)
  					signal_child(AutoVacPID, SIGTERM);
***************
*** 2019,2025 **** pmdie(SIGNAL_ARGS)
  
  			if (StartupPID != 0)
  				signal_child(StartupPID, SIGTERM);
! 			if (pmState == PM_RUN || pmState == PM_WAIT_BACKUP)
  			{
  				ereport(LOG,
  						(errmsg("aborting any active transactions")));
--- 2023,2029 ----
  
  			if (StartupPID != 0)
  				signal_child(StartupPID, SIGTERM);
! 			if (pmState == PM_RUN || pmState == PM_RECOVERY || pmState == PM_WAIT_BACKUP)
  			{
  				ereport(LOG,
  						(errmsg("aborting any active transactions")));
***************
*** 2115,2122 **** reaper(SIGNAL_ARGS)
  		 */
  		if (pid == StartupPID)
  		{
  			StartupPID = 0;
! 			Assert(pmState == PM_STARTUP);
  
  			/* FATAL exit of startup is treated as catastrophic */
  			if (!EXIT_STATUS_0(exitstatus))
--- 2119,2129 ----
  		 */
  		if (pid == StartupPID)
  		{
+ 			bool	leavingRecovery = (pmState == PM_RECOVERY);
+ 
  			StartupPID = 0;
! 			Assert(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
! 				   pmState == PM_WAIT_BACKUP || pmState == PM_WAIT_BACKENDS);
  
  			/* FATAL exit of startup is treated as catastrophic */
  			if (!EXIT_STATUS_0(exitstatus))
***************
*** 2124,2130 **** reaper(SIGNAL_ARGS)
  				LogChildExit(LOG, _("startup process"),
  							 pid, exitstatus);
  				ereport(LOG,
! 				(errmsg("aborting startup due to startup process failure")));
  				ExitPostmaster(1);
  			}
  
--- 2131,2137 ----
  				LogChildExit(LOG, _("startup process"),
  							 pid, exitstatus);
  				ereport(LOG,
! 						(errmsg("aborting startup due to startup process failure")));
  				ExitPostmaster(1);
  			}
  
***************
*** 2157,2166 **** reaper(SIGNAL_ARGS)
  			load_role();
  
  			/*
! 			 * Crank up the background writer.	It doesn't matter if this
! 			 * fails, we'll just try again later.
  			 */
! 			Assert(BgWriterPID == 0);
  			BgWriterPID = StartBackgroundWriter();
  
  			/*
--- 2164,2173 ----
  			load_role();
  
  			/*
! 			 * Check whether we need to start background writer, if not
! 			 * already running.
  			 */
! 			if (BgWriterPID == 0)
  			BgWriterPID = StartBackgroundWriter();
  
  			/*
***************
*** 2177,2184 **** reaper(SIGNAL_ARGS)
  				PgStatPID = pgstat_start();
  
  			/* at this point we are really open for business */
! 			ereport(LOG,
! 				 (errmsg("database system is ready to accept connections")));
  
  			continue;
  		}
--- 2184,2195 ----
  				PgStatPID = pgstat_start();
  
  			/* at this point we are really open for business */
! 			if (leavingRecovery)
! 				ereport(LOG,
! 					 (errmsg("database can now be accessed with read and write transactions")));
! 			else
! 				ereport(LOG,
! 					 (errmsg("database system is ready to accept connections")));
  
  			continue;
  		}
***************
*** 2898,2904 **** BackendStartup(Port *port)
  	bn->pid = pid;
  	bn->cancel_key = MyCancelKey;
  	bn->is_autovacuum = false;
! 	bn->dead_end = (port->canAcceptConnections != CAC_OK &&
  					port->canAcceptConnections != CAC_WAITBACKUP);
  	DLAddHead(BackendList, DLNewElem(bn));
  #ifdef EXEC_BACKEND
--- 2909,2916 ----
  	bn->pid = pid;
  	bn->cancel_key = MyCancelKey;
  	bn->is_autovacuum = false;
! 	bn->dead_end = (!(port->canAcceptConnections == CAC_RECOVERY || 
! 					  port->canAcceptConnections == CAC_OK) &&
  					port->canAcceptConnections != CAC_WAITBACKUP);
  	DLAddHead(BackendList, DLNewElem(bn));
  #ifdef EXEC_BACKEND
***************
*** 3845,3850 **** sigusr1_handler(SIGNAL_ARGS)
--- 3857,3910 ----
  
  	PG_SETMASK(&BlockSig);
  
+ 	if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_START))
+ 	{
+ 		Assert(pmState == PM_STARTUP);
+ 
+ 		/*
+ 		 * Go to shutdown mode if a shutdown request was pending.
+ 		 */
+ 		if (Shutdown > NoShutdown)
+ 		{
+ 			pmState = PM_WAIT_BACKENDS;
+ 			/* PostmasterStateMachine logic does the rest */
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * Startup process has entered recovery
+ 			 */
+ 			pmState = PM_RECOVERY;
+ 
+ 			/*
+ 			 * Load the flat authorization file into postmaster's cache. The
+ 			 * startup process won't have recomputed this from the database
+ 			 * yet, so it may change following recovery. We'll reload it
+ 			 * after the startup process ends.
+ 			 */
+ 			load_role();
+ 
+ 			/*
+ 			 * Crank up the background writer.	It doesn't matter if this
+ 			 * fails, we'll just try again later.
+ 			 */
+ 			Assert(BgWriterPID == 0);
+ 			BgWriterPID = StartBackgroundWriter();
+ 
+ 			/*
+ 			 * Likewise, start other special children as needed.
+ 			 */
+ 			Assert(PgStatPID == 0);
+ 			PgStatPID = pgstat_start();
+ 
+ 			/* We can now accept read-only connections */
+ 			ereport(LOG,
+ 				 (errmsg("database system is ready to accept connections")));
+ 			ereport(LOG,
+ 				 (errmsg("database can now be accessed with read only transactions")));
+ 		}
+ 	}
+ 
  	if (CheckPostmasterSignal(PMSIGNAL_PASSWORD_CHANGE))
  	{
  		/*
*** src/backend/utils/init/flatfiles.c
--- src/backend/utils/init/flatfiles.c
***************
*** 678,686 **** write_auth_file(Relation rel_authid, Relation rel_authmem)
  /*
   * This routine is called once during database startup, after completing
   * WAL replay if needed.  Its purpose is to sync the flat files with the
!  * current state of the database tables.  This is particularly important
!  * during PITR operation, since the flat files will come from the
!  * base backup which may be far out of sync with the current state.
   *
   * In theory we could skip rebuilding the flat files if no WAL replay
   * occurred, but it seems best to just do it always.  We have to
--- 678,687 ----
  /*
   * This routine is called once during database startup, after completing
   * WAL replay if needed.  Its purpose is to sync the flat files with the
!  * current state of the database tables.  
!  *
!  * In 8.4 we also run this during xact_redo_commit() if the transaction
!  * wrote a new database or auth flat file. 
   *
   * In theory we could skip rebuilding the flat files if no WAL replay
   * occurred, but it seems best to just do it always.  We have to
***************
*** 716,723 **** BuildFlatFiles(bool database_only)
  	/*
  	 * We don't have any hope of running a real relcache, but we can use the
  	 * same fake-relcache facility that WAL replay uses.
- 	 *
- 	 * No locking is needed because no one else is alive yet.
  	 */
  	rel_db = CreateFakeRelcacheEntry(rnode);
  	write_database_file(rel_db, true);
--- 717,722 ----
***************
*** 832,845 **** AtEOXact_UpdateFlatFiles(bool isCommit)
  	/* Okay to write the files */
  	if (database_file_update_subid != InvalidSubTransactionId)
  	{
! 		database_file_update_subid = InvalidSubTransactionId;
  		write_database_file(drel, false);
  		heap_close(drel, NoLock);
  	}
  
  	if (auth_file_update_subid != InvalidSubTransactionId)
  	{
! 		auth_file_update_subid = InvalidSubTransactionId;
  		write_auth_file(arel, mrel);
  		heap_close(arel, NoLock);
  		heap_close(mrel, NoLock);
--- 831,844 ----
  	/* Okay to write the files */
  	if (database_file_update_subid != InvalidSubTransactionId)
  	{
! 		/* reset database_file_update_subid later during commit */
  		write_database_file(drel, false);
  		heap_close(drel, NoLock);
  	}
  
  	if (auth_file_update_subid != InvalidSubTransactionId)
  	{
! 		/* reset auth_file_update_subid later during commit */
  		write_auth_file(arel, mrel);
  		heap_close(arel, NoLock);
  		heap_close(mrel, NoLock);
*** src/include/access/xact.h
--- src/include/access/xact.h
***************
*** 17,22 ****
--- 17,23 ----
  #include "access/xlog.h"
  #include "nodes/pg_list.h"
  #include "storage/relfilenode.h"
+ #include "utils/snapshot.h"
  #include "utils/timestamp.h"
  
  
***************
*** 84,111 **** typedef void (*SubXactCallback) (SubXactEvent event, SubTransactionId mySubid,
  #define XLOG_XACT_ABORT				0x20
  #define XLOG_XACT_COMMIT_PREPARED	0x30
  #define XLOG_XACT_ABORT_PREPARED	0x40
  
  typedef struct xl_xact_commit
  {
! 	TimestampTz xact_time;		/* time of commit */
! 	int			nrels;			/* number of RelFileNodes */
! 	int			nsubxacts;		/* number of subtransaction XIDs */
! 	/* Array of RelFileNode(s) to drop at commit */
! 	RelFileNode	xnodes[1];		/* VARIABLE LENGTH ARRAY */
! 	/* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
  } xl_xact_commit;
  
  #define MinSizeOfXactCommit offsetof(xl_xact_commit, xnodes)
  
  typedef struct xl_xact_abort
  {
  	TimestampTz xact_time;		/* time of abort */
  	int			nrels;			/* number of RelFileNodes */
  	int			nsubxacts;		/* number of subtransaction XIDs */
  	/* Array of RelFileNode(s) to drop at abort */
  	RelFileNode	xnodes[1];		/* VARIABLE LENGTH ARRAY */
  	/* ARRAY OF ABORTED SUBTRANSACTION XIDs FOLLOWS */
  } xl_xact_abort;
  
  #define MinSizeOfXactAbort offsetof(xl_xact_abort, xnodes)
  
--- 85,162 ----
  #define XLOG_XACT_ABORT				0x20
  #define XLOG_XACT_COMMIT_PREPARED	0x30
  #define XLOG_XACT_ABORT_PREPARED	0x40
+ #define XLOG_XACT_ASSIGNMENT		0x50
+ #define XLOG_XACT_RUNNING_XACTS		0x60
+ /* 0x70 can also be used, if required */
+ 
+ typedef struct xl_xact_assignment
+ {
+ 	TransactionId	xassign;	/* assigned xid */
+ 	TransactionId	xparent;	/* assigned xids parent, if any */
+ 	bool			isSubXact;	/* is a subtransaction */
+ 	int				slotId;		/* slotId in procarray */
+ } xl_xact_assignment;
+ 
+ /* 
+  * xl_xact_running_xacts is in utils/snapshot.h so it can be passed
+  * around to the same places as snapshots. Not snapmgr.h
+  */
  
  typedef struct xl_xact_commit
  {
!   	TimestampTz xact_time;		/* time of commit */
!  	int			slotId;			/* slotId in procarray */
!  	uint		xinfo;			/* info flags */
!   	int			nrels;			/* number of RelFileForks */
!   	int			nsubxacts;		/* number of subtransaction XIDs */
! 	int			nmsgs;			/* number of shared inval msgs */
!   	/* Array of RelFileFork(s) to drop at commit */
!   	RelFileNode	xnodes[1];		/* VARIABLE LENGTH ARRAY */
!   	/* ARRAY OF COMMITTED SUBTRANSACTION XIDs FOLLOWS */
! 	/* ARRAY OF SHARED INVALIDATION MESSAGES FOLLOWS */
  } xl_xact_commit;
  
  #define MinSizeOfXactCommit offsetof(xl_xact_commit, xnodes)
+ #define OffsetSharedInvalInXactCommit() \
+ ( \
+ 	MinSizeOfXactCommit +  \
+ 	(xlrec->nsubxacts * sizeof(TransactionId)) + \
+ 	(xlrec->nrels * sizeof(RelFileNode)) \
+ )
+ 
+ /*
+  * These flags are set in the xinfo fields of transaction
+  * completion WAL records. They indicate a number of actions
+  * that need to occur when emulating transaction completion.
+  * They are named XactCompletion... to differentiate them from
+  * EOXact... routines which run at the end of the original
+  * transaction completion.
+  */
+ #define XACT_COMPLETION_UNMARKED_SUBXIDS		0x01
+ 
+ /* These next states only occur on commit record types */
+ #define XACT_COMPLETION_UPDATE_DB_FILE			0x02
+ #define XACT_COMPLETION_UPDATE_AUTH_FILE		0x04
+ #define XACT_COMPLETION_UPDATE_RELCACHE_FILE	0x08
+ 
+ /* Access macros for above flags */
+ #define XactCompletionHasUnMarkedSubxids(xlrec)		((xlrec)->xinfo & XACT_COMPLETION_UNMARKED_SUBXIDS)
+ #define XactCompletionUpdateDBFile(xlrec) 			((xlrec)->xinfo & XACT_COMPLETION_UPDATE_DB_FILE)
+ #define XactCompletionUpdateAuthFile(xlrec) 		((xlrec)->xinfo & XACT_COMPLETION_UPDATE_AUTH_FILE)
+ #define XactCompletionRelcacheInitFileInval(xlrec)	((xlrec)->xinfo & XACT_COMPLETION_UPDATE_RELCACHE_FILE)
  
  typedef struct xl_xact_abort
  {
  	TimestampTz xact_time;		/* time of abort */
+ 	int			slotId;			/* slotId in procarray */
+ 	uint		xinfo;			/* info flags */
  	int			nrels;			/* number of RelFileNodes */
  	int			nsubxacts;		/* number of subtransaction XIDs */
  	/* Array of RelFileNode(s) to drop at abort */
  	RelFileNode	xnodes[1];		/* VARIABLE LENGTH ARRAY */
  	/* ARRAY OF ABORTED SUBTRANSACTION XIDs FOLLOWS */
  } xl_xact_abort;
+ /* Note the intentional lack of an invalidation message array c.f. commit */
  
  #define MinSizeOfXactAbort offsetof(xl_xact_abort, xnodes)
  
***************
*** 185,190 **** extern TransactionId RecordTransactionCommit(void);
--- 236,252 ----
  
  extern int	xactGetCommittedChildren(TransactionId **ptr);
  
+ extern void LogCurrentRunningXacts(void);
+ extern bool IsRunningXactDataValid(void);
+ extern void GetStandbyInfoForTransaction(RmgrId rmid, uint8 info,
+ 							XLogRecData *rdata,
+ 							TransactionId *xid2, 
+ 							uint16 *info2);
+ 
+ extern void InitRecoveryTransactionEnvironment(void);
+ extern void XactResolveRecoveryConflicts(TransactionId latestRemovedXid, Oid recDatabaseOid);
+ extern void RecordKnownAssignedTransactionIds(XLogRecPtr lsn, XLogRecord *record);
+ 
  extern void xact_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void xact_desc(StringInfo buf, uint8 xl_info, char *rec);
  
*** src/include/access/xlog.h
--- src/include/access/xlog.h
***************
*** 133,139 **** typedef struct XLogRecData
  } XLogRecData;
  
  extern TimeLineID ThisTimeLineID;		/* current TLI */
! extern bool InRecovery;
  extern XLogRecPtr XactLastRecEnd;
  
  /* these variables are GUC parameters related to XLOG */
--- 133,147 ----
  } XLogRecData;
  
  extern TimeLineID ThisTimeLineID;		/* current TLI */
! /* 
!  * Prior to 8.4, all activity during recovery were carried out by Startup
!  * process. This local variable continues to be used in many parts of the
!  * code to indicate actions taken by RecoveryManagers. Other processes who
!  * potentially perform work during recovery should check
!  * IsRecoveryProcessingMode(), see XLogCtl notes in xlog.c
!  */
! extern bool InRecovery;	
! extern bool InArchiveRecovery;
  extern XLogRecPtr XactLastRecEnd;
  
  /* these variables are GUC parameters related to XLOG */
***************
*** 166,171 **** extern bool XLOG_DEBUG;
--- 174,180 ----
  /* These indicate the cause of a checkpoint request */
  #define CHECKPOINT_CAUSE_XLOG	0x0010	/* XLOG consumption */
  #define CHECKPOINT_CAUSE_TIME	0x0020	/* Elapsed time */
+ #define CHECKPOINT_RESTARTPOINT	0x0040	/* Restartpoint during recovery */
  
  /* Checkpoint statistics */
  typedef struct CheckpointStatsData
***************
*** 197,202 **** extern void XLogSetAsyncCommitLSN(XLogRecPtr record);
--- 206,214 ----
  extern void xlog_redo(XLogRecPtr lsn, XLogRecord *record);
  extern void xlog_desc(StringInfo buf, uint8 xl_info, char *rec);
  
+ extern bool IsRecoveryProcessingMode(void);
+ 
+ 
  extern void UpdateControlFile(void);
  extern Size XLOGShmemSize(void);
  extern void XLOGShmemInit(void);
*** src/include/access/xlog_internal.h
--- src/include/access/xlog_internal.h
***************
*** 17,22 ****
--- 17,23 ----
  #define XLOG_INTERNAL_H
  
  #include "access/xlog.h"
+ #include "catalog/pg_control.h"
  #include "fmgr.h"
  #include "pgtime.h"
  #include "storage/block.h"
***************
*** 245,250 **** extern const RmgrData RmgrTable[];
--- 246,254 ----
  extern pg_time_t GetLastSegSwitchTime(void);
  extern XLogRecPtr RequestXLogSwitch(void);
  
+ extern void CreateRestartPoint(const XLogRecPtr ReadPtr, 
+ 				const CheckPoint *restartPoint, int flags);
+ 
  /*
   * These aren't in xlog.h because I'd rather not include fmgr.h there.
   */
*** src/include/catalog/pg_control.h
--- src/include/catalog/pg_control.h
***************
*** 46,51 **** typedef struct CheckPoint
--- 46,52 ----
  #define XLOG_NOOP						0x20
  #define XLOG_NEXTOID					0x30
  #define XLOG_SWITCH						0x40
+ #define XLOG_RECOVERY_END				0x50
  
  
  /* System status indicator */
***************
*** 102,107 **** typedef struct ControlFileData
--- 103,109 ----
  	CheckPoint	checkPointCopy; /* copy of last check point record */
  
  	XLogRecPtr	minRecoveryPoint;		/* must replay xlog to here */
+ 	XLogRecPtr	minSafeStartPoint;		/* safe point after recovery crashes */
  
  	/*
  	 * This data is used to check for hardware-architecture compatibility of
*** src/include/postmaster/bgwriter.h
--- src/include/postmaster/bgwriter.h
***************
*** 12,17 ****
--- 12,18 ----
  #ifndef _BGWRITER_H
  #define _BGWRITER_H
  
+ #include "catalog/pg_control.h"
  #include "storage/block.h"
  #include "storage/relfilenode.h"
  
***************
*** 25,30 **** extern double CheckPointCompletionTarget;
--- 26,36 ----
  extern void BackgroundWriterMain(void);
  
  extern void RequestCheckpoint(int flags);
+ extern void RequestRestartPoint(const XLogRecPtr ReadPtr, const CheckPoint *restartPoint, bool sendToBGWriter);
+ extern void RequestRestartPointCompletion(void);
+ extern XLogRecPtr GetRedoLocationForArchiveCheckpoint(void);
+ extern bool SetRedoLocationForArchiveCheckpoint(XLogRecPtr redo);
+ 
  extern void CheckpointWriteDelay(int flags, double progress);
  
  extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
*** src/include/storage/pmsignal.h
--- src/include/storage/pmsignal.h
***************
*** 22,27 ****
--- 22,28 ----
   */
  typedef enum
  {
+ 	PMSIGNAL_RECOVERY_START,	/* move to PM_RECOVERY state */
  	PMSIGNAL_PASSWORD_CHANGE,	/* pg_auth file has changed */
  	PMSIGNAL_WAKEN_ARCHIVER,	/* send a NOTIFY signal to xlog archiver */
  	PMSIGNAL_ROTATE_LOGFILE,	/* send SIGUSR1 to syslogger to rotate logfile */

#22

Fujii Masao

masao.fujii@gmail.com

about 17 years ago

In reply to: Heikki Linnakangas (#21)

Re: [PATCHES] Infrastructure changes for recovery (v8)

On Tue, Dec 23, 2008 at 5:18 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Simon Riggs wrote:

On Wed, 2008-12-17 at 23:32 -0300, Alvaro Herrera wrote:

Simon Riggs escribió:

Please let me know how I can make the reviewer's job easier. Diagrams,
writeups, whatever. Thanks,

A link perhaps?

There is much confusion on this point for which I'm very sorry.

I originally wrote "infra" patch to allow it to be committed separately
in the Sept commitfest, to reduce size of the forthcoming hotstandby
patch. That didn't happen (no moans there) so the eventual "hotstandby"
patch includes all of what was the infra patch, plus the new code.

So currently there is no separate "infra" patch. The two line items on
the CommitFest page are really just one large project. I would be in
favour of removing the "infra" lines from the CommitFest page.

I think it's useful to review the "infra" part of the patch separately, so I
split it out of the big patch again. I haven't looked at this in detail yet,
but it compiles and passes regression tests.

Super! I would fix synch rep code based on this patch.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#23

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Heikki Linnakangas (#21)

Re: [PATCHES] Infrastructure changes for recovery (v8)

On Mon, 2008-12-22 at 22:18 +0200, Heikki Linnakangas wrote:

I think it's useful to review the "infra" part of the patch separately,
so I split it out of the big patch again. I haven't looked at this in
detail yet, but it compiles and passes regression tests.

OK, thanks, much appreciated.

The patches were fairly distinct in their features, though choosing an
exact split line could be done in a number of ways.

I'll look through this in more detail today and get back to you.

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support

#24

Fujii Masao

masao.fujii@gmail.com

about 17 years ago

In reply to: Simon Riggs (#12)

Re: [PATCHES] Infrastructure changes for recovery (v8)

Hi,

On Tue, Nov 18, 2008 at 12:39 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, 2008-11-17 at 16:18 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
@@ -3845,6 +3850,52 @@ sigusr1_handler(SIGNAL_ARGS)

PG_SETMASK(&BlockSig);
+       if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_START))
+       {
+               Assert(pmState == PM_STARTUP);
+
+               /*
+                * Go to shutdown mode if a shutdown request was pending.
+                */
+               if (Shutdown > NoShutdown)
+               {
+                       pmState = PM_WAIT_BACKENDS;
+                       /* PostmasterStateMachine logic does the rest */
+               }
+               else
+               {
+                       /*
+                        * Startup process has entered recovery
+                        */
+                       pmState = PM_RECOVERY;
Hmm, I smell a race condition here:

1. Startup process goes into consistent state, and signals postmaster
2. Startup process finishes WAL replay and dies
3. Postmaster wakes up in reaper(), noting that the startup process
dies, and goes into PM_RUN mode.
4. The above signal handler for postmaster is run, causing an assertion
failure, or putting postmaster back into PM_RECOVERY mode if assertions
are disabled.

Highly unlikely in practice, given how much code needs to run in the
startup process between signaling the postmaster and exiting, but it
seems theoretically possible. Do we care, and if we do, how can we fix it?
Might be possible - it does depend on the sequence of actions its true.
Agree not likely to happen, except as the result of another bug.

I'll change it to a test for

if (pmState == PM_STARTUP)
pmState = PM_RECOVERY;

The assertion was mainly for documentation, its not protecting anything
critical (IIRC).

This seems to have not been fixed yet in the latest patch.

http://archives.postgresql.org/message-id/494FF631.90908@enterprisedb.com
recovery-infra-separated-again-1.patch

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#25

Simon Riggs

simon@2ndQuadrant.com

about 17 years ago

In reply to: Fujii Masao (#24)

Re: [PATCHES] Infrastructure changes for recovery (v8)

On Mon, 2008-12-29 at 15:06 +0900, Fujii Masao wrote:

This seems to have not been fixed yet in the latest patch.

http://archives.postgresql.org/message-id/494FF631.90908@enterprisedb.com
recovery-infra-separated-again-1.patch

I'll add it to my issues-reported list so we can check for regressions.

Thanks,

--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support