Online base backup from the hot-standby

Started by Jun Ishidukaover 14 years ago91 messages

ishizuka.jun@po.ntts.co.jp

over 14 years ago

2 attachment(s)

I will provide a patch which can exeute pg_start/stop_backup
including to solve above comment and conditions in next stage.
Then please review.

done.

* Procedure

1. Call pg_start_backup('x') on the standby.
2. Take a backup of the data dir.
3. Call pg_stop_backup() on the standby.
4. Copy the control file on the standby to the backup.
5. Check whether the control file is status during hot standby with pg_controldata.
-> If the standby promote between 3. and 4., the backup can not recovery.
-> pg_control is that "Minimum recovery ending location" is equals 0/0.
-> backup-end record is not written.

* Not correspond yet

* full_page_write = off
-> If the primary is "full_page_write = off", archive recovery may not act
normally. Therefore the standby may need to check whether "full_page_write
= off" to WAL.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

Attachments:

standby_online_backup_05.patchapplication/octet-stream; name=standby_online_backup_05.patchDownload

diff -rcN postgresql/src/backend/access/transam/xlog.c postgresql_with_patch/src/backend/access/transam/xlog.c
*** postgresql/src/backend/access/transam/xlog.c	2011-08-03 05:18:32.000000000 +0900
--- postgresql_with_patch/src/backend/access/transam/xlog.c	2011-08-03 19:03:00.000000000 +0900
***************
*** 6053,6059 ****
  		ereport(LOG,
  				(errmsg("database system was interrupted while in recovery at log time %s",
  						str_time(ControlFile->checkPointCopy.time)),
! 				 errhint("If this has occurred more than once some data might be corrupted"
  			  " and you might need to choose an earlier recovery target.")));
  	else if (ControlFile->state == DB_IN_PRODUCTION)
  		ereport(LOG,
--- 6053,6059 ----
  		ereport(LOG,
  				(errmsg("database system was interrupted while in recovery at log time %s",
  						str_time(ControlFile->checkPointCopy.time)),
! 				 errhint("If this has occurred more than once some data might be corrupted unless take backup from slave"
  			  " and you might need to choose an earlier recovery target.")));
  	else if (ControlFile->state == DB_IN_PRODUCTION)
  		ereport(LOG,
***************
*** 6299,6304 ****
--- 6299,6318 ----
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
+ 		/* 
+ 		 * get minRecoveryPoint before the value is changed for 
+ 		 * getting backup on standby server.
+ 		 * check the result, set to pg_control.
+ 		 */
+ 		XLogRecPtr prevminRecoveryPoint = ControlFile->minRecoveryPoint;
+ 		if (!ControlFile->backupserver)
+ 		{
+ 			if (prevminRecoveryPoint.xlogid == 0 && prevminRecoveryPoint.xrecoff == 0)
+ 				ControlFile->backupserver = BACKUPSERVER_MASTER;
+ 			else
+ 				ControlFile->backupserver = BACKUPSERVER_SLAVE;
+ 		}
+ 
  		/*
  		 * Update pg_control to show that we are recovering and to show the
  		 * selected checkpoint as the place we are starting from. We also mark
***************
*** 6607,6612 ****
--- 6621,6646 ----
  				/* Pop the error context stack */
  				error_context_stack = errcontext.previous;
  
+ 				/* 
+ 				 * Check whether to reach minRecoveryPoint when getting backup 
+ 				 * on standby server.
+ 				 */
+ 				if (ControlFile->backupserver == BACKUPSERVER_SLAVE)
+ 				{
+ 					if (XLByteLE(prevminRecoveryPoint, EndRecPtr))
+ 					{
+ 						elog(DEBUG1, "end of backup reached in the control file");
+ 
+ 						LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 
+ 						MemSet(&ControlFile->backupStartPoint, 0, sizeof(XLogRecPtr));
+ 						ControlFile->backupserver = BACKUPSERVER_NONE;
+ 						UpdateControlFile();
+ 
+ 						LWLockRelease(ControlFileLock);
+ 					}
+ 				}
+ 
  				/*
  				 * Update shared recoveryLastRecPtr after this record has been
  				 * replayed.
***************
*** 8402,8408 ****
  		 * never arrive.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint))
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
--- 8436,8443 ----
  		 * never arrive.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
! 			ControlFile->backupserver == BACKUPSERVER_MASTER)
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
***************
*** 8531,8536 ****
--- 8566,8572 ----
  			if (XLByteLT(ControlFile->minRecoveryPoint, lsn))
  				ControlFile->minRecoveryPoint = lsn;
  			MemSet(&ControlFile->backupStartPoint, 0, sizeof(XLogRecPtr));
+ 			ControlFile->backupserver = BACKUPSERVER_NONE;
  			UpdateControlFile();
  
  			LWLockRelease(ControlFileLock);
***************
*** 8864,8869 ****
--- 8900,8906 ----
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		duringrecovery = false;
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
  	pg_time_t	stamp_time;
***************
*** 8881,8892 ****
  		   errmsg("must be superuser or replication role to run a backup")));
  
  	if (RecoveryInProgress())
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 				 errmsg("recovery is in progress"),
! 				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
--- 8918,8926 ----
  		   errmsg("must be superuser or replication role to run a backup")));
  
  	if (RecoveryInProgress())
! 		duringrecovery = true;
  
! 	if (!XLogIsNeeded() && !duringrecovery)
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
***************
*** 8909,8915 ****
  	 * directory was not included in the base backup and the WAL archive was
  	 * cleared too before starting the backup.
  	 */
! 	RequestXLogSwitch();
  
  	/*
  	 * Mark backup active in shared memory.  We must do full-page WAL writes
--- 8943,8950 ----
  	 * directory was not included in the base backup and the WAL archive was
  	 * cleared too before starting the backup.
  	 */
! 	if (!duringrecovery)
! 		RequestXLogSwitch();
  
  	/*
  	 * Mark backup active in shared memory.  We must do full-page WAL writes
***************
*** 8943,8949 ****
  	}
  	else
  		XLogCtl->Insert.nonExclusiveBackups++;
! 	XLogCtl->Insert.forcePageWrites = true;
  	LWLockRelease(WALInsertLock);
  
  	/* Ensure we release forcePageWrites if fail below */
--- 8978,8985 ----
  	}
  	else
  		XLogCtl->Insert.nonExclusiveBackups++;
! 	if (!duringrecovery)
! 		XLogCtl->Insert.forcePageWrites = true;
  	LWLockRelease(WALInsertLock);
  
  	/* Ensure we release forcePageWrites if fail below */
***************
*** 8994,9000 ****
  				gotUniqueStartpoint = true;
  			}
  			LWLockRelease(WALInsertLock);
! 		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
  		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
--- 9030,9036 ----
  				gotUniqueStartpoint = true;
  			}
  			LWLockRelease(WALInsertLock);
! 		} while (!gotUniqueStartpoint && !duringrecovery);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
  		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
***************
*** 9087,9093 ****
  	}
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
--- 9123,9130 ----
  	}
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0 &&
! 		!RecoveryInProgress())
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
***************
*** 9131,9136 ****
--- 9168,9174 ----
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		duringrecovery = false;
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
  	XLogRecData rdata;
***************
*** 9157,9168 ****
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
  	if (RecoveryInProgress())
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 				 errmsg("recovery is in progress"),
! 				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
--- 9195,9203 ----
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
  	if (RecoveryInProgress())
! 		duringrecovery = true;
  
! 	if (!XLogIsNeeded() && !duringrecovery)
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
***************
*** 9187,9193 ****
  	}
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
--- 9222,9229 ----
  	}
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0 &&
! 	    !duringrecovery)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
***************
*** 9241,9246 ****
--- 9277,9289 ----
  	}
  
  	/*
+ 	 * When pg_stop_backup is excuted on hot standby server, the result 
+ 	 * is minRecoveryPoint in the control file.
+ 	 */
+ 	if (duringrecovery)
+ 		return ControlFile->minRecoveryPoint;
+ 
+ 	/*
  	 * Read and parse the START WAL LOCATION line (this code is pretty crude,
  	 * but we are not expecting any variability in the file format).
  	 */
***************
*** 9398,9404 ****
  	XLogCtl->Insert.nonExclusiveBackups--;
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
--- 9441,9448 ----
  	XLogCtl->Insert.nonExclusiveBackups--;
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0 &&
! 		!RecoveryInProgress())
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
diff -rcN postgresql/src/backend/postmaster/postmaster.c postgresql_with_patch/src/backend/postmaster/postmaster.c
*** postgresql/src/backend/postmaster/postmaster.c	2011-08-03 05:18:32.000000000 +0900
--- postgresql_with_patch/src/backend/postmaster/postmaster.c	2011-08-03 10:10:06.000000000 +0900
***************
*** 286,293 ****
  
  static PMState pmState = PM_INIT;
  
- static bool ReachedNormalRunning = false;		/* T if we've reached PM_RUN */
- 
  bool		ClientAuthInProgress = false;		/* T during new-client
  												 * authentication */
  
--- 286,291 ----
***************
*** 2313,2319 ****
  			 * Startup succeeded, commence normal operations
  			 */
  			FatalError = false;
- 			ReachedNormalRunning = true;
  			pmState = PM_RUN;
  
  			/*
--- 2311,2316 ----
***************
*** 2853,2859 ****
  		 * backends to die off, but that doesn't work at present because
  		 * killing the startup process doesn't release its locks.
  		 */
! 		if (CountChildren(BACKEND_TYPE_NORMAL) == 0)
  		{
  			if (StartupPID != 0)
  				signal_child(StartupPID, SIGTERM);
--- 2850,2856 ----
  		 * backends to die off, but that doesn't work at present because
  		 * killing the startup process doesn't release its locks.
  		 */
! 		if (CountChildren(BACKEND_TYPE_NORMAL) == 0 && !BackupInProgress())
  		{
  			if (StartupPID != 0)
  				signal_child(StartupPID, SIGTERM);
***************
*** 3008,3022 ****
  		{
  			/*
  			 * Terminate backup mode to avoid recovery after a clean fast
! 			 * shutdown.  Since a backup can only be taken during normal
! 			 * running (and not, for example, while running under Hot Standby)
! 			 * it only makes sense to do this if we reached normal running. If
! 			 * we're still in recovery, the backup file is one we're
! 			 * recovering *from*, and we must keep it around so that recovery
! 			 * restarts from the right place.
  			 */
! 			if (ReachedNormalRunning)
! 				CancelBackup();
  
  			/* Normal exit from the postmaster is here */
  			ExitPostmaster(0);
--- 3005,3013 ----
  		{
  			/*
  			 * Terminate backup mode to avoid recovery after a clean fast
! 			 * shutdown.  
  			 */
! 			CancelBackup();
  
  			/* Normal exit from the postmaster is here */
  			ExitPostmaster(0);
***************
*** 4232,4237 ****
--- 4223,4233 ----
  		(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
  		 pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY))
  	{
+ 		/*
+ 		 * Terminate backup mode to avoid recovery after a clean promote.  
+ 		 */
+ 		CancelBackup();
+ 
  		/* Tell startup process to finish recovery */
  		signal_child(StartupPID, SIGUSR2);
  	}
diff -rcN postgresql/src/bin/pg_controldata/pg_controldata.c postgresql_with_patch/src/bin/pg_controldata/pg_controldata.c
*** postgresql/src/bin/pg_controldata/pg_controldata.c	2011-08-03 05:18:32.000000000 +0900
--- postgresql_with_patch/src/bin/pg_controldata/pg_controldata.c	2011-08-03 10:10:06.000000000 +0900
***************
*** 86,91 ****
--- 86,105 ----
  	return _("unrecognized wal_level");
  }
  
+ static const char *
+ backupserver_str(BackupServer backupserver)
+ {
+ 	switch (backupserver)
+ 	{
+ 		case BACKUPSERVER_NONE:
+ 			return "none";
+ 		case BACKUPSERVER_MASTER:
+ 			return "master";
+ 		case BACKUPSERVER_SLAVE:
+ 			return "slave";
+ 	}
+ 	return _("unrecognized backupserver");
+ }
  
  int
  main(int argc, char *argv[])
***************
*** 232,237 ****
--- 246,253 ----
  	printf(_("Backup start location:                %X/%X\n"),
  		   ControlFile.backupStartPoint.xlogid,
  		   ControlFile.backupStartPoint.xrecoff);
+ 	printf(_("Backup from:                          %s\n"),
+ 		   backupserver_str(ControlFile.backupserver));
  	printf(_("Current wal_level setting:            %s\n"),
  		   wal_level_str(ControlFile.wal_level));
  	printf(_("Current max_connections setting:      %d\n"),
diff -rcN postgresql/src/bin/pg_ctl/pg_ctl.c postgresql_with_patch/src/bin/pg_ctl/pg_ctl.c
*** postgresql/src/bin/pg_ctl/pg_ctl.c	2011-08-03 05:18:32.000000000 +0900
--- postgresql_with_patch/src/bin/pg_ctl/pg_ctl.c	2011-08-03 10:10:06.000000000 +0900
***************
*** 882,894 ****
  	{
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
! 		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present, we're recovering from an online
! 		 * backup instead of performing one.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0 &&
! 			stat(recovery_file, &statbuf) != 0)
  		{
  			print_msg(_("WARNING: online backup mode is active\n"
  						"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
--- 882,891 ----
  	{
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
! 		 * that smart shutdown will wait for it to finish. 
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0)
  		{
  			print_msg(_("WARNING: online backup mode is active\n"
  						"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
diff -rcN postgresql/src/include/access/xlog.h postgresql_with_patch/src/include/access/xlog.h
*** postgresql/src/include/access/xlog.h	2011-08-03 05:18:32.000000000 +0900
--- postgresql_with_patch/src/include/access/xlog.h	2011-08-03 10:10:06.000000000 +0900
***************
*** 209,214 ****
--- 209,222 ----
  } WalLevel;
  extern int	wal_level;
  
+ /* where to get backup */
+ typedef enum BackupServer
+ {
+ 	BACKUPSERVER_NONE = 0,
+ 	BACKUPSERVER_MASTER,
+ 	BACKUPSERVER_SLAVE,
+ } BackupServer;
+ 
  #define XLogArchivingActive()	(XLogArchiveMode && wal_level >= WAL_LEVEL_ARCHIVE)
  #define XLogArchiveCommandSet() (XLogArchiveCommand[0] != '\0')
  
diff -rcN postgresql/src/include/catalog/pg_control.h postgresql_with_patch/src/include/catalog/pg_control.h
*** postgresql/src/include/catalog/pg_control.h	2011-08-03 05:18:32.000000000 +0900
--- postgresql_with_patch/src/include/catalog/pg_control.h	2011-08-03 10:10:06.000000000 +0900
***************
*** 142,147 ****
--- 142,156 ----
  	XLogRecPtr	backupStartPoint;
  
  	/*
+ 	 * backupserver is used while postgresql is in recovery mode to
+ 	 * store the location of where the backup comes from.
+ 	 * When Postgres starts recovery operations
+ 	 * it is set to "none". During recovery it is updated to either "master", or "slave".
+ 	 * When recovery operations finish it is updated back to "none".
+ 	 */
+ 	int			backupserver;
+ 
+ 	/*
  	 * Parameter settings that determine if the WAL can be used for archival
  	 * or hot standby.
  	 */

standby_online_backup_doc.patchapplication/octet-stream; name=standby_online_backup_doc.patchDownload

diff -rcN postgresql/doc/src/sgml/backup.sgml postgresql_with_patch/doc/src/sgml/backup.sgml
*** postgresql/doc/src/sgml/backup.sgml	2011-08-03 05:18:32.000000000 +0900
--- postgresql_with_patch/doc/src/sgml/backup.sgml	2011-08-05 05:48:23.000000000 +0900
***************
*** 724,734 ****
     <title>Making a Base Backup</title>
  
     <para>
!     The procedure for making a base backup is relatively simple:
    <orderedlist>
     <listitem>
      <para>
!      Ensure that WAL archiving is enabled and working.
      </para>
     </listitem>
     <listitem>
--- 724,736 ----
     <title>Making a Base Backup</title>
  
     <para>
!     The procedure for making a base backup is relatively simple. This can
!     also run on the hot standby, the procedure is a little different:
    <orderedlist>
     <listitem>
      <para>
!      Ensure that WAL archiving is enabled and working. This don't need on
!      the hot standby.
      </para>
     </listitem>
     <listitem>
***************
*** 787,793 ****
       This terminates the backup mode and performs an automatic switch to
       the next WAL segment.  The reason for the switch is to arrange for
       the last WAL segment file written during the backup interval to be
!      ready to archive.
      </para>
     </listitem>
     <listitem>
--- 789,811 ----
       This terminates the backup mode and performs an automatic switch to
       the next WAL segment.  The reason for the switch is to arrange for
       the last WAL segment file written during the backup interval to be
!      ready to archive. On the hot standby, this terminates the backup mode
!      only.
!     </para>
!    </listitem>
!    <listitem>
!     <para>
!      Copy <filename>pg_control</> file to the backup taken by above-procedure.
!      This needs on the hot standby.
!     </para>
!    </listitem>
!    <listitem>
!     <para>
!      Check whether the backup is status during hot standby with
!      <application>pg_controldata</application>. This needs on the hot standby.
!      If not (this means that the hot standby promote between
!      <function>pg_stop_backup</> and coping <filename>pg_control</> file),
!      the backup will not recovery.
      </para>
     </listitem>
     <listitem>
***************
*** 807,813 ****
       until the archive succeeds and the backup is complete.
       If you wish to place a time limit on the execution of
       <function>pg_stop_backup</>, set an appropriate
!      <varname>statement_timeout</varname> value.
      </para>
     </listitem>
    </orderedlist>
--- 825,832 ----
       until the archive succeeds and the backup is complete.
       If you wish to place a time limit on the execution of
       <function>pg_stop_backup</>, set an appropriate
!      <varname>statement_timeout</varname> value. On the hot standby, this will
!      act much like not configured <varname>archive_command</> on the primary.
      </para>
     </listitem>
    </orderedlist>

Cédric Villemain

cedric.villemain.debian@gmail.com

over 14 years ago

In reply to: Jun Ishiduka (#1)

Re: Online base backup from the hot-standby

2011/8/5 Jun Ishiduka <ishizuka.jun@po.ntts.co.jp>:

I will provide a patch which can exeute pg_start/stop_backup
including to solve above comment and conditions in next stage.
Then please review.

done.

great !

* Procedure

1. Call pg_start_backup('x') on the standby.
2. Take a backup of the data dir.
3. Call pg_stop_backup() on the standby.
4. Copy the control file on the standby to the backup.
5. Check whether the control file is status during hot standby with pg_controldata.
-> If the standby promote between 3. and 4., the backup can not recovery.
-> pg_control is that "Minimum recovery ending location" is equals 0/0.
-> backup-end record is not written.

* Not correspond yet

* full_page_write = off
-> If the primary is "full_page_write = off", archive recovery may not act
normally. Therefore the standby may need to check whether "full_page_write
= off" to WAL.

Isn't having a standby make the full_page_write = on in all case
(bypass configuration) ?

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Cédric Villemain +33 (0)6 20 30 22 52
http://2ndQuadrant.fr/
PostgreSQL: Support 24x7 - Développement, Expertise et Formation

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

over 14 years ago

In reply to: Cédric Villemain (#2)

Re: Online base backup from the hot-standby

* Not correspond yet

* full_page_write = off
-> If the primary is "full_page_write = off", archive recovery may not act
normally. Therefore the standby may need to check whether "full_page_write
= off" to WAL.

Isn't having a standby make the full_page_write = on in all case
(bypass configuration) ?

what's the meaning?

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Jun Ishiduka (#3)

Re: Online base backup from the hot-standby

2011/8/15 Jun Ishiduka <ishizuka.jun@po.ntts.co.jp>:

* Not correspond yet

* full_page_write = off
-> If the primary is "full_page_write = off", archive recovery may not act
normally. Therefore the standby may need to check whether "full_page_write
= off" to WAL.

Isn't having a standby make the full_page_write = on in all case
(bypass configuration) ?

what's the meaning?

Yeah. full_page_writes is a WAL generation parameter. Standbys don't
generate WAL. I think you just have to insist that the master has it
on.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

over 14 years ago

In reply to: Robert Haas (#4)

Re: Online base backup from the hot-standby

* Not correspond yet

?* full_page_write = off
? ?-> If the primary is "full_page_write = off", archive recovery may not act
? ? ? normally. Therefore the standby may need to check whether "full_page_write
? ? ? = off" to WAL.

Isn't having a standby make the full_page_write = on in all case
(bypass configuration) ?

what's the meaning?

Thanks.

This has the following two problems.
* pg_start_backup() must set 'on' to full_page_writes of the master that
is actual writing of the WAL, but not the standby.
* The standby doesn't need to connect to the master that's actual writing
WAL.
(Ex. Standby2 in Cascade Replication: Master - Standby1 - Standby2)

I'm worried how I should clear these problems.

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

over 14 years ago

In reply to: Robert Haas (#4)

Re: Online base backup from the hot-standby

* Not correspond yet

?* full_page_write = off
? ?-> If the primary is "full_page_write = off", archive recovery may not act
? ? ? normally. Therefore the standby may need to check whether "full_page_write
? ? ? = off" to WAL.

Isn't having a standby make the full_page_write = on in all case
(bypass configuration) ?

what's the meaning?

Yeah. full_page_writes is a WAL generation parameter. Standbys don't
generate WAL. I think you just have to insist that the master has it
on.

Thanks.

I'm worried how I should clear these problems.

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

Steve Singer

ssinger_pg@sympatico.ca

over 14 years ago

In reply to: Jun Ishiduka (#5)

Re: Online base backup from the hot-standby

On 11-08-16 02:09 AM, Jun Ishiduka wrote:

Thanks.

This has the following two problems.
* pg_start_backup() must set 'on' to full_page_writes of the master that
is actual writing of the WAL, but not the standby.

Is there any way to tell from the WAL segments if they contain the full
page data? If so could you verify this on the second slave when it is
brought up? Or can you track this on the first slave and produce an
error in either pg_start_backup or pg_stop_backup()

I see in xlog.h XLR_BKP_REMOVABLE, the comment above it says that this
flag is used to indicate that the archiver can compress the full page
blocks to non-full page blocks. I am not familiar with where in the code
this actually happens but will this cause issues if the first standby is
processing WAL files from the archive?

Show quoted text

* The standby doesn't need to connect to the master that's actual writing
WAL.
(Ex. Standby2 in Cascade Replication: Master - Standby1 - Standby2)

I'm worried how I should clear these problems.

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

over 14 years ago

In reply to: Steve Singer (#7)

Re: Online base backup from the hot-standby

Is there any way to tell from the WAL segments if they contain the full
page data? If so could you verify this on the second slave when it is
brought up? Or can you track this on the first slave and produce an
error in either pg_start_backup or pg_stop_backup()

Sure.
I will make a patch with the way to tell from the WAL segments if they
contain the full page data.

I see in xlog.h XLR_BKP_REMOVABLE, the comment above it says that this
flag is used to indicate that the archiver can compress the full page
blocks to non-full page blocks. I am not familiar with where in the code
this actually happens but will this cause issues if the first standby is
processing WAL files from the archive?

I confirmed the flag in xlog.c, so I seemed to only insert it in
XLogInsert(). I consider whether it is available.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

Fujii Masao

masao.fujii@gmail.com

over 14 years ago

In reply to: Jun Ishiduka (#8)

Re: Online base backup from the hot-standby

2011/8/17 Jun Ishiduka <ishizuka.jun@po.ntts.co.jp>:

I see in xlog.h XLR_BKP_REMOVABLE, the comment above it says that this
flag is used to indicate that the archiver can compress the full page
blocks to non-full page blocks. I am not familiar with where in the code
this actually happens but will this cause issues if the first standby is
processing WAL files from the archive?

I confirmed the flag in xlog.c, so I seemed to only insert it in
XLogInsert(). I consider whether it is available.

That flag is not available to check whether full-page writing was
skipped or not.
Because it's in full-page data, not non-full-page one.

The straightforward approach to address the problem you raised is to log
the change of full_page_writes on the master. Since such a WAL record is also
replicated to the standby, the standby can know whether full_page_writes is
enabled or not in the master, from the WAL record. If it's disabled,
pg_start_backup() in the standby should emit an error and refuse standby-only
backup. If the WAL record indicating that full_page_writes was disabled
on the master arrives during standby-only backup, the standby should cancel
the backup.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#10

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Fujii Masao (#9)

Re: Online base backup from the hot-standby

On Wed, Aug 17, 2011 at 6:19 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

2011/8/17 Jun Ishiduka <ishizuka.jun@po.ntts.co.jp>:

I see in xlog.h XLR_BKP_REMOVABLE, the comment above it says that this
flag is used to indicate that the archiver can compress the full page
blocks to non-full page blocks. I am not familiar with where in the code
this actually happens but will this cause issues if the first standby is
processing WAL files from the archive?

I confirmed the flag in xlog.c, so I seemed to only insert it in
XLogInsert(). I consider whether it is available.

That flag is not available to check whether full-page writing was
skipped or not.
Because it's in full-page data, not non-full-page one.

The straightforward approach to address the problem you raised is to log
the change of full_page_writes on the master. Since such a WAL record is also
replicated to the standby, the standby can know whether full_page_writes is
enabled or not in the master, from the WAL record. If it's disabled,
pg_start_backup() in the standby should emit an error and refuse standby-only
backup. If the WAL record indicating that full_page_writes was disabled
on the master arrives during standby-only backup, the standby should cancel
the backup.

Seems like something we could add to XLOG_PARAMETER_CHANGE fairly easily.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11

Fujii Masao

masao.fujii@gmail.com

over 14 years ago

In reply to: Robert Haas (#10)

Re: Online base backup from the hot-standby

On Wed, Aug 17, 2011 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 17, 2011 at 6:19 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

The straightforward approach to address the problem you raised is to log
the change of full_page_writes on the master. Since such a WAL record is also
replicated to the standby, the standby can know whether full_page_writes is
enabled or not in the master, from the WAL record. If it's disabled,
pg_start_backup() in the standby should emit an error and refuse standby-only
backup. If the WAL record indicating that full_page_writes was disabled
on the master arrives during standby-only backup, the standby should cancel
the backup.

Seems like something we could add to XLOG_PARAMETER_CHANGE fairly easily.

I'm afraid it's not so easy. Because since fpw can be changed by
SIGHUP, it's not
easy to ensure that logging the change of fpw must happen ahead of the actual
behavior change by that. Probably we need to make the backend which detects
the change of fpw first log that.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#12

Robert Haas

robertmhaas@gmail.com

over 14 years ago

In reply to: Fujii Masao (#11)

Re: Online base backup from the hot-standby

On Wed, Aug 17, 2011 at 9:53 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Aug 17, 2011 at 9:40 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Aug 17, 2011 at 6:19 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

The straightforward approach to address the problem you raised is to log
the change of full_page_writes on the master. Since such a WAL record is also
replicated to the standby, the standby can know whether full_page_writes is
enabled or not in the master, from the WAL record. If it's disabled,
pg_start_backup() in the standby should emit an error and refuse standby-only
backup. If the WAL record indicating that full_page_writes was disabled
on the master arrives during standby-only backup, the standby should cancel
the backup.

Seems like something we could add to XLOG_PARAMETER_CHANGE fairly easily.

I'm afraid it's not so easy. Because since fpw can be changed by
SIGHUP, it's not
easy to ensure that logging the change of fpw must happen ahead of the actual
behavior change by that. Probably we need to make the backend which detects
the change of fpw first log that.

Ugh, you're right. But then you might have problems if the state
changes again before all backends have picked up the previous change.
What I've thought about before is making one backend (say, bgwriter)
store its latest value in shared memory, protected by some lock that
would already be held at the time the value is needed. Everyone else
uses the shared memory copy instead of relying on their local value.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13

Fujii Masao

masao.fujii@gmail.com

over 14 years ago

In reply to: Robert Haas (#12)

Re: Online base backup from the hot-standby

On Thu, Aug 18, 2011 at 12:09 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Ugh, you're right. But then you might have problems if the state
changes again before all backends have picked up the previous change.

Right.

What I've thought about before is making one backend (say, bgwriter)
store its latest value in shared memory, protected by some lock that
would already be held at the time the value is needed. Everyone else
uses the shared memory copy instead of relying on their local value.

Sounds reasonable.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#14

Fujii Masao

masao.fujii@gmail.com

over 14 years ago

In reply to: Jun Ishiduka (#1)

Re: Online base backup from the hot-standby

2011/8/5 Jun Ishiduka <ishizuka.jun@po.ntts.co.jp>:

* Procedure

1. Call pg_start_backup('x') on the standby.
2. Take a backup of the data dir.
3. Call pg_stop_backup() on the standby.
4. Copy the control file on the standby to the backup.
5. Check whether the control file is status during hot standby with pg_controldata.
-> If the standby promote between 3. and 4., the backup can not recovery.
-> pg_control is that "Minimum recovery ending location" is equals 0/0.
-> backup-end record is not written.

What if we do #4 before #3? The backup gets corrupted? My guess is
that the backup is still valid even if we copy pg_control before executing
pg_stop_backup(). Which would not require #5 because if the standby
promotion happens before pg_stop_backup(), pg_stop_backup() can
detect that status change and cancel the backup.

#5 looks fragile. If we can get rid of it, the procedure becomes more
robust, I think.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#15

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

over 14 years ago

In reply to: Fujii Masao (#14)

Re: Online base backup from the hot-standby

* Procedure

1. Call pg_start_backup('x') on the standby.
2. Take a backup of the data dir.
3. Call pg_stop_backup() on the standby.
4. Copy the control file on the standby to the backup.
5. Check whether the control file is status during hot standby with pg_controldata.
? -> If the standby promote between 3. and 4., the backup can not recovery.
? ? ?-> pg_control is that "Minimum recovery ending location" is equals 0/0.
? ? ?-> backup-end record is not written.

What if we do #4 before #3? The backup gets corrupted? My guess is
that the backup is still valid even if we copy pg_control before executing
pg_stop_backup(). Which would not require #5 because if the standby
promotion happens before pg_stop_backup(), pg_stop_backup() can
detect that status change and cancel the backup.

#5 looks fragile. If we can get rid of it, the procedure becomes more
robust, I think.

Sure, you're right.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

#16

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

over 14 years ago

In reply to: Jun Ishiduka (#15)

1 attachment(s)

Re: Online base backup from the hot-standby

Hi, Created a patch in response to comments.

* Procedure
1. Call pg_start_backup('x') on hot standby.
2. Take a backup of the data dir.
3. Copy the control file on hot standby to the backup.
4. Call pg_stop_backup() on hot standby.

* Behavior
(take backup)
If we execute pg_start_backup() on hot standby then execute restartpoint,
write a strings as "FROM: slave" in backup_label and change backup mode,
but do not change full_page_writes into "on" forcibly.

If we execute pg_stop_backup() on hot standby then rename backup_label
and change backup mode, but neither write backup end record and history
file nor wait to complete the WAL archiving.
pg_stop_backup() is returned this MinRecoveryPoint as result.

If we execute pg_stop_backup() on the server promoted then error
message is output since read the backup_label.

(recovery)
If we recover with the backup taken on hot standby, MinRecoveryPoint in
the control file copied by 3 of above-procedure is used instead of backup
end record.

If recovery starts as first, BackupEndPoint in the control file is written
a same value as MinRecoveryPoint. This is for remembering the value of
MinRecoveryPoint during recovery.

HINT message("If this has ...") is always output when we recover with the
backup taken on hot standby.

* Problem
full_page_writes's problem.

This has the following two problems.
* pg_start_backup() must set 'on' to full_page_writes of the master that
is actual writing of the WAL, but not the standby.
* The standby doesn't need to connect to the master that's actual writing
WAL.
(Ex. Standby2 in Cascade Replication: Master - Standby1 - Standby2)

I'm worried how I should clear these problems.

Status: Considering
(Latest: http://archives.postgresql.org/pgsql-hackers/2011-08/msg00880.php)

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

Attachments:

standby_online_backup_06.patchapplication/octet-stream; name=standby_online_backup_06.patchDownload

diff -rcN postgresql/doc/src/sgml/backup.sgml postgresql_with_patch/doc/src/sgml/backup.sgml
*** postgresql/doc/src/sgml/backup.sgml	2011-09-12 05:19:14.000000000 +0900
--- postgresql_with_patch/doc/src/sgml/backup.sgml	2011-09-12 05:24:42.000000000 +0900
***************
*** 724,734 ****
     <title>Making a Base Backup</title>
  
     <para>
!     The procedure for making a base backup is relatively simple:
    <orderedlist>
     <listitem>
      <para>
!      Ensure that WAL archiving is enabled and working.
      </para>
     </listitem>
     <listitem>
--- 724,736 ----
     <title>Making a Base Backup</title>
  
     <para>
!     The procedure for making a base backup is relatively simple. This can
!     also run on hot standby, the procedure is a little different:
    <orderedlist>
     <listitem>
      <para>
!      Ensure that WAL archiving is enabled and working. On hot standby then
!      this does not need to ensure since there is no WAL archiving originally.
      </para>
     </listitem>
     <listitem>
***************
*** 780,785 ****
--- 782,795 ----
     </listitem>
     <listitem>
      <para>
+      Copy <filename>pg_control</> file to the backup taken by above-procedure.
+      This needs on hot standby. This is performed to recovery with Minimum
+      recovery ending location in <filename>pg_control</> instead of backup end
+      record.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
       Again connect to the database as a superuser, and issue the command:
  <programlisting>
  SELECT pg_stop_backup();
***************
*** 788,793 ****
--- 798,805 ----
       the next WAL segment.  The reason for the switch is to arrange for
       the last WAL segment file written during the backup interval to be
       ready to archive.
+      On hot standby then this terminates the backup mode but does not perform
+      an automatic switch.
      </para>
     </listitem>
     <listitem>
***************
*** 808,813 ****
--- 820,827 ----
       If you wish to place a time limit on the execution of
       <function>pg_stop_backup</>, set an appropriate
       <varname>statement_timeout</varname> value.
+      On hot standby then this does not perform. If WAL archiving is used,
+      ensure to complete archiving as far as <function>pg_stop_backup</> result.
      </para>
     </listitem>
    </orderedlist>
***************
*** 933,938 ****
--- 947,964 ----
      backup dump is which and how far back the associated WAL files go.
      It is generally better to follow the continuous archiving procedure above.
     </para>
+ 
+    <para>
+     <function>pg_stop_backup</> result on hot standby is may be incorrect. But
+     this value is greater than the correct value. If this value is used in
+     recovery then a phenomenon that WAL is not enough does not happen.
+    </para>
+ 
+    <para>
+     When you run in hotstandby <function>pg_start_backup</>, and, if promoted
+     to master when you run the <function>pg_stop_backup</>,
+     <function>pg_stop_backup</> will be failed. Retake the backup then.
+    </para>
    </sect2>
  
    <sect2 id="backup-pitr-recovery">
diff -rcN postgresql/src/backend/access/transam/xlog.c postgresql_with_patch/src/backend/access/transam/xlog.c
*** postgresql/src/backend/access/transam/xlog.c	2011-09-12 05:19:14.000000000 +0900
--- postgresql_with_patch/src/backend/access/transam/xlog.c	2011-09-12 05:24:42.000000000 +0900
***************
*** 664,670 ****
  #endif
  static void pg_start_backup_callback(int code, Datum arg);
  static bool read_backup_label(XLogRecPtr *checkPointLoc,
! 				  bool *backupEndRequired);
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
--- 664,670 ----
  #endif
  static void pg_start_backup_callback(int code, Datum arg);
  static bool read_backup_label(XLogRecPtr *checkPointLoc,
! 				  bool *backupEndRequired, char *backupfromstr);
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
***************
*** 6023,6028 ****
--- 6023,6029 ----
  	uint32		freespace;
  	TransactionId oldestActiveXID;
  	bool		backupEndRequired = false;
+ 	char		backupfromstr[10];
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6061,6067 ****
  				(errmsg("database system was interrupted while in recovery at log time %s",
  						str_time(ControlFile->checkPointCopy.time)),
  				 errhint("If this has occurred more than once some data might be corrupted"
! 			  " and you might need to choose an earlier recovery target.")));
  	else if (ControlFile->state == DB_IN_PRODUCTION)
  		ereport(LOG,
  			  (errmsg("database system was interrupted; last known up at %s",
--- 6062,6069 ----
  				(errmsg("database system was interrupted while in recovery at log time %s",
  						str_time(ControlFile->checkPointCopy.time)),
  				 errhint("If this has occurred more than once some data might be corrupted"
! 			  " and does not take online backup from hot standby"
! 			  " then you might need to choose an earlier recovery target.")));
  	else if (ControlFile->state == DB_IN_PRODUCTION)
  		ereport(LOG,
  			  (errmsg("database system was interrupted; last known up at %s",
***************
*** 6156,6162 ****
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
! 	if (read_backup_label(&checkPointLoc, &backupEndRequired))
  	{
  		/*
  		 * When a backup_label file is present, we want to roll forward from
--- 6158,6164 ----
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
! 	if (read_backup_label(&checkPointLoc, &backupEndRequired, backupfromstr))
  	{
  		/*
  		 * When a backup_label file is present, we want to roll forward from
***************
*** 6307,6312 ****
--- 6309,6328 ----
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		/*
+ 		 * set backupEndPoint at the start if we take online backup from
+ 		 * hot standby. backupEndPoint is used to distinguish whether the
+ 		 * backup is taken from master or hot stanby. If backupStartPoint
+ 		 * and backupEndPoint is invalid then this is meaning the first
+ 		 * recovery.
+ 		 */
+ 		if (XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
+ 		    XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
+ 		{
+ 			if (!XLogRecPtrIsInvalid(ControlFile->minRecoveryPoint))
+ 				ControlFile->backupEndPoint = ControlFile->minRecoveryPoint;
+ 		}
+ 
+ 		/*
  		 * Update pg_control to show that we are recovering and to show the
  		 * selected checkpoint as the place we are starting from. We also mark
  		 * pg_control with any minimum recovery stop point obtained from a
***************
*** 6618,6623 ****
--- 6634,6660 ----
  				error_context_stack = errcontext.previous;
  
  				/*
+ 				 * Check whether redo reaches minRecoveryPoint if we take online
+ 				 * backup from hot standby. Because we can not write backup end
+ 				 * record when we execute pg_stop_backup under the situation.
+ 				 */
+ 				if (!XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
+ 				{
+ 					if (XLByteLE(ControlFile->backupEndPoint, EndRecPtr))
+ 					{
+ 						elog(DEBUG1, "end of backup reached in the control file");
+ 
+ 						LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 
+ 						MemSet(&ControlFile->backupStartPoint, 0, sizeof(XLogRecPtr));
+ 						MemSet(&ControlFile->backupEndPoint, 0, sizeof(XLogRecPtr));
+ 						UpdateControlFile();
+ 
+ 						LWLockRelease(ControlFileLock);
+ 					}
+ 				}
+ 
+ 				/*
  				 * Update shared recoveryLastRecPtr after this record has been
  				 * replayed.
  				 */
***************
*** 8414,8423 ****
  		/*
  		 * If we see a shutdown checkpoint while waiting for an end-of-backup
  		 * record, the backup was canceled and the end-of-backup record will
! 		 * never arrive.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint))
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
--- 8451,8463 ----
  		/*
  		 * If we see a shutdown checkpoint while waiting for an end-of-backup
  		 * record, the backup was canceled and the end-of-backup record will
! 		 * never arrive. If the backup is taken from hot standby then this
! 		 * error is not output because there is a case of shutdown on master
! 		 * during taking online backup from hot standby.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
! 			XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
***************
*** 8880,8885 ****
--- 8920,8926 ----
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		duringrecovery = false;
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
  	pg_time_t	stamp_time;
***************
*** 8890,8908 ****
  	struct stat stat_buf;
  	FILE	   *fp;
  	StringInfoData labelfbuf;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		   errmsg("must be superuser or replication role to run a backup")));
  
  	if (RecoveryInProgress())
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 				 errmsg("recovery is in progress"),
! 				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
--- 8931,8956 ----
  	struct stat stat_buf;
  	FILE	   *fp;
  	StringInfoData labelfbuf;
+ 	char	   *backupfromstr = BACKUP_FROM_MASTER;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		   errmsg("must be superuser or replication role to run a backup")));
  
+ 	/*
+ 	 * check whether during recovery, and determine a string on backup_label.
+ 	 * If duringrecovery is true here then the subsequent process on WAL (check
+ 	 * wal_level, RequestXLogSwitch, forcePageWrites and gotUniqueStartpoint
+ 	 * by RequestCheckpoint) is skipped because hot standby can not write a wal.
+ 	 */
  	if (RecoveryInProgress())
! 	{
! 		duringrecovery = true;
! 		backupfromstr = BACKUP_FROM_SLAVE;
! 	}
  
! 	if (!XLogIsNeeded() && !duringrecovery)
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
***************
*** 8925,8931 ****
  	 * directory was not included in the base backup and the WAL archive was
  	 * cleared too before starting the backup.
  	 */
! 	RequestXLogSwitch();
  
  	/*
  	 * Mark backup active in shared memory.  We must do full-page WAL writes
--- 8973,8980 ----
  	 * directory was not included in the base backup and the WAL archive was
  	 * cleared too before starting the backup.
  	 */
! 	if (!duringrecovery)
! 		RequestXLogSwitch();
  
  	/*
  	 * Mark backup active in shared memory.  We must do full-page WAL writes
***************
*** 8959,8965 ****
  	}
  	else
  		XLogCtl->Insert.nonExclusiveBackups++;
! 	XLogCtl->Insert.forcePageWrites = true;
  	LWLockRelease(WALInsertLock);
  
  	/* Ensure we release forcePageWrites if fail below */
--- 9008,9015 ----
  	}
  	else
  		XLogCtl->Insert.nonExclusiveBackups++;
! 	if (!duringrecovery)
! 		XLogCtl->Insert.forcePageWrites = true;
  	LWLockRelease(WALInsertLock);
  
  	/* Ensure we release forcePageWrites if fail below */
***************
*** 9010,9016 ****
  				gotUniqueStartpoint = true;
  			}
  			LWLockRelease(WALInsertLock);
! 		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
  		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
--- 9060,9066 ----
  				gotUniqueStartpoint = true;
  			}
  			LWLockRelease(WALInsertLock);
! 		} while (!gotUniqueStartpoint && !duringrecovery);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
  		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
***************
*** 9033,9038 ****
--- 9083,9089 ----
  						 exclusive ? "pg_start_backup" : "streamed");
  		appendStringInfo(&labelfbuf, "START TIME: %s\n", strfbuf);
  		appendStringInfo(&labelfbuf, "LABEL: %s\n", backupidstr);
+ 		appendStringInfo(&labelfbuf, "FROM: %s\n", backupfromstr);
  
  		/*
  		 * Okay, write the file, or return its contents to caller.
***************
*** 9105,9111 ****
  	}
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
--- 9156,9163 ----
  	}
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0 &&
! 		!RecoveryInProgress())
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
***************
*** 9149,9154 ****
--- 9201,9207 ----
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		duringrecovery = false;
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
  	XLogRecData rdata;
***************
*** 9168,9192 ****
  	int			waits = 0;
  	bool		reported_waiting = false;
  	char	   *remaining;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
  	if (RecoveryInProgress())
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 				 errmsg("recovery is in progress"),
! 				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
  				 errhint("wal_level must be set to \"archive\" or \"hot_standby\" at server start.")));
  
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
--- 9221,9273 ----
  	int			waits = 0;
  	bool		reported_waiting = false;
  	char	   *remaining;
+ 	XLogRecPtr	checkPointLoc;
+ 	bool		backupEndRequired = false;
+ 	char		backupfromstr[10];
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
+ 	/*
+ 	 * check whether during recovery. If duringrecovery is true here then the
+ 	 * subsequent process on WAL (check wal_level, forcePageWrites, XLogInsert,
+ 	 * RequestXLogSwitch, write the backup history file and XLogArchivingActive)
+ 	 * is skipped because hot standby can not write a wal.
+ 	 */
  	if (RecoveryInProgress())
! 		duringrecovery = true;
  
! 	if (!XLogIsNeeded() && !duringrecovery)
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
  				 errhint("wal_level must be set to \"archive\" or \"hot_standby\" at server start.")));
  
  	/*
+ 	 * backupfromstr is taken from backup_label, which is where we
+ 	 * execute pg_start_backup on. check whether this state is equals it.
+ 	 * If read_backup_label function returns error, this function is
+ 	 * failed by error handling after this.
+ 	 */
+ 	if (read_backup_label(&checkPointLoc, &backupEndRequired, backupfromstr))
+ 	{
+ 		if (duringrecovery == false &&
+ 			strcmp(backupfromstr, BACKUP_FROM_MASTER) != 0 &&
+ 			ControlFile->state != DB_IN_PRODUCTION)
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 					 errmsg("different state than when pg_start_backup")));
+ 		if (duringrecovery == true &&
+ 			strcmp(backupfromstr, BACKUP_FROM_SLAVE) != 0 &&
+ 			ControlFile->state != DB_IN_ARCHIVE_RECOVERY)
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 					 errmsg("different state than when pg_start_backup")));
+ 	}
+ 
+ 	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
***************
*** 9205,9211 ****
  	}
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
--- 9286,9293 ----
  	}
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0 &&
! 	    !duringrecovery)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
***************
*** 9259,9264 ****
--- 9341,9353 ----
  	}
  
  	/*
+ 	 * When pg_stop_backup is excuted on hot standby server, the result
+ 	 * is minRecoveryPoint in the control file.
+ 	 */
+ 	if (duringrecovery)
+ 		return ControlFile->minRecoveryPoint;
+ 
+ 	/*
  	 * Read and parse the START WAL LOCATION line (this code is pretty crude,
  	 * but we are not expecting any variability in the file format).
  	 */
***************
*** 9416,9422 ****
  	XLogCtl->Insert.nonExclusiveBackups--;
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
--- 9505,9512 ----
  	XLogCtl->Insert.nonExclusiveBackups--;
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0 &&
! 		!RecoveryInProgress())
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
***************
*** 9790,9802 ****
   * streamed backup, *backupEndRequired is set to TRUE.
   */
  static bool
! read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired)
  {
  	char		startxlogfilename[MAXFNAMELEN];
  	TimeLineID	tli;
  	FILE	   *lfp;
  	char		ch;
  	char		backuptype[20];
  
  	*backupEndRequired = false;
  
--- 9880,9893 ----
   * streamed backup, *backupEndRequired is set to TRUE.
   */
  static bool
! read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired, char *backupfromstr)
  {
  	char		startxlogfilename[MAXFNAMELEN];
  	TimeLineID	tli;
  	FILE	   *lfp;
  	char		ch;
  	char		backuptype[20];
+ 	char		strbuff[256];
  
  	*backupEndRequired = false;
  
***************
*** 9842,9847 ****
--- 9933,9952 ----
  			*backupEndRequired = true;
  	}
  
+ 	fgets(strbuff, sizeof(strbuff), lfp);  /* skip one line */
+ 	fgets(strbuff, sizeof(strbuff), lfp);  /* skip one line */
+ 
+ 	/*
+ 	 * Read and parse the FROM line. If not read then output WARNING message
+ 	 * and set BACKUP_FROM_MASTER.
+ 	 *
+ 	 */
+ 	strcpy(backupfromstr, BACKUP_FROM_MASTER);
+ 	if (fscanf(lfp, "FROM: %s%c", backupfromstr, &ch) != 2 || ch != '\n')
+ 		ereport(WARNING,
+ 				(errmsg("loaded old file \"%s\", set backup from \"%s\"",
+ 						BACKUP_LABEL_FILE, BACKUP_FROM_MASTER)));
+ 
  	if (ferror(lfp) || FreeFile(lfp))
  		ereport(FATAL,
  				(errcode_for_file_access(),
diff -rcN postgresql/src/backend/postmaster/postmaster.c postgresql_with_patch/src/backend/postmaster/postmaster.c
*** postgresql/src/backend/postmaster/postmaster.c	2011-09-12 05:19:14.000000000 +0900
--- postgresql_with_patch/src/backend/postmaster/postmaster.c	2011-09-12 05:24:42.000000000 +0900
***************
*** 287,292 ****
--- 287,295 ----
  static PMState pmState = PM_INIT;
  
  static bool ReachedNormalRunning = false;		/* T if we've reached PM_RUN */
+ static bool ReachedHotStandbyRunning = false;	/* T if we've reached PM_HOT_STANDBY */
+ static bool WaitBackupForHotStandby = false;	/* T if we've moved from PM_WAIT_READONLY
+ 												 * to PM_WAIT_BACKUP */
  
  bool		ClientAuthInProgress = false;		/* T during new-client
  												 * authentication */
***************
*** 2825,2831 ****
  		 * PM_WAIT_BACKUP state ends when online backup mode is not active.
  		 */
  		if (!BackupInProgress())
! 			pmState = PM_WAIT_BACKENDS;
  	}
  
  	if (pmState == PM_WAIT_READONLY)
--- 2828,2845 ----
  		 * PM_WAIT_BACKUP state ends when online backup mode is not active.
  		 */
  		if (!BackupInProgress())
! 		{
! 			/*
! 			 * WaitBackupForHotStandby, this flag is if we execute
! 			 * smart shutdown during takeing online backup from hot standby.
! 			 * If the flag is true then we need to change PM_WAIT_BACKENDS
! 			 * at the root of PM_WAIT_READONLY.
! 			 */
! 			if (!WaitBackupForHotStandby)
! 				pmState = PM_WAIT_BACKENDS;
! 			else
! 				pmState = PM_WAIT_READONLY;
! 		}
  	}
  
  	if (pmState == PM_WAIT_READONLY)
***************
*** 2840,2850 ****
  		 */
  		if (CountChildren(BACKEND_TYPE_NORMAL) == 0)
  		{
! 			if (StartupPID != 0)
! 				signal_child(StartupPID, SIGTERM);
! 			if (WalReceiverPID != 0)
! 				signal_child(WalReceiverPID, SIGTERM);
! 			pmState = PM_WAIT_BACKENDS;
  		}
  	}
  
--- 2854,2878 ----
  		 */
  		if (CountChildren(BACKEND_TYPE_NORMAL) == 0)
  		{
! 			if (!BackupInProgress())
! 			{
! 				if (StartupPID != 0)
! 					signal_child(StartupPID, SIGTERM);
! 				if (WalReceiverPID != 0)
! 					signal_child(WalReceiverPID, SIGTERM);
! 				pmState = PM_WAIT_BACKENDS;
! 			}
! 			else
! 			{
! 				/*
! 				 * This is meaning that we execute smart shutdown during
! 				 * online backup from hot standby. we need to allow the
! 				 * connection to the backend by changing PM_WAIT_BACKUP
! 				 * to execute pg_stop_backup.
! 				 */
! 				WaitBackupForHotStandby = true;
! 				pmState = PM_WAIT_BACKUP;
! 			}
  		}
  	}
  
***************
*** 2993,3006 ****
  		{
  			/*
  			 * Terminate backup mode to avoid recovery after a clean fast
! 			 * shutdown.  Since a backup can only be taken during normal
! 			 * running (and not, for example, while running under Hot Standby)
! 			 * it only makes sense to do this if we reached normal running. If
! 			 * we're still in recovery, the backup file is one we're
! 			 * recovering *from*, and we must keep it around so that recovery
! 			 * restarts from the right place.
  			 */
! 			if (ReachedNormalRunning)
  				CancelBackup();
  
  			/* Normal exit from the postmaster is here */
--- 3021,3029 ----
  		{
  			/*
  			 * Terminate backup mode to avoid recovery after a clean fast
! 			 * shutdown if we reached normal running or hot standby.
  			 */
! 			if (ReachedNormalRunning || ReachedHotStandbyRunning)
  				CancelBackup();
  
  			/* Normal exit from the postmaster is here */
***************
*** 4157,4162 ****
--- 4180,4186 ----
  		ereport(LOG,
  		(errmsg("database system is ready to accept read only connections")));
  
+ 		ReachedHotStandbyRunning = true;
  		pmState = PM_HOT_STANDBY;
  	}
  
diff -rcN postgresql/src/backend/postmaster/postmaster.c.orig postgresql_with_patch/src/backend/postmaster/postmaster.c.orig
*** postgresql/src/backend/postmaster/postmaster.c.orig	1970-01-01 09:00:00.000000000 +0900
--- postgresql_with_patch/src/backend/postmaster/postmaster.c.orig	2011-09-12 05:23:29.000000000 +0900
***************
*** 0 ****
--- 1,5055 ----
+ /*-------------------------------------------------------------------------
+  *
+  * postmaster.c
+  *	  This program acts as a clearing house for requests to the
+  *	  POSTGRES system.	Frontend programs send a startup message
+  *	  to the Postmaster and the postmaster uses the info in the
+  *	  message to setup a backend process.
+  *
+  *	  The postmaster also manages system-wide operations such as
+  *	  startup and shutdown. The postmaster itself doesn't do those
+  *	  operations, mind you --- it just forks off a subprocess to do them
+  *	  at the right times.  It also takes care of resetting the system
+  *	  if a backend crashes.
+  *
+  *	  The postmaster process creates the shared memory and semaphore
+  *	  pools during startup, but as a rule does not touch them itself.
+  *	  In particular, it is not a member of the PGPROC array of backends
+  *	  and so it cannot participate in lock-manager operations.	Keeping
+  *	  the postmaster away from shared memory operations makes it simpler
+  *	  and more reliable.  The postmaster is almost always able to recover
+  *	  from crashes of individual backends by resetting shared memory;
+  *	  if it did much with shared memory then it would be prone to crashing
+  *	  along with the backends.
+  *
+  *	  When a request message is received, we now fork() immediately.
+  *	  The child process performs authentication of the request, and
+  *	  then becomes a backend if successful.  This allows the auth code
+  *	  to be written in a simple single-threaded style (as opposed to the
+  *	  crufty "poor man's multitasking" code that used to be needed).
+  *	  More importantly, it ensures that blockages in non-multithreaded
+  *	  libraries like SSL or PAM cannot cause denial of service to other
+  *	  clients.
+  *
+  *
+  * Portions Copyright (c) 1996-2011, PostgreSQL Global Development Group
+  * Portions Copyright (c) 1994, Regents of the University of California
+  *
+  *
+  * IDENTIFICATION
+  *	  src/backend/postmaster/postmaster.c
+  *
+  * NOTES
+  *
+  * Initialization:
+  *		The Postmaster sets up shared memory data structures
+  *		for the backends.
+  *
+  * Synchronization:
+  *		The Postmaster shares memory with the backends but should avoid
+  *		touching shared memory, so as not to become stuck if a crashing
+  *		backend screws up locks or shared memory.  Likewise, the Postmaster
+  *		should never block on messages from frontend clients.
+  *
+  * Garbage Collection:
+  *		The Postmaster cleans up after backends if they have an emergency
+  *		exit and/or core dump.
+  *
+  * Error Reporting:
+  *		Use write_stderr() only for reporting "interactive" errors
+  *		(essentially, bogus arguments on the command line).  Once the
+  *		postmaster is launched, use ereport().
+  *
+  *-------------------------------------------------------------------------
+  */
+ 
+ #include "postgres.h"
+ 
+ #include <unistd.h>
+ #include <signal.h>
+ #include <time.h>
+ #include <sys/wait.h>
+ #include <ctype.h>
+ #include <sys/stat.h>
+ #include <sys/socket.h>
+ #include <fcntl.h>
+ #include <sys/param.h>
+ #include <netinet/in.h>
+ #include <arpa/inet.h>
+ #include <netdb.h>
+ #include <limits.h>
+ 
+ #ifdef HAVE_SYS_SELECT_H
+ #include <sys/select.h>
+ #endif
+ 
+ #ifdef HAVE_GETOPT_H
+ #include <getopt.h>
+ #endif
+ 
+ #ifdef USE_BONJOUR
+ #include <dns_sd.h>
+ #endif
+ 
+ #include "access/transam.h"
+ #include "access/xlog.h"
+ #include "bootstrap/bootstrap.h"
+ #include "catalog/pg_control.h"
+ #include "lib/dllist.h"
+ #include "libpq/auth.h"
+ #include "libpq/ip.h"
+ #include "libpq/libpq.h"
+ #include "libpq/pqsignal.h"
+ #include "miscadmin.h"
+ #include "pgstat.h"
+ #include "postmaster/autovacuum.h"
+ #include "postmaster/fork_process.h"
+ #include "postmaster/pgarch.h"
+ #include "postmaster/postmaster.h"
+ #include "postmaster/syslogger.h"
+ #include "replication/walsender.h"
+ #include "storage/fd.h"
+ #include "storage/ipc.h"
+ #include "storage/pg_shmem.h"
+ #include "storage/pmsignal.h"
+ #include "storage/proc.h"
+ #include "tcop/tcopprot.h"
+ #include "utils/builtins.h"
+ #include "utils/datetime.h"
+ #include "utils/memutils.h"
+ #include "utils/ps_status.h"
+ 
+ #ifdef EXEC_BACKEND
+ #include "storage/spin.h"
+ #endif
+ 
+ 
+ /*
+  * List of active backends (or child processes anyway; we don't actually
+  * know whether a given child has become a backend or is still in the
+  * authorization phase).  This is used mainly to keep track of how many
+  * children we have and send them appropriate signals when necessary.
+  *
+  * "Special" children such as the startup, bgwriter and autovacuum launcher
+  * tasks are not in this list.	Autovacuum worker and walsender processes are
+  * in it. Also, "dead_end" children are in it: these are children launched just
+  * for the purpose of sending a friendly rejection message to a would-be
+  * client.	We must track them because they are attached to shared memory,
+  * but we know they will never become live backends.  dead_end children are
+  * not assigned a PMChildSlot.
+  */
+ typedef struct bkend
+ {
+ 	pid_t		pid;			/* process id of backend */
+ 	long		cancel_key;		/* cancel key for cancels for this backend */
+ 	int			child_slot;		/* PMChildSlot for this backend, if any */
+ 	bool		is_autovacuum;	/* is it an autovacuum process? */
+ 	bool		dead_end;		/* is it going to send an error and quit? */
+ 	Dlelem		elem;			/* list link in BackendList */
+ } Backend;
+ 
+ static Dllist *BackendList;
+ 
+ #ifdef EXEC_BACKEND
+ static Backend *ShmemBackendArray;
+ #endif
+ 
+ /* The socket number we are listening for connections on */
+ int			PostPortNumber;
+ char	   *UnixSocketDir;
+ char	   *ListenAddresses;
+ 
+ /*
+  * ReservedBackends is the number of backends reserved for superuser use.
+  * This number is taken out of the pool size given by MaxBackends so
+  * number of backend slots available to non-superusers is
+  * (MaxBackends - ReservedBackends).  Note what this really means is
+  * "if there are <= ReservedBackends connections available, only superusers
+  * can make new connections" --- pre-existing superuser connections don't
+  * count against the limit.
+  */
+ int			ReservedBackends;
+ 
+ /* The socket(s) we're listening to. */
+ #define MAXLISTEN	64
+ static pgsocket ListenSocket[MAXLISTEN];
+ 
+ /*
+  * Set by the -o option
+  */
+ static char ExtraOptions[MAXPGPATH];
+ 
+ /*
+  * These globals control the behavior of the postmaster in case some
+  * backend dumps core.	Normally, it kills all peers of the dead backend
+  * and reinitializes shared memory.  By specifying -s or -n, we can have
+  * the postmaster stop (rather than kill) peers and not reinitialize
+  * shared data structures.	(Reinit is currently dead code, though.)
+  */
+ static bool Reinit = true;
+ static int	SendStop = false;
+ 
+ /* still more option variables */
+ bool		EnableSSL = false;
+ 
+ int			PreAuthDelay = 0;
+ int			AuthenticationTimeout = 60;
+ 
+ bool		log_hostname;		/* for ps display and logging */
+ bool		Log_connections = false;
+ bool		Db_user_namespace = false;
+ 
+ bool		enable_bonjour = false;
+ char	   *bonjour_name;
+ bool		restart_after_crash = true;
+ 
+ /* PIDs of special child processes; 0 when not running */
+ static pid_t StartupPID = 0,
+ 			BgWriterPID = 0,
+ 			WalWriterPID = 0,
+ 			WalReceiverPID = 0,
+ 			AutoVacPID = 0,
+ 			PgArchPID = 0,
+ 			PgStatPID = 0,
+ 			SysLoggerPID = 0;
+ 
+ /* Startup/shutdown state */
+ #define			NoShutdown		0
+ #define			SmartShutdown	1
+ #define			FastShutdown	2
+ 
+ static int	Shutdown = NoShutdown;
+ 
+ static bool FatalError = false; /* T if recovering from backend crash */
+ static bool RecoveryError = false;		/* T if WAL recovery failed */
+ 
+ /*
+  * We use a simple state machine to control startup, shutdown, and
+  * crash recovery (which is rather like shutdown followed by startup).
+  *
+  * After doing all the postmaster initialization work, we enter PM_STARTUP
+  * state and the startup process is launched. The startup process begins by
+  * reading the control file and other preliminary initialization steps.
+  * In a normal startup, or after crash recovery, the startup process exits
+  * with exit code 0 and we switch to PM_RUN state.	However, archive recovery
+  * is handled specially since it takes much longer and we would like to support
+  * hot standby during archive recovery.
+  *
+  * When the startup process is ready to start archive recovery, it signals the
+  * postmaster, and we switch to PM_RECOVERY state. The background writer is
+  * launched, while the startup process continues applying WAL.	If Hot Standby
+  * is enabled, then, after reaching a consistent point in WAL redo, startup
+  * process signals us again, and we switch to PM_HOT_STANDBY state and
+  * begin accepting connections to perform read-only queries.  When archive
+  * recovery is finished, the startup process exits with exit code 0 and we
+  * switch to PM_RUN state.
+  *
+  * Normal child backends can only be launched when we are in PM_RUN or
+  * PM_HOT_STANDBY state.  (We also allow launch of normal
+  * child backends in PM_WAIT_BACKUP state, but only for superusers.)
+  * In other states we handle connection requests by launching "dead_end"
+  * child processes, which will simply send the client an error message and
+  * quit.  (We track these in the BackendList so that we can know when they
+  * are all gone; this is important because they're still connected to shared
+  * memory, and would interfere with an attempt to destroy the shmem segment,
+  * possibly leading to SHMALL failure when we try to make a new one.)
+  * In PM_WAIT_DEAD_END state we are waiting for all the dead_end children
+  * to drain out of the system, and therefore stop accepting connection
+  * requests at all until the last existing child has quit (which hopefully
+  * will not be very long).
+  *
+  * Notice that this state variable does not distinguish *why* we entered
+  * states later than PM_RUN --- Shutdown and FatalError must be consulted
+  * to find that out.  FatalError is never true in PM_RECOVERY_* or PM_RUN
+  * states, nor in PM_SHUTDOWN states (because we don't enter those states
+  * when trying to recover from a crash).  It can be true in PM_STARTUP state,
+  * because we don't clear it until we've successfully started WAL redo.
+  * Similarly, RecoveryError means that we have crashed during recovery, and
+  * should not try to restart.
+  */
+ typedef enum
+ {
+ 	PM_INIT,					/* postmaster starting */
+ 	PM_STARTUP,					/* waiting for startup subprocess */
+ 	PM_RECOVERY,				/* in archive recovery mode */
+ 	PM_HOT_STANDBY,				/* in hot standby mode */
+ 	PM_RUN,						/* normal "database is alive" state */
+ 	PM_WAIT_BACKUP,				/* waiting for online backup mode to end */
+ 	PM_WAIT_READONLY,			/* waiting for read only backends to exit */
+ 	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
+ 	PM_SHUTDOWN,				/* waiting for bgwriter to do shutdown ckpt */
+ 	PM_SHUTDOWN_2,				/* waiting for archiver and walsenders to
+ 								 * finish */
+ 	PM_WAIT_DEAD_END,			/* waiting for dead_end children to exit */
+ 	PM_NO_CHILDREN				/* all important children have exited */
+ } PMState;
+ 
+ static PMState pmState = PM_INIT;
+ 
+ static bool ReachedNormalRunning = false;		/* T if we've reached PM_RUN */
+ 
+ bool		ClientAuthInProgress = false;		/* T during new-client
+ 												 * authentication */
+ 
+ bool		redirection_done = false;	/* stderr redirected for syslogger? */
+ 
+ /* received START_AUTOVAC_LAUNCHER signal */
+ static volatile sig_atomic_t start_autovac_launcher = false;
+ 
+ /* the launcher needs to be signalled to communicate some condition */
+ static volatile bool avlauncher_needs_signal = false;
+ 
+ /*
+  * State for assigning random salts and cancel keys.
+  * Also, the global MyCancelKey passes the cancel key assigned to a given
+  * backend from the postmaster to that backend (via fork).
+  */
+ static unsigned int random_seed = 0;
+ static struct timeval random_start_time;
+ 
+ extern char *optarg;
+ extern int	optind,
+ 			opterr;
+ 
+ #ifdef HAVE_INT_OPTRESET
+ extern int	optreset;			/* might not be declared by system headers */
+ #endif
+ 
+ #ifdef USE_BONJOUR
+ static DNSServiceRef bonjour_sdref = NULL;
+ #endif
+ 
+ /*
+  * postmaster.c - function prototypes
+  */
+ static void getInstallationPaths(const char *argv0);
+ static void checkDataDir(void);
+ static Port *ConnCreate(int serverFd);
+ static void ConnFree(Port *port);
+ static void reset_shared(int port);
+ static void SIGHUP_handler(SIGNAL_ARGS);
+ static void pmdie(SIGNAL_ARGS);
+ static void reaper(SIGNAL_ARGS);
+ static void sigusr1_handler(SIGNAL_ARGS);
+ static void startup_die(SIGNAL_ARGS);
+ static void dummy_handler(SIGNAL_ARGS);
+ static void CleanupBackend(int pid, int exitstatus);
+ static void HandleChildCrash(int pid, int exitstatus, const char *procname);
+ static void LogChildExit(int lev, const char *procname,
+ 			 int pid, int exitstatus);
+ static void PostmasterStateMachine(void);
+ static void BackendInitialize(Port *port);
+ static int	BackendRun(Port *port);
+ static void ExitPostmaster(int status);
+ static int	ServerLoop(void);
+ static int	BackendStartup(Port *port);
+ static int	ProcessStartupPacket(Port *port, bool SSLdone);
+ static void processCancelRequest(Port *port, void *pkt);
+ static int	initMasks(fd_set *rmask);
+ static void report_fork_failure_to_client(Port *port, int errnum);
+ static CAC_state canAcceptConnections(void);
+ static long PostmasterRandom(void);
+ static void RandomSalt(char *md5Salt);
+ static void signal_child(pid_t pid, int signal);
+ static bool SignalSomeChildren(int signal, int targets);
+ 
+ #define SignalChildren(sig)			   SignalSomeChildren(sig, BACKEND_TYPE_ALL)
+ 
+ /*
+  * Possible types of a backend. These are OR-able request flag bits
+  * for SignalSomeChildren() and CountChildren().
+  */
+ #define BACKEND_TYPE_NORMAL		0x0001	/* normal backend */
+ #define BACKEND_TYPE_AUTOVAC	0x0002	/* autovacuum worker process */
+ #define BACKEND_TYPE_WALSND		0x0004	/* walsender process */
+ #define BACKEND_TYPE_ALL		0x0007	/* OR of all the above */
+ 
+ static int	CountChildren(int target);
+ static bool CreateOptsFile(int argc, char *argv[], char *fullprogname);
+ static pid_t StartChildProcess(AuxProcType type);
+ static void StartAutovacuumWorker(void);
+ static void InitPostmasterDeathWatchHandle(void);
+ 
+ #ifdef EXEC_BACKEND
+ 
+ #ifdef WIN32
+ static pid_t win32_waitpid(int *exitstatus);
+ static void WINAPI pgwin32_deadchild_callback(PVOID lpParameter, BOOLEAN TimerOrWaitFired);
+ 
+ static HANDLE win32ChildQueue;
+ 
+ typedef struct
+ {
+ 	HANDLE		waitHandle;
+ 	HANDLE		procHandle;
+ 	DWORD		procId;
+ } win32_deadchild_waitinfo;
+ #endif
+ 
+ static pid_t backend_forkexec(Port *port);
+ static pid_t internal_forkexec(int argc, char *argv[], Port *port);
+ 
+ /* Type for a socket that can be inherited to a client process */
+ #ifdef WIN32
+ typedef struct
+ {
+ 	SOCKET		origsocket;		/* Original socket value, or PGINVALID_SOCKET
+ 								 * if not a socket */
+ 	WSAPROTOCOL_INFO wsainfo;
+ } InheritableSocket;
+ #else
+ typedef int InheritableSocket;
+ #endif
+ 
+ typedef struct LWLock LWLock;	/* ugly kluge */
+ 
+ /*
+  * Structure contains all variables passed to exec:ed backends
+  */
+ typedef struct
+ {
+ 	Port		port;
+ 	InheritableSocket portsocket;
+ 	char		DataDir[MAXPGPATH];
+ 	pgsocket	ListenSocket[MAXLISTEN];
+ 	long		MyCancelKey;
+ 	int			MyPMChildSlot;
+ #ifndef WIN32
+ 	unsigned long UsedShmemSegID;
+ #else
+ 	HANDLE		UsedShmemSegID;
+ #endif
+ 	void	   *UsedShmemSegAddr;
+ 	slock_t    *ShmemLock;
+ 	VariableCache ShmemVariableCache;
+ 	Backend    *ShmemBackendArray;
+ 	LWLock	   *LWLockArray;
+ 	slock_t    *ProcStructLock;
+ 	PROC_HDR   *ProcGlobal;
+ 	PGPROC	   *AuxiliaryProcs;
+ 	PMSignalData *PMSignalState;
+ 	InheritableSocket pgStatSock;
+ 	pid_t		PostmasterPid;
+ 	TimestampTz PgStartTime;
+ 	TimestampTz PgReloadTime;
+ 	bool		redirection_done;
+ #ifdef WIN32
+ 	HANDLE		PostmasterHandle;
+ 	HANDLE		initial_signal_pipe;
+ 	HANDLE		syslogPipe[2];
+ #else
+ 	int			postmaster_alive_fds[2];
+ 	int			syslogPipe[2];
+ #endif
+ 	char		my_exec_path[MAXPGPATH];
+ 	char		pkglib_path[MAXPGPATH];
+ 	char		ExtraOptions[MAXPGPATH];
+ } BackendParameters;
+ 
+ static void read_backend_variables(char *id, Port *port);
+ static void restore_backend_variables(BackendParameters *param, Port *port);
+ 
+ #ifndef WIN32
+ static bool save_backend_variables(BackendParameters *param, Port *port);
+ #else
+ static bool save_backend_variables(BackendParameters *param, Port *port,
+ 					   HANDLE childProcess, pid_t childPid);
+ #endif
+ 
+ static void ShmemBackendArrayAdd(Backend *bn);
+ static void ShmemBackendArrayRemove(Backend *bn);
+ #endif   /* EXEC_BACKEND */
+ 
+ #define StartupDataBase()		StartChildProcess(StartupProcess)
+ #define StartBackgroundWriter() StartChildProcess(BgWriterProcess)
+ #define StartWalWriter()		StartChildProcess(WalWriterProcess)
+ #define StartWalReceiver()		StartChildProcess(WalReceiverProcess)
+ 
+ /* Macros to check exit status of a child process */
+ #define EXIT_STATUS_0(st)  ((st) == 0)
+ #define EXIT_STATUS_1(st)  (WIFEXITED(st) && WEXITSTATUS(st) == 1)
+ 
+ #ifndef WIN32
+ /*
+  * File descriptors for pipe used to monitor if postmaster is alive.
+  * First is POSTMASTER_FD_WATCH, second is POSTMASTER_FD_OWN.
+  */
+ int postmaster_alive_fds[2] = { -1, -1 };
+ #else
+ /* Process handle of postmaster used for the same purpose on Windows */
+ HANDLE		PostmasterHandle;
+ #endif
+ 
+ /*
+  * Postmaster main entry point
+  */
+ int
+ PostmasterMain(int argc, char *argv[])
+ {
+ 	int			opt;
+ 	int			status;
+ 	char	   *userDoption = NULL;
+ 	bool		listen_addr_saved = false;
+ 	int			i;
+ 
+ 	MyProcPid = PostmasterPid = getpid();
+ 
+ 	MyStartTime = time(NULL);
+ 
+ 	IsPostmasterEnvironment = true;
+ 
+ 	/*
+ 	 * for security, no dir or file created can be group or other accessible
+ 	 */
+ 	umask(S_IRWXG | S_IRWXO);
+ 
+ 	/*
+ 	 * Fire up essential subsystems: memory management
+ 	 */
+ 	MemoryContextInit();
+ 
+ 	/*
+ 	 * By default, palloc() requests in the postmaster will be allocated in
+ 	 * the PostmasterContext, which is space that can be recycled by backends.
+ 	 * Allocated data that needs to be available to backends should be
+ 	 * allocated in TopMemoryContext.
+ 	 */
+ 	PostmasterContext = AllocSetContextCreate(TopMemoryContext,
+ 											  "Postmaster",
+ 											  ALLOCSET_DEFAULT_MINSIZE,
+ 											  ALLOCSET_DEFAULT_INITSIZE,
+ 											  ALLOCSET_DEFAULT_MAXSIZE);
+ 	MemoryContextSwitchTo(PostmasterContext);
+ 
+ 	/* Initialize paths to installation files */
+ 	getInstallationPaths(argv[0]);
+ 
+ 	/*
+ 	 * Options setup
+ 	 */
+ 	InitializeGUCOptions();
+ 
+ 	opterr = 1;
+ 
+ 	/*
+ 	 * Parse command-line options.	CAUTION: keep this in sync with
+ 	 * tcop/postgres.c (the option sets should not conflict) and with the
+ 	 * common help() function in main/main.c.
+ 	 */
+ 	while ((opt = getopt(argc, argv, "A:B:bc:D:d:EeFf:h:ijk:lN:nOo:Pp:r:S:sTt:W:-:")) != -1)
+ 	{
+ 		switch (opt)
+ 		{
+ 			case 'A':
+ 				SetConfigOption("debug_assertions", optarg, PGC_POSTMASTER, PGC_S_ARGV);
+ 				break;
+ 
+ 			case 'B':
+ 				SetConfigOption("shared_buffers", optarg, PGC_POSTMASTER, PGC_S_ARGV);
+ 				break;
+ 
+ 			case 'b':
+ 				/* Undocumented flag used for binary upgrades */
+ 				IsBinaryUpgrade = true;
+ 				break;
+ 
+ 			case 'D':
+ 				userDoption = optarg;
+ 				break;
+ 
+ 			case 'd':
+ 				set_debug_options(atoi(optarg), PGC_POSTMASTER, PGC_S_ARGV);
+ 				break;
+ 
+ 			case 'E':
+ 				SetConfigOption("log_statement", "all", PGC_POSTMASTER, PGC_S_ARGV);
+ 				break;
+ 
+ 			case 'e':
+ 				SetConfigOption("datestyle", "euro", PGC_POSTMASTER, PGC_S_ARGV);
+ 				break;
+ 
+ 			case 'F':
+ 				SetConfigOption("fsync", "false", PGC_POSTMASTER, PGC_S_ARGV);
+ 				break;
+ 
+ 			case 'f':
+ 				if (!set_plan_disabling_options(optarg, PGC_POSTMASTER, PGC_S_ARGV))
+ 				{
+ 					write_stderr("%s: invalid argument for option -f: \"%s\"\n",
+ 								 progname, optarg);
+ 					ExitPostmaster(1);
+ 				}
+ 				break;
+ 
+ 			case 'h':
+ 				SetConfigOption("listen_addresses", optarg, PGC_POSTMASTER, PGC_S_ARGV);
+ 				break;
+ 
+ 			case 'i':
+ 				SetConfigOption("listen_addresses", "*", PGC_POSTMASTER, PGC_S_ARGV);
+ 				break;
+ 
+ 			case 'j':
+ 				/* only used by interactive backend */
+ 				break;
+ 
+ 			case 'k':
+ 				SetConfigOption("unix_socket_directory", optarg, PGC_POSTMASTER, PGC_S_ARGV);
+ 				break;
+ 
+ 			case 'l':
+ 				SetConfigOption("ssl", "true", PGC_POSTMASTER, PGC_S_ARGV);
+ 				break;
+ 
+ 			case 'N':
+ 				SetConfigOption("max_connections", optarg, PGC_POSTMASTER, PGC_S_ARGV);
+ 				break;
+ 
+ 			case 'n':
+ 				/* Don't reinit shared mem after abnormal exit */
+ 				Reinit = false;
+ 				break;
+ 
+ 			case 'O':
+ 				SetConfigOption("allow_system_table_mods", "true", PGC_POSTMASTER, PGC_S_ARGV);
+ 				break;
+ 
+ 			case 'o':
+ 				/* Other options to pass to the backend on the command line */
+ 				snprintf(ExtraOptions + strlen(ExtraOptions),
+ 						 sizeof(ExtraOptions) - strlen(ExtraOptions),
+ 						 " %s", optarg);
+ 				break;
+ 
+ 			case 'P':
+ 				SetConfigOption("ignore_system_indexes", "true", PGC_POSTMASTER, PGC_S_ARGV);
+ 				break;
+ 
+ 			case 'p':
+ 				SetConfigOption("port", optarg, PGC_POSTMASTER, PGC_S_ARGV);
+ 				break;
+ 
+ 			case 'r':
+ 				/* only used by single-user backend */
+ 				break;
+ 
+ 			case 'S':
+ 				SetConfigOption("work_mem", optarg, PGC_POSTMASTER, PGC_S_ARGV);
+ 				break;
+ 
+ 			case 's':
+ 				SetConfigOption("log_statement_stats", "true", PGC_POSTMASTER, PGC_S_ARGV);
+ 				break;
+ 
+ 			case 'T':
+ 
+ 				/*
+ 				 * In the event that some backend dumps core, send SIGSTOP,
+ 				 * rather than SIGQUIT, to all its peers.  This lets the wily
+ 				 * post_hacker collect core dumps from everyone.
+ 				 */
+ 				SendStop = true;
+ 				break;
+ 
+ 			case 't':
+ 				{
+ 					const char *tmp = get_stats_option_name(optarg);
+ 
+ 					if (tmp)
+ 					{
+ 						SetConfigOption(tmp, "true", PGC_POSTMASTER, PGC_S_ARGV);
+ 					}
+ 					else
+ 					{
+ 						write_stderr("%s: invalid argument for option -t: \"%s\"\n",
+ 									 progname, optarg);
+ 						ExitPostmaster(1);
+ 					}
+ 					break;
+ 				}
+ 
+ 			case 'W':
+ 				SetConfigOption("post_auth_delay", optarg, PGC_POSTMASTER, PGC_S_ARGV);
+ 				break;
+ 
+ 			case 'c':
+ 			case '-':
+ 				{
+ 					char	   *name,
+ 							   *value;
+ 
+ 					ParseLongOption(optarg, &name, &value);
+ 					if (!value)
+ 					{
+ 						if (opt == '-')
+ 							ereport(ERROR,
+ 									(errcode(ERRCODE_SYNTAX_ERROR),
+ 									 errmsg("--%s requires a value",
+ 											optarg)));
+ 						else
+ 							ereport(ERROR,
+ 									(errcode(ERRCODE_SYNTAX_ERROR),
+ 									 errmsg("-c %s requires a value",
+ 											optarg)));
+ 					}
+ 
+ 					SetConfigOption(name, value, PGC_POSTMASTER, PGC_S_ARGV);
+ 					free(name);
+ 					if (value)
+ 						free(value);
+ 					break;
+ 				}
+ 
+ 			default:
+ 				write_stderr("Try \"%s --help\" for more information.\n",
+ 							 progname);
+ 				ExitPostmaster(1);
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Postmaster accepts no non-option switch arguments.
+ 	 */
+ 	if (optind < argc)
+ 	{
+ 		write_stderr("%s: invalid argument: \"%s\"\n",
+ 					 progname, argv[optind]);
+ 		write_stderr("Try \"%s --help\" for more information.\n",
+ 					 progname);
+ 		ExitPostmaster(1);
+ 	}
+ 
+ 	/*
+ 	 * Locate the proper configuration files and data directory, and read
+ 	 * postgresql.conf for the first time.
+ 	 */
+ 	if (!SelectConfigFiles(userDoption, progname))
+ 		ExitPostmaster(2);
+ 
+ 	/* Verify that DataDir looks reasonable */
+ 	checkDataDir();
+ 
+ 	/* And switch working directory into it */
+ 	ChangeToDataDir();
+ 
+ 	/*
+ 	 * Check for invalid combinations of GUC settings.
+ 	 */
+ 	if (ReservedBackends >= MaxBackends)
+ 	{
+ 		write_stderr("%s: superuser_reserved_connections must be less than max_connections\n", progname);
+ 		ExitPostmaster(1);
+ 	}
+ 	if (XLogArchiveMode && wal_level == WAL_LEVEL_MINIMAL)
+ 		ereport(ERROR,
+ 				(errmsg("WAL archival (archive_mode=on) requires wal_level \"archive\" or \"hot_standby\"")));
+ 	if (max_wal_senders > 0 && wal_level == WAL_LEVEL_MINIMAL)
+ 		ereport(ERROR,
+ 				(errmsg("WAL streaming (max_wal_senders > 0) requires wal_level \"archive\" or \"hot_standby\"")));
+ 
+ 	/*
+ 	 * Other one-time internal sanity checks can go here, if they are fast.
+ 	 * (Put any slow processing further down, after postmaster.pid creation.)
+ 	 */
+ 	if (!CheckDateTokenTables())
+ 	{
+ 		write_stderr("%s: invalid datetoken tables, please fix\n", progname);
+ 		ExitPostmaster(1);
+ 	}
+ 
+ 	/*
+ 	 * Now that we are done processing the postmaster arguments, reset
+ 	 * getopt(3) library so that it will work correctly in subprocesses.
+ 	 */
+ 	optind = 1;
+ #ifdef HAVE_INT_OPTRESET
+ 	optreset = 1;				/* some systems need this too */
+ #endif
+ 
+ 	/* For debugging: display postmaster environment */
+ 	{
+ 		extern char **environ;
+ 		char	  **p;
+ 
+ 		ereport(DEBUG3,
+ 				(errmsg_internal("%s: PostmasterMain: initial environment dump:",
+ 								 progname)));
+ 		ereport(DEBUG3,
+ 			 (errmsg_internal("-----------------------------------------")));
+ 		for (p = environ; *p; ++p)
+ 			ereport(DEBUG3,
+ 					(errmsg_internal("\t%s", *p)));
+ 		ereport(DEBUG3,
+ 			 (errmsg_internal("-----------------------------------------")));
+ 	}
+ 
+ 	/*
+ 	 * Create lockfile for data directory.
+ 	 *
+ 	 * We want to do this before we try to grab the input sockets, because the
+ 	 * data directory interlock is more reliable than the socket-file
+ 	 * interlock (thanks to whoever decided to put socket files in /tmp :-().
+ 	 * For the same reason, it's best to grab the TCP socket(s) before the
+ 	 * Unix socket.
+ 	 */
+ 	CreateDataDirLockFile(true);
+ 
+ 	/*
+ 	 * Initialize SSL library, if specified.
+ 	 */
+ #ifdef USE_SSL
+ 	if (EnableSSL)
+ 		secure_initialize();
+ #endif
+ 
+ 	/*
+ 	 * process any libraries that should be preloaded at postmaster start
+ 	 */
+ 	process_shared_preload_libraries();
+ 
+ 	/*
+ 	 * Remove old temporary files.	At this point there can be no other
+ 	 * Postgres processes running in this directory, so this should be safe.
+ 	 */
+ 	RemovePgTempFiles();
+ 
+ 	/*
+ 	 * Establish input sockets.
+ 	 */
+ 	for (i = 0; i < MAXLISTEN; i++)
+ 		ListenSocket[i] = PGINVALID_SOCKET;
+ 
+ 	if (ListenAddresses)
+ 	{
+ 		char	   *rawstring;
+ 		List	   *elemlist;
+ 		ListCell   *l;
+ 		int			success = 0;
+ 
+ 		/* Need a modifiable copy of ListenAddresses */
+ 		rawstring = pstrdup(ListenAddresses);
+ 
+ 		/* Parse string into list of identifiers */
+ 		if (!SplitIdentifierString(rawstring, ',', &elemlist))
+ 		{
+ 			/* syntax error in list */
+ 			ereport(FATAL,
+ 					(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 					 errmsg("invalid list syntax for \"listen_addresses\"")));
+ 		}
+ 
+ 		foreach(l, elemlist)
+ 		{
+ 			char	   *curhost = (char *) lfirst(l);
+ 
+ 			if (strcmp(curhost, "*") == 0)
+ 				status = StreamServerPort(AF_UNSPEC, NULL,
+ 										  (unsigned short) PostPortNumber,
+ 										  UnixSocketDir,
+ 										  ListenSocket, MAXLISTEN);
+ 			else
+ 				status = StreamServerPort(AF_UNSPEC, curhost,
+ 										  (unsigned short) PostPortNumber,
+ 										  UnixSocketDir,
+ 										  ListenSocket, MAXLISTEN);
+ 
+ 			if (status == STATUS_OK)
+ 			{
+ 				success++;
+ 				/* record the first successful host addr in lockfile */
+ 				if (!listen_addr_saved)
+ 				{
+ 					AddToDataDirLockFile(LOCK_FILE_LINE_LISTEN_ADDR, curhost);
+ 					listen_addr_saved = true;
+ 				}
+ 			}
+ 			else
+ 				ereport(WARNING,
+ 						(errmsg("could not create listen socket for \"%s\"",
+ 								curhost)));
+ 		}
+ 
+ 		if (!success && list_length(elemlist))
+ 			ereport(FATAL,
+ 					(errmsg("could not create any TCP/IP sockets")));
+ 
+ 		list_free(elemlist);
+ 		pfree(rawstring);
+ 	}
+ 
+ #ifdef USE_BONJOUR
+ 	/* Register for Bonjour only if we opened TCP socket(s) */
+ 	if (enable_bonjour && ListenSocket[0] != PGINVALID_SOCKET)
+ 	{
+ 		DNSServiceErrorType err;
+ 
+ 		/*
+ 		 * We pass 0 for interface_index, which will result in registering on
+ 		 * all "applicable" interfaces.  It's not entirely clear from the
+ 		 * DNS-SD docs whether this would be appropriate if we have bound to
+ 		 * just a subset of the available network interfaces.
+ 		 */
+ 		err = DNSServiceRegister(&bonjour_sdref,
+ 								 0,
+ 								 0,
+ 								 bonjour_name,
+ 								 "_postgresql._tcp.",
+ 								 NULL,
+ 								 NULL,
+ 								 htons(PostPortNumber),
+ 								 0,
+ 								 NULL,
+ 								 NULL,
+ 								 NULL);
+ 		if (err != kDNSServiceErr_NoError)
+ 			elog(LOG, "DNSServiceRegister() failed: error code %ld",
+ 				 (long) err);
+ 
+ 		/*
+ 		 * We don't bother to read the mDNS daemon's reply, and we expect that
+ 		 * it will automatically terminate our registration when the socket is
+ 		 * closed at postmaster termination.  So there's nothing more to be
+ 		 * done here.  However, the bonjour_sdref is kept around so that
+ 		 * forked children can close their copies of the socket.
+ 		 */
+ 	}
+ #endif
+ 
+ #ifdef HAVE_UNIX_SOCKETS
+ 	status = StreamServerPort(AF_UNIX, NULL,
+ 							  (unsigned short) PostPortNumber,
+ 							  UnixSocketDir,
+ 							  ListenSocket, MAXLISTEN);
+ 	if (status != STATUS_OK)
+ 		ereport(WARNING,
+ 				(errmsg("could not create Unix-domain socket")));
+ #endif
+ 
+ 	/*
+ 	 * check that we have some socket to listen on
+ 	 */
+ 	if (ListenSocket[0] == PGINVALID_SOCKET)
+ 		ereport(FATAL,
+ 				(errmsg("no socket created for listening")));
+ 
+ 	/*
+ 	 * If no valid TCP ports, write an empty line for listen address,
+ 	 * indicating the Unix socket must be used.  Note that this line is not
+ 	 * added to the lock file until there is a socket backing it.
+ 	 */
+ 	if (!listen_addr_saved)
+ 		AddToDataDirLockFile(LOCK_FILE_LINE_LISTEN_ADDR, "");
+ 
+ 	/*
+ 	 * Set up shared memory and semaphores.
+ 	 */
+ 	reset_shared(PostPortNumber);
+ 
+ 	/*
+ 	 * Estimate number of openable files.  This must happen after setting up
+ 	 * semaphores, because on some platforms semaphores count as open files.
+ 	 */
+ 	set_max_safe_fds();
+ 
+ 	/*
+ 	 * Initialize the list of active backends.
+ 	 */
+ 	BackendList = DLNewList();
+ 
+ 	/*
+ 	 * Initialize pipe (or process handle on Windows) that allows children to
+ 	 * wake up from sleep on postmaster death.
+ 	 */
+ 	InitPostmasterDeathWatchHandle();
+ 
+ #ifdef WIN32
+ 	/*
+ 	 * Initialize I/O completion port used to deliver list of dead children.
+ 	 */
+ 	win32ChildQueue = CreateIoCompletionPort(INVALID_HANDLE_VALUE, NULL, 0, 1);
+ 	if (win32ChildQueue == NULL)
+ 		ereport(FATAL,
+ 		   (errmsg("could not create I/O completion port for child queue")));
+ #endif
+ 
+ 	/*
+ 	 * Record postmaster options.  We delay this till now to avoid recording
+ 	 * bogus options (eg, NBuffers too high for available memory).
+ 	 */
+ 	if (!CreateOptsFile(argc, argv, my_exec_path))
+ 		ExitPostmaster(1);
+ 
+ #ifdef EXEC_BACKEND
+ 	/* Write out nondefault GUC settings for child processes to use */
+ 	write_nondefault_variables(PGC_POSTMASTER);
+ #endif
+ 
+ 	/*
+ 	 * Write the external PID file if requested
+ 	 */
+ 	if (external_pid_file)
+ 	{
+ 		FILE	   *fpidfile = fopen(external_pid_file, "w");
+ 
+ 		if (fpidfile)
+ 		{
+ 			fprintf(fpidfile, "%d\n", MyProcPid);
+ 			fclose(fpidfile);
+ 			/* Should we remove the pid file on postmaster exit? */
+ 
+ 			/* Make PID file world readable */
+ 			if (chmod(external_pid_file, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH) != 0)
+ 				write_stderr("%s: could not change permissions of external PID file \"%s\": %s\n",
+ 							 progname, external_pid_file, strerror(errno));
+ 		}
+ 		else
+ 			write_stderr("%s: could not write external PID file \"%s\": %s\n",
+ 						 progname, external_pid_file, strerror(errno));
+ 	}
+ 
+ 	/*
+ 	 * Set up signal handlers for the postmaster process.
+ 	 *
+ 	 * CAUTION: when changing this list, check for side-effects on the signal
+ 	 * handling setup of child processes.  See tcop/postgres.c,
+ 	 * bootstrap/bootstrap.c, postmaster/bgwriter.c, postmaster/walwriter.c,
+ 	 * postmaster/autovacuum.c, postmaster/pgarch.c, postmaster/pgstat.c, and
+ 	 * postmaster/syslogger.c.
+ 	 */
+ 	pqinitmask();
+ 	PG_SETMASK(&BlockSig);
+ 
+ 	pqsignal(SIGHUP, SIGHUP_handler);	/* reread config file and have
+ 										 * children do same */
+ 	pqsignal(SIGINT, pmdie);	/* send SIGTERM and shut down */
+ 	pqsignal(SIGQUIT, pmdie);	/* send SIGQUIT and die */
+ 	pqsignal(SIGTERM, pmdie);	/* wait for children and shut down */
+ 	pqsignal(SIGALRM, SIG_IGN); /* ignored */
+ 	pqsignal(SIGPIPE, SIG_IGN); /* ignored */
+ 	pqsignal(SIGUSR1, sigusr1_handler); /* message from child process */
+ 	pqsignal(SIGUSR2, dummy_handler);	/* unused, reserve for children */
+ 	pqsignal(SIGCHLD, reaper);	/* handle child termination */
+ 	pqsignal(SIGTTIN, SIG_IGN); /* ignored */
+ 	pqsignal(SIGTTOU, SIG_IGN); /* ignored */
+ 	/* ignore SIGXFSZ, so that ulimit violations work like disk full */
+ #ifdef SIGXFSZ
+ 	pqsignal(SIGXFSZ, SIG_IGN); /* ignored */
+ #endif
+ 
+ 	/*
+ 	 * If enabled, start up syslogger collection subprocess
+ 	 */
+ 	SysLoggerPID = SysLogger_Start();
+ 
+ 	/*
+ 	 * Reset whereToSendOutput from DestDebug (its starting state) to
+ 	 * DestNone. This stops ereport from sending log messages to stderr unless
+ 	 * Log_destination permits.  We don't do this until the postmaster is
+ 	 * fully launched, since startup failures may as well be reported to
+ 	 * stderr.
+ 	 */
+ 	whereToSendOutput = DestNone;
+ 
+ 	/*
+ 	 * Initialize stats collection subsystem (this does NOT start the
+ 	 * collector process!)
+ 	 */
+ 	pgstat_init();
+ 
+ 	/*
+ 	 * Initialize the autovacuum subsystem (again, no process start yet)
+ 	 */
+ 	autovac_init();
+ 
+ 	/*
+ 	 * Load configuration files for client authentication.
+ 	 */
+ 	if (!load_hba())
+ 	{
+ 		/*
+ 		 * It makes no sense to continue if we fail to load the HBA file,
+ 		 * since there is no way to connect to the database in this case.
+ 		 */
+ 		ereport(FATAL,
+ 				(errmsg("could not load pg_hba.conf")));
+ 	}
+ 	load_ident();
+ 
+ 	/*
+ 	 * Remember postmaster startup time
+ 	 */
+ 	PgStartTime = GetCurrentTimestamp();
+ 	/* PostmasterRandom wants its own copy */
+ 	gettimeofday(&random_start_time, NULL);
+ 
+ 	/*
+ 	 * We're ready to rock and roll...
+ 	 */
+ 	StartupPID = StartupDataBase();
+ 	Assert(StartupPID != 0);
+ 	pmState = PM_STARTUP;
+ 
+ 	status = ServerLoop();
+ 
+ 	/*
+ 	 * ServerLoop probably shouldn't ever return, but if it does, close down.
+ 	 */
+ 	ExitPostmaster(status != STATUS_OK);
+ 
+ 	return 0;					/* not reached */
+ }
+ 
+ 
+ /*
+  * Compute and check the directory paths to files that are part of the
+  * installation (as deduced from the postgres executable's own location)
+  */
+ static void
+ getInstallationPaths(const char *argv0)
+ {
+ 	DIR		   *pdir;
+ 
+ 	/* Locate the postgres executable itself */
+ 	if (find_my_exec(argv0, my_exec_path) < 0)
+ 		elog(FATAL, "%s: could not locate my own executable path", argv0);
+ 
+ #ifdef EXEC_BACKEND
+ 	/* Locate executable backend before we change working directory */
+ 	if (find_other_exec(argv0, "postgres", PG_BACKEND_VERSIONSTR,
+ 						postgres_exec_path) < 0)
+ 		ereport(FATAL,
+ 				(errmsg("%s: could not locate matching postgres executable",
+ 						argv0)));
+ #endif
+ 
+ 	/*
+ 	 * Locate the pkglib directory --- this has to be set early in case we try
+ 	 * to load any modules from it in response to postgresql.conf entries.
+ 	 */
+ 	get_pkglib_path(my_exec_path, pkglib_path);
+ 
+ 	/*
+ 	 * Verify that there's a readable directory there; otherwise the Postgres
+ 	 * installation is incomplete or corrupt.  (A typical cause of this
+ 	 * failure is that the postgres executable has been moved or hardlinked to
+ 	 * some directory that's not a sibling of the installation lib/
+ 	 * directory.)
+ 	 */
+ 	pdir = AllocateDir(pkglib_path);
+ 	if (pdir == NULL)
+ 		ereport(ERROR,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not open directory \"%s\": %m",
+ 						pkglib_path),
+ 				 errhint("This may indicate an incomplete PostgreSQL installation, or that the file \"%s\" has been moved away from its proper location.",
+ 						 my_exec_path)));
+ 	FreeDir(pdir);
+ 
+ 	/*
+ 	 * XXX is it worth similarly checking the share/ directory?  If the lib/
+ 	 * directory is there, then share/ probably is too.
+ 	 */
+ }
+ 
+ 
+ /*
+  * Validate the proposed data directory
+  */
+ static void
+ checkDataDir(void)
+ {
+ 	char		path[MAXPGPATH];
+ 	FILE	   *fp;
+ 	struct stat stat_buf;
+ 
+ 	Assert(DataDir);
+ 
+ 	if (stat(DataDir, &stat_buf) != 0)
+ 	{
+ 		if (errno == ENOENT)
+ 			ereport(FATAL,
+ 					(errcode_for_file_access(),
+ 					 errmsg("data directory \"%s\" does not exist",
+ 							DataDir)));
+ 		else
+ 			ereport(FATAL,
+ 					(errcode_for_file_access(),
+ 				 errmsg("could not read permissions of directory \"%s\": %m",
+ 						DataDir)));
+ 	}
+ 
+ 	/* eventual chdir would fail anyway, but let's test ... */
+ 	if (!S_ISDIR(stat_buf.st_mode))
+ 		ereport(FATAL,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("specified data directory \"%s\" is not a directory",
+ 						DataDir)));
+ 
+ 	/*
+ 	 * Check that the directory belongs to my userid; if not, reject.
+ 	 *
+ 	 * This check is an essential part of the interlock that prevents two
+ 	 * postmasters from starting in the same directory (see CreateLockFile()).
+ 	 * Do not remove or weaken it.
+ 	 *
+ 	 * XXX can we safely enable this check on Windows?
+ 	 */
+ #if !defined(WIN32) && !defined(__CYGWIN__)
+ 	if (stat_buf.st_uid != geteuid())
+ 		ereport(FATAL,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("data directory \"%s\" has wrong ownership",
+ 						DataDir),
+ 				 errhint("The server must be started by the user that owns the data directory.")));
+ #endif
+ 
+ 	/*
+ 	 * Check if the directory has group or world access.  If so, reject.
+ 	 *
+ 	 * It would be possible to allow weaker constraints (for example, allow
+ 	 * group access) but we cannot make a general assumption that that is
+ 	 * okay; for example there are platforms where nearly all users
+ 	 * customarily belong to the same group.  Perhaps this test should be
+ 	 * configurable.
+ 	 *
+ 	 * XXX temporarily suppress check when on Windows, because there may not
+ 	 * be proper support for Unix-y file permissions.  Need to think of a
+ 	 * reasonable check to apply on Windows.
+ 	 */
+ #if !defined(WIN32) && !defined(__CYGWIN__)
+ 	if (stat_buf.st_mode & (S_IRWXG | S_IRWXO))
+ 		ereport(FATAL,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("data directory \"%s\" has group or world access",
+ 						DataDir),
+ 				 errdetail("Permissions should be u=rwx (0700).")));
+ #endif
+ 
+ 	/* Look for PG_VERSION before looking for pg_control */
+ 	ValidatePgVersion(DataDir);
+ 
+ 	snprintf(path, sizeof(path), "%s/global/pg_control", DataDir);
+ 
+ 	fp = AllocateFile(path, PG_BINARY_R);
+ 	if (fp == NULL)
+ 	{
+ 		write_stderr("%s: could not find the database system\n"
+ 					 "Expected to find it in the directory \"%s\",\n"
+ 					 "but could not open file \"%s\": %s\n",
+ 					 progname, DataDir, path, strerror(errno));
+ 		ExitPostmaster(2);
+ 	}
+ 	FreeFile(fp);
+ }
+ 
+ /*
+  * Main idle loop of postmaster
+  */
+ static int
+ ServerLoop(void)
+ {
+ 	fd_set		readmask;
+ 	int			nSockets;
+ 	time_t		now,
+ 				last_touch_time;
+ 
+ 	last_touch_time = time(NULL);
+ 
+ 	nSockets = initMasks(&readmask);
+ 
+ 	for (;;)
+ 	{
+ 		fd_set		rmask;
+ 		int			selres;
+ 
+ 		/*
+ 		 * Wait for a connection request to arrive.
+ 		 *
+ 		 * We wait at most one minute, to ensure that the other background
+ 		 * tasks handled below get done even when no requests are arriving.
+ 		 *
+ 		 * If we are in PM_WAIT_DEAD_END state, then we don't want to accept
+ 		 * any new connections, so we don't call select() at all; just sleep
+ 		 * for a little bit with signals unblocked.
+ 		 */
+ 		memcpy((char *) &rmask, (char *) &readmask, sizeof(fd_set));
+ 
+ 		PG_SETMASK(&UnBlockSig);
+ 
+ 		if (pmState == PM_WAIT_DEAD_END)
+ 		{
+ 			pg_usleep(100000L); /* 100 msec seems reasonable */
+ 			selres = 0;
+ 		}
+ 		else
+ 		{
+ 			/* must set timeout each time; some OSes change it! */
+ 			struct timeval timeout;
+ 
+ 			timeout.tv_sec = 60;
+ 			timeout.tv_usec = 0;
+ 
+ 			selres = select(nSockets, &rmask, NULL, NULL, &timeout);
+ 		}
+ 
+ 		/*
+ 		 * Block all signals until we wait again.  (This makes it safe for our
+ 		 * signal handlers to do nontrivial work.)
+ 		 */
+ 		PG_SETMASK(&BlockSig);
+ 
+ 		/* Now check the select() result */
+ 		if (selres < 0)
+ 		{
+ 			if (errno != EINTR && errno != EWOULDBLOCK)
+ 			{
+ 				ereport(LOG,
+ 						(errcode_for_socket_access(),
+ 						 errmsg("select() failed in postmaster: %m")));
+ 				return STATUS_ERROR;
+ 			}
+ 		}
+ 
+ 		/*
+ 		 * New connection pending on any of our sockets? If so, fork a child
+ 		 * process to deal with it.
+ 		 */
+ 		if (selres > 0)
+ 		{
+ 			int			i;
+ 
+ 			for (i = 0; i < MAXLISTEN; i++)
+ 			{
+ 				if (ListenSocket[i] == PGINVALID_SOCKET)
+ 					break;
+ 				if (FD_ISSET(ListenSocket[i], &rmask))
+ 				{
+ 					Port	   *port;
+ 
+ 					port = ConnCreate(ListenSocket[i]);
+ 					if (port)
+ 					{
+ 						BackendStartup(port);
+ 
+ 						/*
+ 						 * We no longer need the open socket or port structure
+ 						 * in this process
+ 						 */
+ 						StreamClose(port->sock);
+ 						ConnFree(port);
+ 					}
+ 				}
+ 			}
+ 		}
+ 
+ 		/* If we have lost the log collector, try to start a new one */
+ 		if (SysLoggerPID == 0 && Logging_collector)
+ 			SysLoggerPID = SysLogger_Start();
+ 
+ 		/*
+ 		 * If no background writer process is running, and we are not in a
+ 		 * state that prevents it, start one.  It doesn't matter if this
+ 		 * fails, we'll just try again later.
+ 		 */
+ 		if (BgWriterPID == 0 &&
+ 			(pmState == PM_RUN || pmState == PM_RECOVERY ||
+ 			 pmState == PM_HOT_STANDBY))
+ 			BgWriterPID = StartBackgroundWriter();
+ 
+ 		/*
+ 		 * Likewise, if we have lost the walwriter process, try to start a new
+ 		 * one.
+ 		 */
+ 		if (WalWriterPID == 0 && pmState == PM_RUN)
+ 			WalWriterPID = StartWalWriter();
+ 
+ 		/*
+ 		 * If we have lost the autovacuum launcher, try to start a new one. We
+ 		 * don't want autovacuum to run in binary upgrade mode because
+ 		 * autovacuum might update relfrozenxid for empty tables before the
+ 		 * physical files are put in place.
+ 		 */
+ 		if (!IsBinaryUpgrade && AutoVacPID == 0 &&
+ 			(AutoVacuumingActive() || start_autovac_launcher) &&
+ 			pmState == PM_RUN)
+ 		{
+ 			AutoVacPID = StartAutoVacLauncher();
+ 			if (AutoVacPID != 0)
+ 				start_autovac_launcher = false; /* signal processed */
+ 		}
+ 
+ 		/* If we have lost the archiver, try to start a new one */
+ 		if (XLogArchivingActive() && PgArchPID == 0 && pmState == PM_RUN)
+ 			PgArchPID = pgarch_start();
+ 
+ 		/* If we have lost the stats collector, try to start a new one */
+ 		if (PgStatPID == 0 && pmState == PM_RUN)
+ 			PgStatPID = pgstat_start();
+ 
+ 		/* If we need to signal the autovacuum launcher, do so now */
+ 		if (avlauncher_needs_signal)
+ 		{
+ 			avlauncher_needs_signal = false;
+ 			if (AutoVacPID != 0)
+ 				kill(AutoVacPID, SIGUSR2);
+ 		}
+ 
+ 		/*
+ 		 * Touch the socket and lock file every 58 minutes, to ensure that
+ 		 * they are not removed by overzealous /tmp-cleaning tasks.  We assume
+ 		 * no one runs cleaners with cutoff times of less than an hour ...
+ 		 */
+ 		now = time(NULL);
+ 		if (now - last_touch_time >= 58 * SECS_PER_MINUTE)
+ 		{
+ 			TouchSocketFile();
+ 			TouchSocketLockFile();
+ 			last_touch_time = now;
+ 		}
+ 	}
+ }
+ 
+ 
+ /*
+  * Initialise the masks for select() for the ports we are listening on.
+  * Return the number of sockets to listen on.
+  */
+ static int
+ initMasks(fd_set *rmask)
+ {
+ 	int			maxsock = -1;
+ 	int			i;
+ 
+ 	FD_ZERO(rmask);
+ 
+ 	for (i = 0; i < MAXLISTEN; i++)
+ 	{
+ 		int			fd = ListenSocket[i];
+ 
+ 		if (fd == PGINVALID_SOCKET)
+ 			break;
+ 		FD_SET(fd, rmask);
+ 
+ 		if (fd > maxsock)
+ 			maxsock = fd;
+ 	}
+ 
+ 	return maxsock + 1;
+ }
+ 
+ 
+ /*
+  * Read a client's startup packet and do something according to it.
+  *
+  * Returns STATUS_OK or STATUS_ERROR, or might call ereport(FATAL) and
+  * not return at all.
+  *
+  * (Note that ereport(FATAL) stuff is sent to the client, so only use it
+  * if that's what you want.  Return STATUS_ERROR if you don't want to
+  * send anything to the client, which would typically be appropriate
+  * if we detect a communications failure.)
+  */
+ static int
+ ProcessStartupPacket(Port *port, bool SSLdone)
+ {
+ 	int32		len;
+ 	void	   *buf;
+ 	ProtocolVersion proto;
+ 	MemoryContext oldcontext;
+ 
+ 	if (pq_getbytes((char *) &len, 4) == EOF)
+ 	{
+ 		/*
+ 		 * EOF after SSLdone probably means the client didn't like our
+ 		 * response to NEGOTIATE_SSL_CODE.	That's not an error condition, so
+ 		 * don't clutter the log with a complaint.
+ 		 */
+ 		if (!SSLdone)
+ 			ereport(COMMERROR,
+ 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+ 					 errmsg("incomplete startup packet")));
+ 		return STATUS_ERROR;
+ 	}
+ 
+ 	len = ntohl(len);
+ 	len -= 4;
+ 
+ 	if (len < (int32) sizeof(ProtocolVersion) ||
+ 		len > MAX_STARTUP_PACKET_LENGTH)
+ 	{
+ 		ereport(COMMERROR,
+ 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+ 				 errmsg("invalid length of startup packet")));
+ 		return STATUS_ERROR;
+ 	}
+ 
+ 	/*
+ 	 * Allocate at least the size of an old-style startup packet, plus one
+ 	 * extra byte, and make sure all are zeroes.  This ensures we will have
+ 	 * null termination of all strings, in both fixed- and variable-length
+ 	 * packet layouts.
+ 	 */
+ 	if (len <= (int32) sizeof(StartupPacket))
+ 		buf = palloc0(sizeof(StartupPacket) + 1);
+ 	else
+ 		buf = palloc0(len + 1);
+ 
+ 	if (pq_getbytes(buf, len) == EOF)
+ 	{
+ 		ereport(COMMERROR,
+ 				(errcode(ERRCODE_PROTOCOL_VIOLATION),
+ 				 errmsg("incomplete startup packet")));
+ 		return STATUS_ERROR;
+ 	}
+ 
+ 	/*
+ 	 * The first field is either a protocol version number or a special
+ 	 * request code.
+ 	 */
+ 	port->proto = proto = ntohl(*((ProtocolVersion *) buf));
+ 
+ 	if (proto == CANCEL_REQUEST_CODE)
+ 	{
+ 		processCancelRequest(port, buf);
+ 		/* Not really an error, but we don't want to proceed further */
+ 		return STATUS_ERROR;
+ 	}
+ 
+ 	if (proto == NEGOTIATE_SSL_CODE && !SSLdone)
+ 	{
+ 		char		SSLok;
+ 
+ #ifdef USE_SSL
+ 		/* No SSL when disabled or on Unix sockets */
+ 		if (!EnableSSL || IS_AF_UNIX(port->laddr.addr.ss_family))
+ 			SSLok = 'N';
+ 		else
+ 			SSLok = 'S';		/* Support for SSL */
+ #else
+ 		SSLok = 'N';			/* No support for SSL */
+ #endif
+ 
+ retry1:
+ 		if (send(port->sock, &SSLok, 1, 0) != 1)
+ 		{
+ 			if (errno == EINTR)
+ 				goto retry1;	/* if interrupted, just retry */
+ 			ereport(COMMERROR,
+ 					(errcode_for_socket_access(),
+ 					 errmsg("failed to send SSL negotiation response: %m")));
+ 			return STATUS_ERROR;	/* close the connection */
+ 		}
+ 
+ #ifdef USE_SSL
+ 		if (SSLok == 'S' && secure_open_server(port) == -1)
+ 			return STATUS_ERROR;
+ #endif
+ 		/* regular startup packet, cancel, etc packet should follow... */
+ 		/* but not another SSL negotiation request */
+ 		return ProcessStartupPacket(port, true);
+ 	}
+ 
+ 	/* Could add additional special packet types here */
+ 
+ 	/*
+ 	 * Set FrontendProtocol now so that ereport() knows what format to send if
+ 	 * we fail during startup.
+ 	 */
+ 	FrontendProtocol = proto;
+ 
+ 	/* Check we can handle the protocol the frontend is using. */
+ 
+ 	if (PG_PROTOCOL_MAJOR(proto) < PG_PROTOCOL_MAJOR(PG_PROTOCOL_EARLIEST) ||
+ 		PG_PROTOCOL_MAJOR(proto) > PG_PROTOCOL_MAJOR(PG_PROTOCOL_LATEST) ||
+ 		(PG_PROTOCOL_MAJOR(proto) == PG_PROTOCOL_MAJOR(PG_PROTOCOL_LATEST) &&
+ 		 PG_PROTOCOL_MINOR(proto) > PG_PROTOCOL_MINOR(PG_PROTOCOL_LATEST)))
+ 		ereport(FATAL,
+ 				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ 				 errmsg("unsupported frontend protocol %u.%u: server supports %u.0 to %u.%u",
+ 						PG_PROTOCOL_MAJOR(proto), PG_PROTOCOL_MINOR(proto),
+ 						PG_PROTOCOL_MAJOR(PG_PROTOCOL_EARLIEST),
+ 						PG_PROTOCOL_MAJOR(PG_PROTOCOL_LATEST),
+ 						PG_PROTOCOL_MINOR(PG_PROTOCOL_LATEST))));
+ 
+ 	/*
+ 	 * Now fetch parameters out of startup packet and save them into the Port
+ 	 * structure.  All data structures attached to the Port struct must be
+ 	 * allocated in TopMemoryContext so that they will remain available in a
+ 	 * running backend (even after PostmasterContext is destroyed).  We need
+ 	 * not worry about leaking this storage on failure, since we aren't in the
+ 	 * postmaster process anymore.
+ 	 */
+ 	oldcontext = MemoryContextSwitchTo(TopMemoryContext);
+ 
+ 	if (PG_PROTOCOL_MAJOR(proto) >= 3)
+ 	{
+ 		int32		offset = sizeof(ProtocolVersion);
+ 
+ 		/*
+ 		 * Scan packet body for name/option pairs.	We can assume any string
+ 		 * beginning within the packet body is null-terminated, thanks to
+ 		 * zeroing extra byte above.
+ 		 */
+ 		port->guc_options = NIL;
+ 
+ 		while (offset < len)
+ 		{
+ 			char	   *nameptr = ((char *) buf) + offset;
+ 			int32		valoffset;
+ 			char	   *valptr;
+ 
+ 			if (*nameptr == '\0')
+ 				break;			/* found packet terminator */
+ 			valoffset = offset + strlen(nameptr) + 1;
+ 			if (valoffset >= len)
+ 				break;			/* missing value, will complain below */
+ 			valptr = ((char *) buf) + valoffset;
+ 
+ 			if (strcmp(nameptr, "database") == 0)
+ 				port->database_name = pstrdup(valptr);
+ 			else if (strcmp(nameptr, "user") == 0)
+ 				port->user_name = pstrdup(valptr);
+ 			else if (strcmp(nameptr, "options") == 0)
+ 				port->cmdline_options = pstrdup(valptr);
+ 			else if (strcmp(nameptr, "replication") == 0)
+ 			{
+ 				if (!parse_bool(valptr, &am_walsender))
+ 					ereport(FATAL,
+ 							(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 							 errmsg("invalid value for boolean option \"replication\"")));
+ 			}
+ 			else
+ 			{
+ 				/* Assume it's a generic GUC option */
+ 				port->guc_options = lappend(port->guc_options,
+ 											pstrdup(nameptr));
+ 				port->guc_options = lappend(port->guc_options,
+ 											pstrdup(valptr));
+ 			}
+ 			offset = valoffset + strlen(valptr) + 1;
+ 		}
+ 
+ 		/*
+ 		 * If we didn't find a packet terminator exactly at the end of the
+ 		 * given packet length, complain.
+ 		 */
+ 		if (offset != len - 1)
+ 			ereport(FATAL,
+ 					(errcode(ERRCODE_PROTOCOL_VIOLATION),
+ 					 errmsg("invalid startup packet layout: expected terminator as last byte")));
+ 	}
+ 	else
+ 	{
+ 		/*
+ 		 * Get the parameters from the old-style, fixed-width-fields startup
+ 		 * packet as C strings.  The packet destination was cleared first so a
+ 		 * short packet has zeros silently added.  We have to be prepared to
+ 		 * truncate the pstrdup result for oversize fields, though.
+ 		 */
+ 		StartupPacket *packet = (StartupPacket *) buf;
+ 
+ 		port->database_name = pstrdup(packet->database);
+ 		if (strlen(port->database_name) > sizeof(packet->database))
+ 			port->database_name[sizeof(packet->database)] = '\0';
+ 		port->user_name = pstrdup(packet->user);
+ 		if (strlen(port->user_name) > sizeof(packet->user))
+ 			port->user_name[sizeof(packet->user)] = '\0';
+ 		port->cmdline_options = pstrdup(packet->options);
+ 		if (strlen(port->cmdline_options) > sizeof(packet->options))
+ 			port->cmdline_options[sizeof(packet->options)] = '\0';
+ 		port->guc_options = NIL;
+ 	}
+ 
+ 	/* Check a user name was given. */
+ 	if (port->user_name == NULL || port->user_name[0] == '\0')
+ 		ereport(FATAL,
+ 				(errcode(ERRCODE_INVALID_AUTHORIZATION_SPECIFICATION),
+ 			 errmsg("no PostgreSQL user name specified in startup packet")));
+ 
+ 	/* The database defaults to the user name. */
+ 	if (port->database_name == NULL || port->database_name[0] == '\0')
+ 		port->database_name = pstrdup(port->user_name);
+ 
+ 	if (Db_user_namespace)
+ 	{
+ 		/*
+ 		 * If user@, it is a global user, remove '@'. We only want to do this
+ 		 * if there is an '@' at the end and no earlier in the user string or
+ 		 * they may fake as a local user of another database attaching to this
+ 		 * database.
+ 		 */
+ 		if (strchr(port->user_name, '@') ==
+ 			port->user_name + strlen(port->user_name) - 1)
+ 			*strchr(port->user_name, '@') = '\0';
+ 		else
+ 		{
+ 			/* Append '@' and dbname */
+ 			char	   *db_user;
+ 
+ 			db_user = palloc(strlen(port->user_name) +
+ 							 strlen(port->database_name) + 2);
+ 			sprintf(db_user, "%s@%s", port->user_name, port->database_name);
+ 			port->user_name = db_user;
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * Truncate given database and user names to length of a Postgres name.
+ 	 * This avoids lookup failures when overlength names are given.
+ 	 */
+ 	if (strlen(port->database_name) >= NAMEDATALEN)
+ 		port->database_name[NAMEDATALEN - 1] = '\0';
+ 	if (strlen(port->user_name) >= NAMEDATALEN)
+ 		port->user_name[NAMEDATALEN - 1] = '\0';
+ 
+ 	/* Walsender is not related to a particular database */
+ 	if (am_walsender)
+ 		port->database_name[0] = '\0';
+ 
+ 	/*
+ 	 * Done putting stuff in TopMemoryContext.
+ 	 */
+ 	MemoryContextSwitchTo(oldcontext);
+ 
+ 	/*
+ 	 * If we're going to reject the connection due to database state, say so
+ 	 * now instead of wasting cycles on an authentication exchange. (This also
+ 	 * allows a pg_ping utility to be written.)
+ 	 */
+ 	switch (port->canAcceptConnections)
+ 	{
+ 		case CAC_STARTUP:
+ 			ereport(FATAL,
+ 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
+ 					 errmsg("the database system is starting up")));
+ 			break;
+ 		case CAC_SHUTDOWN:
+ 			ereport(FATAL,
+ 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
+ 					 errmsg("the database system is shutting down")));
+ 			break;
+ 		case CAC_RECOVERY:
+ 			ereport(FATAL,
+ 					(errcode(ERRCODE_CANNOT_CONNECT_NOW),
+ 					 errmsg("the database system is in recovery mode")));
+ 			break;
+ 		case CAC_TOOMANY:
+ 			ereport(FATAL,
+ 					(errcode(ERRCODE_TOO_MANY_CONNECTIONS),
+ 					 errmsg("sorry, too many clients already")));
+ 			break;
+ 		case CAC_WAITBACKUP:
+ 			/* OK for now, will check in InitPostgres */
+ 			break;
+ 		case CAC_OK:
+ 			break;
+ 	}
+ 
+ 	return STATUS_OK;
+ }
+ 
+ 
+ /*
+  * The client has sent a cancel request packet, not a normal
+  * start-a-new-connection packet.  Perform the necessary processing.
+  * Nothing is sent back to the client.
+  */
+ static void
+ processCancelRequest(Port *port, void *pkt)
+ {
+ 	CancelRequestPacket *canc = (CancelRequestPacket *) pkt;
+ 	int			backendPID;
+ 	long		cancelAuthCode;
+ 	Backend    *bp;
+ 
+ #ifndef EXEC_BACKEND
+ 	Dlelem	   *curr;
+ #else
+ 	int			i;
+ #endif
+ 
+ 	backendPID = (int) ntohl(canc->backendPID);
+ 	cancelAuthCode = (long) ntohl(canc->cancelAuthCode);
+ 
+ 	/*
+ 	 * See if we have a matching backend.  In the EXEC_BACKEND case, we can no
+ 	 * longer access the postmaster's own backend list, and must rely on the
+ 	 * duplicate array in shared memory.
+ 	 */
+ #ifndef EXEC_BACKEND
+ 	for (curr = DLGetHead(BackendList); curr; curr = DLGetSucc(curr))
+ 	{
+ 		bp = (Backend *) DLE_VAL(curr);
+ #else
+ 	for (i = MaxLivePostmasterChildren() - 1; i >= 0; i--)
+ 	{
+ 		bp = (Backend *) &ShmemBackendArray[i];
+ #endif
+ 		if (bp->pid == backendPID)
+ 		{
+ 			if (bp->cancel_key == cancelAuthCode)
+ 			{
+ 				/* Found a match; signal that backend to cancel current op */
+ 				ereport(DEBUG2,
+ 						(errmsg_internal("processing cancel request: sending SIGINT to process %d",
+ 										 backendPID)));
+ 				signal_child(bp->pid, SIGINT);
+ 			}
+ 			else
+ 				/* Right PID, wrong key: no way, Jose */
+ 				ereport(LOG,
+ 						(errmsg("wrong key in cancel request for process %d",
+ 								backendPID)));
+ 			return;
+ 		}
+ 	}
+ 
+ 	/* No matching backend */
+ 	ereport(LOG,
+ 			(errmsg("PID %d in cancel request did not match any process",
+ 					backendPID)));
+ }
+ 
+ /*
+  * canAcceptConnections --- check to see if database state allows connections.
+  */
+ static CAC_state
+ canAcceptConnections(void)
+ {
+ 	CAC_state	result = CAC_OK;
+ 
+ 	/*
+ 	 * Can't start backends when in startup/shutdown/inconsistent recovery
+ 	 * state.
+ 	 *
+ 	 * In state PM_WAIT_BACKUP only superusers can connect (this must be
+ 	 * allowed so that a superuser can end online backup mode); we return
+ 	 * CAC_WAITBACKUP code to indicate that this must be checked later. Note
+ 	 * that neither CAC_OK nor CAC_WAITBACKUP can safely be returned until we
+ 	 * have checked for too many children.
+ 	 */
+ 	if (pmState != PM_RUN)
+ 	{
+ 		if (pmState == PM_WAIT_BACKUP)
+ 			result = CAC_WAITBACKUP;	/* allow superusers only */
+ 		else if (Shutdown > NoShutdown)
+ 			return CAC_SHUTDOWN;	/* shutdown is pending */
+ 		else if (!FatalError &&
+ 				 (pmState == PM_STARTUP ||
+ 				  pmState == PM_RECOVERY))
+ 			return CAC_STARTUP; /* normal startup */
+ 		else if (!FatalError &&
+ 				 pmState == PM_HOT_STANDBY)
+ 			result = CAC_OK;	/* connection OK during hot standby */
+ 		else
+ 			return CAC_RECOVERY;	/* else must be crash recovery */
+ 	}
+ 
+ 	/*
+ 	 * Don't start too many children.
+ 	 *
+ 	 * We allow more connections than we can have backends here because some
+ 	 * might still be authenticating; they might fail auth, or some existing
+ 	 * backend might exit before the auth cycle is completed. The exact
+ 	 * MaxBackends limit is enforced when a new backend tries to join the
+ 	 * shared-inval backend array.
+ 	 *
+ 	 * The limit here must match the sizes of the per-child-process arrays;
+ 	 * see comments for MaxLivePostmasterChildren().
+ 	 */
+ 	if (CountChildren(BACKEND_TYPE_ALL) >= MaxLivePostmasterChildren())
+ 		result = CAC_TOOMANY;
+ 
+ 	return result;
+ }
+ 
+ 
+ /*
+  * ConnCreate -- create a local connection data structure
+  *
+  * Returns NULL on failure, other than out-of-memory which is fatal.
+  */
+ static Port *
+ ConnCreate(int serverFd)
+ {
+ 	Port	   *port;
+ 
+ 	if (!(port = (Port *) calloc(1, sizeof(Port))))
+ 	{
+ 		ereport(LOG,
+ 				(errcode(ERRCODE_OUT_OF_MEMORY),
+ 				 errmsg("out of memory")));
+ 		ExitPostmaster(1);
+ 	}
+ 
+ 	if (StreamConnection(serverFd, port) != STATUS_OK)
+ 	{
+ 		if (port->sock >= 0)
+ 			StreamClose(port->sock);
+ 		ConnFree(port);
+ 		return NULL;
+ 	}
+ 
+ 	/*
+ 	 * Precompute password salt values to use for this connection. It's
+ 	 * slightly annoying to do this long in advance of knowing whether we'll
+ 	 * need 'em or not, but we must do the random() calls before we fork, not
+ 	 * after.  Else the postmaster's random sequence won't get advanced, and
+ 	 * all backends would end up using the same salt...
+ 	 */
+ 	RandomSalt(port->md5Salt);
+ 
+ 	/*
+ 	 * Allocate GSSAPI specific state struct
+ 	 */
+ #ifndef EXEC_BACKEND
+ #if defined(ENABLE_GSS) || defined(ENABLE_SSPI)
+ 	port->gss = (pg_gssinfo *) calloc(1, sizeof(pg_gssinfo));
+ 	if (!port->gss)
+ 	{
+ 		ereport(LOG,
+ 				(errcode(ERRCODE_OUT_OF_MEMORY),
+ 				 errmsg("out of memory")));
+ 		ExitPostmaster(1);
+ 	}
+ #endif
+ #endif
+ 
+ 	return port;
+ }
+ 
+ 
+ /*
+  * ConnFree -- free a local connection data structure
+  */
+ static void
+ ConnFree(Port *conn)
+ {
+ #ifdef USE_SSL
+ 	secure_close(conn);
+ #endif
+ 	if (conn->gss)
+ 		free(conn->gss);
+ 	free(conn);
+ }
+ 
+ 
+ /*
+  * ClosePostmasterPorts -- close all the postmaster's open sockets
+  *
+  * This is called during child process startup to release file descriptors
+  * that are not needed by that child process.  The postmaster still has
+  * them open, of course.
+  *
+  * Note: we pass am_syslogger as a boolean because we don't want to set
+  * the global variable yet when this is called.
+  */
+ void
+ ClosePostmasterPorts(bool am_syslogger)
+ {
+ 	int			i;
+ 
+ #ifndef WIN32
+ 	/*
+ 	 * Close the write end of postmaster death watch pipe. It's important to
+ 	 * do this as early as possible, so that if postmaster dies, others won't
+ 	 * think that it's still running because we're holding the pipe open.
+ 	 */
+ 	if (close(postmaster_alive_fds[POSTMASTER_FD_OWN]))
+ 		ereport(FATAL,
+ 			(errcode_for_file_access(),
+ 			 errmsg_internal("could not close postmaster death monitoring pipe in child process: %m")));
+ 	postmaster_alive_fds[POSTMASTER_FD_OWN] = -1;
+ #endif
+ 
+ 	/* Close the listen sockets */
+ 	for (i = 0; i < MAXLISTEN; i++)
+ 	{
+ 		if (ListenSocket[i] != PGINVALID_SOCKET)
+ 		{
+ 			StreamClose(ListenSocket[i]);
+ 			ListenSocket[i] = PGINVALID_SOCKET;
+ 		}
+ 	}
+ 
+ 	/* If using syslogger, close the read side of the pipe */
+ 	if (!am_syslogger)
+ 	{
+ #ifndef WIN32
+ 		if (syslogPipe[0] >= 0)
+ 			close(syslogPipe[0]);
+ 		syslogPipe[0] = -1;
+ #else
+ 		if (syslogPipe[0])
+ 			CloseHandle(syslogPipe[0]);
+ 		syslogPipe[0] = 0;
+ #endif
+ 	}
+ 
+ #ifdef USE_BONJOUR
+ 	/* If using Bonjour, close the connection to the mDNS daemon */
+ 	if (bonjour_sdref)
+ 		close(DNSServiceRefSockFD(bonjour_sdref));
+ #endif
+ }
+ 
+ 
+ /*
+  * reset_shared -- reset shared memory and semaphores
+  */
+ static void
+ reset_shared(int port)
+ {
+ 	/*
+ 	 * Create or re-create shared memory and semaphores.
+ 	 *
+ 	 * Note: in each "cycle of life" we will normally assign the same IPC keys
+ 	 * (if using SysV shmem and/or semas), since the port number is used to
+ 	 * determine IPC keys.	This helps ensure that we will clean up dead IPC
+ 	 * objects if the postmaster crashes and is restarted.
+ 	 */
+ 	CreateSharedMemoryAndSemaphores(false, port);
+ }
+ 
+ 
+ /*
+  * SIGHUP -- reread config files, and tell children to do same
+  */
+ static void
+ SIGHUP_handler(SIGNAL_ARGS)
+ {
+ 	int			save_errno = errno;
+ 
+ 	PG_SETMASK(&BlockSig);
+ 
+ 	if (Shutdown <= SmartShutdown)
+ 	{
+ 		ereport(LOG,
+ 				(errmsg("received SIGHUP, reloading configuration files")));
+ 		ProcessConfigFile(PGC_SIGHUP);
+ 		SignalChildren(SIGHUP);
+ 		if (StartupPID != 0)
+ 			signal_child(StartupPID, SIGHUP);
+ 		if (BgWriterPID != 0)
+ 			signal_child(BgWriterPID, SIGHUP);
+ 		if (WalWriterPID != 0)
+ 			signal_child(WalWriterPID, SIGHUP);
+ 		if (WalReceiverPID != 0)
+ 			signal_child(WalReceiverPID, SIGHUP);
+ 		if (AutoVacPID != 0)
+ 			signal_child(AutoVacPID, SIGHUP);
+ 		if (PgArchPID != 0)
+ 			signal_child(PgArchPID, SIGHUP);
+ 		if (SysLoggerPID != 0)
+ 			signal_child(SysLoggerPID, SIGHUP);
+ 		if (PgStatPID != 0)
+ 			signal_child(PgStatPID, SIGHUP);
+ 
+ 		/* Reload authentication config files too */
+ 		if (!load_hba())
+ 			ereport(WARNING,
+ 					(errmsg("pg_hba.conf not reloaded")));
+ 
+ 		load_ident();
+ 
+ #ifdef EXEC_BACKEND
+ 		/* Update the starting-point file for future children */
+ 		write_nondefault_variables(PGC_SIGHUP);
+ #endif
+ 	}
+ 
+ 	PG_SETMASK(&UnBlockSig);
+ 
+ 	errno = save_errno;
+ }
+ 
+ 
+ /*
+  * pmdie -- signal handler for processing various postmaster signals.
+  */
+ static void
+ pmdie(SIGNAL_ARGS)
+ {
+ 	int			save_errno = errno;
+ 
+ 	PG_SETMASK(&BlockSig);
+ 
+ 	ereport(DEBUG2,
+ 			(errmsg_internal("postmaster received signal %d",
+ 							 postgres_signal_arg)));
+ 
+ 	switch (postgres_signal_arg)
+ 	{
+ 		case SIGTERM:
+ 
+ 			/*
+ 			 * Smart Shutdown:
+ 			 *
+ 			 * Wait for children to end their work, then shut down.
+ 			 */
+ 			if (Shutdown >= SmartShutdown)
+ 				break;
+ 			Shutdown = SmartShutdown;
+ 			ereport(LOG,
+ 					(errmsg("received smart shutdown request")));
+ 
+ 			if (pmState == PM_RUN || pmState == PM_RECOVERY ||
+ 				pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
+ 			{
+ 				/* autovacuum workers are told to shut down immediately */
+ 				SignalSomeChildren(SIGTERM, BACKEND_TYPE_AUTOVAC);
+ 				/* and the autovac launcher too */
+ 				if (AutoVacPID != 0)
+ 					signal_child(AutoVacPID, SIGTERM);
+ 				/* and the walwriter too */
+ 				if (WalWriterPID != 0)
+ 					signal_child(WalWriterPID, SIGTERM);
+ 
+ 				/*
+ 				 * If we're in recovery, we can't kill the startup process
+ 				 * right away, because at present doing so does not release
+ 				 * its locks.  We might want to change this in a future
+ 				 * release.  For the time being, the PM_WAIT_READONLY state
+ 				 * indicates that we're waiting for the regular (read only)
+ 				 * backends to die off; once they do, we'll kill the startup
+ 				 * and walreceiver processes.
+ 				 */
+ 				pmState = (pmState == PM_RUN) ?
+ 					PM_WAIT_BACKUP : PM_WAIT_READONLY;
+ 			}
+ 
+ 			/*
+ 			 * Now wait for online backup mode to end and backends to exit. If
+ 			 * that is already the case, PostmasterStateMachine will take the
+ 			 * next step.
+ 			 */
+ 			PostmasterStateMachine();
+ 			break;
+ 
+ 		case SIGINT:
+ 
+ 			/*
+ 			 * Fast Shutdown:
+ 			 *
+ 			 * Abort all children with SIGTERM (rollback active transactions
+ 			 * and exit) and shut down when they are gone.
+ 			 */
+ 			if (Shutdown >= FastShutdown)
+ 				break;
+ 			Shutdown = FastShutdown;
+ 			ereport(LOG,
+ 					(errmsg("received fast shutdown request")));
+ 
+ 			if (StartupPID != 0)
+ 				signal_child(StartupPID, SIGTERM);
+ 			if (WalReceiverPID != 0)
+ 				signal_child(WalReceiverPID, SIGTERM);
+ 			if (pmState == PM_RECOVERY)
+ 			{
+ 				/* only bgwriter is active in this state */
+ 				pmState = PM_WAIT_BACKENDS;
+ 			}
+ 			else if (pmState == PM_RUN ||
+ 					 pmState == PM_WAIT_BACKUP ||
+ 					 pmState == PM_WAIT_READONLY ||
+ 					 pmState == PM_WAIT_BACKENDS ||
+ 					 pmState == PM_HOT_STANDBY)
+ 			{
+ 				ereport(LOG,
+ 						(errmsg("aborting any active transactions")));
+ 				/* shut down all backends and autovac workers */
+ 				SignalSomeChildren(SIGTERM,
+ 								 BACKEND_TYPE_NORMAL | BACKEND_TYPE_AUTOVAC);
+ 				/* and the autovac launcher too */
+ 				if (AutoVacPID != 0)
+ 					signal_child(AutoVacPID, SIGTERM);
+ 				/* and the walwriter too */
+ 				if (WalWriterPID != 0)
+ 					signal_child(WalWriterPID, SIGTERM);
+ 				pmState = PM_WAIT_BACKENDS;
+ 			}
+ 
+ 			/*
+ 			 * Now wait for backends to exit.  If there are none,
+ 			 * PostmasterStateMachine will take the next step.
+ 			 */
+ 			PostmasterStateMachine();
+ 			break;
+ 
+ 		case SIGQUIT:
+ 
+ 			/*
+ 			 * Immediate Shutdown:
+ 			 *
+ 			 * abort all children with SIGQUIT and exit without attempt to
+ 			 * properly shut down data base system.
+ 			 */
+ 			ereport(LOG,
+ 					(errmsg("received immediate shutdown request")));
+ 			SignalChildren(SIGQUIT);
+ 			if (StartupPID != 0)
+ 				signal_child(StartupPID, SIGQUIT);
+ 			if (BgWriterPID != 0)
+ 				signal_child(BgWriterPID, SIGQUIT);
+ 			if (WalWriterPID != 0)
+ 				signal_child(WalWriterPID, SIGQUIT);
+ 			if (WalReceiverPID != 0)
+ 				signal_child(WalReceiverPID, SIGQUIT);
+ 			if (AutoVacPID != 0)
+ 				signal_child(AutoVacPID, SIGQUIT);
+ 			if (PgArchPID != 0)
+ 				signal_child(PgArchPID, SIGQUIT);
+ 			if (PgStatPID != 0)
+ 				signal_child(PgStatPID, SIGQUIT);
+ 			ExitPostmaster(0);
+ 			break;
+ 	}
+ 
+ 	PG_SETMASK(&UnBlockSig);
+ 
+ 	errno = save_errno;
+ }
+ 
+ /*
+  * Reaper -- signal handler to cleanup after a child process dies.
+  */
+ static void
+ reaper(SIGNAL_ARGS)
+ {
+ 	int			save_errno = errno;
+ 	int			pid;			/* process id of dead child process */
+ 	int			exitstatus;		/* its exit status */
+ 
+ 	/* These macros hide platform variations in getting child status */
+ #ifdef HAVE_WAITPID
+ 	int			status;			/* child exit status */
+ 
+ #define LOOPTEST()		((pid = waitpid(-1, &status, WNOHANG)) > 0)
+ #define LOOPHEADER()	(exitstatus = status)
+ #else							/* !HAVE_WAITPID */
+ #ifndef WIN32
+ 	union wait	status;			/* child exit status */
+ 
+ #define LOOPTEST()		((pid = wait3(&status, WNOHANG, NULL)) > 0)
+ #define LOOPHEADER()	(exitstatus = status.w_status)
+ #else							/* WIN32 */
+ #define LOOPTEST()		((pid = win32_waitpid(&exitstatus)) > 0)
+ #define LOOPHEADER()
+ #endif   /* WIN32 */
+ #endif   /* HAVE_WAITPID */
+ 
+ 	PG_SETMASK(&BlockSig);
+ 
+ 	ereport(DEBUG4,
+ 			(errmsg_internal("reaping dead processes")));
+ 
+ 	while (LOOPTEST())
+ 	{
+ 		LOOPHEADER();
+ 
+ 		/*
+ 		 * Check if this child was a startup process.
+ 		 */
+ 		if (pid == StartupPID)
+ 		{
+ 			StartupPID = 0;
+ 
+ 			/*
+ 			 * Unexpected exit of startup process (including FATAL exit)
+ 			 * during PM_STARTUP is treated as catastrophic. There are no
+ 			 * other processes running yet, so we can just exit.
+ 			 */
+ 			if (pmState == PM_STARTUP && !EXIT_STATUS_0(exitstatus))
+ 			{
+ 				LogChildExit(LOG, _("startup process"),
+ 							 pid, exitstatus);
+ 				ereport(LOG,
+ 				(errmsg("aborting startup due to startup process failure")));
+ 				ExitPostmaster(1);
+ 			}
+ 
+ 			/*
+ 			 * Startup process exited in response to a shutdown request (or it
+ 			 * completed normally regardless of the shutdown request).
+ 			 */
+ 			if (Shutdown > NoShutdown &&
+ 				(EXIT_STATUS_0(exitstatus) || EXIT_STATUS_1(exitstatus)))
+ 			{
+ 				pmState = PM_WAIT_BACKENDS;
+ 				/* PostmasterStateMachine logic does the rest */
+ 				continue;
+ 			}
+ 
+ 			/*
+ 			 * Any unexpected exit (including FATAL exit) of the startup
+ 			 * process is treated as a crash, except that we don't want to
+ 			 * reinitialize.
+ 			 */
+ 			if (!EXIT_STATUS_0(exitstatus))
+ 			{
+ 				RecoveryError = true;
+ 				HandleChildCrash(pid, exitstatus,
+ 								 _("startup process"));
+ 				continue;
+ 			}
+ 
+ 			/*
+ 			 * Startup succeeded, commence normal operations
+ 			 */
+ 			FatalError = false;
+ 			ReachedNormalRunning = true;
+ 			pmState = PM_RUN;
+ 
+ 			/*
+ 			 * Kill any walsenders to force the downstream standby(s) to
+ 			 * reread the timeline history file, adjust their timelines and
+ 			 * establish replication connections again. This is required
+ 			 * because the timeline of cascading standby is not consistent
+ 			 * with that of cascaded one just after failover. We LOG this
+ 			 * message since we need to leave a record to explain this
+ 			 * disconnection.
+ 			 *
+ 			 * XXX should avoid the need for disconnection. When we do,
+ 			 * am_cascading_walsender should be replaced with RecoveryInProgress()
+ 			 */
+ 			if (max_wal_senders > 0 && CountChildren(BACKEND_TYPE_WALSND) > 0)
+ 			{
+ 				ereport(LOG,
+ 						(errmsg("terminating all walsender processes to force cascaded "
+ 								"standby(s) to update timeline and reconnect")));
+ 				SignalSomeChildren(SIGUSR2, BACKEND_TYPE_WALSND);
+ 			}
+ 
+ 			/*
+ 			 * Crank up the background writer, if we didn't do that already
+ 			 * when we entered consistent recovery state.  It doesn't matter
+ 			 * if this fails, we'll just try again later.
+ 			 */
+ 			if (BgWriterPID == 0)
+ 				BgWriterPID = StartBackgroundWriter();
+ 
+ 			/*
+ 			 * Likewise, start other special children as needed.  In a restart
+ 			 * situation, some of them may be alive already.
+ 			 */
+ 			if (WalWriterPID == 0)
+ 				WalWriterPID = StartWalWriter();
+ 			if (!IsBinaryUpgrade && AutoVacuumingActive() && AutoVacPID == 0)
+ 				AutoVacPID = StartAutoVacLauncher();
+ 			if (XLogArchivingActive() && PgArchPID == 0)
+ 				PgArchPID = pgarch_start();
+ 			if (PgStatPID == 0)
+ 				PgStatPID = pgstat_start();
+ 
+ 			/* at this point we are really open for business */
+ 			ereport(LOG,
+ 				 (errmsg("database system is ready to accept connections")));
+ 
+ 			continue;
+ 		}
+ 
+ 		/*
+ 		 * Was it the bgwriter?
+ 		 */
+ 		if (pid == BgWriterPID)
+ 		{
+ 			BgWriterPID = 0;
+ 			if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN)
+ 			{
+ 				/*
+ 				 * OK, we saw normal exit of the bgwriter after it's been told
+ 				 * to shut down.  We expect that it wrote a shutdown
+ 				 * checkpoint.	(If for some reason it didn't, recovery will
+ 				 * occur on next postmaster start.)
+ 				 *
+ 				 * At this point we should have no normal backend children
+ 				 * left (else we'd not be in PM_SHUTDOWN state) but we might
+ 				 * have dead_end children to wait for.
+ 				 *
+ 				 * If we have an archiver subprocess, tell it to do a last
+ 				 * archive cycle and quit. Likewise, if we have walsender
+ 				 * processes, tell them to send any remaining WAL and quit.
+ 				 */
+ 				Assert(Shutdown > NoShutdown);
+ 
+ 				/* Waken archiver for the last time */
+ 				if (PgArchPID != 0)
+ 					signal_child(PgArchPID, SIGUSR2);
+ 
+ 				/*
+ 				 * Waken walsenders for the last time. No regular backends
+ 				 * should be around anymore.
+ 				 */
+ 				SignalChildren(SIGUSR2);
+ 
+ 				pmState = PM_SHUTDOWN_2;
+ 
+ 				/*
+ 				 * We can also shut down the stats collector now; there's
+ 				 * nothing left for it to do.
+ 				 */
+ 				if (PgStatPID != 0)
+ 					signal_child(PgStatPID, SIGQUIT);
+ 			}
+ 			else
+ 			{
+ 				/*
+ 				 * Any unexpected exit of the bgwriter (including FATAL exit)
+ 				 * is treated as a crash.
+ 				 */
+ 				HandleChildCrash(pid, exitstatus,
+ 								 _("background writer process"));
+ 			}
+ 
+ 			continue;
+ 		}
+ 
+ 		/*
+ 		 * Was it the wal writer?  Normal exit can be ignored; we'll start a
+ 		 * new one at the next iteration of the postmaster's main loop, if
+ 		 * necessary.  Any other exit condition is treated as a crash.
+ 		 */
+ 		if (pid == WalWriterPID)
+ 		{
+ 			WalWriterPID = 0;
+ 			if (!EXIT_STATUS_0(exitstatus))
+ 				HandleChildCrash(pid, exitstatus,
+ 								 _("WAL writer process"));
+ 			continue;
+ 		}
+ 
+ 		/*
+ 		 * Was it the wal receiver?  If exit status is zero (normal) or one
+ 		 * (FATAL exit), we assume everything is all right just like normal
+ 		 * backends.
+ 		 */
+ 		if (pid == WalReceiverPID)
+ 		{
+ 			WalReceiverPID = 0;
+ 			if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
+ 				HandleChildCrash(pid, exitstatus,
+ 								 _("WAL receiver process"));
+ 			continue;
+ 		}
+ 
+ 		/*
+ 		 * Was it the autovacuum launcher?	Normal exit can be ignored; we'll
+ 		 * start a new one at the next iteration of the postmaster's main
+ 		 * loop, if necessary.	Any other exit condition is treated as a
+ 		 * crash.
+ 		 */
+ 		if (pid == AutoVacPID)
+ 		{
+ 			AutoVacPID = 0;
+ 			if (!EXIT_STATUS_0(exitstatus))
+ 				HandleChildCrash(pid, exitstatus,
+ 								 _("autovacuum launcher process"));
+ 			continue;
+ 		}
+ 
+ 		/*
+ 		 * Was it the archiver?  If so, just try to start a new one; no need
+ 		 * to force reset of the rest of the system.  (If fail, we'll try
+ 		 * again in future cycles of the main loop.).  Unless we were waiting
+ 		 * for it to shut down; don't restart it in that case, and and
+ 		 * PostmasterStateMachine() will advance to the next shutdown step.
+ 		 */
+ 		if (pid == PgArchPID)
+ 		{
+ 			PgArchPID = 0;
+ 			if (!EXIT_STATUS_0(exitstatus))
+ 				LogChildExit(LOG, _("archiver process"),
+ 							 pid, exitstatus);
+ 			if (XLogArchivingActive() && pmState == PM_RUN)
+ 				PgArchPID = pgarch_start();
+ 			continue;
+ 		}
+ 
+ 		/*
+ 		 * Was it the statistics collector?  If so, just try to start a new
+ 		 * one; no need to force reset of the rest of the system.  (If fail,
+ 		 * we'll try again in future cycles of the main loop.)
+ 		 */
+ 		if (pid == PgStatPID)
+ 		{
+ 			PgStatPID = 0;
+ 			if (!EXIT_STATUS_0(exitstatus))
+ 				LogChildExit(LOG, _("statistics collector process"),
+ 							 pid, exitstatus);
+ 			if (pmState == PM_RUN)
+ 				PgStatPID = pgstat_start();
+ 			continue;
+ 		}
+ 
+ 		/* Was it the system logger?  If so, try to start a new one */
+ 		if (pid == SysLoggerPID)
+ 		{
+ 			SysLoggerPID = 0;
+ 			/* for safety's sake, launch new logger *first* */
+ 			SysLoggerPID = SysLogger_Start();
+ 			if (!EXIT_STATUS_0(exitstatus))
+ 				LogChildExit(LOG, _("system logger process"),
+ 							 pid, exitstatus);
+ 			continue;
+ 		}
+ 
+ 		/*
+ 		 * Else do standard backend child cleanup.
+ 		 */
+ 		CleanupBackend(pid, exitstatus);
+ 	}							/* loop over pending child-death reports */
+ 
+ 	/*
+ 	 * After cleaning out the SIGCHLD queue, see if we have any state changes
+ 	 * or actions to make.
+ 	 */
+ 	PostmasterStateMachine();
+ 
+ 	/* Done with signal handler */
+ 	PG_SETMASK(&UnBlockSig);
+ 
+ 	errno = save_errno;
+ }
+ 
+ 
+ /*
+  * CleanupBackend -- cleanup after terminated backend.
+  *
+  * Remove all local state associated with backend.
+  */
+ static void
+ CleanupBackend(int pid,
+ 			   int exitstatus)	/* child's exit status. */
+ {
+ 	Dlelem	   *curr;
+ 
+ 	LogChildExit(DEBUG2, _("server process"), pid, exitstatus);
+ 
+ 	/*
+ 	 * If a backend dies in an ugly way then we must signal all other backends
+ 	 * to quickdie.  If exit status is zero (normal) or one (FATAL exit), we
+ 	 * assume everything is all right and proceed to remove the backend from
+ 	 * the active backend list.
+ 	 */
+ #ifdef WIN32
+ 
+ 	/*
+ 	 * On win32, also treat ERROR_WAIT_NO_CHILDREN (128) as nonfatal case,
+ 	 * since that sometimes happens under load when the process fails to start
+ 	 * properly (long before it starts using shared memory). Microsoft reports
+ 	 * it is related to mutex failure:
+ 	 * http://archives.postgresql.org/pgsql-hackers/2010-09/msg00790.php
+ 	 */
+ 	if (exitstatus == ERROR_WAIT_NO_CHILDREN)
+ 	{
+ 		LogChildExit(LOG, _("server process"), pid, exitstatus);
+ 		exitstatus = 0;
+ 	}
+ #endif
+ 
+ 	if (!EXIT_STATUS_0(exitstatus) && !EXIT_STATUS_1(exitstatus))
+ 	{
+ 		HandleChildCrash(pid, exitstatus, _("server process"));
+ 		return;
+ 	}
+ 
+ 	for (curr = DLGetHead(BackendList); curr; curr = DLGetSucc(curr))
+ 	{
+ 		Backend    *bp = (Backend *) DLE_VAL(curr);
+ 
+ 		if (bp->pid == pid)
+ 		{
+ 			if (!bp->dead_end)
+ 			{
+ 				if (!ReleasePostmasterChildSlot(bp->child_slot))
+ 				{
+ 					/*
+ 					 * Uh-oh, the child failed to clean itself up.	Treat as a
+ 					 * crash after all.
+ 					 */
+ 					HandleChildCrash(pid, exitstatus, _("server process"));
+ 					return;
+ 				}
+ #ifdef EXEC_BACKEND
+ 				ShmemBackendArrayRemove(bp);
+ #endif
+ 			}
+ 			DLRemove(curr);
+ 			free(bp);
+ 			break;
+ 		}
+ 	}
+ }
+ 
+ /*
+  * HandleChildCrash -- cleanup after failed backend, bgwriter, walwriter,
+  * or autovacuum.
+  *
+  * The objectives here are to clean up our local state about the child
+  * process, and to signal all other remaining children to quickdie.
+  */
+ static void
+ HandleChildCrash(int pid, int exitstatus, const char *procname)
+ {
+ 	Dlelem	   *curr,
+ 			   *next;
+ 	Backend    *bp;
+ 
+ 	/*
+ 	 * Make log entry unless there was a previous crash (if so, nonzero exit
+ 	 * status is to be expected in SIGQUIT response; don't clutter log)
+ 	 */
+ 	if (!FatalError)
+ 	{
+ 		LogChildExit(LOG, procname, pid, exitstatus);
+ 		ereport(LOG,
+ 				(errmsg("terminating any other active server processes")));
+ 	}
+ 
+ 	/* Process regular backends */
+ 	for (curr = DLGetHead(BackendList); curr; curr = next)
+ 	{
+ 		next = DLGetSucc(curr);
+ 		bp = (Backend *) DLE_VAL(curr);
+ 		if (bp->pid == pid)
+ 		{
+ 			/*
+ 			 * Found entry for freshly-dead backend, so remove it.
+ 			 */
+ 			if (!bp->dead_end)
+ 			{
+ 				(void) ReleasePostmasterChildSlot(bp->child_slot);
+ #ifdef EXEC_BACKEND
+ 				ShmemBackendArrayRemove(bp);
+ #endif
+ 			}
+ 			DLRemove(curr);
+ 			free(bp);
+ 			/* Keep looping so we can signal remaining backends */
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * This backend is still alive.  Unless we did so already, tell it
+ 			 * to commit hara-kiri.
+ 			 *
+ 			 * SIGQUIT is the special signal that says exit without proc_exit
+ 			 * and let the user know what's going on. But if SendStop is set
+ 			 * (-s on command line), then we send SIGSTOP instead, so that we
+ 			 * can get core dumps from all backends by hand.
+ 			 *
+ 			 * We could exclude dead_end children here, but at least in the
+ 			 * SIGSTOP case it seems better to include them.
+ 			 */
+ 			if (!FatalError)
+ 			{
+ 				ereport(DEBUG2,
+ 						(errmsg_internal("sending %s to process %d",
+ 										 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ 										 (int) bp->pid)));
+ 				signal_child(bp->pid, (SendStop ? SIGSTOP : SIGQUIT));
+ 			}
+ 		}
+ 	}
+ 
+ 	/* Take care of the startup process too */
+ 	if (pid == StartupPID)
+ 		StartupPID = 0;
+ 	else if (StartupPID != 0 && !FatalError)
+ 	{
+ 		ereport(DEBUG2,
+ 				(errmsg_internal("sending %s to process %d",
+ 								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ 								 (int) StartupPID)));
+ 		signal_child(StartupPID, (SendStop ? SIGSTOP : SIGQUIT));
+ 	}
+ 
+ 	/* Take care of the bgwriter too */
+ 	if (pid == BgWriterPID)
+ 		BgWriterPID = 0;
+ 	else if (BgWriterPID != 0 && !FatalError)
+ 	{
+ 		ereport(DEBUG2,
+ 				(errmsg_internal("sending %s to process %d",
+ 								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ 								 (int) BgWriterPID)));
+ 		signal_child(BgWriterPID, (SendStop ? SIGSTOP : SIGQUIT));
+ 	}
+ 
+ 	/* Take care of the walwriter too */
+ 	if (pid == WalWriterPID)
+ 		WalWriterPID = 0;
+ 	else if (WalWriterPID != 0 && !FatalError)
+ 	{
+ 		ereport(DEBUG2,
+ 				(errmsg_internal("sending %s to process %d",
+ 								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ 								 (int) WalWriterPID)));
+ 		signal_child(WalWriterPID, (SendStop ? SIGSTOP : SIGQUIT));
+ 	}
+ 
+ 	/* Take care of the walreceiver too */
+ 	if (pid == WalReceiverPID)
+ 		WalReceiverPID = 0;
+ 	else if (WalReceiverPID != 0 && !FatalError)
+ 	{
+ 		ereport(DEBUG2,
+ 				(errmsg_internal("sending %s to process %d",
+ 								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ 								 (int) WalReceiverPID)));
+ 		signal_child(WalReceiverPID, (SendStop ? SIGSTOP : SIGQUIT));
+ 	}
+ 
+ 	/* Take care of the autovacuum launcher too */
+ 	if (pid == AutoVacPID)
+ 		AutoVacPID = 0;
+ 	else if (AutoVacPID != 0 && !FatalError)
+ 	{
+ 		ereport(DEBUG2,
+ 				(errmsg_internal("sending %s to process %d",
+ 								 (SendStop ? "SIGSTOP" : "SIGQUIT"),
+ 								 (int) AutoVacPID)));
+ 		signal_child(AutoVacPID, (SendStop ? SIGSTOP : SIGQUIT));
+ 	}
+ 
+ 	/*
+ 	 * Force a power-cycle of the pgarch process too.  (This isn't absolutely
+ 	 * necessary, but it seems like a good idea for robustness, and it
+ 	 * simplifies the state-machine logic in the case where a shutdown request
+ 	 * arrives during crash processing.)
+ 	 */
+ 	if (PgArchPID != 0 && !FatalError)
+ 	{
+ 		ereport(DEBUG2,
+ 				(errmsg_internal("sending %s to process %d",
+ 								 "SIGQUIT",
+ 								 (int) PgArchPID)));
+ 		signal_child(PgArchPID, SIGQUIT);
+ 	}
+ 
+ 	/*
+ 	 * Force a power-cycle of the pgstat process too.  (This isn't absolutely
+ 	 * necessary, but it seems like a good idea for robustness, and it
+ 	 * simplifies the state-machine logic in the case where a shutdown request
+ 	 * arrives during crash processing.)
+ 	 */
+ 	if (PgStatPID != 0 && !FatalError)
+ 	{
+ 		ereport(DEBUG2,
+ 				(errmsg_internal("sending %s to process %d",
+ 								 "SIGQUIT",
+ 								 (int) PgStatPID)));
+ 		signal_child(PgStatPID, SIGQUIT);
+ 		allow_immediate_pgstat_restart();
+ 	}
+ 
+ 	/* We do NOT restart the syslogger */
+ 
+ 	FatalError = true;
+ 	/* We now transit into a state of waiting for children to die */
+ 	if (pmState == PM_RECOVERY ||
+ 		pmState == PM_HOT_STANDBY ||
+ 		pmState == PM_RUN ||
+ 		pmState == PM_WAIT_BACKUP ||
+ 		pmState == PM_WAIT_READONLY ||
+ 		pmState == PM_SHUTDOWN)
+ 		pmState = PM_WAIT_BACKENDS;
+ }
+ 
+ /*
+  * Log the death of a child process.
+  */
+ static void
+ LogChildExit(int lev, const char *procname, int pid, int exitstatus)
+ {
+ 	if (WIFEXITED(exitstatus))
+ 		ereport(lev,
+ 
+ 		/*------
+ 		  translator: %s is a noun phrase describing a child process, such as
+ 		  "server process" */
+ 				(errmsg("%s (PID %d) exited with exit code %d",
+ 						procname, pid, WEXITSTATUS(exitstatus))));
+ 	else if (WIFSIGNALED(exitstatus))
+ #if defined(WIN32)
+ 		ereport(lev,
+ 
+ 		/*------
+ 		  translator: %s is a noun phrase describing a child process, such as
+ 		  "server process" */
+ 				(errmsg("%s (PID %d) was terminated by exception 0x%X",
+ 						procname, pid, WTERMSIG(exitstatus)),
+ 				 errhint("See C include file \"ntstatus.h\" for a description of the hexadecimal value.")));
+ #elif defined(HAVE_DECL_SYS_SIGLIST) && HAVE_DECL_SYS_SIGLIST
+ 	ereport(lev,
+ 
+ 	/*------
+ 	  translator: %s is a noun phrase describing a child process, such as
+ 	  "server process" */
+ 			(errmsg("%s (PID %d) was terminated by signal %d: %s",
+ 					procname, pid, WTERMSIG(exitstatus),
+ 					WTERMSIG(exitstatus) < NSIG ?
+ 					sys_siglist[WTERMSIG(exitstatus)] : "(unknown)")));
+ #else
+ 		ereport(lev,
+ 
+ 		/*------
+ 		  translator: %s is a noun phrase describing a child process, such as
+ 		  "server process" */
+ 				(errmsg("%s (PID %d) was terminated by signal %d",
+ 						procname, pid, WTERMSIG(exitstatus))));
+ #endif
+ 	else
+ 		ereport(lev,
+ 
+ 		/*------
+ 		  translator: %s is a noun phrase describing a child process, such as
+ 		  "server process" */
+ 				(errmsg("%s (PID %d) exited with unrecognized status %d",
+ 						procname, pid, exitstatus)));
+ }
+ 
+ /*
+  * Advance the postmaster's state machine and take actions as appropriate
+  *
+  * This is common code for pmdie(), reaper() and sigusr1_handler(), which
+  * receive the signals that might mean we need to change state.
+  */
+ static void
+ PostmasterStateMachine(void)
+ {
+ 	if (pmState == PM_WAIT_BACKUP)
+ 	{
+ 		/*
+ 		 * PM_WAIT_BACKUP state ends when online backup mode is not active.
+ 		 */
+ 		if (!BackupInProgress())
+ 			pmState = PM_WAIT_BACKENDS;
+ 	}
+ 
+ 	if (pmState == PM_WAIT_READONLY)
+ 	{
+ 		/*
+ 		 * PM_WAIT_READONLY state ends when we have no regular backends that
+ 		 * have been started during recovery.  We kill the startup and
+ 		 * walreceiver processes and transition to PM_WAIT_BACKENDS.  Ideally,
+ 		 * we might like to kill these processes first and then wait for
+ 		 * backends to die off, but that doesn't work at present because
+ 		 * killing the startup process doesn't release its locks.
+ 		 */
+ 		if (CountChildren(BACKEND_TYPE_NORMAL) == 0)
+ 		{
+ 			if (StartupPID != 0)
+ 				signal_child(StartupPID, SIGTERM);
+ 			if (WalReceiverPID != 0)
+ 				signal_child(WalReceiverPID, SIGTERM);
+ 			pmState = PM_WAIT_BACKENDS;
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * If we are in a state-machine state that implies waiting for backends to
+ 	 * exit, see if they're all gone, and change state if so.
+ 	 */
+ 	if (pmState == PM_WAIT_BACKENDS)
+ 	{
+ 		/*
+ 		 * PM_WAIT_BACKENDS state ends when we have no regular backends
+ 		 * (including autovac workers) and no walwriter or autovac launcher.
+ 		 * If we are doing crash recovery then we expect the bgwriter to exit
+ 		 * too, otherwise not.	The archiver, stats, and syslogger processes
+ 		 * are disregarded since they are not connected to shared memory; we
+ 		 * also disregard dead_end children here. Walsenders are also
+ 		 * disregarded, they will be terminated later after writing the
+ 		 * checkpoint record, like the archiver process.
+ 		 */
+ 		if (CountChildren(BACKEND_TYPE_NORMAL | BACKEND_TYPE_AUTOVAC) == 0 &&
+ 			StartupPID == 0 &&
+ 			WalReceiverPID == 0 &&
+ 			(BgWriterPID == 0 || !FatalError) &&
+ 			WalWriterPID == 0 &&
+ 			AutoVacPID == 0)
+ 		{
+ 			if (FatalError)
+ 			{
+ 				/*
+ 				 * Start waiting for dead_end children to die.	This state
+ 				 * change causes ServerLoop to stop creating new ones.
+ 				 */
+ 				pmState = PM_WAIT_DEAD_END;
+ 
+ 				/*
+ 				 * We already SIGQUIT'd the archiver and stats processes, if
+ 				 * any, when we entered FatalError state.
+ 				 */
+ 			}
+ 			else
+ 			{
+ 				/*
+ 				 * If we get here, we are proceeding with normal shutdown. All
+ 				 * the regular children are gone, and it's time to tell the
+ 				 * bgwriter to do a shutdown checkpoint.
+ 				 */
+ 				Assert(Shutdown > NoShutdown);
+ 				/* Start the bgwriter if not running */
+ 				if (BgWriterPID == 0)
+ 					BgWriterPID = StartBackgroundWriter();
+ 				/* And tell it to shut down */
+ 				if (BgWriterPID != 0)
+ 				{
+ 					signal_child(BgWriterPID, SIGUSR2);
+ 					pmState = PM_SHUTDOWN;
+ 				}
+ 				else
+ 				{
+ 					/*
+ 					 * If we failed to fork a bgwriter, just shut down. Any
+ 					 * required cleanup will happen at next restart. We set
+ 					 * FatalError so that an "abnormal shutdown" message gets
+ 					 * logged when we exit.
+ 					 */
+ 					FatalError = true;
+ 					pmState = PM_WAIT_DEAD_END;
+ 
+ 					/* Kill the walsenders, archiver and stats collector too */
+ 					SignalChildren(SIGQUIT);
+ 					if (PgArchPID != 0)
+ 						signal_child(PgArchPID, SIGQUIT);
+ 					if (PgStatPID != 0)
+ 						signal_child(PgStatPID, SIGQUIT);
+ 				}
+ 			}
+ 		}
+ 	}
+ 
+ 	if (pmState == PM_SHUTDOWN_2)
+ 	{
+ 		/*
+ 		 * PM_SHUTDOWN_2 state ends when there's no other children than
+ 		 * dead_end children left. There shouldn't be any regular backends
+ 		 * left by now anyway; what we're really waiting for is walsenders and
+ 		 * archiver.
+ 		 *
+ 		 * Walreceiver should normally be dead by now, but not when a fast
+ 		 * shutdown is performed during recovery.
+ 		 */
+ 		if (PgArchPID == 0 && CountChildren(BACKEND_TYPE_ALL) == 0 &&
+ 			WalReceiverPID == 0)
+ 		{
+ 			pmState = PM_WAIT_DEAD_END;
+ 		}
+ 	}
+ 
+ 	if (pmState == PM_WAIT_DEAD_END)
+ 	{
+ 		/*
+ 		 * PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
+ 		 * (ie, no dead_end children remain), and the archiver and stats
+ 		 * collector are gone too.
+ 		 *
+ 		 * The reason we wait for those two is to protect them against a new
+ 		 * postmaster starting conflicting subprocesses; this isn't an
+ 		 * ironclad protection, but it at least helps in the
+ 		 * shutdown-and-immediately-restart scenario.  Note that they have
+ 		 * already been sent appropriate shutdown signals, either during a
+ 		 * normal state transition leading up to PM_WAIT_DEAD_END, or during
+ 		 * FatalError processing.
+ 		 */
+ 		if (DLGetHead(BackendList) == NULL &&
+ 			PgArchPID == 0 && PgStatPID == 0)
+ 		{
+ 			/* These other guys should be dead already */
+ 			Assert(StartupPID == 0);
+ 			Assert(WalReceiverPID == 0);
+ 			Assert(BgWriterPID == 0);
+ 			Assert(WalWriterPID == 0);
+ 			Assert(AutoVacPID == 0);
+ 			/* syslogger is not considered here */
+ 			pmState = PM_NO_CHILDREN;
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * If we've been told to shut down, we exit as soon as there are no
+ 	 * remaining children.	If there was a crash, cleanup will occur at the
+ 	 * next startup.  (Before PostgreSQL 8.3, we tried to recover from the
+ 	 * crash before exiting, but that seems unwise if we are quitting because
+ 	 * we got SIGTERM from init --- there may well not be time for recovery
+ 	 * before init decides to SIGKILL us.)
+ 	 *
+ 	 * Note that the syslogger continues to run.  It will exit when it sees
+ 	 * EOF on its input pipe, which happens when there are no more upstream
+ 	 * processes.
+ 	 */
+ 	if (Shutdown > NoShutdown && pmState == PM_NO_CHILDREN)
+ 	{
+ 		if (FatalError)
+ 		{
+ 			ereport(LOG, (errmsg("abnormal database system shutdown")));
+ 			ExitPostmaster(1);
+ 		}
+ 		else
+ 		{
+ 			/*
+ 			 * Terminate backup mode to avoid recovery after a clean fast
+ 			 * shutdown.  Since a backup can only be taken during normal
+ 			 * running (and not, for example, while running under Hot Standby)
+ 			 * it only makes sense to do this if we reached normal running. If
+ 			 * we're still in recovery, the backup file is one we're
+ 			 * recovering *from*, and we must keep it around so that recovery
+ 			 * restarts from the right place.
+ 			 */
+ 			if (ReachedNormalRunning)
+ 				CancelBackup();
+ 
+ 			/* Normal exit from the postmaster is here */
+ 			ExitPostmaster(0);
+ 		}
+ 	}
+ 
+ 	/*
+ 	 * If recovery failed, or the user does not want an automatic restart
+ 	 * after backend crashes, wait for all non-syslogger children to exit, and
+ 	 * then exit postmaster. We don't try to reinitialize when recovery fails,
+ 	 * because more than likely it will just fail again and we will keep
+ 	 * trying forever.
+ 	 */
+ 	if (pmState == PM_NO_CHILDREN && (RecoveryError || !restart_after_crash))
+ 		ExitPostmaster(1);
+ 
+ 	/*
+ 	 * If we need to recover from a crash, wait for all non-syslogger children
+ 	 * to exit, then reset shmem and StartupDataBase.
+ 	 */
+ 	if (FatalError && pmState == PM_NO_CHILDREN)
+ 	{
+ 		ereport(LOG,
+ 				(errmsg("all server processes terminated; reinitializing")));
+ 
+ 		shmem_exit(1);
+ 		reset_shared(PostPortNumber);
+ 
+ 		StartupPID = StartupDataBase();
+ 		Assert(StartupPID != 0);
+ 		pmState = PM_STARTUP;
+ 	}
+ }
+ 
+ 
+ /*
+  * Send a signal to a postmaster child process
+  *
+  * On systems that have setsid(), each child process sets itself up as a
+  * process group leader.  For signals that are generally interpreted in the
+  * appropriate fashion, we signal the entire process group not just the
+  * direct child process.  This allows us to, for example, SIGQUIT a blocked
+  * archive_recovery script, or SIGINT a script being run by a backend via
+  * system().
+  *
+  * There is a race condition for recently-forked children: they might not
+  * have executed setsid() yet.	So we signal the child directly as well as
+  * the group.  We assume such a child will handle the signal before trying
+  * to spawn any grandchild processes.  We also assume that signaling the
+  * child twice will not cause any problems.
+  */
+ static void
+ signal_child(pid_t pid, int signal)
+ {
+ 	if (kill(pid, signal) < 0)
+ 		elog(DEBUG3, "kill(%ld,%d) failed: %m", (long) pid, signal);
+ #ifdef HAVE_SETSID
+ 	switch (signal)
+ 	{
+ 		case SIGINT:
+ 		case SIGTERM:
+ 		case SIGQUIT:
+ 		case SIGSTOP:
+ 			if (kill(-pid, signal) < 0)
+ 				elog(DEBUG3, "kill(%ld,%d) failed: %m", (long) (-pid), signal);
+ 			break;
+ 		default:
+ 			break;
+ 	}
+ #endif
+ }
+ 
+ /*
+  * Send a signal to the targeted children (but NOT special children;
+  * dead_end children are never signaled, either).
+  */
+ static bool
+ SignalSomeChildren(int signal, int target)
+ {
+ 	Dlelem	   *curr;
+ 	bool		signaled = false;
+ 
+ 	for (curr = DLGetHead(BackendList); curr; curr = DLGetSucc(curr))
+ 	{
+ 		Backend    *bp = (Backend *) DLE_VAL(curr);
+ 
+ 		if (bp->dead_end)
+ 			continue;
+ 
+ 		/*
+ 		 * Since target == BACKEND_TYPE_ALL is the most common case, we test
+ 		 * it first and avoid touching shared memory for every child.
+ 		 */
+ 		if (target != BACKEND_TYPE_ALL)
+ 		{
+ 			int			child;
+ 
+ 			if (bp->is_autovacuum)
+ 				child = BACKEND_TYPE_AUTOVAC;
+ 			else if (IsPostmasterChildWalSender(bp->child_slot))
+ 				child = BACKEND_TYPE_WALSND;
+ 			else
+ 				child = BACKEND_TYPE_NORMAL;
+ 			if (!(target & child))
+ 				continue;
+ 		}
+ 
+ 		ereport(DEBUG4,
+ 				(errmsg_internal("sending signal %d to process %d",
+ 								 signal, (int) bp->pid)));
+ 		signal_child(bp->pid, signal);
+ 		signaled = true;
+ 	}
+ 	return signaled;
+ }
+ 
+ /*
+  * BackendStartup -- start backend process
+  *
+  * returns: STATUS_ERROR if the fork failed, STATUS_OK otherwise.
+  *
+  * Note: if you change this code, also consider StartAutovacuumWorker.
+  */
+ static int
+ BackendStartup(Port *port)
+ {
+ 	Backend    *bn;				/* for backend cleanup */
+ 	pid_t		pid;
+ 
+ 	/*
+ 	 * Create backend data structure.  Better before the fork() so we can
+ 	 * handle failure cleanly.
+ 	 */
+ 	bn = (Backend *) malloc(sizeof(Backend));
+ 	if (!bn)
+ 	{
+ 		ereport(LOG,
+ 				(errcode(ERRCODE_OUT_OF_MEMORY),
+ 				 errmsg("out of memory")));
+ 		return STATUS_ERROR;
+ 	}
+ 
+ 	/*
+ 	 * Compute the cancel key that will be assigned to this backend. The
+ 	 * backend will have its own copy in the forked-off process' value of
+ 	 * MyCancelKey, so that it can transmit the key to the frontend.
+ 	 */
+ 	MyCancelKey = PostmasterRandom();
+ 	bn->cancel_key = MyCancelKey;
+ 
+ 	/* Pass down canAcceptConnections state */
+ 	port->canAcceptConnections = canAcceptConnections();
+ 	bn->dead_end = (port->canAcceptConnections != CAC_OK &&
+ 					port->canAcceptConnections != CAC_WAITBACKUP);
+ 
+ 	/*
+ 	 * Unless it's a dead_end child, assign it a child slot number
+ 	 */
+ 	if (!bn->dead_end)
+ 		bn->child_slot = MyPMChildSlot = AssignPostmasterChildSlot();
+ 	else
+ 		bn->child_slot = 0;
+ 
+ #ifdef EXEC_BACKEND
+ 	pid = backend_forkexec(port);
+ #else							/* !EXEC_BACKEND */
+ 	pid = fork_process();
+ 	if (pid == 0)				/* child */
+ 	{
+ 		free(bn);
+ 
+ 		/*
+ 		 * Let's clean up ourselves as the postmaster child, and close the
+ 		 * postmaster's listen sockets.  (In EXEC_BACKEND case this is all
+ 		 * done in SubPostmasterMain.)
+ 		 */
+ 		IsUnderPostmaster = true;		/* we are a postmaster subprocess now */
+ 
+ 		MyProcPid = getpid();	/* reset MyProcPid */
+ 
+ 		MyStartTime = time(NULL);
+ 
+ 		/* We don't want the postmaster's proc_exit() handlers */
+ 		on_exit_reset();
+ 
+ 		/* Close the postmaster's sockets */
+ 		ClosePostmasterPorts(false);
+ 
+ 		/* Perform additional initialization and collect startup packet */
+ 		BackendInitialize(port);
+ 
+ 		/* And run the backend */
+ 		proc_exit(BackendRun(port));
+ 	}
+ #endif   /* EXEC_BACKEND */
+ 
+ 	if (pid < 0)
+ 	{
+ 		/* in parent, fork failed */
+ 		int			save_errno = errno;
+ 
+ 		if (!bn->dead_end)
+ 			(void) ReleasePostmasterChildSlot(bn->child_slot);
+ 		free(bn);
+ 		errno = save_errno;
+ 		ereport(LOG,
+ 				(errmsg("could not fork new process for connection: %m")));
+ 		report_fork_failure_to_client(port, save_errno);
+ 		return STATUS_ERROR;
+ 	}
+ 
+ 	/* in parent, successful fork */
+ 	ereport(DEBUG2,
+ 			(errmsg_internal("forked new backend, pid=%d socket=%d",
+ 							 (int) pid, (int) port->sock)));
+ 
+ 	/*
+ 	 * Everything's been successful, it's safe to add this backend to our list
+ 	 * of backends.
+ 	 */
+ 	bn->pid = pid;
+ 	bn->is_autovacuum = false;
+ 	DLInitElem(&bn->elem, bn);
+ 	DLAddHead(BackendList, &bn->elem);
+ #ifdef EXEC_BACKEND
+ 	if (!bn->dead_end)
+ 		ShmemBackendArrayAdd(bn);
+ #endif
+ 
+ 	return STATUS_OK;
+ }
+ 
+ /*
+  * Try to report backend fork() failure to client before we close the
+  * connection.	Since we do not care to risk blocking the postmaster on
+  * this connection, we set the connection to non-blocking and try only once.
+  *
+  * This is grungy special-purpose code; we cannot use backend libpq since
+  * it's not up and running.
+  */
+ static void
+ report_fork_failure_to_client(Port *port, int errnum)
+ {
+ 	char		buffer[1000];
+ 	int			rc;
+ 
+ 	/* Format the error message packet (always V2 protocol) */
+ 	snprintf(buffer, sizeof(buffer), "E%s%s\n",
+ 			 _("could not fork new process for connection: "),
+ 			 strerror(errnum));
+ 
+ 	/* Set port to non-blocking.  Don't do send() if this fails */
+ 	if (!pg_set_noblock(port->sock))
+ 		return;
+ 
+ 	/* We'll retry after EINTR, but ignore all other failures */
+ 	do
+ 	{
+ 		rc = send(port->sock, buffer, strlen(buffer) + 1, 0);
+ 	} while (rc < 0 && errno == EINTR);
+ }
+ 
+ 
+ /*
+  * BackendInitialize -- initialize an interactive (postmaster-child)
+  *				backend process, and collect the client's startup packet.
+  *
+  * returns: nothing.  Will not return at all if there's any failure.
+  *
+  * Note: this code does not depend on having any access to shared memory.
+  * In the EXEC_BACKEND case, we are physically attached to shared memory
+  * but have not yet set up most of our local pointers to shmem structures.
+  */
+ static void
+ BackendInitialize(Port *port)
+ {
+ 	int			status;
+ 	char		remote_host[NI_MAXHOST];
+ 	char		remote_port[NI_MAXSERV];
+ 	char		remote_ps_data[NI_MAXHOST];
+ 
+ 	/* Save port etc. for ps status */
+ 	MyProcPort = port;
+ 
+ 	/*
+ 	 * PreAuthDelay is a debugging aid for investigating problems in the
+ 	 * authentication cycle: it can be set in postgresql.conf to allow time to
+ 	 * attach to the newly-forked backend with a debugger.	(See also
+ 	 * PostAuthDelay, which we allow clients to pass through PGOPTIONS, but it
+ 	 * is not honored until after authentication.)
+ 	 */
+ 	if (PreAuthDelay > 0)
+ 		pg_usleep(PreAuthDelay * 1000000L);
+ 
+ 	/* This flag will remain set until InitPostgres finishes authentication */
+ 	ClientAuthInProgress = true;	/* limit visibility of log messages */
+ 
+ 	/* save process start time */
+ 	port->SessionStartTime = GetCurrentTimestamp();
+ 	MyStartTime = timestamptz_to_time_t(port->SessionStartTime);
+ 
+ 	/* set these to empty in case they are needed before we set them up */
+ 	port->remote_host = "";
+ 	port->remote_port = "";
+ 
+ 	/*
+ 	 * Initialize libpq and enable reporting of ereport errors to the client.
+ 	 * Must do this now because authentication uses libpq to send messages.
+ 	 */
+ 	pq_init();					/* initialize libpq to talk to client */
+ 	whereToSendOutput = DestRemote;		/* now safe to ereport to client */
+ 
+ 	/*
+ 	 * If possible, make this process a group leader, so that the postmaster
+ 	 * can signal any child processes too.	(We do this now on the off chance
+ 	 * that something might spawn a child process during authentication.)
+ 	 */
+ #ifdef HAVE_SETSID
+ 	if (setsid() < 0)
+ 		elog(FATAL, "setsid() failed: %m");
+ #endif
+ 
+ 	/*
+ 	 * We arrange for a simple exit(1) if we receive SIGTERM or SIGQUIT or
+ 	 * timeout while trying to collect the startup packet.	Otherwise the
+ 	 * postmaster cannot shutdown the database FAST or IMMED cleanly if a
+ 	 * buggy client fails to send the packet promptly.
+ 	 */
+ 	pqsignal(SIGTERM, startup_die);
+ 	pqsignal(SIGQUIT, startup_die);
+ 	pqsignal(SIGALRM, startup_die);
+ 	PG_SETMASK(&StartupBlockSig);
+ 
+ 	/*
+ 	 * Get the remote host name and port for logging and status display.
+ 	 */
+ 	remote_host[0] = '\0';
+ 	remote_port[0] = '\0';
+ 	if (pg_getnameinfo_all(&port->raddr.addr, port->raddr.salen,
+ 						   remote_host, sizeof(remote_host),
+ 						   remote_port, sizeof(remote_port),
+ 					   (log_hostname ? 0 : NI_NUMERICHOST) | NI_NUMERICSERV) != 0)
+ 	{
+ 		int			ret = pg_getnameinfo_all(&port->raddr.addr, port->raddr.salen,
+ 											 remote_host, sizeof(remote_host),
+ 											 remote_port, sizeof(remote_port),
+ 											 NI_NUMERICHOST | NI_NUMERICSERV);
+ 
+ 		if (ret != 0)
+ 			ereport(WARNING,
+ 					(errmsg_internal("pg_getnameinfo_all() failed: %s",
+ 									 gai_strerror(ret))));
+ 	}
+ 	if (remote_port[0] == '\0')
+ 		snprintf(remote_ps_data, sizeof(remote_ps_data), "%s", remote_host);
+ 	else
+ 		snprintf(remote_ps_data, sizeof(remote_ps_data), "%s(%s)", remote_host, remote_port);
+ 
+ 	if (Log_connections)
+ 	{
+ 		if (remote_port[0])
+ 			ereport(LOG,
+ 					(errmsg("connection received: host=%s port=%s",
+ 							remote_host,
+ 							remote_port)));
+ 		else
+ 			ereport(LOG,
+ 					(errmsg("connection received: host=%s",
+ 							remote_host)));
+ 	}
+ 
+ 	/*
+ 	 * save remote_host and remote_port in port structure
+ 	 */
+ 	port->remote_host = strdup(remote_host);
+ 	port->remote_port = strdup(remote_port);
+ 	if (log_hostname)
+ 		port->remote_hostname = port->remote_host;
+ 
+ 	/*
+ 	 * Ready to begin client interaction.  We will give up and exit(1) after a
+ 	 * time delay, so that a broken client can't hog a connection
+ 	 * indefinitely.  PreAuthDelay and any DNS interactions above don't count
+ 	 * against the time limit.
+ 	 */
+ 	if (!enable_sig_alarm(AuthenticationTimeout * 1000, false))
+ 		elog(FATAL, "could not set timer for startup packet timeout");
+ 
+ 	/*
+ 	 * Receive the startup packet (which might turn out to be a cancel request
+ 	 * packet).
+ 	 */
+ 	status = ProcessStartupPacket(port, false);
+ 
+ 	/*
+ 	 * Stop here if it was bad or a cancel packet.	ProcessStartupPacket
+ 	 * already did any appropriate error reporting.
+ 	 */
+ 	if (status != STATUS_OK)
+ 		proc_exit(0);
+ 
+ 	/*
+ 	 * Now that we have the user and database name, we can set the process
+ 	 * title for ps.  It's good to do this as early as possible in startup.
+ 	 *
+ 	 * For a walsender, the ps display is set in the following form:
+ 	 *
+ 	 * postgres: wal sender process <user> <host> <activity>
+ 	 *
+ 	 * To achieve that, we pass "wal sender process" as username and username
+ 	 * as dbname to init_ps_display(). XXX: should add a new variant of
+ 	 * init_ps_display() to avoid abusing the parameters like this.
+ 	 */
+ 	if (am_walsender)
+ 		init_ps_display("wal sender process", port->user_name, remote_ps_data,
+ 						update_process_title ? "authentication" : "");
+ 	else
+ 		init_ps_display(port->user_name, port->database_name, remote_ps_data,
+ 						update_process_title ? "authentication" : "");
+ 
+ 	/*
+ 	 * Disable the timeout, and prevent SIGTERM/SIGQUIT again.
+ 	 */
+ 	if (!disable_sig_alarm(false))
+ 		elog(FATAL, "could not disable timer for startup packet timeout");
+ 	PG_SETMASK(&BlockSig);
+ }
+ 
+ 
+ /*
+  * BackendRun -- set up the backend's argument list and invoke PostgresMain()
+  *
+  * returns:
+  *		Shouldn't return at all.
+  *		If PostgresMain() fails, return status.
+  */
+ static int
+ BackendRun(Port *port)
+ {
+ 	char	  **av;
+ 	int			maxac;
+ 	int			ac;
+ 	long		secs;
+ 	int			usecs;
+ 	int			i;
+ 
+ 	/*
+ 	 * Don't want backend to be able to see the postmaster random number
+ 	 * generator state.  We have to clobber the static random_seed *and* start
+ 	 * a new random sequence in the random() library function.
+ 	 */
+ 	random_seed = 0;
+ 	random_start_time.tv_usec = 0;
+ 	/* slightly hacky way to get integer microseconds part of timestamptz */
+ 	TimestampDifference(0, port->SessionStartTime, &secs, &usecs);
+ 	srandom((unsigned int) (MyProcPid ^ usecs));
+ 
+ 	/*
+ 	 * Now, build the argv vector that will be given to PostgresMain.
+ 	 *
+ 	 * The maximum possible number of commandline arguments that could come
+ 	 * from ExtraOptions is (strlen(ExtraOptions) + 1) / 2; see
+ 	 * pg_split_opts().
+ 	 */
+ 	maxac = 5;					/* for fixed args supplied below */
+ 	maxac += (strlen(ExtraOptions) + 1) / 2;
+ 
+ 	av = (char **) MemoryContextAlloc(TopMemoryContext,
+ 									  maxac * sizeof(char *));
+ 	ac = 0;
+ 
+ 	av[ac++] = "postgres";
+ 
+ 	/*
+ 	 * Pass any backend switches specified with -o on the postmaster's own
+ 	 * command line.  We assume these are secure.  (It's OK to mangle
+ 	 * ExtraOptions now, since we're safely inside a subprocess.)
+ 	 */
+ 	pg_split_opts(av, &ac, ExtraOptions);
+ 
+ 	/*
+ 	 * Tell the backend which database to use.
+ 	 */
+ 	av[ac++] = port->database_name;
+ 
+ 	av[ac] = NULL;
+ 
+ 	Assert(ac < maxac);
+ 
+ 	/*
+ 	 * Debug: print arguments being passed to backend
+ 	 */
+ 	ereport(DEBUG3,
+ 			(errmsg_internal("%s child[%d]: starting with (",
+ 							 progname, (int) getpid())));
+ 	for (i = 0; i < ac; ++i)
+ 		ereport(DEBUG3,
+ 				(errmsg_internal("\t%s", av[i])));
+ 	ereport(DEBUG3,
+ 			(errmsg_internal(")")));
+ 
+ 	/*
+ 	 * Make sure we aren't in PostmasterContext anymore.  (We can't delete it
+ 	 * just yet, though, because InitPostgres will need the HBA data.)
+ 	 */
+ 	MemoryContextSwitchTo(TopMemoryContext);
+ 
+ 	return (PostgresMain(ac, av, port->user_name));
+ }
+ 
+ 
+ #ifdef EXEC_BACKEND
+ 
+ /*
+  * postmaster_forkexec -- fork and exec a postmaster subprocess
+  *
+  * The caller must have set up the argv array already, except for argv[2]
+  * which will be filled with the name of the temp variable file.
+  *
+  * Returns the child process PID, or -1 on fork failure (a suitable error
+  * message has been logged on failure).
+  *
+  * All uses of this routine will dispatch to SubPostmasterMain in the
+  * child process.
+  */
+ pid_t
+ postmaster_forkexec(int argc, char *argv[])
+ {
+ 	Port		port;
+ 
+ 	/* This entry point passes dummy values for the Port variables */
+ 	memset(&port, 0, sizeof(port));
+ 	return internal_forkexec(argc, argv, &port);
+ }
+ 
+ /*
+  * backend_forkexec -- fork/exec off a backend process
+  *
+  * Some operating systems (WIN32) don't have fork() so we have to simulate
+  * it by storing parameters that need to be passed to the child and
+  * then create a new child process.
+  *
+  * returns the pid of the fork/exec'd process, or -1 on failure
+  */
+ static pid_t
+ backend_forkexec(Port *port)
+ {
+ 	char	   *av[4];
+ 	int			ac = 0;
+ 
+ 	av[ac++] = "postgres";
+ 	av[ac++] = "--forkbackend";
+ 	av[ac++] = NULL;			/* filled in by internal_forkexec */
+ 
+ 	av[ac] = NULL;
+ 	Assert(ac < lengthof(av));
+ 
+ 	return internal_forkexec(ac, av, port);
+ }
+ 
+ #ifndef WIN32
+ 
+ /*
+  * internal_forkexec non-win32 implementation
+  *
+  * - writes out backend variables to the parameter file
+  * - fork():s, and then exec():s the child process
+  */
+ static pid_t
+ internal_forkexec(int argc, char *argv[], Port *port)
+ {
+ 	static unsigned long tmpBackendFileNum = 0;
+ 	pid_t		pid;
+ 	char		tmpfilename[MAXPGPATH];
+ 	BackendParameters param;
+ 	FILE	   *fp;
+ 
+ 	if (!save_backend_variables(&param, port))
+ 		return -1;				/* log made by save_backend_variables */
+ 
+ 	/* Calculate name for temp file */
+ 	snprintf(tmpfilename, MAXPGPATH, "%s/%s.backend_var.%d.%lu",
+ 			 PG_TEMP_FILES_DIR, PG_TEMP_FILE_PREFIX,
+ 			 MyProcPid, ++tmpBackendFileNum);
+ 
+ 	/* Open file */
+ 	fp = AllocateFile(tmpfilename, PG_BINARY_W);
+ 	if (!fp)
+ 	{
+ 		/* As in OpenTemporaryFile, try to make the temp-file directory */
+ 		mkdir(PG_TEMP_FILES_DIR, S_IRWXU);
+ 
+ 		fp = AllocateFile(tmpfilename, PG_BINARY_W);
+ 		if (!fp)
+ 		{
+ 			ereport(LOG,
+ 					(errcode_for_file_access(),
+ 					 errmsg("could not create file \"%s\": %m",
+ 							tmpfilename)));
+ 			return -1;
+ 		}
+ 	}
+ 
+ 	if (fwrite(&param, sizeof(param), 1, fp) != 1)
+ 	{
+ 		ereport(LOG,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not write to file \"%s\": %m", tmpfilename)));
+ 		FreeFile(fp);
+ 		return -1;
+ 	}
+ 
+ 	/* Release file */
+ 	if (FreeFile(fp))
+ 	{
+ 		ereport(LOG,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not write to file \"%s\": %m", tmpfilename)));
+ 		return -1;
+ 	}
+ 
+ 	/* Make sure caller set up argv properly */
+ 	Assert(argc >= 3);
+ 	Assert(argv[argc] == NULL);
+ 	Assert(strncmp(argv[1], "--fork", 6) == 0);
+ 	Assert(argv[2] == NULL);
+ 
+ 	/* Insert temp file name after --fork argument */
+ 	argv[2] = tmpfilename;
+ 
+ 	/* Fire off execv in child */
+ 	if ((pid = fork_process()) == 0)
+ 	{
+ 		if (execv(postgres_exec_path, argv) < 0)
+ 		{
+ 			ereport(LOG,
+ 					(errmsg("could not execute server process \"%s\": %m",
+ 							postgres_exec_path)));
+ 			/* We're already in the child process here, can't return */
+ 			exit(1);
+ 		}
+ 	}
+ 
+ 	return pid;					/* Parent returns pid, or -1 on fork failure */
+ }
+ #else							/* WIN32 */
+ 
+ /*
+  * internal_forkexec win32 implementation
+  *
+  * - starts backend using CreateProcess(), in suspended state
+  * - writes out backend variables to the parameter file
+  *	- during this, duplicates handles and sockets required for
+  *	  inheritance into the new process
+  * - resumes execution of the new process once the backend parameter
+  *	 file is complete.
+  */
+ static pid_t
+ internal_forkexec(int argc, char *argv[], Port *port)
+ {
+ 	STARTUPINFO si;
+ 	PROCESS_INFORMATION pi;
+ 	int			i;
+ 	int			j;
+ 	char		cmdLine[MAXPGPATH * 2];
+ 	HANDLE		paramHandle;
+ 	BackendParameters *param;
+ 	SECURITY_ATTRIBUTES sa;
+ 	char		paramHandleStr[32];
+ 	win32_deadchild_waitinfo *childinfo;
+ 
+ 	/* Make sure caller set up argv properly */
+ 	Assert(argc >= 3);
+ 	Assert(argv[argc] == NULL);
+ 	Assert(strncmp(argv[1], "--fork", 6) == 0);
+ 	Assert(argv[2] == NULL);
+ 
+ 	/* Set up shared memory for parameter passing */
+ 	ZeroMemory(&sa, sizeof(sa));
+ 	sa.nLength = sizeof(sa);
+ 	sa.bInheritHandle = TRUE;
+ 	paramHandle = CreateFileMapping(INVALID_HANDLE_VALUE,
+ 									&sa,
+ 									PAGE_READWRITE,
+ 									0,
+ 									sizeof(BackendParameters),
+ 									NULL);
+ 	if (paramHandle == INVALID_HANDLE_VALUE)
+ 	{
+ 		elog(LOG, "could not create backend parameter file mapping: error code %lu",
+ 			 GetLastError());
+ 		return -1;
+ 	}
+ 
+ 	param = MapViewOfFile(paramHandle, FILE_MAP_WRITE, 0, 0, sizeof(BackendParameters));
+ 	if (!param)
+ 	{
+ 		elog(LOG, "could not map backend parameter memory: error code %lu",
+ 			 GetLastError());
+ 		CloseHandle(paramHandle);
+ 		return -1;
+ 	}
+ 
+ 	/* Insert temp file name after --fork argument */
+ #ifdef _WIN64
+ 	sprintf(paramHandleStr, "%llu", (LONG_PTR) paramHandle);
+ #else
+ 	sprintf(paramHandleStr, "%lu", (DWORD) paramHandle);
+ #endif
+ 	argv[2] = paramHandleStr;
+ 
+ 	/* Format the cmd line */
+ 	cmdLine[sizeof(cmdLine) - 1] = '\0';
+ 	cmdLine[sizeof(cmdLine) - 2] = '\0';
+ 	snprintf(cmdLine, sizeof(cmdLine) - 1, "\"%s\"", postgres_exec_path);
+ 	i = 0;
+ 	while (argv[++i] != NULL)
+ 	{
+ 		j = strlen(cmdLine);
+ 		snprintf(cmdLine + j, sizeof(cmdLine) - 1 - j, " \"%s\"", argv[i]);
+ 	}
+ 	if (cmdLine[sizeof(cmdLine) - 2] != '\0')
+ 	{
+ 		elog(LOG, "subprocess command line too long");
+ 		return -1;
+ 	}
+ 
+ 	memset(&pi, 0, sizeof(pi));
+ 	memset(&si, 0, sizeof(si));
+ 	si.cb = sizeof(si);
+ 
+ 	/*
+ 	 * Create the subprocess in a suspended state. This will be resumed later,
+ 	 * once we have written out the parameter file.
+ 	 */
+ 	if (!CreateProcess(NULL, cmdLine, NULL, NULL, TRUE, CREATE_SUSPENDED,
+ 					   NULL, NULL, &si, &pi))
+ 	{
+ 		elog(LOG, "CreateProcess call failed: %m (error code %lu)",
+ 			 GetLastError());
+ 		return -1;
+ 	}
+ 
+ 	if (!save_backend_variables(param, port, pi.hProcess, pi.dwProcessId))
+ 	{
+ 		/*
+ 		 * log made by save_backend_variables, but we have to clean up the
+ 		 * mess with the half-started process
+ 		 */
+ 		if (!TerminateProcess(pi.hProcess, 255))
+ 			ereport(LOG,
+ 					(errmsg_internal("could not terminate unstarted process: error code %lu",
+ 									 GetLastError())));
+ 		CloseHandle(pi.hProcess);
+ 		CloseHandle(pi.hThread);
+ 		return -1;				/* log made by save_backend_variables */
+ 	}
+ 
+ 	/* Drop the parameter shared memory that is now inherited to the backend */
+ 	if (!UnmapViewOfFile(param))
+ 		elog(LOG, "could not unmap view of backend parameter file: error code %lu",
+ 			 GetLastError());
+ 	if (!CloseHandle(paramHandle))
+ 		elog(LOG, "could not close handle to backend parameter file: error code %lu",
+ 			 GetLastError());
+ 
+ 	/*
+ 	 * Reserve the memory region used by our main shared memory segment before
+ 	 * we resume the child process.
+ 	 */
+ 	if (!pgwin32_ReserveSharedMemoryRegion(pi.hProcess))
+ 	{
+ 		/*
+ 		 * Failed to reserve the memory, so terminate the newly created
+ 		 * process and give up.
+ 		 */
+ 		if (!TerminateProcess(pi.hProcess, 255))
+ 			ereport(LOG,
+ 					(errmsg_internal("could not terminate process that failed to reserve memory: error code %lu",
+ 									 GetLastError())));
+ 		CloseHandle(pi.hProcess);
+ 		CloseHandle(pi.hThread);
+ 		return -1;				/* logging done made by
+ 								 * pgwin32_ReserveSharedMemoryRegion() */
+ 	}
+ 
+ 	/*
+ 	 * Now that the backend variables are written out, we start the child
+ 	 * thread so it can start initializing while we set up the rest of the
+ 	 * parent state.
+ 	 */
+ 	if (ResumeThread(pi.hThread) == -1)
+ 	{
+ 		if (!TerminateProcess(pi.hProcess, 255))
+ 		{
+ 			ereport(LOG,
+ 					(errmsg_internal("could not terminate unstartable process: error code %lu",
+ 									 GetLastError())));
+ 			CloseHandle(pi.hProcess);
+ 			CloseHandle(pi.hThread);
+ 			return -1;
+ 		}
+ 		CloseHandle(pi.hProcess);
+ 		CloseHandle(pi.hThread);
+ 		ereport(LOG,
+ 				(errmsg_internal("could not resume thread of unstarted process: error code %lu",
+ 								 GetLastError())));
+ 		return -1;
+ 	}
+ 
+ 	/*
+ 	 * Queue a waiter for to signal when this child dies. The wait will be
+ 	 * handled automatically by an operating system thread pool.
+ 	 *
+ 	 * Note: use malloc instead of palloc, since it needs to be thread-safe.
+ 	 * Struct will be free():d from the callback function that runs on a
+ 	 * different thread.
+ 	 */
+ 	childinfo = malloc(sizeof(win32_deadchild_waitinfo));
+ 	if (!childinfo)
+ 		ereport(FATAL,
+ 				(errcode(ERRCODE_OUT_OF_MEMORY),
+ 				 errmsg("out of memory")));
+ 
+ 	childinfo->procHandle = pi.hProcess;
+ 	childinfo->procId = pi.dwProcessId;
+ 
+ 	if (!RegisterWaitForSingleObject(&childinfo->waitHandle,
+ 									 pi.hProcess,
+ 									 pgwin32_deadchild_callback,
+ 									 childinfo,
+ 									 INFINITE,
+ 								WT_EXECUTEONLYONCE | WT_EXECUTEINWAITTHREAD))
+ 		ereport(FATAL,
+ 		(errmsg_internal("could not register process for wait: error code %lu",
+ 						 GetLastError())));
+ 
+ 	/* Don't close pi.hProcess here - the wait thread needs access to it */
+ 
+ 	CloseHandle(pi.hThread);
+ 
+ 	return pi.dwProcessId;
+ }
+ #endif   /* WIN32 */
+ 
+ 
+ /*
+  * SubPostmasterMain -- Get the fork/exec'd process into a state equivalent
+  *			to what it would be if we'd simply forked on Unix, and then
+  *			dispatch to the appropriate place.
+  *
+  * The first two command line arguments are expected to be "--forkFOO"
+  * (where FOO indicates which postmaster child we are to become), and
+  * the name of a variables file that we can read to load data that would
+  * have been inherited by fork() on Unix.  Remaining arguments go to the
+  * subprocess FooMain() routine.
+  */
+ int
+ SubPostmasterMain(int argc, char *argv[])
+ {
+ 	Port		port;
+ 
+ 	/* Do this sooner rather than later... */
+ 	IsUnderPostmaster = true;	/* we are a postmaster subprocess now */
+ 
+ 	MyProcPid = getpid();		/* reset MyProcPid */
+ 
+ 	MyStartTime = time(NULL);
+ 
+ 	/*
+ 	 * make sure stderr is in binary mode before anything can possibly be
+ 	 * written to it, in case it's actually the syslogger pipe, so the pipe
+ 	 * chunking protocol isn't disturbed. Non-logpipe data gets translated on
+ 	 * redirection (e.g. via pg_ctl -l) anyway.
+ 	 */
+ #ifdef WIN32
+ 	_setmode(fileno(stderr), _O_BINARY);
+ #endif
+ 
+ 	/* Lose the postmaster's on-exit routines (really a no-op) */
+ 	on_exit_reset();
+ 
+ 	/* In EXEC_BACKEND case we will not have inherited these settings */
+ 	IsPostmasterEnvironment = true;
+ 	whereToSendOutput = DestNone;
+ 
+ 	/* Setup essential subsystems (to ensure elog() behaves sanely) */
+ 	MemoryContextInit();
+ 	InitializeGUCOptions();
+ 
+ 	/* Read in the variables file */
+ 	memset(&port, 0, sizeof(Port));
+ 	read_backend_variables(argv[2], &port);
+ 
+ 	/*
+ 	 * Set up memory area for GSS information. Mirrors the code in ConnCreate
+ 	 * for the non-exec case.
+ 	 */
+ #if defined(ENABLE_GSS) || defined(ENABLE_SSPI)
+ 	port.gss = (pg_gssinfo *) calloc(1, sizeof(pg_gssinfo));
+ 	if (!port.gss)
+ 		ereport(FATAL,
+ 				(errcode(ERRCODE_OUT_OF_MEMORY),
+ 				 errmsg("out of memory")));
+ #endif
+ 
+ 	/* Check we got appropriate args */
+ 	if (argc < 3)
+ 		elog(FATAL, "invalid subpostmaster invocation");
+ 
+ 	/*
+ 	 * If appropriate, physically re-attach to shared memory segment. We want
+ 	 * to do this before going any further to ensure that we can attach at the
+ 	 * same address the postmaster used.
+ 	 */
+ 	if (strcmp(argv[1], "--forkbackend") == 0 ||
+ 		strcmp(argv[1], "--forkavlauncher") == 0 ||
+ 		strcmp(argv[1], "--forkavworker") == 0 ||
+ 		strcmp(argv[1], "--forkboot") == 0)
+ 		PGSharedMemoryReAttach();
+ 
+ 	/* autovacuum needs this set before calling InitProcess */
+ 	if (strcmp(argv[1], "--forkavlauncher") == 0)
+ 		AutovacuumLauncherIAm();
+ 	if (strcmp(argv[1], "--forkavworker") == 0)
+ 		AutovacuumWorkerIAm();
+ 
+ 	/*
+ 	 * Start our win32 signal implementation. This has to be done after we
+ 	 * read the backend variables, because we need to pick up the signal pipe
+ 	 * from the parent process.
+ 	 */
+ #ifdef WIN32
+ 	pgwin32_signal_initialize();
+ #endif
+ 
+ 	/* In EXEC_BACKEND case we will not have inherited these settings */
+ 	pqinitmask();
+ 	PG_SETMASK(&BlockSig);
+ 
+ 	/* Read in remaining GUC variables */
+ 	read_nondefault_variables();
+ 
+ 	/*
+ 	 * Reload any libraries that were preloaded by the postmaster.	Since we
+ 	 * exec'd this process, those libraries didn't come along with us; but we
+ 	 * should load them into all child processes to be consistent with the
+ 	 * non-EXEC_BACKEND behavior.
+ 	 */
+ 	process_shared_preload_libraries();
+ 
+ 	/* Run backend or appropriate child */
+ 	if (strcmp(argv[1], "--forkbackend") == 0)
+ 	{
+ 		Assert(argc == 3);		/* shouldn't be any more args */
+ 
+ 		/* Close the postmaster's sockets */
+ 		ClosePostmasterPorts(false);
+ 
+ 		/*
+ 		 * Need to reinitialize the SSL library in the backend, since the
+ 		 * context structures contain function pointers and cannot be passed
+ 		 * through the parameter file.
+ 		 *
+ 		 * XXX should we do this in all child processes?  For the moment it's
+ 		 * enough to do it in backend children.
+ 		 */
+ #ifdef USE_SSL
+ 		if (EnableSSL)
+ 			secure_initialize();
+ #endif
+ 
+ 		/*
+ 		 * Perform additional initialization and collect startup packet.
+ 		 *
+ 		 * We want to do this before InitProcess() for a couple of reasons: 1.
+ 		 * so that we aren't eating up a PGPROC slot while waiting on the
+ 		 * client. 2. so that if InitProcess() fails due to being out of
+ 		 * PGPROC slots, we have already initialized libpq and are able to
+ 		 * report the error to the client.
+ 		 */
+ 		BackendInitialize(&port);
+ 
+ 		/* Restore basic shared memory pointers */
+ 		InitShmemAccess(UsedShmemSegAddr);
+ 
+ 		/* Need a PGPROC to run CreateSharedMemoryAndSemaphores */
+ 		InitProcess();
+ 
+ 		/*
+ 		 * Attach process to shared data structures.  If testing EXEC_BACKEND
+ 		 * on Linux, you must run this as root before starting the postmaster:
+ 		 *
+ 		 * echo 0 >/proc/sys/kernel/randomize_va_space
+ 		 *
+ 		 * This prevents a randomized stack base address that causes child
+ 		 * shared memory to be at a different address than the parent, making
+ 		 * it impossible to attached to shared memory.	Return the value to
+ 		 * '1' when finished.
+ 		 */
+ 		CreateSharedMemoryAndSemaphores(false, 0);
+ 
+ 		/* And run the backend */
+ 		proc_exit(BackendRun(&port));
+ 	}
+ 	if (strcmp(argv[1], "--forkboot") == 0)
+ 	{
+ 		/* Close the postmaster's sockets */
+ 		ClosePostmasterPorts(false);
+ 
+ 		/* Restore basic shared memory pointers */
+ 		InitShmemAccess(UsedShmemSegAddr);
+ 
+ 		/* Need a PGPROC to run CreateSharedMemoryAndSemaphores */
+ 		InitAuxiliaryProcess();
+ 
+ 		/* Attach process to shared data structures */
+ 		CreateSharedMemoryAndSemaphores(false, 0);
+ 
+ 		AuxiliaryProcessMain(argc - 2, argv + 2);
+ 		proc_exit(0);
+ 	}
+ 	if (strcmp(argv[1], "--forkavlauncher") == 0)
+ 	{
+ 		/* Close the postmaster's sockets */
+ 		ClosePostmasterPorts(false);
+ 
+ 		/* Restore basic shared memory pointers */
+ 		InitShmemAccess(UsedShmemSegAddr);
+ 
+ 		/* Need a PGPROC to run CreateSharedMemoryAndSemaphores */
+ 		InitProcess();
+ 
+ 		/* Attach process to shared data structures */
+ 		CreateSharedMemoryAndSemaphores(false, 0);
+ 
+ 		AutoVacLauncherMain(argc - 2, argv + 2);
+ 		proc_exit(0);
+ 	}
+ 	if (strcmp(argv[1], "--forkavworker") == 0)
+ 	{
+ 		/* Close the postmaster's sockets */
+ 		ClosePostmasterPorts(false);
+ 
+ 		/* Restore basic shared memory pointers */
+ 		InitShmemAccess(UsedShmemSegAddr);
+ 
+ 		/* Need a PGPROC to run CreateSharedMemoryAndSemaphores */
+ 		InitProcess();
+ 
+ 		/* Attach process to shared data structures */
+ 		CreateSharedMemoryAndSemaphores(false, 0);
+ 
+ 		AutoVacWorkerMain(argc - 2, argv + 2);
+ 		proc_exit(0);
+ 	}
+ 	if (strcmp(argv[1], "--forkarch") == 0)
+ 	{
+ 		/* Close the postmaster's sockets */
+ 		ClosePostmasterPorts(false);
+ 
+ 		/* Do not want to attach to shared memory */
+ 
+ 		PgArchiverMain(argc, argv);
+ 		proc_exit(0);
+ 	}
+ 	if (strcmp(argv[1], "--forkcol") == 0)
+ 	{
+ 		/* Close the postmaster's sockets */
+ 		ClosePostmasterPorts(false);
+ 
+ 		/* Do not want to attach to shared memory */
+ 
+ 		PgstatCollectorMain(argc, argv);
+ 		proc_exit(0);
+ 	}
+ 	if (strcmp(argv[1], "--forklog") == 0)
+ 	{
+ 		/* Close the postmaster's sockets */
+ 		ClosePostmasterPorts(true);
+ 
+ 		/* Do not want to attach to shared memory */
+ 
+ 		SysLoggerMain(argc, argv);
+ 		proc_exit(0);
+ 	}
+ 
+ 	return 1;					/* shouldn't get here */
+ }
+ #endif   /* EXEC_BACKEND */
+ 
+ 
+ /*
+  * ExitPostmaster -- cleanup
+  *
+  * Do NOT call exit() directly --- always go through here!
+  */
+ static void
+ ExitPostmaster(int status)
+ {
+ 	/* should cleanup shared memory and kill all backends */
+ 
+ 	/*
+ 	 * Not sure of the semantics here.	When the Postmaster dies, should the
+ 	 * backends all be killed? probably not.
+ 	 *
+ 	 * MUST		-- vadim 05-10-1999
+ 	 */
+ 
+ 	proc_exit(status);
+ }
+ 
+ /*
+  * sigusr1_handler - handle signal conditions from child processes
+  */
+ static void
+ sigusr1_handler(SIGNAL_ARGS)
+ {
+ 	int			save_errno = errno;
+ 
+ 	PG_SETMASK(&BlockSig);
+ 
+ 	/*
+ 	 * RECOVERY_STARTED and BEGIN_HOT_STANDBY signals are ignored in
+ 	 * unexpected states. If the startup process quickly starts up, completes
+ 	 * recovery, exits, we might process the death of the startup process
+ 	 * first. We don't want to go back to recovery in that case.
+ 	 */
+ 	if (CheckPostmasterSignal(PMSIGNAL_RECOVERY_STARTED) &&
+ 		pmState == PM_STARTUP)
+ 	{
+ 		/* WAL redo has started. We're out of reinitialization. */
+ 		FatalError = false;
+ 
+ 		/*
+ 		 * Crank up the background writer.	It doesn't matter if this fails,
+ 		 * we'll just try again later.
+ 		 */
+ 		Assert(BgWriterPID == 0);
+ 		BgWriterPID = StartBackgroundWriter();
+ 
+ 		pmState = PM_RECOVERY;
+ 	}
+ 	if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
+ 		pmState == PM_RECOVERY)
+ 	{
+ 		/*
+ 		 * Likewise, start other special children as needed.
+ 		 */
+ 		Assert(PgStatPID == 0);
+ 		PgStatPID = pgstat_start();
+ 
+ 		ereport(LOG,
+ 		(errmsg("database system is ready to accept read only connections")));
+ 
+ 		pmState = PM_HOT_STANDBY;
+ 	}
+ 
+ 	if (CheckPostmasterSignal(PMSIGNAL_WAKEN_ARCHIVER) &&
+ 		PgArchPID != 0)
+ 	{
+ 		/*
+ 		 * Send SIGUSR1 to archiver process, to wake it up and begin archiving
+ 		 * next transaction log file.
+ 		 */
+ 		signal_child(PgArchPID, SIGUSR1);
+ 	}
+ 
+ 	if (CheckPostmasterSignal(PMSIGNAL_ROTATE_LOGFILE) &&
+ 		SysLoggerPID != 0)
+ 	{
+ 		/* Tell syslogger to rotate logfile */
+ 		signal_child(SysLoggerPID, SIGUSR1);
+ 	}
+ 
+ 	if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER))
+ 	{
+ 		/*
+ 		 * Start one iteration of the autovacuum daemon, even if autovacuuming
+ 		 * is nominally not enabled.  This is so we can have an active defense
+ 		 * against transaction ID wraparound.  We set a flag for the main loop
+ 		 * to do it rather than trying to do it here --- this is because the
+ 		 * autovac process itself may send the signal, and we want to handle
+ 		 * that by launching another iteration as soon as the current one
+ 		 * completes.
+ 		 */
+ 		start_autovac_launcher = true;
+ 	}
+ 
+ 	if (CheckPostmasterSignal(PMSIGNAL_START_AUTOVAC_WORKER))
+ 	{
+ 		/* The autovacuum launcher wants us to start a worker process. */
+ 		StartAutovacuumWorker();
+ 	}
+ 
+ 	if (CheckPostmasterSignal(PMSIGNAL_START_WALRECEIVER) &&
+ 		WalReceiverPID == 0 &&
+ 		(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
+ 		 pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY))
+ 	{
+ 		/* Startup Process wants us to start the walreceiver process. */
+ 		WalReceiverPID = StartWalReceiver();
+ 	}
+ 
+ 	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE) &&
+ 		(pmState == PM_WAIT_BACKUP || pmState == PM_WAIT_BACKENDS))
+ 	{
+ 		/* Advance postmaster's state machine */
+ 		PostmasterStateMachine();
+ 	}
+ 
+ 	if (CheckPromoteSignal() && StartupPID != 0 &&
+ 		(pmState == PM_STARTUP || pmState == PM_RECOVERY ||
+ 		 pmState == PM_HOT_STANDBY || pmState == PM_WAIT_READONLY))
+ 	{
+ 		/* Tell startup process to finish recovery */
+ 		signal_child(StartupPID, SIGUSR2);
+ 	}
+ 
+ 	PG_SETMASK(&UnBlockSig);
+ 
+ 	errno = save_errno;
+ }
+ 
+ /*
+  * Timeout or shutdown signal from postmaster while processing startup packet.
+  * Cleanup and exit(1).
+  *
+  * XXX: possible future improvement: try to send a message indicating
+  * why we are disconnecting.  Problem is to be sure we don't block while
+  * doing so, nor mess up SSL initialization.  In practice, if the client
+  * has wedged here, it probably couldn't do anything with the message anyway.
+  */
+ static void
+ startup_die(SIGNAL_ARGS)
+ {
+ 	proc_exit(1);
+ }
+ 
+ /*
+  * Dummy signal handler
+  *
+  * We use this for signals that we don't actually use in the postmaster,
+  * but we do use in backends.  If we were to SIG_IGN such signals in the
+  * postmaster, then a newly started backend might drop a signal that arrives
+  * before it's able to reconfigure its signal processing.  (See notes in
+  * tcop/postgres.c.)
+  */
+ static void
+ dummy_handler(SIGNAL_ARGS)
+ {
+ }
+ 
+ /*
+  * RandomSalt
+  */
+ static void
+ RandomSalt(char *md5Salt)
+ {
+ 	long		rand;
+ 
+ 	/*
+ 	 * We use % 255, sacrificing one possible byte value, so as to ensure that
+ 	 * all bits of the random() value participate in the result. While at it,
+ 	 * add one to avoid generating any null bytes.
+ 	 */
+ 	rand = PostmasterRandom();
+ 	md5Salt[0] = (rand % 255) + 1;
+ 	rand = PostmasterRandom();
+ 	md5Salt[1] = (rand % 255) + 1;
+ 	rand = PostmasterRandom();
+ 	md5Salt[2] = (rand % 255) + 1;
+ 	rand = PostmasterRandom();
+ 	md5Salt[3] = (rand % 255) + 1;
+ }
+ 
+ /*
+  * PostmasterRandom
+  */
+ static long
+ PostmasterRandom(void)
+ {
+ 	/*
+ 	 * Select a random seed at the time of first receiving a request.
+ 	 */
+ 	if (random_seed == 0)
+ 	{
+ 		do
+ 		{
+ 			struct timeval random_stop_time;
+ 
+ 			gettimeofday(&random_stop_time, NULL);
+ 
+ 			/*
+ 			 * We are not sure how much precision is in tv_usec, so we swap
+ 			 * the high and low 16 bits of 'random_stop_time' and XOR them
+ 			 * with 'random_start_time'. On the off chance that the result is
+ 			 * 0, we loop until it isn't.
+ 			 */
+ 			random_seed = random_start_time.tv_usec ^
+ 				((random_stop_time.tv_usec << 16) |
+ 				 ((random_stop_time.tv_usec >> 16) & 0xffff));
+ 		}
+ 		while (random_seed == 0);
+ 
+ 		srandom(random_seed);
+ 	}
+ 
+ 	return random();
+ }
+ 
+ /*
+  * Count up number of child processes of specified types (dead_end chidren
+  * are always excluded).
+  */
+ static int
+ CountChildren(int target)
+ {
+ 	Dlelem	   *curr;
+ 	int			cnt = 0;
+ 
+ 	for (curr = DLGetHead(BackendList); curr; curr = DLGetSucc(curr))
+ 	{
+ 		Backend    *bp = (Backend *) DLE_VAL(curr);
+ 
+ 		if (bp->dead_end)
+ 			continue;
+ 
+ 		/*
+ 		 * Since target == BACKEND_TYPE_ALL is the most common case, we test
+ 		 * it first and avoid touching shared memory for every child.
+ 		 */
+ 		if (target != BACKEND_TYPE_ALL)
+ 		{
+ 			int			child;
+ 
+ 			if (bp->is_autovacuum)
+ 				child = BACKEND_TYPE_AUTOVAC;
+ 			else if (IsPostmasterChildWalSender(bp->child_slot))
+ 				child = BACKEND_TYPE_WALSND;
+ 			else
+ 				child = BACKEND_TYPE_NORMAL;
+ 			if (!(target & child))
+ 				continue;
+ 		}
+ 
+ 		cnt++;
+ 	}
+ 	return cnt;
+ }
+ 
+ 
+ /*
+  * StartChildProcess -- start an auxiliary process for the postmaster
+  *
+  * xlop determines what kind of child will be started.	All child types
+  * initially go to AuxiliaryProcessMain, which will handle common setup.
+  *
+  * Return value of StartChildProcess is subprocess' PID, or 0 if failed
+  * to start subprocess.
+  */
+ static pid_t
+ StartChildProcess(AuxProcType type)
+ {
+ 	pid_t		pid;
+ 	char	   *av[10];
+ 	int			ac = 0;
+ 	char		typebuf[32];
+ 
+ 	/*
+ 	 * Set up command-line arguments for subprocess
+ 	 */
+ 	av[ac++] = "postgres";
+ 
+ #ifdef EXEC_BACKEND
+ 	av[ac++] = "--forkboot";
+ 	av[ac++] = NULL;			/* filled in by postmaster_forkexec */
+ #endif
+ 
+ 	snprintf(typebuf, sizeof(typebuf), "-x%d", type);
+ 	av[ac++] = typebuf;
+ 
+ 	av[ac] = NULL;
+ 	Assert(ac < lengthof(av));
+ 
+ #ifdef EXEC_BACKEND
+ 	pid = postmaster_forkexec(ac, av);
+ #else							/* !EXEC_BACKEND */
+ 	pid = fork_process();
+ 
+ 	if (pid == 0)				/* child */
+ 	{
+ 		IsUnderPostmaster = true;		/* we are a postmaster subprocess now */
+ 
+ 		/* Close the postmaster's sockets */
+ 		ClosePostmasterPorts(false);
+ 
+ 		/* Lose the postmaster's on-exit routines and port connections */
+ 		on_exit_reset();
+ 
+ 		/* Release postmaster's working memory context */
+ 		MemoryContextSwitchTo(TopMemoryContext);
+ 		MemoryContextDelete(PostmasterContext);
+ 		PostmasterContext = NULL;
+ 
+ 		AuxiliaryProcessMain(ac, av);
+ 		ExitPostmaster(0);
+ 	}
+ #endif   /* EXEC_BACKEND */
+ 
+ 	if (pid < 0)
+ 	{
+ 		/* in parent, fork failed */
+ 		int			save_errno = errno;
+ 
+ 		errno = save_errno;
+ 		switch (type)
+ 		{
+ 			case StartupProcess:
+ 				ereport(LOG,
+ 						(errmsg("could not fork startup process: %m")));
+ 				break;
+ 			case BgWriterProcess:
+ 				ereport(LOG,
+ 				   (errmsg("could not fork background writer process: %m")));
+ 				break;
+ 			case WalWriterProcess:
+ 				ereport(LOG,
+ 						(errmsg("could not fork WAL writer process: %m")));
+ 				break;
+ 			case WalReceiverProcess:
+ 				ereport(LOG,
+ 						(errmsg("could not fork WAL receiver process: %m")));
+ 				break;
+ 			default:
+ 				ereport(LOG,
+ 						(errmsg("could not fork process: %m")));
+ 				break;
+ 		}
+ 
+ 		/*
+ 		 * fork failure is fatal during startup, but there's no need to choke
+ 		 * immediately if starting other child types fails.
+ 		 */
+ 		if (type == StartupProcess)
+ 			ExitPostmaster(1);
+ 		return 0;
+ 	}
+ 
+ 	/*
+ 	 * in parent, successful fork
+ 	 */
+ 	return pid;
+ }
+ 
+ /*
+  * StartAutovacuumWorker
+  *		Start an autovac worker process.
+  *
+  * This function is here because it enters the resulting PID into the
+  * postmaster's private backends list.
+  *
+  * NB -- this code very roughly matches BackendStartup.
+  */
+ static void
+ StartAutovacuumWorker(void)
+ {
+ 	Backend    *bn;
+ 
+ 	/*
+ 	 * If not in condition to run a process, don't try, but handle it like a
+ 	 * fork failure.  This does not normally happen, since the signal is only
+ 	 * supposed to be sent by autovacuum launcher when it's OK to do it, but
+ 	 * we have to check to avoid race-condition problems during DB state
+ 	 * changes.
+ 	 */
+ 	if (canAcceptConnections() == CAC_OK)
+ 	{
+ 		bn = (Backend *) malloc(sizeof(Backend));
+ 		if (bn)
+ 		{
+ 			/*
+ 			 * Compute the cancel key that will be assigned to this session.
+ 			 * We probably don't need cancel keys for autovac workers, but
+ 			 * we'd better have something random in the field to prevent
+ 			 * unfriendly people from sending cancels to them.
+ 			 */
+ 			MyCancelKey = PostmasterRandom();
+ 			bn->cancel_key = MyCancelKey;
+ 
+ 			/* Autovac workers are not dead_end and need a child slot */
+ 			bn->dead_end = false;
+ 			bn->child_slot = MyPMChildSlot = AssignPostmasterChildSlot();
+ 
+ 			bn->pid = StartAutoVacWorker();
+ 			if (bn->pid > 0)
+ 			{
+ 				bn->is_autovacuum = true;
+ 				DLInitElem(&bn->elem, bn);
+ 				DLAddHead(BackendList, &bn->elem);
+ #ifdef EXEC_BACKEND
+ 				ShmemBackendArrayAdd(bn);
+ #endif
+ 				/* all OK */
+ 				return;
+ 			}
+ 
+ 			/*
+ 			 * fork failed, fall through to report -- actual error message was
+ 			 * logged by StartAutoVacWorker
+ 			 */
+ 			(void) ReleasePostmasterChildSlot(bn->child_slot);
+ 			free(bn);
+ 		}
+ 		else
+ 			ereport(LOG,
+ 					(errcode(ERRCODE_OUT_OF_MEMORY),
+ 					 errmsg("out of memory")));
+ 	}
+ 
+ 	/*
+ 	 * Report the failure to the launcher, if it's running.  (If it's not, we
+ 	 * might not even be connected to shared memory, so don't try to call
+ 	 * AutoVacWorkerFailed.)  Note that we also need to signal it so that it
+ 	 * responds to the condition, but we don't do that here, instead waiting
+ 	 * for ServerLoop to do it.  This way we avoid a ping-pong signalling in
+ 	 * quick succession between the autovac launcher and postmaster in case
+ 	 * things get ugly.
+ 	 */
+ 	if (AutoVacPID != 0)
+ 	{
+ 		AutoVacWorkerFailed();
+ 		avlauncher_needs_signal = true;
+ 	}
+ }
+ 
+ /*
+  * Create the opts file
+  */
+ static bool
+ CreateOptsFile(int argc, char *argv[], char *fullprogname)
+ {
+ 	FILE	   *fp;
+ 	int			i;
+ 
+ #define OPTS_FILE	"postmaster.opts"
+ 
+ 	if ((fp = fopen(OPTS_FILE, "w")) == NULL)
+ 	{
+ 		elog(LOG, "could not create file \"%s\": %m", OPTS_FILE);
+ 		return false;
+ 	}
+ 
+ 	fprintf(fp, "%s", fullprogname);
+ 	for (i = 1; i < argc; i++)
+ 		fprintf(fp, " \"%s\"", argv[i]);
+ 	fputs("\n", fp);
+ 
+ 	if (fclose(fp))
+ 	{
+ 		elog(LOG, "could not write file \"%s\": %m", OPTS_FILE);
+ 		return false;
+ 	}
+ 
+ 	return true;
+ }
+ 
+ 
+ /*
+  * MaxLivePostmasterChildren
+  *
+  * This reports the number of entries needed in per-child-process arrays
+  * (the PMChildFlags array, and if EXEC_BACKEND the ShmemBackendArray).
+  * These arrays include regular backends, autovac workers and walsenders,
+  * but not special children nor dead_end children.	This allows the arrays
+  * to have a fixed maximum size, to wit the same too-many-children limit
+  * enforced by canAcceptConnections().	The exact value isn't too critical
+  * as long as it's more than MaxBackends.
+  */
+ int
+ MaxLivePostmasterChildren(void)
+ {
+ 	return 2 * MaxBackends;
+ }
+ 
+ 
+ #ifdef EXEC_BACKEND
+ 
+ /*
+  * The following need to be available to the save/restore_backend_variables
+  * functions
+  */
+ extern slock_t *ShmemLock;
+ extern LWLock *LWLockArray;
+ extern slock_t *ProcStructLock;
+ extern PGPROC *AuxiliaryProcs;
+ extern PMSignalData *PMSignalState;
+ extern pgsocket pgStatSock;
+ 
+ #ifndef WIN32
+ #define write_inheritable_socket(dest, src, childpid) ((*(dest) = (src)), true)
+ #define read_inheritable_socket(dest, src) (*(dest) = *(src))
+ #else
+ static bool write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE child);
+ static bool write_inheritable_socket(InheritableSocket *dest, SOCKET src,
+ 						 pid_t childPid);
+ static void read_inheritable_socket(SOCKET *dest, InheritableSocket *src);
+ #endif
+ 
+ 
+ /* Save critical backend variables into the BackendParameters struct */
+ #ifndef WIN32
+ static bool
+ save_backend_variables(BackendParameters *param, Port *port)
+ #else
+ static bool
+ save_backend_variables(BackendParameters *param, Port *port,
+ 					   HANDLE childProcess, pid_t childPid)
+ #endif
+ {
+ 	memcpy(&param->port, port, sizeof(Port));
+ 	if (!write_inheritable_socket(&param->portsocket, port->sock, childPid))
+ 		return false;
+ 
+ 	strlcpy(param->DataDir, DataDir, MAXPGPATH);
+ 
+ 	memcpy(&param->ListenSocket, &ListenSocket, sizeof(ListenSocket));
+ 
+ 	param->MyCancelKey = MyCancelKey;
+ 	param->MyPMChildSlot = MyPMChildSlot;
+ 
+ 	param->UsedShmemSegID = UsedShmemSegID;
+ 	param->UsedShmemSegAddr = UsedShmemSegAddr;
+ 
+ 	param->ShmemLock = ShmemLock;
+ 	param->ShmemVariableCache = ShmemVariableCache;
+ 	param->ShmemBackendArray = ShmemBackendArray;
+ 
+ 	param->LWLockArray = LWLockArray;
+ 	param->ProcStructLock = ProcStructLock;
+ 	param->ProcGlobal = ProcGlobal;
+ 	param->AuxiliaryProcs = AuxiliaryProcs;
+ 	param->PMSignalState = PMSignalState;
+ 	if (!write_inheritable_socket(&param->pgStatSock, pgStatSock, childPid))
+ 		return false;
+ 
+ 	param->PostmasterPid = PostmasterPid;
+ 	param->PgStartTime = PgStartTime;
+ 	param->PgReloadTime = PgReloadTime;
+ 
+ 	param->redirection_done = redirection_done;
+ 
+ #ifdef WIN32
+ 	param->PostmasterHandle = PostmasterHandle;
+ 	if (!write_duplicated_handle(&param->initial_signal_pipe,
+ 								 pgwin32_create_signal_listener(childPid),
+ 								 childProcess))
+ 		return false;
+ #else
+ 	memcpy(&param->postmaster_alive_fds, &postmaster_alive_fds,
+ 		   sizeof(postmaster_alive_fds));
+ #endif
+ 
+ 	memcpy(&param->syslogPipe, &syslogPipe, sizeof(syslogPipe));
+ 
+ 	strlcpy(param->my_exec_path, my_exec_path, MAXPGPATH);
+ 
+ 	strlcpy(param->pkglib_path, pkglib_path, MAXPGPATH);
+ 
+ 	strlcpy(param->ExtraOptions, ExtraOptions, MAXPGPATH);
+ 
+ 	return true;
+ }
+ 
+ 
+ #ifdef WIN32
+ /*
+  * Duplicate a handle for usage in a child process, and write the child
+  * process instance of the handle to the parameter file.
+  */
+ static bool
+ write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE childProcess)
+ {
+ 	HANDLE		hChild = INVALID_HANDLE_VALUE;
+ 
+ 	if (!DuplicateHandle(GetCurrentProcess(),
+ 						 src,
+ 						 childProcess,
+ 						 &hChild,
+ 						 0,
+ 						 TRUE,
+ 						 DUPLICATE_CLOSE_SOURCE | DUPLICATE_SAME_ACCESS))
+ 	{
+ 		ereport(LOG,
+ 				(errmsg_internal("could not duplicate handle to be written to backend parameter file: error code %lu",
+ 								 GetLastError())));
+ 		return false;
+ 	}
+ 
+ 	*dest = hChild;
+ 	return true;
+ }
+ 
+ /*
+  * Duplicate a socket for usage in a child process, and write the resulting
+  * structure to the parameter file.
+  * This is required because a number of LSPs (Layered Service Providers) very
+  * common on Windows (antivirus, firewalls, download managers etc) break
+  * straight socket inheritance.
+  */
+ static bool
+ write_inheritable_socket(InheritableSocket *dest, SOCKET src, pid_t childpid)
+ {
+ 	dest->origsocket = src;
+ 	if (src != 0 && src != PGINVALID_SOCKET)
+ 	{
+ 		/* Actual socket */
+ 		if (WSADuplicateSocket(src, childpid, &dest->wsainfo) != 0)
+ 		{
+ 			ereport(LOG,
+ 					(errmsg("could not duplicate socket %d for use in backend: error code %d",
+ 							(int) src, WSAGetLastError())));
+ 			return false;
+ 		}
+ 	}
+ 	return true;
+ }
+ 
+ /*
+  * Read a duplicate socket structure back, and get the socket descriptor.
+  */
+ static void
+ read_inheritable_socket(SOCKET *dest, InheritableSocket *src)
+ {
+ 	SOCKET		s;
+ 
+ 	if (src->origsocket == PGINVALID_SOCKET || src->origsocket == 0)
+ 	{
+ 		/* Not a real socket! */
+ 		*dest = src->origsocket;
+ 	}
+ 	else
+ 	{
+ 		/* Actual socket, so create from structure */
+ 		s = WSASocket(FROM_PROTOCOL_INFO,
+ 					  FROM_PROTOCOL_INFO,
+ 					  FROM_PROTOCOL_INFO,
+ 					  &src->wsainfo,
+ 					  0,
+ 					  0);
+ 		if (s == INVALID_SOCKET)
+ 		{
+ 			write_stderr("could not create inherited socket: error code %d\n",
+ 						 WSAGetLastError());
+ 			exit(1);
+ 		}
+ 		*dest = s;
+ 
+ 		/*
+ 		 * To make sure we don't get two references to the same socket, close
+ 		 * the original one. (This would happen when inheritance actually
+ 		 * works..
+ 		 */
+ 		closesocket(src->origsocket);
+ 	}
+ }
+ #endif
+ 
+ static void
+ read_backend_variables(char *id, Port *port)
+ {
+ 	BackendParameters param;
+ 
+ #ifndef WIN32
+ 	/* Non-win32 implementation reads from file */
+ 	FILE	   *fp;
+ 
+ 	/* Open file */
+ 	fp = AllocateFile(id, PG_BINARY_R);
+ 	if (!fp)
+ 	{
+ 		write_stderr("could not read from backend variables file \"%s\": %s\n",
+ 					 id, strerror(errno));
+ 		exit(1);
+ 	}
+ 
+ 	if (fread(&param, sizeof(param), 1, fp) != 1)
+ 	{
+ 		write_stderr("could not read from backend variables file \"%s\": %s\n",
+ 					 id, strerror(errno));
+ 		exit(1);
+ 	}
+ 
+ 	/* Release file */
+ 	FreeFile(fp);
+ 	if (unlink(id) != 0)
+ 	{
+ 		write_stderr("could not remove file \"%s\": %s\n",
+ 					 id, strerror(errno));
+ 		exit(1);
+ 	}
+ #else
+ 	/* Win32 version uses mapped file */
+ 	HANDLE		paramHandle;
+ 	BackendParameters *paramp;
+ 
+ #ifdef _WIN64
+ 	paramHandle = (HANDLE) _atoi64(id);
+ #else
+ 	paramHandle = (HANDLE) atol(id);
+ #endif
+ 	paramp = MapViewOfFile(paramHandle, FILE_MAP_READ, 0, 0, 0);
+ 	if (!paramp)
+ 	{
+ 		write_stderr("could not map view of backend variables: error code %lu\n",
+ 					 GetLastError());
+ 		exit(1);
+ 	}
+ 
+ 	memcpy(&param, paramp, sizeof(BackendParameters));
+ 
+ 	if (!UnmapViewOfFile(paramp))
+ 	{
+ 		write_stderr("could not unmap view of backend variables: error code %lu\n",
+ 					 GetLastError());
+ 		exit(1);
+ 	}
+ 
+ 	if (!CloseHandle(paramHandle))
+ 	{
+ 		write_stderr("could not close handle to backend parameter variables: error code %lu\n",
+ 					 GetLastError());
+ 		exit(1);
+ 	}
+ #endif
+ 
+ 	restore_backend_variables(&param, port);
+ }
+ 
+ /* Restore critical backend variables from the BackendParameters struct */
+ static void
+ restore_backend_variables(BackendParameters *param, Port *port)
+ {
+ 	memcpy(port, &param->port, sizeof(Port));
+ 	read_inheritable_socket(&port->sock, &param->portsocket);
+ 
+ 	SetDataDir(param->DataDir);
+ 
+ 	memcpy(&ListenSocket, &param->ListenSocket, sizeof(ListenSocket));
+ 
+ 	MyCancelKey = param->MyCancelKey;
+ 	MyPMChildSlot = param->MyPMChildSlot;
+ 
+ 	UsedShmemSegID = param->UsedShmemSegID;
+ 	UsedShmemSegAddr = param->UsedShmemSegAddr;
+ 
+ 	ShmemLock = param->ShmemLock;
+ 	ShmemVariableCache = param->ShmemVariableCache;
+ 	ShmemBackendArray = param->ShmemBackendArray;
+ 
+ 	LWLockArray = param->LWLockArray;
+ 	ProcStructLock = param->ProcStructLock;
+ 	ProcGlobal = param->ProcGlobal;
+ 	AuxiliaryProcs = param->AuxiliaryProcs;
+ 	PMSignalState = param->PMSignalState;
+ 	read_inheritable_socket(&pgStatSock, &param->pgStatSock);
+ 
+ 	PostmasterPid = param->PostmasterPid;
+ 	PgStartTime = param->PgStartTime;
+ 	PgReloadTime = param->PgReloadTime;
+ 
+ 	redirection_done = param->redirection_done;
+ 
+ #ifdef WIN32
+ 	PostmasterHandle = param->PostmasterHandle;
+ 	pgwin32_initial_signal_pipe = param->initial_signal_pipe;
+ #else
+ 	memcpy(&postmaster_alive_fds, &param->postmaster_alive_fds,
+ 		   sizeof(postmaster_alive_fds));
+ #endif
+ 
+ 	memcpy(&syslogPipe, &param->syslogPipe, sizeof(syslogPipe));
+ 
+ 	strlcpy(my_exec_path, param->my_exec_path, MAXPGPATH);
+ 
+ 	strlcpy(pkglib_path, param->pkglib_path, MAXPGPATH);
+ 
+ 	strlcpy(ExtraOptions, param->ExtraOptions, MAXPGPATH);
+ }
+ 
+ 
+ Size
+ ShmemBackendArraySize(void)
+ {
+ 	return mul_size(MaxLivePostmasterChildren(), sizeof(Backend));
+ }
+ 
+ void
+ ShmemBackendArrayAllocation(void)
+ {
+ 	Size		size = ShmemBackendArraySize();
+ 
+ 	ShmemBackendArray = (Backend *) ShmemAlloc(size);
+ 	/* Mark all slots as empty */
+ 	memset(ShmemBackendArray, 0, size);
+ }
+ 
+ static void
+ ShmemBackendArrayAdd(Backend *bn)
+ {
+ 	/* The array slot corresponding to my PMChildSlot should be free */
+ 	int			i = bn->child_slot - 1;
+ 
+ 	Assert(ShmemBackendArray[i].pid == 0);
+ 	ShmemBackendArray[i] = *bn;
+ }
+ 
+ static void
+ ShmemBackendArrayRemove(Backend *bn)
+ {
+ 	int			i = bn->child_slot - 1;
+ 
+ 	Assert(ShmemBackendArray[i].pid == bn->pid);
+ 	/* Mark the slot as empty */
+ 	ShmemBackendArray[i].pid = 0;
+ }
+ #endif   /* EXEC_BACKEND */
+ 
+ 
+ #ifdef WIN32
+ 
+ static pid_t
+ win32_waitpid(int *exitstatus)
+ {
+ 	DWORD		dwd;
+ 	ULONG_PTR	key;
+ 	OVERLAPPED *ovl;
+ 
+ 	/*
+ 	 * Check if there are any dead children. If there are, return the pid of
+ 	 * the first one that died.
+ 	 */
+ 	if (GetQueuedCompletionStatus(win32ChildQueue, &dwd, &key, &ovl, 0))
+ 	{
+ 		*exitstatus = (int) key;
+ 		return dwd;
+ 	}
+ 
+ 	return -1;
+ }
+ 
+ /*
+  * Note! Code below executes on a thread pool! All operations must
+  * be thread safe! Note that elog() and friends must *not* be used.
+  */
+ static void WINAPI
+ pgwin32_deadchild_callback(PVOID lpParameter, BOOLEAN TimerOrWaitFired)
+ {
+ 	win32_deadchild_waitinfo *childinfo = (win32_deadchild_waitinfo *) lpParameter;
+ 	DWORD		exitcode;
+ 
+ 	if (TimerOrWaitFired)
+ 		return;					/* timeout. Should never happen, since we use
+ 								 * INFINITE as timeout value. */
+ 
+ 	/*
+ 	 * Remove handle from wait - required even though it's set to wait only
+ 	 * once
+ 	 */
+ 	UnregisterWaitEx(childinfo->waitHandle, NULL);
+ 
+ 	if (!GetExitCodeProcess(childinfo->procHandle, &exitcode))
+ 	{
+ 		/*
+ 		 * Should never happen. Inform user and set a fixed exitcode.
+ 		 */
+ 		write_stderr("could not read exit code for process\n");
+ 		exitcode = 255;
+ 	}
+ 
+ 	if (!PostQueuedCompletionStatus(win32ChildQueue, childinfo->procId, (ULONG_PTR) exitcode, NULL))
+ 		write_stderr("could not post child completion status\n");
+ 
+ 	/*
+ 	 * Handle is per-process, so we close it here instead of in the
+ 	 * originating thread
+ 	 */
+ 	CloseHandle(childinfo->procHandle);
+ 
+ 	/*
+ 	 * Free struct that was allocated before the call to
+ 	 * RegisterWaitForSingleObject()
+ 	 */
+ 	free(childinfo);
+ 
+ 	/* Queue SIGCHLD signal */
+ 	pg_queue_signal(SIGCHLD);
+ }
+ 
+ #endif   /* WIN32 */
+ 
+ /*
+  * Initialize one and only handle for monitoring postmaster death.
+  *
+  * Called once in the postmaster, so that child processes can subsequently
+  * monitor if their parent is dead.
+  */
+ static void
+ InitPostmasterDeathWatchHandle(void)
+ {
+ #ifndef WIN32
+ 	/*
+ 	 * Create a pipe. Postmaster holds the write end of the pipe open
+ 	 * (POSTMASTER_FD_OWN), and children hold the read end. Children can
+ 	 * pass the read file descriptor to select() to wake up in case postmaster
+ 	 * dies, or check for postmaster death with a (read() == 0). Children must
+ 	 * close the write end as soon as possible after forking, because EOF
+ 	 * won't be signaled in the read end until all processes have closed the
+ 	 * write fd. That is taken care of in ClosePostmasterPorts().
+ 	 */
+ 	Assert(MyProcPid == PostmasterPid);
+ 	if (pipe(postmaster_alive_fds))
+ 		ereport(FATAL,
+ 				(errcode_for_file_access(),
+ 				 errmsg_internal("could not create pipe to monitor postmaster death: %m")));
+ 
+ 	/*
+ 	 * Set O_NONBLOCK to allow testing for the fd's presence with a read()
+ 	 * call.
+ 	 */
+ 	if (fcntl(postmaster_alive_fds[POSTMASTER_FD_WATCH], F_SETFL, O_NONBLOCK))
+ 		ereport(FATAL,
+ 				(errcode_for_socket_access(),
+ 				 errmsg_internal("could not set postmaster death monitoring pipe to non-blocking mode: %m")));
+ 
+ #else
+ 	/*
+ 	 * On Windows, we use a process handle for the same purpose.
+ 	 */
+ 	if (DuplicateHandle(GetCurrentProcess(),
+ 						GetCurrentProcess(),
+ 						GetCurrentProcess(),
+ 						&PostmasterHandle,
+ 						0,
+ 						TRUE,
+ 						DUPLICATE_SAME_ACCESS) == 0)
+ 		ereport(FATAL,
+ 				(errmsg_internal("could not duplicate postmaster handle: error code %lu",
+ 								 GetLastError())));
+ #endif   /* WIN32 */
+ }
diff -rcN postgresql/src/bin/pg_controldata/pg_controldata.c postgresql_with_patch/src/bin/pg_controldata/pg_controldata.c
*** postgresql/src/bin/pg_controldata/pg_controldata.c	2011-09-12 05:19:14.000000000 +0900
--- postgresql_with_patch/src/bin/pg_controldata/pg_controldata.c	2011-09-12 05:24:42.000000000 +0900
***************
*** 234,239 ****
--- 234,242 ----
  		   ControlFile.backupStartPoint.xrecoff);
  	printf(_("End-of-backup record required:        %s\n"),
  		   ControlFile.backupEndRequired ? _("yes") : _("no"));
+ 	printf(_("Backup end location:                  %X/%X\n"),
+ 		   ControlFile.backupEndPoint.xlogid,
+ 		   ControlFile.backupEndPoint.xrecoff);
  	printf(_("Current wal_level setting:            %s\n"),
  		   wal_level_str(ControlFile.wal_level));
  	printf(_("Current max_connections setting:      %d\n"),
diff -rcN postgresql/src/bin/pg_ctl/pg_ctl.c postgresql_with_patch/src/bin/pg_ctl/pg_ctl.c
*** postgresql/src/bin/pg_ctl/pg_ctl.c	2011-09-12 05:19:14.000000000 +0900
--- postgresql_with_patch/src/bin/pg_ctl/pg_ctl.c	2011-09-12 05:24:42.000000000 +0900
***************
*** 882,894 ****
  	{
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
! 		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present, we're recovering from an online
! 		 * backup instead of performing one.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0 &&
! 			stat(recovery_file, &statbuf) != 0)
  		{
  			print_msg(_("WARNING: online backup mode is active\n"
  						"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
--- 882,891 ----
  	{
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
! 		 * that smart shutdown will wait for it to finish.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0)
  		{
  			print_msg(_("WARNING: online backup mode is active\n"
  						"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
diff -rcN postgresql/src/bin/pg_resetxlog/pg_resetxlog.c postgresql_with_patch/src/bin/pg_resetxlog/pg_resetxlog.c
*** postgresql/src/bin/pg_resetxlog/pg_resetxlog.c	2011-09-12 05:19:14.000000000 +0900
--- postgresql_with_patch/src/bin/pg_resetxlog/pg_resetxlog.c	2011-09-12 05:24:42.000000000 +0900
***************
*** 503,509 ****
  	ControlFile.time = (pg_time_t) time(NULL);
  	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
  
! 	/* minRecoveryPoint and backupStartPoint can be left zero */
  
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.MaxConnections = 100;
--- 503,509 ----
  	ControlFile.time = (pg_time_t) time(NULL);
  	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
  
! 	/* minRecoveryPoint, backupStartPoint and backupEndPoint can be left zero */
  
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.MaxConnections = 100;
***************
*** 638,643 ****
--- 638,645 ----
  	ControlFile.backupStartPoint.xlogid = 0;
  	ControlFile.backupStartPoint.xrecoff = 0;
  	ControlFile.backupEndRequired = false;
+ 	ControlFile.backupEndPoint.xlogid = 0;
+ 	ControlFile.backupEndPoint.xrecoff = 0;
  
  	/*
  	 * Force the defaults for max_* settings. The values don't really matter
diff -rcN postgresql/src/include/access/xlog.h postgresql_with_patch/src/include/access/xlog.h
*** postgresql/src/include/access/xlog.h	2011-09-12 05:19:14.000000000 +0900
--- postgresql_with_patch/src/include/access/xlog.h	2011-09-12 05:24:42.000000000 +0900
***************
*** 328,331 ****
--- 328,335 ----
  #define BACKUP_LABEL_FILE		"backup_label"
  #define BACKUP_LABEL_OLD		"backup_label.old"
  
+ /* These are written in backup label file */
+ #define BACKUP_FROM_MASTER		"master"
+ #define BACKUP_FROM_SLAVE		"slave"
+ 
  #endif   /* XLOG_H */
diff -rcN postgresql/src/include/catalog/pg_control.h postgresql_with_patch/src/include/catalog/pg_control.h
*** postgresql/src/include/catalog/pg_control.h	2011-09-12 05:19:14.000000000 +0900
--- postgresql_with_patch/src/include/catalog/pg_control.h	2011-09-12 05:24:42.000000000 +0900
***************
*** 143,152 ****
--- 143,158 ----
  	 * start up. If it's false, but backupStartPoint is set, a backup_label
  	 * file was found at startup but it may have been a leftover from a stray
  	 * pg_start_backup() call, not accompanied by pg_stop_backup().
+ 	 *
+ 	 * backupEndPoint is set a same value as minRecoveryPoint at the first
+ 	 * recovery if the backup is taken from hot standby. It is unset if we
+ 	 * reach the end of backup. It is not set if the backup is taken from
+ 	 * normal running.
  	 */
  	XLogRecPtr	minRecoveryPoint;
  	XLogRecPtr	backupStartPoint;
  	bool		backupEndRequired;
+ 	XLogRecPtr	backupEndPoint;
  
  	/*
  	 * Parameter settings that determine if the WAL can be used for archival

#17

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

over 14 years ago

In reply to: Jun Ishiduka (#16)

1 attachment(s)

Re: Online base backup from the hot-standby

Update patch.

Changes:
* set 'on' full_page_writes by user (in document)
* read "FROM: XX" in backup_label (in xlog.c)
* check status when pg_stop_backup is executed (in xlog.c)

Hi, Created a patch in response to comments.

* Procedure
1. Call pg_start_backup('x') on hot standby.
2. Take a backup of the data dir.
3. Copy the control file on hot standby to the backup.
4. Call pg_stop_backup() on hot standby.

* Behavior
(take backup)
If we execute pg_start_backup() on hot standby then execute restartpoint,
write a strings as "FROM: slave" in backup_label and change backup mode,
but do not change full_page_writes into "on" forcibly.

If we execute pg_stop_backup() on hot standby then rename backup_label
and change backup mode, but neither write backup end record and history
file nor wait to complete the WAL archiving.
pg_stop_backup() is returned this MinRecoveryPoint as result.

If we execute pg_stop_backup() on the server promoted then error
message is output since read the backup_label.

(recovery)
If we recover with the backup taken on hot standby, MinRecoveryPoint in
the control file copied by 3 of above-procedure is used instead of backup
end record.

If recovery starts as first, BackupEndPoint in the control file is written
a same value as MinRecoveryPoint. This is for remembering the value of
MinRecoveryPoint during recovery.

HINT message("If this has ...") is always output when we recover with the
backup taken on hot standby.

* Problem
full_page_writes's problem.

This has the following two problems.
* pg_start_backup() must set 'on' to full_page_writes of the master that
is actual writing of the WAL, but not the standby.
* The standby doesn't need to connect to the master that's actual writing
WAL.
(Ex. Standby2 in Cascade Replication: Master - Standby1 - Standby2)

I'm worried how I should clear these problems.

Status: Considering
(Latest: http://archives.postgresql.org/pgsql-hackers/2011-08/msg00880.php)

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

Attachments:

standby_online_backup_07.patchapplication/octet-stream; name=standby_online_backup_07.patchDownload

diff -rcN postgresql/doc/src/sgml/backup.sgml postgresql_with_patch/doc/src/sgml/backup.sgml
*** postgresql/doc/src/sgml/backup.sgml	2011-09-12 05:19:14.000000000 +0900
--- postgresql_with_patch/doc/src/sgml/backup.sgml	2011-09-13 15:23:49.000000000 +0900
***************
*** 724,734 ****
     <title>Making a Base Backup</title>
  
     <para>
!     The procedure for making a base backup is relatively simple:
    <orderedlist>
     <listitem>
      <para>
!      Ensure that WAL archiving is enabled and working.
      </para>
     </listitem>
     <listitem>
--- 724,736 ----
     <title>Making a Base Backup</title>
  
     <para>
!     The procedure for making a base backup is relatively simple. This can
!     also run on hot standby, the procedure is a little different:
    <orderedlist>
     <listitem>
      <para>
!      Ensure that WAL archiving is enabled and working. On hot standby then
!      this does not need to ensure since there is no WAL archiving originally.
      </para>
     </listitem>
     <listitem>
***************
*** 780,785 ****
--- 782,795 ----
     </listitem>
     <listitem>
      <para>
+      Copy <filename>pg_control</> file to the backup taken by above-procedure.
+      This needs on hot standby. This is performed to recovery with Minimum
+      recovery ending location in <filename>pg_control</> instead of backup end
+      record.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
       Again connect to the database as a superuser, and issue the command:
  <programlisting>
  SELECT pg_stop_backup();
***************
*** 788,793 ****
--- 798,805 ----
       the next WAL segment.  The reason for the switch is to arrange for
       the last WAL segment file written during the backup interval to be
       ready to archive.
+      On hot standby then this terminates the backup mode but does not perform
+      an automatic switch.
      </para>
     </listitem>
     <listitem>
***************
*** 808,813 ****
--- 820,827 ----
       If you wish to place a time limit on the execution of
       <function>pg_stop_backup</>, set an appropriate
       <varname>statement_timeout</varname> value.
+      On hot standby then this does not perform. If WAL archiving is used,
+      ensure to complete archiving as far as <function>pg_stop_backup</> result.
      </para>
     </listitem>
    </orderedlist>
***************
*** 851,856 ****
--- 865,873 ----
      effectively forced on during backup mode.)  You must ensure that these
      steps are carried out in sequence, without any possible
      overlap, or you will invalidate the backup.
+     On hot standby, full_page_writes is not effectively forced because hot
+     standby does not write WAL. you must set 'on' full_page_writes on master
+     during backup mode.
     </para>
  
     <para>
***************
*** 933,938 ****
--- 950,967 ----
      backup dump is which and how far back the associated WAL files go.
      It is generally better to follow the continuous archiving procedure above.
     </para>
+ 
+    <para>
+     <function>pg_stop_backup</> result on hot standby is may be incorrect. But
+     this value is greater than the correct value. If this value is used in
+     recovery then a phenomenon that WAL is not enough does not happen.
+    </para>
+ 
+    <para>
+     When you run in hotstandby <function>pg_start_backup</>, and, if promoted
+     to master when you run the <function>pg_stop_backup</>,
+     <function>pg_stop_backup</> will be failed. Retake the backup then.
+    </para>
    </sect2>
  
    <sect2 id="backup-pitr-recovery">
diff -rcN postgresql/src/backend/access/transam/xlog.c postgresql_with_patch/src/backend/access/transam/xlog.c
*** postgresql/src/backend/access/transam/xlog.c	2011-09-12 05:19:14.000000000 +0900
--- postgresql_with_patch/src/backend/access/transam/xlog.c	2011-09-13 14:55:01.000000000 +0900
***************
*** 664,670 ****
  #endif
  static void pg_start_backup_callback(int code, Datum arg);
  static bool read_backup_label(XLogRecPtr *checkPointLoc,
! 				  bool *backupEndRequired);
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
--- 664,670 ----
  #endif
  static void pg_start_backup_callback(int code, Datum arg);
  static bool read_backup_label(XLogRecPtr *checkPointLoc,
! 				  bool *backupEndRequired, char *backupfromstr);
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
***************
*** 6023,6028 ****
--- 6023,6029 ----
  	uint32		freespace;
  	TransactionId oldestActiveXID;
  	bool		backupEndRequired = false;
+ 	char		backupfromstr[10];
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6061,6067 ****
  				(errmsg("database system was interrupted while in recovery at log time %s",
  						str_time(ControlFile->checkPointCopy.time)),
  				 errhint("If this has occurred more than once some data might be corrupted"
! 			  " and you might need to choose an earlier recovery target.")));
  	else if (ControlFile->state == DB_IN_PRODUCTION)
  		ereport(LOG,
  			  (errmsg("database system was interrupted; last known up at %s",
--- 6062,6069 ----
  				(errmsg("database system was interrupted while in recovery at log time %s",
  						str_time(ControlFile->checkPointCopy.time)),
  				 errhint("If this has occurred more than once some data might be corrupted"
! 			  " and does not take online backup from hot standby"
! 			  " then you might need to choose an earlier recovery target.")));
  	else if (ControlFile->state == DB_IN_PRODUCTION)
  		ereport(LOG,
  			  (errmsg("database system was interrupted; last known up at %s",
***************
*** 6156,6162 ****
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
! 	if (read_backup_label(&checkPointLoc, &backupEndRequired))
  	{
  		/*
  		 * When a backup_label file is present, we want to roll forward from
--- 6158,6164 ----
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
! 	if (read_backup_label(&checkPointLoc, &backupEndRequired, backupfromstr))
  	{
  		/*
  		 * When a backup_label file is present, we want to roll forward from
***************
*** 6307,6312 ****
--- 6309,6328 ----
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
  		/*
+ 		 * set backupEndPoint at the start if we take online backup from
+ 		 * hot standby. backupEndPoint is used to distinguish whether the
+ 		 * backup is taken from master or hot stanby. If backupStartPoint
+ 		 * and backupEndPoint is invalid then this is meaning the first
+ 		 * recovery.
+ 		 */
+ 		if (XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
+ 		    XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
+ 		{
+ 			if (!XLogRecPtrIsInvalid(ControlFile->minRecoveryPoint))
+ 				ControlFile->backupEndPoint = ControlFile->minRecoveryPoint;
+ 		}
+ 
+ 		/*
  		 * Update pg_control to show that we are recovering and to show the
  		 * selected checkpoint as the place we are starting from. We also mark
  		 * pg_control with any minimum recovery stop point obtained from a
***************
*** 6618,6623 ****
--- 6634,6660 ----
  				error_context_stack = errcontext.previous;
  
  				/*
+ 				 * Check whether redo reaches minRecoveryPoint if we take online
+ 				 * backup from hot standby. Because we can not write backup end
+ 				 * record when we execute pg_stop_backup under the situation.
+ 				 */
+ 				if (!XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
+ 				{
+ 					if (XLByteLE(ControlFile->backupEndPoint, EndRecPtr))
+ 					{
+ 						elog(DEBUG1, "end of backup reached in the control file");
+ 
+ 						LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 
+ 						MemSet(&ControlFile->backupStartPoint, 0, sizeof(XLogRecPtr));
+ 						MemSet(&ControlFile->backupEndPoint, 0, sizeof(XLogRecPtr));
+ 						UpdateControlFile();
+ 
+ 						LWLockRelease(ControlFileLock);
+ 					}
+ 				}
+ 
+ 				/*
  				 * Update shared recoveryLastRecPtr after this record has been
  				 * replayed.
  				 */
***************
*** 8414,8423 ****
  		/*
  		 * If we see a shutdown checkpoint while waiting for an end-of-backup
  		 * record, the backup was canceled and the end-of-backup record will
! 		 * never arrive.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint))
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
--- 8451,8463 ----
  		/*
  		 * If we see a shutdown checkpoint while waiting for an end-of-backup
  		 * record, the backup was canceled and the end-of-backup record will
! 		 * never arrive. If the backup is taken from hot standby then this
! 		 * error is not output because there is a case of shutdown on master
! 		 * during taking online backup from hot standby.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
! 			XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
***************
*** 8880,8885 ****
--- 8920,8926 ----
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		duringrecovery = false;
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
  	pg_time_t	stamp_time;
***************
*** 8890,8908 ****
  	struct stat stat_buf;
  	FILE	   *fp;
  	StringInfoData labelfbuf;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		   errmsg("must be superuser or replication role to run a backup")));
  
  	if (RecoveryInProgress())
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 				 errmsg("recovery is in progress"),
! 				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
--- 8931,8956 ----
  	struct stat stat_buf;
  	FILE	   *fp;
  	StringInfoData labelfbuf;
+ 	char	   *backupfromstr = BACKUP_FROM_MASTER;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		   errmsg("must be superuser or replication role to run a backup")));
  
+ 	/*
+ 	 * check whether during recovery, and determine a string on backup_label.
+ 	 * If duringrecovery is true here then the subsequent process on WAL (check
+ 	 * wal_level, RequestXLogSwitch, forcePageWrites and gotUniqueStartpoint
+ 	 * by RequestCheckpoint) is skipped because hot standby can not write a wal.
+ 	 */
  	if (RecoveryInProgress())
! 	{
! 		duringrecovery = true;
! 		backupfromstr = BACKUP_FROM_SLAVE;
! 	}
  
! 	if (!XLogIsNeeded() && !duringrecovery)
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
***************
*** 8925,8931 ****
  	 * directory was not included in the base backup and the WAL archive was
  	 * cleared too before starting the backup.
  	 */
! 	RequestXLogSwitch();
  
  	/*
  	 * Mark backup active in shared memory.  We must do full-page WAL writes
--- 8973,8980 ----
  	 * directory was not included in the base backup and the WAL archive was
  	 * cleared too before starting the backup.
  	 */
! 	if (!duringrecovery)
! 		RequestXLogSwitch();
  
  	/*
  	 * Mark backup active in shared memory.  We must do full-page WAL writes
***************
*** 8959,8965 ****
  	}
  	else
  		XLogCtl->Insert.nonExclusiveBackups++;
! 	XLogCtl->Insert.forcePageWrites = true;
  	LWLockRelease(WALInsertLock);
  
  	/* Ensure we release forcePageWrites if fail below */
--- 9008,9015 ----
  	}
  	else
  		XLogCtl->Insert.nonExclusiveBackups++;
! 	if (!duringrecovery)
! 		XLogCtl->Insert.forcePageWrites = true;
  	LWLockRelease(WALInsertLock);
  
  	/* Ensure we release forcePageWrites if fail below */
***************
*** 9010,9016 ****
  				gotUniqueStartpoint = true;
  			}
  			LWLockRelease(WALInsertLock);
! 		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
  		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
--- 9060,9066 ----
  				gotUniqueStartpoint = true;
  			}
  			LWLockRelease(WALInsertLock);
! 		} while (!gotUniqueStartpoint && !duringrecovery);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
  		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
***************
*** 9033,9038 ****
--- 9083,9089 ----
  						 exclusive ? "pg_start_backup" : "streamed");
  		appendStringInfo(&labelfbuf, "START TIME: %s\n", strfbuf);
  		appendStringInfo(&labelfbuf, "LABEL: %s\n", backupidstr);
+ 		appendStringInfo(&labelfbuf, "FROM: %s\n", backupfromstr);
  
  		/*
  		 * Okay, write the file, or return its contents to caller.
***************
*** 9105,9111 ****
  	}
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
--- 9156,9163 ----
  	}
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0 &&
! 		!RecoveryInProgress())
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
***************
*** 9149,9154 ****
--- 9201,9207 ----
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		duringrecovery = false;
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
  	XLogRecData rdata;
***************
*** 9168,9192 ****
  	int			waits = 0;
  	bool		reported_waiting = false;
  	char	   *remaining;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
  	if (RecoveryInProgress())
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 				 errmsg("recovery is in progress"),
! 				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
  				 errhint("wal_level must be set to \"archive\" or \"hot_standby\" at server start.")));
  
  	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
--- 9221,9277 ----
  	int			waits = 0;
  	bool		reported_waiting = false;
  	char	   *remaining;
+ 	XLogRecPtr	checkPointLoc;
+ 	bool		backupEndRequired = false;
+ 	char		backupfromstr[10];
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
+ 	/*
+ 	 * check whether during recovery. If duringrecovery is true here then the
+ 	 * subsequent process on WAL (check wal_level, forcePageWrites, XLogInsert,
+ 	 * RequestXLogSwitch, write the backup history file and XLogArchivingActive)
+ 	 * is skipped because hot standby can not write a wal.
+ 	 */
  	if (RecoveryInProgress())
! 		duringrecovery = true;
  
! 	if (!XLogIsNeeded() && !duringrecovery)
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
  				 errhint("wal_level must be set to \"archive\" or \"hot_standby\" at server start.")));
  
  	/*
+ 	 * backupfromstr is taken from backup_label, which is where we
+ 	 * execute pg_start_backup on. check whether this state is equals it.
+ 	 * If read_backup_label function returns error, this function is
+ 	 * failed by error handling after this.
+ 	 */
+ 	if (read_backup_label(&checkPointLoc, &backupEndRequired, backupfromstr))
+ 	{
+ 		if (duringrecovery == false)
+ 		{
+ 			if (strcmp(backupfromstr, BACKUP_FROM_MASTER) != 0 ||
+ 				ControlFile->state != DB_IN_PRODUCTION)
+ 				ereport(ERROR,
+ 						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 						 errmsg("different state than when pg_start_backup is executed")));
+ 		}
+ 		else
+ 		{
+ 			if (strcmp(backupfromstr, BACKUP_FROM_SLAVE) != 0 ||
+ 				ControlFile->state != DB_IN_ARCHIVE_RECOVERY)
+ 				ereport(ERROR,
+ 						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 						 errmsg("different state than when pg_start_backup is executed")));
+ 		}
+ 	}
+ 
+ 	/*
  	 * OK to update backup counters and forcePageWrites
  	 */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
***************
*** 9205,9211 ****
  	}
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
--- 9290,9297 ----
  	}
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0 &&
! 	    !duringrecovery)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
***************
*** 9259,9264 ****
--- 9345,9357 ----
  	}
  
  	/*
+ 	 * When pg_stop_backup is excuted on hot standby server, the result
+ 	 * is minRecoveryPoint in the control file.
+ 	 */
+ 	if (duringrecovery)
+ 		return ControlFile->minRecoveryPoint;
+ 
+ 	/*
  	 * Read and parse the START WAL LOCATION line (this code is pretty crude,
  	 * but we are not expecting any variability in the file format).
  	 */
***************
*** 9416,9422 ****
  	XLogCtl->Insert.nonExclusiveBackups--;
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0)
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
--- 9509,9516 ----
  	XLogCtl->Insert.nonExclusiveBackups--;
  
  	if (!XLogCtl->Insert.exclusiveBackup &&
! 		XLogCtl->Insert.nonExclusiveBackups == 0 &&
! 		!RecoveryInProgress())
  	{
  		XLogCtl->Insert.forcePageWrites = false;
  	}
***************
*** 9790,9802 ****
   * streamed backup, *backupEndRequired is set to TRUE.
   */
  static bool
! read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired)
  {
  	char		startxlogfilename[MAXFNAMELEN];
  	TimeLineID	tli;
  	FILE	   *lfp;
  	char		ch;
  	char		backuptype[20];
  
  	*backupEndRequired = false;
  
--- 9884,9897 ----
   * streamed backup, *backupEndRequired is set to TRUE.
   */
  static bool
! read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired, char *backupfromstr)
  {
  	char		startxlogfilename[MAXFNAMELEN];
  	TimeLineID	tli;
  	FILE	   *lfp;
  	char		ch;
  	char		backuptype[20];
+ 	char		strbuff[256];
  
  	*backupEndRequired = false;
  
***************
*** 9842,9847 ****
--- 9937,9957 ----
  			*backupEndRequired = true;
  	}
  
+ 	fgets(strbuff, sizeof(strbuff), lfp);  /* skip one line */
+ 	fgets(strbuff, sizeof(strbuff), lfp);  /* skip one line */
+ 	fgets(strbuff, sizeof(strbuff), lfp);  /* skip one line */
+ 
+ 	/*
+ 	 * Read and parse the FROM line. If not read then output WARNING message
+ 	 * and set BACKUP_FROM_MASTER.
+ 	 *
+ 	 */
+ 	strcpy(backupfromstr, BACKUP_FROM_MASTER);
+ 	if (fscanf(lfp, "FROM: %s%c", backupfromstr, &ch) != 2 || ch != '\n')
+ 		ereport(WARNING,
+ 				(errmsg("loaded old file \"%s\", set backup from \"%s\"",
+ 						BACKUP_LABEL_FILE, BACKUP_FROM_MASTER)));
+ 
  	if (ferror(lfp) || FreeFile(lfp))
  		ereport(FATAL,
  				(errcode_for_file_access(),
diff -rcN postgresql/src/backend/postmaster/postmaster.c postgresql_with_patch/src/backend/postmaster/postmaster.c
*** postgresql/src/backend/postmaster/postmaster.c	2011-09-12 05:19:14.000000000 +0900
--- postgresql_with_patch/src/backend/postmaster/postmaster.c	2011-09-12 05:24:42.000000000 +0900
***************
*** 287,292 ****
--- 287,295 ----
  static PMState pmState = PM_INIT;
  
  static bool ReachedNormalRunning = false;		/* T if we've reached PM_RUN */
+ static bool ReachedHotStandbyRunning = false;	/* T if we've reached PM_HOT_STANDBY */
+ static bool WaitBackupForHotStandby = false;	/* T if we've moved from PM_WAIT_READONLY
+ 												 * to PM_WAIT_BACKUP */
  
  bool		ClientAuthInProgress = false;		/* T during new-client
  												 * authentication */
***************
*** 2825,2831 ****
  		 * PM_WAIT_BACKUP state ends when online backup mode is not active.
  		 */
  		if (!BackupInProgress())
! 			pmState = PM_WAIT_BACKENDS;
  	}
  
  	if (pmState == PM_WAIT_READONLY)
--- 2828,2845 ----
  		 * PM_WAIT_BACKUP state ends when online backup mode is not active.
  		 */
  		if (!BackupInProgress())
! 		{
! 			/*
! 			 * WaitBackupForHotStandby, this flag is if we execute
! 			 * smart shutdown during takeing online backup from hot standby.
! 			 * If the flag is true then we need to change PM_WAIT_BACKENDS
! 			 * at the root of PM_WAIT_READONLY.
! 			 */
! 			if (!WaitBackupForHotStandby)
! 				pmState = PM_WAIT_BACKENDS;
! 			else
! 				pmState = PM_WAIT_READONLY;
! 		}
  	}
  
  	if (pmState == PM_WAIT_READONLY)
***************
*** 2840,2850 ****
  		 */
  		if (CountChildren(BACKEND_TYPE_NORMAL) == 0)
  		{
! 			if (StartupPID != 0)
! 				signal_child(StartupPID, SIGTERM);
! 			if (WalReceiverPID != 0)
! 				signal_child(WalReceiverPID, SIGTERM);
! 			pmState = PM_WAIT_BACKENDS;
  		}
  	}
  
--- 2854,2878 ----
  		 */
  		if (CountChildren(BACKEND_TYPE_NORMAL) == 0)
  		{
! 			if (!BackupInProgress())
! 			{
! 				if (StartupPID != 0)
! 					signal_child(StartupPID, SIGTERM);
! 				if (WalReceiverPID != 0)
! 					signal_child(WalReceiverPID, SIGTERM);
! 				pmState = PM_WAIT_BACKENDS;
! 			}
! 			else
! 			{
! 				/*
! 				 * This is meaning that we execute smart shutdown during
! 				 * online backup from hot standby. we need to allow the
! 				 * connection to the backend by changing PM_WAIT_BACKUP
! 				 * to execute pg_stop_backup.
! 				 */
! 				WaitBackupForHotStandby = true;
! 				pmState = PM_WAIT_BACKUP;
! 			}
  		}
  	}
  
***************
*** 2993,3006 ****
  		{
  			/*
  			 * Terminate backup mode to avoid recovery after a clean fast
! 			 * shutdown.  Since a backup can only be taken during normal
! 			 * running (and not, for example, while running under Hot Standby)
! 			 * it only makes sense to do this if we reached normal running. If
! 			 * we're still in recovery, the backup file is one we're
! 			 * recovering *from*, and we must keep it around so that recovery
! 			 * restarts from the right place.
  			 */
! 			if (ReachedNormalRunning)
  				CancelBackup();
  
  			/* Normal exit from the postmaster is here */
--- 3021,3029 ----
  		{
  			/*
  			 * Terminate backup mode to avoid recovery after a clean fast
! 			 * shutdown if we reached normal running or hot standby.
  			 */
! 			if (ReachedNormalRunning || ReachedHotStandbyRunning)
  				CancelBackup();
  
  			/* Normal exit from the postmaster is here */
***************
*** 4157,4162 ****
--- 4180,4186 ----
  		ereport(LOG,
  		(errmsg("database system is ready to accept read only connections")));
  
+ 		ReachedHotStandbyRunning = true;
  		pmState = PM_HOT_STANDBY;
  	}
  
diff -rcN postgresql/src/bin/pg_controldata/pg_controldata.c postgresql_with_patch/src/bin/pg_controldata/pg_controldata.c
*** postgresql/src/bin/pg_controldata/pg_controldata.c	2011-09-12 05:19:14.000000000 +0900
--- postgresql_with_patch/src/bin/pg_controldata/pg_controldata.c	2011-09-12 05:24:42.000000000 +0900
***************
*** 234,239 ****
--- 234,242 ----
  		   ControlFile.backupStartPoint.xrecoff);
  	printf(_("End-of-backup record required:        %s\n"),
  		   ControlFile.backupEndRequired ? _("yes") : _("no"));
+ 	printf(_("Backup end location:                  %X/%X\n"),
+ 		   ControlFile.backupEndPoint.xlogid,
+ 		   ControlFile.backupEndPoint.xrecoff);
  	printf(_("Current wal_level setting:            %s\n"),
  		   wal_level_str(ControlFile.wal_level));
  	printf(_("Current max_connections setting:      %d\n"),
diff -rcN postgresql/src/bin/pg_ctl/pg_ctl.c postgresql_with_patch/src/bin/pg_ctl/pg_ctl.c
*** postgresql/src/bin/pg_ctl/pg_ctl.c	2011-09-12 05:19:14.000000000 +0900
--- postgresql_with_patch/src/bin/pg_ctl/pg_ctl.c	2011-09-12 05:24:42.000000000 +0900
***************
*** 882,894 ****
  	{
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
! 		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present, we're recovering from an online
! 		 * backup instead of performing one.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0 &&
! 			stat(recovery_file, &statbuf) != 0)
  		{
  			print_msg(_("WARNING: online backup mode is active\n"
  						"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
--- 882,891 ----
  	{
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
! 		 * that smart shutdown will wait for it to finish.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0)
  		{
  			print_msg(_("WARNING: online backup mode is active\n"
  						"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
diff -rcN postgresql/src/bin/pg_resetxlog/pg_resetxlog.c postgresql_with_patch/src/bin/pg_resetxlog/pg_resetxlog.c
*** postgresql/src/bin/pg_resetxlog/pg_resetxlog.c	2011-09-12 05:19:14.000000000 +0900
--- postgresql_with_patch/src/bin/pg_resetxlog/pg_resetxlog.c	2011-09-12 05:24:42.000000000 +0900
***************
*** 503,509 ****
  	ControlFile.time = (pg_time_t) time(NULL);
  	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
  
! 	/* minRecoveryPoint and backupStartPoint can be left zero */
  
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.MaxConnections = 100;
--- 503,509 ----
  	ControlFile.time = (pg_time_t) time(NULL);
  	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
  
! 	/* minRecoveryPoint, backupStartPoint and backupEndPoint can be left zero */
  
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.MaxConnections = 100;
***************
*** 638,643 ****
--- 638,645 ----
  	ControlFile.backupStartPoint.xlogid = 0;
  	ControlFile.backupStartPoint.xrecoff = 0;
  	ControlFile.backupEndRequired = false;
+ 	ControlFile.backupEndPoint.xlogid = 0;
+ 	ControlFile.backupEndPoint.xrecoff = 0;
  
  	/*
  	 * Force the defaults for max_* settings. The values don't really matter
diff -rcN postgresql/src/include/access/xlog.h postgresql_with_patch/src/include/access/xlog.h
*** postgresql/src/include/access/xlog.h	2011-09-12 05:19:14.000000000 +0900
--- postgresql_with_patch/src/include/access/xlog.h	2011-09-12 05:24:42.000000000 +0900
***************
*** 328,331 ****
--- 328,335 ----
  #define BACKUP_LABEL_FILE		"backup_label"
  #define BACKUP_LABEL_OLD		"backup_label.old"
  
+ /* These are written in backup label file */
+ #define BACKUP_FROM_MASTER		"master"
+ #define BACKUP_FROM_SLAVE		"slave"
+ 
  #endif   /* XLOG_H */
diff -rcN postgresql/src/include/catalog/pg_control.h postgresql_with_patch/src/include/catalog/pg_control.h
*** postgresql/src/include/catalog/pg_control.h	2011-09-12 05:19:14.000000000 +0900
--- postgresql_with_patch/src/include/catalog/pg_control.h	2011-09-12 05:24:42.000000000 +0900
***************
*** 143,152 ****
--- 143,158 ----
  	 * start up. If it's false, but backupStartPoint is set, a backup_label
  	 * file was found at startup but it may have been a leftover from a stray
  	 * pg_start_backup() call, not accompanied by pg_stop_backup().
+ 	 *
+ 	 * backupEndPoint is set a same value as minRecoveryPoint at the first
+ 	 * recovery if the backup is taken from hot standby. It is unset if we
+ 	 * reach the end of backup. It is not set if the backup is taken from
+ 	 * normal running.
  	 */
  	XLogRecPtr	minRecoveryPoint;
  	XLogRecPtr	backupStartPoint;
  	bool		backupEndRequired;
+ 	XLogRecPtr	backupEndPoint;
  
  	/*
  	 * Parameter settings that determine if the WAL can be used for archival

#18

Fujii Masao

masao.fujii@gmail.com

over 14 years ago

In reply to: Jun Ishiduka (#17)

Re: Online base backup from the hot-standby

2011/9/13 Jun Ishiduka <ishizuka.jun@po.ntts.co.jp>:

Update patch.

Changes:
* set 'on' full_page_writes by user (in document)
* read "FROM: XX" in backup_label (in xlog.c)
* check status when pg_stop_backup is executed (in xlog.c)

Thanks for updating the patch.

Before reviewing the patch, to encourage people to comment and
review the patch, I explain what this patch provides:

This patch provides the capability to take a base backup during recovery,
i.e., from the standby server. This is very useful feature to offload the
expense of periodic backups from the master. That backup procedure is
similar to that during normal running, but slightly different:

1. Execute pg_start_backup on the standby. To execute a query on the
standby, hot standby must be enabled.

2. Perform a file system backup on the standby.

3. Copy the pg_control file from the cluster directory on the standby to
the backup as follows:

cp $PGDATA/global/pg_control /mnt/server/backupdir/global

4. Execute pg_stop_backup on the standby.

The backup taken by the above procedure is available for an archive
recovery or standby server.

If the standby is promoted during a backup, pg_stop_backup() detects
the change of the server status and fails. The data backed up before the
promotion is invalid and not available for recovery.

Taking a backup from the standby by using pg_basebackup is still not
possible. But we can relax that restriction after applying this patch.

To take a base backup during recovery safely, some sort of parameters
must be set properly. Hot standby must be enabled on the standby, i.e.,
wal_level and hot_standby must be enabled on the master and the standby,
respectively. FPW (full page writes) is required for a base backup,
so full_page_writes must be enabled on the master.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#19

Magnus Hagander

magnus@hagander.net

over 14 years ago

In reply to: Fujii Masao (#18)

Re: Online base backup from the hot-standby

On Wed, Sep 21, 2011 at 04:50, Fujii Masao <masao.fujii@gmail.com> wrote:

2011/9/13 Jun Ishiduka <ishizuka.jun@po.ntts.co.jp>:

Update patch.

Changes:
* set 'on' full_page_writes by user (in document)
* read "FROM: XX" in backup_label (in xlog.c)
* check status when pg_stop_backup is executed (in xlog.c)

Thanks for updating the patch.

Before reviewing the patch, to encourage people to comment and
review the patch, I explain what this patch provides:

This patch provides the capability to take a base backup during recovery,
i.e., from the standby server. This is very useful feature to offload the
expense of periodic backups from the master. That backup procedure is
similar to that during normal running, but slightly different:

1. Execute pg_start_backup on the standby. To execute a query on the
standby, hot standby must be enabled.

2. Perform a file system backup on the standby.

3. Copy the pg_control file from the cluster directory on the standby to
the backup as follows:

cp $PGDATA/global/pg_control /mnt/server/backupdir/global

But this is done as part of step 2 already. I assume what this really
means is that the pg_control file must be the last file backed up?

(Since there are certainly a lot other ways to do the backup than just
cp to a mounted directory..)

4. Execute pg_stop_backup on the standby.

The backup taken by the above procedure is available for an archive
recovery or standby server.

If the standby is promoted during a backup, pg_stop_backup() detects
the change of the server status and fails. The data backed up before the
promotion is invalid and not available for recovery.

Taking a backup from the standby by using pg_basebackup is still not
possible. But we can relax that restriction after applying this patch.

I think that this is going to be very important, particularly given
the requirements on pt 3 above. (But yes, it certainly doesn't have to
be done as part of this patch, but it really should be the plan to
have this included in the same version)

To take a base backup during recovery safely, some sort of parameters
must be set properly. Hot standby must be enabled on the standby, i.e.,
wal_level and hot_standby must be enabled on the master and the standby,
respectively. FPW (full page writes) is required for a base backup,
so full_page_writes must be enabled on the master.

Presumably pg_start_backup() will check this. And we'll somehow track
this before pg_stop_backup() as well? (for such evil things such as
the user changing FPW from on to off and then back to on again during
a backup, will will make it look correct both during start and stop,
but incorrect in the middle - pg_stop_backup needs to fail in that
case as well)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#20

Fujii Masao

masao.fujii@gmail.com

over 14 years ago

In reply to: Magnus Hagander (#19)

Re: Online base backup from the hot-standby

On Wed, Sep 21, 2011 at 2:13 PM, Magnus Hagander <magnus@hagander.net> wrote:

On Wed, Sep 21, 2011 at 04:50, Fujii Masao <masao.fujii@gmail.com> wrote:

3. Copy the pg_control file from the cluster directory on the standby to
the backup as follows:

cp $PGDATA/global/pg_control /mnt/server/backupdir/global

But this is done as part of step 2 already. I assume what this really
means is that the pg_control file must be the last file backed up?

Yes.

When we perform an archive recovery from the backup taken during
normal processing, we gets a backup end location from the backup-end
WAL record which was written by pg_stop_backup(). But since no WAL
writing is allowed during recovery, pg_stop_backup() on the standby
cannot write a backup-end WAL record. So, in his patch, instead of
a backup-end WAL record, the startup process uses the minimum
recovery point recorded in pg_control which has been included in the
backup, as a backup end location. BTW, a backup end location is
used to check whether recovery has reached a consistency state
(i.e., end-of-backup).

To use the minimum recovery point in pg_control as a backup end
location safely, pg_control must be backed up last. Otherwise, data
page which has the newer LSN than the minimum recovery point
might be included in the backup.

(Since there are certainly a lot other ways to do the backup than just
cp to a mounted directory..)

Yes. The above command I described is just an example.

4. Execute pg_stop_backup on the standby.

The backup taken by the above procedure is available for an archive
recovery or standby server.

If the standby is promoted during a backup, pg_stop_backup() detects
the change of the server status and fails. The data backed up before the
promotion is invalid and not available for recovery.

Taking a backup from the standby by using pg_basebackup is still not
possible. But we can relax that restriction after applying this patch.

I think that this is going to be very important, particularly given
the requirements on pt 3 above. (But yes, it certainly doesn't have to
be done as part of this patch, but it really should be the plan to
have this included in the same version)

Agreed.

To take a base backup during recovery safely, some sort of parameters
must be set properly. Hot standby must be enabled on the standby, i.e.,
wal_level and hot_standby must be enabled on the master and the standby,
respectively. FPW (full page writes) is required for a base backup,
so full_page_writes must be enabled on the master.

Presumably pg_start_backup() will check this. And we'll somehow track
this before pg_stop_backup() as well? (for such evil things such as
the user changing FPW from on to off and then back to on again during
a backup, will will make it look correct both during start and stop,
but incorrect in the middle - pg_stop_backup needs to fail in that
case as well)

Right. As I suggested upthread, to address that problem, we need to log
the change of FPW on the master, and then we need to check whether
such a WAL is replayed on the standby during the backup. If it's done,
pg_stop_backup() should emit an error.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#21

Magnus Hagander

magnus@hagander.net

over 14 years ago

In reply to: Fujii Masao (#20)

Re: Online base backup from the hot-standby

On Wed, Sep 21, 2011 at 08:23, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Sep 21, 2011 at 2:13 PM, Magnus Hagander <magnus@hagander.net> wrote:

On Wed, Sep 21, 2011 at 04:50, Fujii Masao <masao.fujii@gmail.com> wrote:

3. Copy the pg_control file from the cluster directory on the standby to
the backup as follows:

cp $PGDATA/global/pg_control /mnt/server/backupdir/global

But this is done as part of step 2 already. I assume what this really
means is that the pg_control file must be the last file backed up?

Yes.

When we perform an archive recovery from the backup taken during
normal processing, we gets a backup end location from the backup-end
WAL record which was written by pg_stop_backup(). But since no WAL
writing is allowed during recovery, pg_stop_backup() on the standby
cannot write a backup-end WAL record. So, in his patch, instead of
a backup-end WAL record, the startup process uses the minimum
recovery point recorded in pg_control which has been included in the
backup, as a backup end location. BTW, a backup end location is
used to check whether recovery has reached a consistency state
(i.e., end-of-backup).

To use the minimum recovery point in pg_control as a backup end
location safely, pg_control must be backed up last. Otherwise, data
page which has the newer LSN than the minimum recovery point
might be included in the backup.

Ah, check.

(Since there are certainly a lot other ways to do the backup than just
cp to a mounted directory..)

Yes. The above command I described is just an example.

ok.

4. Execute pg_stop_backup on the standby.

The backup taken by the above procedure is available for an archive
recovery or standby server.

If the standby is promoted during a backup, pg_stop_backup() detects
the change of the server status and fails. The data backed up before the
promotion is invalid and not available for recovery.

Taking a backup from the standby by using pg_basebackup is still not
possible. But we can relax that restriction after applying this patch.

I think that this is going to be very important, particularly given
the requirements on pt 3 above. (But yes, it certainly doesn't have to
be done as part of this patch, but it really should be the plan to
have this included in the same version)

Agreed.

To take a base backup during recovery safely, some sort of parameters
must be set properly. Hot standby must be enabled on the standby, i.e.,
wal_level and hot_standby must be enabled on the master and the standby,
respectively. FPW (full page writes) is required for a base backup,
so full_page_writes must be enabled on the master.

Presumably pg_start_backup() will check this. And we'll somehow track
this before pg_stop_backup() as well? (for such evil things such as
the user changing FPW from on to off and then back to on again during
a backup, will will make it look correct both during start and stop,
but incorrect in the middle - pg_stop_backup needs to fail in that
case as well)

Right. As I suggested upthread, to address that problem, we need to log
the change of FPW on the master, and then we need to check whether
such a WAL is replayed on the standby during the backup. If it's done,
pg_stop_backup() should emit an error.

I somehow missed this thread completely, so I didn't catch your
previous comments - oops, sorry. The important point being that we
need to track if when this happens even if it has been reset to a
valid value. So we can't just check the state of the variable at the
beginning and at the end.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#22

Josh Berkus

josh@agliodbs.com

over 14 years ago

In reply to: Fujii Masao (#18)

Remastering using streaming only replication?

Fujii,

I haven't really been following your latest patches about taking backups
from the standby and cascading replication, but I wanted to see if it
fulfills another TODO: the ability to "remaster" (that is, designate the
"lead standby" as the new master) without needing to copy WAL files.

Supporting remastering using steaming replication only was on your TODO
list when we closed 9.1. It seems like this would get solved as a
side-effect, but I wanted to confirm that.

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#23

Fujii Masao

masao.fujii@gmail.com

over 14 years ago

In reply to: Magnus Hagander (#21)

Re: Online base backup from the hot-standby

On Wed, Sep 21, 2011 at 5:34 PM, Magnus Hagander <magnus@hagander.net> wrote:

On Wed, Sep 21, 2011 at 08:23, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Sep 21, 2011 at 2:13 PM, Magnus Hagander <magnus@hagander.net> wrote:

Presumably pg_start_backup() will check this. And we'll somehow track
this before pg_stop_backup() as well? (for such evil things such as
the user changing FPW from on to off and then back to on again during
a backup, will will make it look correct both during start and stop,
but incorrect in the middle - pg_stop_backup needs to fail in that
case as well)

Right. As I suggested upthread, to address that problem, we need to log
the change of FPW on the master, and then we need to check whether
such a WAL is replayed on the standby during the backup. If it's done,
pg_stop_backup() should emit an error.

I somehow missed this thread completely, so I didn't catch your
previous comments - oops, sorry. The important point being that we
need to track if when this happens even if it has been reset to a
valid value. So we can't just check the state of the variable at the
beginning and at the end.

Right. Let me explain again what I'm thinking.

When FPW is changed, the master always writes the WAL record
which contains the current value of FPW. This means that the standby
can track all changes of FPW by reading WAL records.

The standby has two flags: One indicates whether FPW has always
been TRUE since last restartpoint. Another indicates whether FPW
has always been TRUE since last pg_start_backup(). The standby
can maintain those flags by reading WAL records streamed from
the master.

If the former flag indicates FALSE (i.e., the WAL records which
the standby has replayed since last restartpoint might not contain
required FPW), pg_start_backup() fails. If the latter flag indicates
FALSE (i.e., the WAL records which the standby has replayed
during the backup might not contain required FPW),
pg_stop_backup() fails.

If I'm not missing something, this approach can address the problem
which you're concerned about.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#24

Fujii Masao

masao.fujii@gmail.com

over 14 years ago

In reply to: Fujii Masao (#18)

1 attachment(s)

Re: Online base backup from the hot-standby

On Wed, Sep 21, 2011 at 11:50 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

2011/9/13 Jun Ishiduka <ishizuka.jun@po.ntts.co.jp>:

Update patch.

Changes:
* set 'on' full_page_writes by user (in document)
* read "FROM: XX" in backup_label (in xlog.c)
* check status when pg_stop_backup is executed (in xlog.c)

Thanks for updating the patch.

Before reviewing the patch, to encourage people to comment and
review the patch, I explain what this patch provides:

Attached is the updated version of the patch. I refactored the code, fixed
some bugs, added lots of source code comments, improved the document,
but didn't change the basic design. Please check this patch, and let's use
this patch as the base if you agree with that.

In the current patch, there is no safeguard for preventing users from
taking backup during recovery when FPW is disabled. This is unsafe.
Are you planning to implement such a safeguard?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

standby_online_backup_08_fujii.patchtext/x-patch; charset=US-ASCII; name=standby_online_backup_08_fujii.patchDownload

*** a/doc/src/sgml/backup.sgml
--- b/doc/src/sgml/backup.sgml
***************
*** 935,940 **** SELECT pg_stop_backup();
--- 935,999 ----
     </para>
    </sect2>
  
+   <sect2 id="backup-during-recovery">
+    <title>Making a Base Backup during Recovery</title>
+ 
+    <para>
+     It's possible to make a base backup during recovery. Which allows a user
+     to take a base backup from the standby to offload the expense of
+     periodic backups from the master. Its procedure is similar to that
+     during normal running.
+   <orderedlist>
+    <listitem>
+     <para>
+      Ensure that hot standby is enabled (see <xref linkend="hot-standby">
+      for more information).
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Connect to the database as a superuser and execute <function>pg_start_backup</>.
+      This performs a restartpoint if there is at least one checkpoint record
+      replayed since last restartpoint.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Perform a file system backup.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Copy the pg_control file from the cluster directory to the backup as follows:
+ <programlisting>
+ cp $PGDATA/global/pg_control /mnt/server/backupdir/global
+ </programlisting>
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Again connect to the database as a superuser, and execute
+      <function>pg_stop_backup</>. This terminates the backup mode, but does not
+      perform a switch to the next WAL segment, create a backup history file and
+      wait for all required WAL segments to be archived,
+      unlike that during normal processing.
+     </para>
+    </listitem>
+   </orderedlist>
+    </para>
+ 
+    <para>
+     You cannot use the <application>pg_basebackup</> tool to take the backup
+     during recovery.
+    </para>
+    <para>
+     It's not possible to make a base backup from the server in recovery mode
+     when reading WAL written during a period when <varname>full_page_writes</>
+     was disabled. If you take a base backup from the standby,
+     <varname>full_page_writes</> must be set to true on the master.
+    </para>
+   </sect2>
+ 
    <sect2 id="backup-pitr-recovery">
     <title>Recovering Using a Continuous Archive Backup</title>
  
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1680,1685 **** SET ENABLE_SEQSCAN TO OFF;
--- 1680,1693 ----
         </para>
  
         <para>
+         WAL written while <varname>full_page_writes</> is disabled does not
+         contain enough information to make a base backup during recovery
+         (see <xref linkend="backup-during-recovery">),
+         so <varname>full_page_writes</> must be enabled on the master
+         to take a backup from the standby.
+        </para>
+ 
+        <para>
          This parameter can only be set in the <filename>postgresql.conf</>
          file or on the server command line.
          The default is <literal>on</>.
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***************
*** 14014,14020 **** SELECT set_config('log_statement_stats', 'off', false);
     <para>
      The functions shown in <xref
      linkend="functions-admin-backup-table"> assist in making on-line backups.
!     These functions cannot be executed during recovery.
     </para>
  
     <table id="functions-admin-backup-table">
--- 14014,14021 ----
     <para>
      The functions shown in <xref
      linkend="functions-admin-backup-table"> assist in making on-line backups.
!     These functions except <function>pg_start_backup</> and <function>pg_stop_backup</>
!     cannot be executed during recovery.
     </para>
  
     <table id="functions-admin-backup-table">
***************
*** 14094,14100 **** SELECT set_config('log_statement_stats', 'off', false);
      database cluster's data directory, performs a checkpoint,
      and then returns the backup's starting transaction log location as text.
      The user can ignore this result value, but it is
!     provided in case it is useful.
  <programlisting>
  postgres=# select pg_start_backup('label_goes_here');
   pg_start_backup
--- 14095,14103 ----
      database cluster's data directory, performs a checkpoint,
      and then returns the backup's starting transaction log location as text.
      The user can ignore this result value, but it is
!     provided in case it is useful. If <function>pg_start_backup</> is
!     executed during recovery, it performs a restartpoint rather than
!     writing a new checkpoint.
  <programlisting>
  postgres=# select pg_start_backup('label_goes_here');
   pg_start_backup
***************
*** 14122,14127 **** postgres=# select pg_start_backup('label_goes_here');
--- 14125,14137 ----
     </para>
  
     <para>
+     If <function>pg_stop_backup</> is executed during recovery, it just
+     removes the label file, but doesn't create a backup history file and wait for
+     the ending transaction log file to be archived. The return value is equal to
+     or bigger than the exact backup's ending transaction log location.
+    </para>
+ 
+    <para>
      <function>pg_switch_xlog</> moves to the next transaction log file, allowing the
      current file to be archived (assuming you are using continuous archiving).
      The return value is the ending transaction log location + 1 within the just-completed transaction log file.
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 664,670 **** static void xlog_outrec(StringInfo buf, XLogRecord *record);
  #endif
  static void pg_start_backup_callback(int code, Datum arg);
  static bool read_backup_label(XLogRecPtr *checkPointLoc,
! 				  bool *backupEndRequired);
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
--- 664,670 ----
  #endif
  static void pg_start_backup_callback(int code, Datum arg);
  static bool read_backup_label(XLogRecPtr *checkPointLoc,
! 				  bool *backupEndRequired, bool *backupDuringRecovery);
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
***************
*** 6023,6028 **** StartupXLOG(void)
--- 6023,6030 ----
  	uint32		freespace;
  	TransactionId oldestActiveXID;
  	bool		backupEndRequired = false;
+ 	bool		backupDuringRecovery = false;
+ 	DBState	save_state;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6156,6162 **** StartupXLOG(void)
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
! 	if (read_backup_label(&checkPointLoc, &backupEndRequired))
  	{
  		/*
  		 * When a backup_label file is present, we want to roll forward from
--- 6158,6165 ----
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
! 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
! 						  &backupDuringRecovery))
  	{
  		/*
  		 * When a backup_label file is present, we want to roll forward from
***************
*** 6312,6317 **** StartupXLOG(void)
--- 6315,6321 ----
  		 * pg_control with any minimum recovery stop point obtained from a
  		 * backup history file.
  		 */
+ 		save_state = ControlFile->state;
  		if (InArchiveRecovery)
  			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
  		else
***************
*** 6332,6343 **** StartupXLOG(void)
  		}
  
  		/*
! 		 * set backupStartPoint if we're starting recovery from a base backup
  		 */
  		if (haveBackupLabel)
  		{
  			ControlFile->backupStartPoint = checkPoint.redo;
  			ControlFile->backupEndRequired = backupEndRequired;
  		}
  		ControlFile->time = (pg_time_t) time(NULL);
  		/* No need to hold ControlFileLock yet, we aren't up far enough */
--- 6336,6369 ----
  		}
  
  		/*
! 		 * Set backupStartPoint if we're starting recovery from a base backup.
! 		 *
! 		 * Set backupEndPoint if we're starting recovery from a base backup
! 		 * which was taken from the server in recovery mode. We confirm
! 		 * that minRecoveryPoint can be used as the backup end location by
! 		 * checking whether the database system status in pg_control indicates
! 		 * DB_IN_ARCHIVE_RECOVERY. If minRecoveryPoint is not available,
! 		 * there is no way to know the backup end location, so we cannot
! 		 * advance recovery any more. In this case, we have to cancel recovery
! 		 * before changing the database system status in pg_control to
! 		 * DB_IN_ARCHIVE_RECOVERY because otherwise subsequent
! 		 * restarted recovery would go through this check wrongly.
  		 */
  		if (haveBackupLabel)
  		{
  			ControlFile->backupStartPoint = checkPoint.redo;
  			ControlFile->backupEndRequired = backupEndRequired;
+ 
+ 			if (backupDuringRecovery)
+ 			{
+ 				if (save_state != DB_IN_ARCHIVE_RECOVERY)
+ 					ereport(FATAL,
+ 							(errmsg("database system status mismatches between "
+ 									"pg_control and backup_label"),
+ 							 errhint("This means that the backup is corrupted and you will "
+ 									 "have to use another backup for recovery.")));
+ 				ControlFile->backupEndPoint = ControlFile->minRecoveryPoint;
+ 			}
  		}
  		ControlFile->time = (pg_time_t) time(NULL);
  		/* No need to hold ControlFileLock yet, we aren't up far enough */
***************
*** 6617,6622 **** StartupXLOG(void)
--- 6643,6670 ----
  				/* Pop the error context stack */
  				error_context_stack = errcontext.previous;
  
+ 				if (!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
+ 					XLByteLE(ControlFile->backupEndPoint, EndRecPtr))
+ 				{
+ 					/*
+ 					 * We have reached the end of base backup, the point where
+ 					 * the minimum recovery point in pg_control which was
+ 					 * backed up just before pg_stop_backup() indicates.
+ 					 * The data on disk is now consistent. Reset backupStartPoint
+ 					 * and backupEndPoint.
+ 					 */
+ 					elog(DEBUG1, "end of backup reached");
+ 
+ 					LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 
+ 					MemSet(&ControlFile->backupStartPoint, 0, sizeof(XLogRecPtr));
+ 					MemSet(&ControlFile->backupEndPoint, 0, sizeof(XLogRecPtr));
+ 					ControlFile->backupEndRequired = false;
+ 					UpdateControlFile();
+ 
+ 					LWLockRelease(ControlFileLock);
+ 				}
+ 
  				/*
  				 * Update shared recoveryLastRecPtr after this record has been
  				 * replayed.
***************
*** 8417,8423 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
  		 * never arrive.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint))
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
--- 8465,8472 ----
  		 * never arrive.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
! 			XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
***************
*** 8880,8885 **** XLogRecPtr
--- 8929,8935 ----
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		recovery_in_progress = false;
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
  	pg_time_t	stamp_time;
***************
*** 8891,8908 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	FILE	   *fp;
  	StringInfoData labelfbuf;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		   errmsg("must be superuser or replication role to run a backup")));
  
! 	if (RecoveryInProgress())
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 				 errmsg("recovery is in progress"),
! 				 errhint("WAL control functions cannot be executed during recovery.")));
! 
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
--- 8941,8960 ----
  	FILE	   *fp;
  	StringInfoData labelfbuf;
  
+ 	recovery_in_progress = RecoveryInProgress();
+ 
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		   errmsg("must be superuser or replication role to run a backup")));
  
! 	/*
! 	 * During recovery, we don't need to check WAL level. Because the fact that
! 	 * we are now executing pg_start_backup() during recovery means that
! 	 * wal_level is set to hot_standby on the master, i.e., WAL level is sufficient
! 	 * for making an online backup.
! 	 */
! 	if (!recovery_in_progress && !XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
***************
*** 8924,8931 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	 * we won't have a history file covering the old timeline if pg_xlog
  	 * directory was not included in the base backup and the WAL archive was
  	 * cleared too before starting the backup.
  	 */
! 	RequestXLogSwitch();
  
  	/*
  	 * Mark backup active in shared memory.  We must do full-page WAL writes
--- 8976,8988 ----
  	 * we won't have a history file covering the old timeline if pg_xlog
  	 * directory was not included in the base backup and the WAL archive was
  	 * cleared too before starting the backup.
+ 	 *
+ 	 * During recovery, we skip forcing XLOG file switch, which means that
+ 	 * the backup taken during recovery is not available for the special recovery
+ 	 * case described above.
  	 */
! 	if (!recovery_in_progress)
! 		RequestXLogSwitch();
  
  	/*
  	 * Mark backup active in shared memory.  We must do full-page WAL writes
***************
*** 8941,8946 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
--- 8998,9006 ----
  	 * since we expect that any pages not modified during the backup interval
  	 * must have been correctly captured by the backup.)
  	 *
+ 	 * Note that forcePageWrites has no effect during an online backup from
+ 	 * the server in recovery mode.
+ 	 *
  	 * We must hold WALInsertLock to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
***************
*** 8970,8980 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  		do
  		{
  			/*
! 			 * Force a CHECKPOINT.	Aside from being necessary to prevent torn
  			 * page problems, this guarantees that two successive backup runs
  			 * will have different checkpoint positions and hence different
  			 * history file names, even if nothing happened in between.
  			 *
  			 * We use CHECKPOINT_IMMEDIATE only if requested by user (via
  			 * passing fast = true).  Otherwise this can take awhile.
  			 */
--- 9030,9048 ----
  		do
  		{
  			/*
! 			 * Force a CHECKPOINT.  Aside from being necessary to prevent torn
  			 * page problems, this guarantees that two successive backup runs
  			 * will have different checkpoint positions and hence different
  			 * history file names, even if nothing happened in between.
  			 *
+ 			 * During recovery, establish a restartpoint if possible. We use the last
+ 			 * restartpoint as the backup starting checkpoint. This means that two
+ 			 * successive backup runs can have same checkpoint positions.
+ 			 *
+ 			 * Since the fact that we are executing pg_start_backup() during
+ 			 * recovery means that bgwriter is running, we can use
+ 			 * RequestCheckpoint() to establish a restartpoint.
+ 			 *
  			 * We use CHECKPOINT_IMMEDIATE only if requested by user (via
  			 * passing fast = true).  Otherwise this can take awhile.
  			 */
***************
*** 9002,9007 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
--- 9070,9081 ----
  			 * for each backup instead of forcing another checkpoint, but
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
+ 			 *
+ 			 * During recovery, since we don't use the end-of-backup WAL
+ 			 * record and don't write the backup history file, the starting WAL
+ 			 * location doesn't need to be unique. This means that two base
+ 			 * backups started at the same time might use the same checkpoint
+ 			 * as starting locations.
  			 */
  			LWLockAcquire(WALInsertLock, LW_SHARED);
  			if (XLByteLT(XLogCtl->Insert.lastBackupStart, startpoint))
***************
*** 9010,9015 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
--- 9084,9091 ----
  				gotUniqueStartpoint = true;
  			}
  			LWLockRelease(WALInsertLock);
+ 			if (recovery_in_progress)
+ 				gotUniqueStartpoint = true;
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
***************
*** 9031,9036 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
--- 9107,9114 ----
  						 checkpointloc.xlogid, checkpointloc.xrecoff);
  		appendStringInfo(&labelfbuf, "BACKUP METHOD: %s\n",
  						 exclusive ? "pg_start_backup" : "streamed");
+ 		appendStringInfo(&labelfbuf, "SYSTEM STATUS: %s\n",
+ 						 recovery_in_progress ? "recovery" : "in production");
  		appendStringInfo(&labelfbuf, "START TIME: %s\n", strfbuf);
  		appendStringInfo(&labelfbuf, "LABEL: %s\n", backupidstr);
  
***************
*** 9123,9128 **** pg_start_backup_callback(int code, Datum arg)
--- 9201,9208 ----
   * history file at the beginning of archive recovery, but we now use the WAL
   * record for that and the file is for informational and debug purposes only.
   *
+  * During recovery, we only remove the backup label file.
+  *
   * Note: different from CancelBackup which just cancels online backup mode.
   */
  Datum
***************
*** 9149,9154 **** XLogRecPtr
--- 9229,9235 ----
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		recovery_in_progress = false;
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
  	XLogRecData rdata;
***************
*** 9159,9164 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
--- 9240,9246 ----
  	char		stopxlogfilename[MAXFNAMELEN];
  	char		lastxlogfilename[MAXFNAMELEN];
  	char		histfilename[MAXFNAMELEN];
+ 	char		systemstatus[20];
  	uint32		_logId;
  	uint32		_logSeg;
  	FILE	   *lfp;
***************
*** 9168,9186 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  	int			waits = 0;
  	bool		reported_waiting = false;
  	char	   *remaining;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
! 	if (RecoveryInProgress())
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 				 errmsg("recovery is in progress"),
! 				 errhint("WAL control functions cannot be executed during recovery.")));
! 
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
--- 9250,9271 ----
  	int			waits = 0;
  	bool		reported_waiting = false;
  	char	   *remaining;
+ 	char	   *ptr;
+ 
+ 	recovery_in_progress = RecoveryInProgress();
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
! 	/*
! 	 * During recovery, we don't need to check WAL level. Because the fact that
! 	 * we are now executing pg_stop_backup() means that wal_level is set to
! 	 * hot_standby on the master, i.e., WAL level is sufficient for making an online
! 	 * backup.
! 	 */
! 	if (!recovery_in_progress && !XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
***************
*** 9271,9276 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
--- 9356,9413 ----
  	remaining = strchr(labelfile, '\n') + 1;	/* %n is not portable enough */
  
  	/*
+ 	 * Parse the SYSTEM STATUS line, and check that database system
+ 	 * status matches between pg_start_backup() and pg_stop_backup().
+ 	 */
+ 	ptr = strstr(remaining, "SYSTEM STATUS:");
+ 	if (sscanf(ptr, "SYSTEM STATUS: %19s\n", systemstatus) != 1)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
+ 	if (strcmp(systemstatus, "recovery") == 0 && !recovery_in_progress)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("database system status mismatches between "
+ 						"pg_start_backup() and pg_stop_backup()")));
+ 
+ 	/*
+ 	 * During recovery, we don't write an end-of-backup record. We can
+ 	 * assume that pg_control was backed up just before pg_stop_backup()
+ 	 * and its minimum recovery point can be available as the backup end
+ 	 * location. Without an end-of-backup record, we can check correctly
+ 	 * whether we've reached the end of backup when starting recovery
+ 	 * from this backup.
+ 	 *
+ 	 * We don't force a switch to new WAL file and wait for all the required
+ 	 * files to be archived. This is okay if we use the backup to start
+ 	 * the standby. But, if it's for an archive recovery, to ensure all the
+ 	 * required files are available, a user should wait for them to be archived,
+ 	 * or include them into the backup after pg_stop_backup().
+ 	 *
+ 	 * We return the current minimum recovery point as the backup end
+ 	 * location. Note that it's would be bigger than the exact backup end
+ 	 * location if the minimum recovery point is updated since the backup
+ 	 * of pg_control. The return value of pg_stop_backup() is often used
+ 	 * for a user to calculate the required files. Returning approximate
+ 	 * location is harmless for that use because it's guaranteed not to be
+ 	 * smaller than the exact backup end location.
+ 	 *
+ 	 * XXX currently a backup history file is for informational and debug
+ 	 * purposes only. It's not essential for an online backup. Furthermore,
+ 	 * even if it's created, it will not be archived during recovery because
+ 	 * an archiver is not invoked. So it doesn't seem worthwhile to write
+ 	 * a backup history file during recovery.
+ 	 */
+ 	if (recovery_in_progress)
+ 	{
+ 		LWLockAcquire(ControlFileLock, LW_SHARED);
+ 		stoppoint = ControlFile->minRecoveryPoint;
+ 		LWLockRelease(ControlFileLock);
+ 
+ 		return stoppoint;
+ 	}
+ 
+ 	/*
  	 * Write the backup-end xlog record
  	 */
  	rdata.data = (char *) (&startpoint);
***************
*** 9787,9804 **** pg_xlogfile_name(PG_FUNCTION_ARGS)
   * Returns TRUE if a backup_label was found (and fills the checkpoint
   * location and its REDO location into *checkPointLoc and RedoStartLSN,
   * respectively); returns FALSE if not. If this backup_label came from a
!  * streamed backup, *backupEndRequired is set to TRUE.
   */
  static bool
! read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired)
  {
  	char		startxlogfilename[MAXFNAMELEN];
  	TimeLineID	tli;
  	FILE	   *lfp;
  	char		ch;
  	char		backuptype[20];
  
  	*backupEndRequired = false;
  
  	/*
  	 * See if label file is present
--- 9924,9945 ----
   * Returns TRUE if a backup_label was found (and fills the checkpoint
   * location and its REDO location into *checkPointLoc and RedoStartLSN,
   * respectively); returns FALSE if not. If this backup_label came from a
!  * streamed backup, *backupEndRequired is set to TRUE. If this backup_label
!  * was created during recovery, *backupDuringRecovery is set to TRUE.
   */
  static bool
! read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired,
! 				  bool *backupDuringRecovery)
  {
  	char		startxlogfilename[MAXFNAMELEN];
  	TimeLineID	tli;
  	FILE	   *lfp;
  	char		ch;
  	char		backuptype[20];
+ 	char		systemstatus[20];
  
  	*backupEndRequired = false;
+ 	*backupDuringRecovery = false;
  
  	/*
  	 * See if label file is present
***************
*** 9832,9847 **** read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired)
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
  	/*
! 	 * BACKUP METHOD line is new in 9.1. We can't restore from an older backup
! 	 * anyway, but since the information on it is not strictly required, don't
! 	 * error out if it's missing for some reason.
  	 */
! 	if (fscanf(lfp, "BACKUP METHOD: %19s", backuptype) == 1)
  	{
  		if (strcmp(backuptype, "streamed") == 0)
  			*backupEndRequired = true;
  	}
  
  	if (ferror(lfp) || FreeFile(lfp))
  		ereport(FATAL,
  				(errcode_for_file_access(),
--- 9973,9994 ----
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
  	/*
! 	 * BACKUP METHOD and SYSTEM STATUS lines are new in 9.2. We can't
! 	 * restore from an older backup anyway, but since the information on it
! 	 * is not strictly required, don't error out if it's missing for some reason.
  	 */
! 	if (fscanf(lfp, "BACKUP METHOD: %19s\n", backuptype) == 1)
  	{
  		if (strcmp(backuptype, "streamed") == 0)
  			*backupEndRequired = true;
  	}
  
+ 	if (fscanf(lfp, "SYSTEM STATUS: %19s\n", systemstatus) == 1)
+ 	{
+ 		if (strcmp(systemstatus, "recovery") == 0)
+ 			*backupDuringRecovery = true;
+ 	}
+ 
  	if (ferror(lfp) || FreeFile(lfp))
  		ereport(FATAL,
  				(errcode_for_file_access(),
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***************
*** 287,292 **** typedef enum
--- 287,294 ----
  static PMState pmState = PM_INIT;
  
  static bool ReachedNormalRunning = false;		/* T if we've reached PM_RUN */
+ static bool OnlineBackupAllowed = false;		/* T if we've reached PM_RUN or
+ 												 * PM_HOT_STANDBY */
  
  bool		ClientAuthInProgress = false;		/* T during new-client
  												 * authentication */
***************
*** 2105,2122 **** pmdie(SIGNAL_ARGS)
  				/* and the walwriter too */
  				if (WalWriterPID != 0)
  					signal_child(WalWriterPID, SIGTERM);
! 
! 				/*
! 				 * If we're in recovery, we can't kill the startup process
! 				 * right away, because at present doing so does not release
! 				 * its locks.  We might want to change this in a future
! 				 * release.  For the time being, the PM_WAIT_READONLY state
! 				 * indicates that we're waiting for the regular (read only)
! 				 * backends to die off; once they do, we'll kill the startup
! 				 * and walreceiver processes.
! 				 */
! 				pmState = (pmState == PM_RUN) ?
! 					PM_WAIT_BACKUP : PM_WAIT_READONLY;
  			}
  
  			/*
--- 2107,2113 ----
  				/* and the walwriter too */
  				if (WalWriterPID != 0)
  					signal_child(WalWriterPID, SIGTERM);
! 				pmState = PM_WAIT_BACKUP;
  			}
  
  			/*
***************
*** 2299,2304 **** reaper(SIGNAL_ARGS)
--- 2290,2296 ----
  			 */
  			FatalError = false;
  			ReachedNormalRunning = true;
+ 			OnlineBackupAllowed = true;
  			pmState = PM_RUN;
  
  			/*
***************
*** 2823,2831 **** PostmasterStateMachine(void)
  	{
  		/*
  		 * PM_WAIT_BACKUP state ends when online backup mode is not active.
  		 */
  		if (!BackupInProgress())
! 			pmState = PM_WAIT_BACKENDS;
  	}
  
  	if (pmState == PM_WAIT_READONLY)
--- 2815,2831 ----
  	{
  		/*
  		 * PM_WAIT_BACKUP state ends when online backup mode is not active.
+ 		 *
+ 		 * If we're in recovery, we can't kill the startup process right away,
+ 		 * because at present doing so does not release its locks.  We might
+ 		 * want to change this in a future release.  For the time being,
+ 		 * the PM_WAIT_READONLY state indicates that we're waiting for
+ 		 * the regular (read only) backends to die off; once they do,
+ 		 * we'll kill the startup and walreceiver processes.
  		 */
  		if (!BackupInProgress())
! 			pmState = ReachedNormalRunning ?
! 				PM_WAIT_BACKENDS : PM_WAIT_READONLY;
  	}
  
  	if (pmState == PM_WAIT_READONLY)
***************
*** 2994,3006 **** PostmasterStateMachine(void)
  			/*
  			 * Terminate backup mode to avoid recovery after a clean fast
  			 * shutdown.  Since a backup can only be taken during normal
! 			 * running (and not, for example, while running under Hot Standby)
! 			 * it only makes sense to do this if we reached normal running. If
! 			 * we're still in recovery, the backup file is one we're
! 			 * recovering *from*, and we must keep it around so that recovery
! 			 * restarts from the right place.
  			 */
! 			if (ReachedNormalRunning)
  				CancelBackup();
  
  			/* Normal exit from the postmaster is here */
--- 2994,3006 ----
  			/*
  			 * Terminate backup mode to avoid recovery after a clean fast
  			 * shutdown.  Since a backup can only be taken during normal
! 			 * running and hot standby, it only makes sense to do this
! 			 * if we reached normal running or hot standby. If we have not
! 			 * reached a consistent recovery state yet, the backup file is
! 			 * one we're recovering *from*, and we must keep it around
! 			 * so that recovery restarts from the right place.
  			 */
! 			if (OnlineBackupAllowed)
  				CancelBackup();
  
  			/* Normal exit from the postmaster is here */
***************
*** 4157,4162 **** sigusr1_handler(SIGNAL_ARGS)
--- 4157,4163 ----
  		ereport(LOG,
  		(errmsg("database system is ready to accept read only connections")));
  
+ 		OnlineBackupAllowed = true;
  		pmState = PM_HOT_STANDBY;
  	}
  
*** a/src/bin/pg_controldata/pg_controldata.c
--- b/src/bin/pg_controldata/pg_controldata.c
***************
*** 232,237 **** main(int argc, char *argv[])
--- 232,240 ----
  	printf(_("Backup start location:                %X/%X\n"),
  		   ControlFile.backupStartPoint.xlogid,
  		   ControlFile.backupStartPoint.xrecoff);
+ 	printf(_("Backup end location:                  %X/%X\n"),
+ 		   ControlFile.backupEndPoint.xlogid,
+ 		   ControlFile.backupEndPoint.xrecoff);
  	printf(_("End-of-backup record required:        %s\n"),
  		   ControlFile.backupEndRequired ? _("yes") : _("no"));
  	printf(_("Current wal_level setting:            %s\n"),
*** a/src/bin/pg_ctl/pg_ctl.c
--- b/src/bin/pg_ctl/pg_ctl.c
***************
*** 883,897 **** do_stop(void)
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
  		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present, we're recovering from an online
! 		 * backup instead of performing one.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0 &&
! 			stat(recovery_file, &statbuf) != 0)
  		{
! 			print_msg(_("WARNING: online backup mode is active\n"
! 						"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
  		}
  
  		print_msg(_("waiting for server to shut down..."));
--- 883,900 ----
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
  		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present and new connection has not been
! 		 * allowed yet, an online backup mode must not be active.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0)
  		{
! 			if (stat(recovery_file, &statbuf) != 0)
! 				print_msg(_("WARNING: online backup mode is active\n"
! 							"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
! 			else
! 				print_msg(_("WARNING: online backup mode is active if you can connect as a superuser to server\n"
! 							"If so, shutdown will not complete until pg_stop_backup() is called.\n\n"));
  		}
  
  		print_msg(_("waiting for server to shut down..."));
***************
*** 971,985 **** do_restart(void)
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
  		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present, we're recovering from an online
! 		 * backup instead of performing one.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0 &&
! 			stat(recovery_file, &statbuf) != 0)
  		{
! 			print_msg(_("WARNING: online backup mode is active\n"
! 						"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
  		}
  
  		print_msg(_("waiting for server to shut down..."));
--- 974,991 ----
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
  		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present and new connection has not been
! 		 * allowed yet, an online backup mode must not be active.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0)
  		{
! 			if (stat(recovery_file, &statbuf) != 0)
! 				print_msg(_("WARNING: online backup mode is active\n"
! 							"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
! 			else
! 				print_msg(_("WARNING: online backup mode is active if you can connect as a superuser to server\n"
! 							"If so, shutdown will not complete until pg_stop_backup() is called.\n\n"));
  		}
  
  		print_msg(_("waiting for server to shut down..."));
*** a/src/bin/pg_resetxlog/pg_resetxlog.c
--- b/src/bin/pg_resetxlog/pg_resetxlog.c
***************
*** 503,509 **** GuessControlValues(void)
  	ControlFile.time = (pg_time_t) time(NULL);
  	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
  
! 	/* minRecoveryPoint and backupStartPoint can be left zero */
  
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.MaxConnections = 100;
--- 503,509 ----
  	ControlFile.time = (pg_time_t) time(NULL);
  	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
  
! 	/* minRecoveryPoint, backupStartPoint and backupEndPoint can be left zero */
  
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.MaxConnections = 100;
***************
*** 637,642 **** RewriteControlFile(void)
--- 637,644 ----
  	ControlFile.minRecoveryPoint.xrecoff = 0;
  	ControlFile.backupStartPoint.xlogid = 0;
  	ControlFile.backupStartPoint.xrecoff = 0;
+ 	ControlFile.backupEndPoint.xlogid = 0;
+ 	ControlFile.backupEndPoint.xrecoff = 0;
  	ControlFile.backupEndRequired = false;
  
  	/*
*** a/src/include/catalog/pg_control.h
--- b/src/include/catalog/pg_control.h
***************
*** 21,27 ****
  
  
  /* Version identifier for this pg_control format */
! #define PG_CONTROL_VERSION	921
  
  /*
   * Body of CheckPoint XLOG records.  This is declared here because we keep
--- 21,27 ----
  
  
  /* Version identifier for this pg_control format */
! #define PG_CONTROL_VERSION	922
  
  /*
   * Body of CheckPoint XLOG records.  This is declared here because we keep
***************
*** 138,143 **** typedef struct ControlFileData
--- 138,150 ----
  	 * record, to make sure the end-of-backup record corresponds the base
  	 * backup we're recovering from.
  	 *
+ 	 * backupEndPoint is the backup end location, if we are recovering from
+ 	 * an online backup which was taken from the server in recovery mode
+ 	 * and haven't reached the end of backup yet. It is initialized to
+ 	 * the minimum recovery point in pg_control which was backed up just
+ 	 * before pg_stop_backup(). It is reset to zero when the end of backup
+ 	 * is reached, and we mustn't start up before that.
+ 	 *
  	 * If backupEndRequired is true, we know for sure that we're restoring
  	 * from a backup, and must see a backup-end record before we can safely
  	 * start up. If it's false, but backupStartPoint is set, a backup_label
***************
*** 146,151 **** typedef struct ControlFileData
--- 153,159 ----
  	 */
  	XLogRecPtr	minRecoveryPoint;
  	XLogRecPtr	backupStartPoint;
+ 	XLogRecPtr	backupEndPoint;
  	bool		backupEndRequired;
  
  	/*

#25

Magnus Hagander

magnus@hagander.net

over 14 years ago

In reply to: Fujii Masao (#23)

Re: Online base backup from the hot-standby

On Thu, Sep 22, 2011 at 14:13, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Sep 21, 2011 at 5:34 PM, Magnus Hagander <magnus@hagander.net> wrote:

On Wed, Sep 21, 2011 at 08:23, Fujii Masao <masao.fujii@gmail.com> wrote:

On Wed, Sep 21, 2011 at 2:13 PM, Magnus Hagander <magnus@hagander.net> wrote:

Presumably pg_start_backup() will check this. And we'll somehow track
this before pg_stop_backup() as well? (for such evil things such as
the user changing FPW from on to off and then back to on again during
a backup, will will make it look correct both during start and stop,
but incorrect in the middle - pg_stop_backup needs to fail in that
case as well)

Right. As I suggested upthread, to address that problem, we need to log
the change of FPW on the master, and then we need to check whether
such a WAL is replayed on the standby during the backup. If it's done,
pg_stop_backup() should emit an error.

I somehow missed this thread completely, so I didn't catch your
previous comments - oops, sorry. The important point being that we
need to track if when this happens even if it has been reset to a
valid value. So we can't just check the state of the variable at the
beginning and at the end.

Right. Let me explain again what I'm thinking.

When FPW is changed, the master always writes the WAL record
which contains the current value of FPW. This means that the standby
can track all changes of FPW by reading WAL records.

The standby has two flags: One indicates whether FPW has always
been TRUE since last restartpoint. Another indicates whether FPW
has always been TRUE since last pg_start_backup(). The standby
can maintain those flags by reading WAL records streamed from
the master.

If the former flag indicates FALSE (i.e., the WAL records which
the standby has replayed since last restartpoint might not contain
required FPW), pg_start_backup() fails. If the latter flag indicates
FALSE (i.e., the WAL records which the standby has replayed
during the backup might not contain required FPW),
pg_stop_backup() fails.

If I'm not missing something, this approach can address the problem
which you're concerned about.

Yeah, it sounds safe to me.

Would it make sense for pg_start_backup() to have the ability to wait
for the next restartpoint in a case like this, if we know that FPW has
been set? Instead of failing? Or maybe that's just overcomplicating
things when trying to be user-friendly.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#26

Steve Singer

ssinger_pg@sympatico.ca

over 14 years ago

In reply to: Fujii Masao (#24)

Re: Online base backup from the hot-standby

On 11-09-22 09:24 AM, Fujii Masao wrote:

On Wed, Sep 21, 2011 at 11:50 AM, Fujii Masao<masao.fujii@gmail.com> wrote:

2011/9/13 Jun Ishiduka<ishizuka.jun@po.ntts.co.jp>:

Update patch.

Changes:
* set 'on' full_page_writes by user (in document)
* read "FROM: XX" in backup_label (in xlog.c)
* check status when pg_stop_backup is executed (in xlog.c)

Thanks for updating the patch.

Before reviewing the patch, to encourage people to comment and
review the patch, I explain what this patch provides:

Attached is the updated version of the patch. I refactored the code, fixed
some bugs, added lots of source code comments, improved the document,
but didn't change the basic design. Please check this patch, and let's use
this patch as the base if you agree with that.

I have looked at both Jun's patch from Sept 13 and Fujii's updates to
the patch. I agree that Fujii's updated version should be used as the
basis for changes going forward. My comments below refer to that
version (unless otherwise noted).

In backup.sgml the new section titled "Making a Base Backup during
Recovery" I would prefer to see some mention in the title that this
procedure is for standby servers ie "Making a Base Backup from a Standby
Database". Users who have setup a hot-standby database should be
familiar with the 'standby' terminology. I agree that the "during
recovery" description is technically correct but I'm not sure someone
who is looking through the manual for instructions on making a base
backup from here standby will realize this is the section they should read.

Around line 969 where you give an example of copying the control file I
would be a bit clearer that this is an example command. Ie (Copy the
pg_control file from the cluster directory to the global sub-directory
of the backup. For example "cp $PGDATA/global/pg_control
/mnt/server/backupdir/global")

Testing Notes
-----------------------------

I created a standby server from a base backup of another standby server.
On this new standby server I then

1. Ran pg_start_backup('3'); and left the psql connection open
2. touch /tmp/3 -- my trigger_file

ssinger@ssinger-laptop:/usr/local/pgsql92git/bin$ LOG: trigger file
found: /tmp/3
FATAL: terminating walreceiver process due to administrator command
LOG: restored log file "000000010000000000000006" from archive
LOG: record with zero length at 0/60002F0
LOG: restored log file "000000010000000000000006" from archive
LOG: redo done at 0/6000298
LOG: restored log file "000000010000000000000006" from archive
PANIC: record with zero length at 0/6000298
LOG: startup process (PID 19011) was terminated by signal 6: Aborted
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back
the current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.

The new postmaster (the one trying to be promoted) dies. This is
somewhat repeatable.

----

If a base backup is in progress on a recovery database and that recovery
database is promoted to master, following the promotion (if you don't
restart the postmaster). I see
select pg_stop_backup();
ERROR: database system status mismatches between pg_start_backup() and
pg_stop_backup()

If you restart the postmaster this goes away. When the postmaster
leaves recovery mode I think it should abort an existing base backup so
pg_stop_backup() will say no backup in progress, or give an error
message on pg_stop_backup() saying that the base backup won't be
usable. The above error doesn't really tell the user why there is a
mismatch.

---------

In my testing a few times I got into a situation where a standby server
coming from a recovery target took a while to finish recovery (this is
on a database with no activity). Then when i tried promoting that
server to master I got

LOG: trigger file found: /tmp/3
FATAL: terminating walreceiver process due to administrator command
LOG: restored log file "000000010000000000000009" from archive
LOG: restored log file "000000010000000000000009" from archive
LOG: redo done at 0/90000E8
LOG: restored log file "000000010000000000000009" from archive
PANIC: unexpected pageaddr 0/6000000 in log file 0, segment 9, offset 0
LOG: startup process (PID 1804) was terminated by signal 6: Aborted
LOG: terminating any other active server processes

It is *possible* I mixed up the order of a step somewhere since my
testing isn't script based. A standby server that 'looks' okay but can't
actually be promoted is dangerous.

This version of the patch (I was testing the Sept 22nd version) seems
less stable than how I remember the version from the July CF. Maybe I'm
just testing it harder or maybe something has been broken.

In the current patch, there is no safeguard for preventing users from
taking backup during recovery when FPW is disabled. This is unsafe.
Are you planning to implement such a safeguard?

I agree with Fujii that we need a way (on the recovery machine) to
detect if the master doesn't have FPW on. The ideas up-thread on how to
do this sound good.

Show quoted text

Regards,

#27

Fujii Masao

masao.fujii@gmail.com

over 14 years ago

In reply to: Josh Berkus (#22)

Re: Remastering using streaming only replication?

On Thu, Sep 22, 2011 at 1:52 AM, Josh Berkus <josh@agliodbs.com> wrote:

Fujii,

I haven't really been following your latest patches about taking backups
from the standby and cascading replication, but I wanted to see if it
fulfills another TODO: the ability to "remaster" (that is, designate the
"lead standby" as the new master) without needing to copy WAL files.

Sorry, I could not follow you. I believe that we can "remaster" even in 9.1.
When the master crashes, we can choose the "lead standby" by comparing
each standby replay location, and can promote it by pg_ctl promote.

What "remaster" feature are you expecting we should develop in 9.2?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#28

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

over 14 years ago

In reply to: Fujii Masao (#24)

Re: Online base backup from the hot-standby

Attached is the updated version of the patch. I refactored the code, fixed
some bugs, added lots of source code comments, improved the document,
but didn't change the basic design. Please check this patch, and let's use
this patch as the base if you agree with that.

Thanks for update patch.
Yes. I agree.

In the current patch, there is no safeguard for preventing users from
taking backup during recovery when FPW is disabled. This is unsafe.
Are you planning to implement such a safeguard?

Yes.
I want to reference the following Fujii's comments.

-------------------------------------------------------------------------

Right. Let me explain again what I'm thinking.

When FPW is changed, the master always writes the WAL record
which contains the current value of FPW. This means that the standby
can track all changes of FPW by reading WAL records.

The standby has two flags: One indicates whether FPW has always
been TRUE since last restartpoint. Another indicates whether FPW
has always been TRUE since last pg_start_backup(). The standby
can maintain those flags by reading WAL records streamed from
the master.

If the former flag indicates FALSE (i.e., the WAL records which
the standby has replayed since last restartpoint might not contain
required FPW), pg_start_backup() fails. If the latter flag indicates
FALSE (i.e., the WAL records which the standby has replayed
during the backup might not contain required FPW),
pg_stop_backup() fails.

If I'm not missing something, this approach can address the problem
which you're concerned about.

-------------------------------------------------------------------------

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

#29

Fujii Masao

masao.fujii@gmail.com

over 14 years ago

In reply to: Magnus Hagander (#25)

Re: Online base backup from the hot-standby

On Fri, Sep 23, 2011 at 12:44 AM, Magnus Hagander <magnus@hagander.net> wrote:

Would it make sense for pg_start_backup() to have the ability to wait
for the next restartpoint in a case like this, if we know that FPW has
been set? Instead of failing? Or maybe that's just overcomplicating
things when trying to be user-friendly.

I don't think that it's worth adding code for such a feature. Because I believe
there are not many users who enable FPW on-the-fly for standby-only backup
and use such a feature.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#30

Fujii Masao

masao.fujii@gmail.com

over 14 years ago

In reply to: Steve Singer (#26)

Re: Online base backup from the hot-standby

On Mon, Sep 26, 2011 at 11:39 AM, Steve Singer <ssinger_pg@sympatico.ca> wrote:

I have looked at both Jun's patch from Sept 13 and Fujii's updates to the
patch. I agree that Fujii's updated version should be used as the basis for
changes going forward. My comments below refer to that version (unless
otherwise noted).

Thanks for the tests and comments!

In backup.sgml the new section titled "Making a Base Backup during
Recovery" I would prefer to see some mention in the title that this
procedure is for standby servers ie "Making a Base Backup from a Standby
Database". Users who have setup a hot-standby database should be familiar
with the 'standby' terminology. I agree that the "during recovery"
description is technically correct but I'm not sure someone who is looking
through the manual for instructions on making a base backup from here
standby will realize this is the section they should read.

I used the term "recovery" rather than "standby" because we can take
a backup even from the server in normal archive recovery mode but not
standby mode. But there is not many users who take a backup during
normal archive recovery, so I agree that the term "standby" is better to
be used in the document. Will change.

Around line 969 where you give an example of copying the control file I
would be a bit clearer that this is an example command. Ie (Copy the
pg_control file from the cluster directory to the global sub-directory of
the backup. For example "cp $PGDATA/global/pg_control
/mnt/server/backupdir/global")

Looks better. Will change.

Testing Notes
-----------------------------

I created a standby server from a base backup of another standby server. On
this new standby server I then

1. Ran pg_start_backup('3'); and left the psql connection open
2. touch /tmp/3 -- my trigger_file

ssinger@ssinger-laptop:/usr/local/pgsql92git/bin$ LOG: trigger file found:
/tmp/3
FATAL: terminating walreceiver process due to administrator command
LOG: restored log file "000000010000000000000006" from archive
LOG: record with zero length at 0/60002F0
LOG: restored log file "000000010000000000000006" from archive
LOG: redo done at 0/6000298
LOG: restored log file "000000010000000000000006" from archive
PANIC: record with zero length at 0/6000298
LOG: startup process (PID 19011) was terminated by signal 6: Aborted
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and
repeat your command.

The new postmaster (the one trying to be promoted) dies. This is somewhat
repeatable.

Looks weired. Though the WAL record starting from 0/6000298 was read
successfully, then re-fetch of the same record fails at the end of recovery.
One possible cause is the corruption of archived WAL file. What
restore_command on the standby and archive_command on the master
are you using? Could you confirm that there is no chance to overwrite
archive WAL files in your environment?

I tried to reproduce this problem several times, but I could not. Could
you provide the test case which reproduces the problem?

If a base backup is in progress on a recovery database and that recovery
database is promoted to master, following the promotion (if you don't
restart the postmaster). I see
select pg_stop_backup();
ERROR: database system status mismatches between pg_start_backup() and
pg_stop_backup()

If you restart the postmaster this goes away. When the postmaster leaves
recovery mode I think it should abort an existing base backup so
pg_stop_backup() will say no backup in progress,

I don't think that it's good idea to cancel the backup when promoting
the standby.
Because if we do so, we need to handle correctly the case where cancel of backup
and pg_start_backup/pg_stop_backup are performed at the same time. We can
simply do that by protecting those whole operations including pg_start_backup's
checkpoint by the lwlock. But I don't think that it's worth
introducing new lwlock
only for that. And it's not good to take a lwlock through
time-consuming checkpoint
operation. Of course we can avoid such a lwlock, but which would require more
complicated code.

or give an error message on
pg_stop_backup() saying that the base backup won't be usable. The above
error doesn't really tell the user why there is a mismatch.

What about the following error message?

ERROR: pg_stop_backup() was executed during normal processing though
pg_start_backup() was executed during recovery
HINT: The database backup will not be usable.

Or, you have better idea?

In my testing a few times I got into a situation where a standby server
coming from a recovery target took a while to finish recovery (this is on a
database with no activity). Then when i tried promoting that server to
master I got

LOG: trigger file found: /tmp/3
FATAL: terminating walreceiver process due to administrator command
LOG: restored log file "000000010000000000000009" from archive
LOG: restored log file "000000010000000000000009" from archive
LOG: redo done at 0/90000E8
LOG: restored log file "000000010000000000000009" from archive
PANIC: unexpected pageaddr 0/6000000 in log file 0, segment 9, offset 0
LOG: startup process (PID 1804) was terminated by signal 6: Aborted
LOG: terminating any other active server processes

It is *possible* I mixed up the order of a step somewhere since my testing
isn't script based. A standby server that 'looks' okay but can't actually be
promoted is dangerous.

Looks the same problem as the above. Another weired point is that
the same archived WAL file is restored two times before redo is done.
I'm not sure why this happens... Could you provide the test case which
reproduces this problem? Will diagnose.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#31

Fujii Masao

masao.fujii@gmail.com

over 14 years ago

In reply to: Fujii Masao (#30)

1 attachment(s)

Re: Online base backup from the hot-standby

On Tue, Sep 27, 2011 at 11:56 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

In backup.sgml the new section titled "Making a Base Backup during
Recovery" I would prefer to see some mention in the title that this
procedure is for standby servers ie "Making a Base Backup from a Standby
Database". Users who have setup a hot-standby database should be familiar
with the 'standby' terminology. I agree that the "during recovery"
description is technically correct but I'm not sure someone who is looking
through the manual for instructions on making a base backup from here
standby will realize this is the section they should read.

I used the term "recovery" rather than "standby" because we can take
a backup even from the server in normal archive recovery mode but not
standby mode. But there is not many users who take a backup during
normal archive recovery, so I agree that the term "standby" is better to
be used in the document. Will change.

Done.

Around line 969 where you give an example of copying the control file I
would be a bit clearer that this is an example command. Ie (Copy the
pg_control file from the cluster directory to the global sub-directory of
the backup. For example "cp $PGDATA/global/pg_control
/mnt/server/backupdir/global")

Looks better. Will change.

Done.

or give an error message on
pg_stop_backup() saying that the base backup won't be usable. The above
error doesn't really tell the user why there is a mismatch.

What about the following error message?

ERROR: pg_stop_backup() was executed during normal processing though
pg_start_backup() was executed during recovery
HINT: The database backup will not be usable.

Done. I attached the new version of the patch.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

standby_online_backup_09_fujii.patchtext/x-patch; charset=US-ASCII; name=standby_online_backup_09_fujii.patchDownload

*** a/doc/src/sgml/backup.sgml
--- b/doc/src/sgml/backup.sgml
***************
*** 935,940 **** SELECT pg_stop_backup();
--- 935,1000 ----
     </para>
    </sect2>
  
+   <sect2 id="backup-from-standby">
+    <title>Making a Base Backup from Standby Database</title>
+ 
+    <para>
+     It's possible to make a base backup during recovery. Which allows a user
+     to take a base backup from the standby to offload the expense of
+     periodic backups from the master. Its procedure is similar to that
+     during normal running. All these steps must be performed on the standby.
+   <orderedlist>
+    <listitem>
+     <para>
+      Ensure that hot standby is enabled (see <xref linkend="hot-standby">
+      for more information).
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Connect to the database as a superuser and execute <function>pg_start_backup</>.
+      This performs a restartpoint if there is at least one checkpoint record
+      replayed since last restartpoint.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Perform a file system backup.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Copy the pg_control file from the cluster directory to the global
+      sub-directory of the backup. For example:
+ <programlisting>
+ cp $PGDATA/global/pg_control /mnt/server/backupdir/global
+ </programlisting>
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Again connect to the database as a superuser, and execute
+      <function>pg_stop_backup</>. This terminates the backup mode, but does not
+      perform a switch to the next WAL segment, create a backup history file and
+      wait for all required WAL segments to be archived,
+      unlike that during normal processing.
+     </para>
+    </listitem>
+   </orderedlist>
+    </para>
+ 
+    <para>
+     You cannot use the <application>pg_basebackup</> tool to take the backup
+     from the standby.
+    </para>
+    <para>
+     It's not possible to make a base backup from the server in recovery mode
+     when reading WAL written during a period when <varname>full_page_writes</>
+     was disabled. If you want to take a base backup from the standby,
+     <varname>full_page_writes</> must be set to true on the master.
+    </para>
+   </sect2>
+ 
    <sect2 id="backup-pitr-recovery">
     <title>Recovering Using a Continuous Archive Backup</title>
  
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1680,1685 **** SET ENABLE_SEQSCAN TO OFF;
--- 1680,1693 ----
         </para>
  
         <para>
+         WAL written while <varname>full_page_writes</> is disabled does not
+         contain enough information to make a base backup during recovery
+         (see <xref linkend="backup-from-standby">),
+         so <varname>full_page_writes</> must be enabled on the master
+         to take a backup from the standby.
+        </para>
+ 
+        <para>
          This parameter can only be set in the <filename>postgresql.conf</>
          file or on the server command line.
          The default is <literal>on</>.
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***************
*** 14014,14020 **** SELECT set_config('log_statement_stats', 'off', false);
     <para>
      The functions shown in <xref
      linkend="functions-admin-backup-table"> assist in making on-line backups.
!     These functions cannot be executed during recovery.
     </para>
  
     <table id="functions-admin-backup-table">
--- 14014,14021 ----
     <para>
      The functions shown in <xref
      linkend="functions-admin-backup-table"> assist in making on-line backups.
!     These functions except <function>pg_start_backup</> and <function>pg_stop_backup</>
!     cannot be executed during recovery.
     </para>
  
     <table id="functions-admin-backup-table">
***************
*** 14094,14100 **** SELECT set_config('log_statement_stats', 'off', false);
      database cluster's data directory, performs a checkpoint,
      and then returns the backup's starting transaction log location as text.
      The user can ignore this result value, but it is
!     provided in case it is useful.
  <programlisting>
  postgres=# select pg_start_backup('label_goes_here');
   pg_start_backup
--- 14095,14103 ----
      database cluster's data directory, performs a checkpoint,
      and then returns the backup's starting transaction log location as text.
      The user can ignore this result value, but it is
!     provided in case it is useful. If <function>pg_start_backup</> is
!     executed during recovery, it performs a restartpoint rather than
!     writing a new checkpoint.
  <programlisting>
  postgres=# select pg_start_backup('label_goes_here');
   pg_start_backup
***************
*** 14122,14127 **** postgres=# select pg_start_backup('label_goes_here');
--- 14125,14137 ----
     </para>
  
     <para>
+     If <function>pg_stop_backup</> is executed during recovery, it just
+     removes the label file, but doesn't create a backup history file and wait for
+     the ending transaction log file to be archived. The return value is equal to
+     or bigger than the exact backup's ending transaction log location.
+    </para>
+ 
+    <para>
      <function>pg_switch_xlog</> moves to the next transaction log file, allowing the
      current file to be archived (assuming you are using continuous archiving).
      The return value is the ending transaction log location + 1 within the just-completed transaction log file.
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 664,670 **** static void xlog_outrec(StringInfo buf, XLogRecord *record);
  #endif
  static void pg_start_backup_callback(int code, Datum arg);
  static bool read_backup_label(XLogRecPtr *checkPointLoc,
! 				  bool *backupEndRequired);
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
--- 664,670 ----
  #endif
  static void pg_start_backup_callback(int code, Datum arg);
  static bool read_backup_label(XLogRecPtr *checkPointLoc,
! 				  bool *backupEndRequired, bool *backupDuringRecovery);
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
***************
*** 6023,6028 **** StartupXLOG(void)
--- 6023,6030 ----
  	uint32		freespace;
  	TransactionId oldestActiveXID;
  	bool		backupEndRequired = false;
+ 	bool		backupDuringRecovery = false;
+ 	DBState	save_state;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6156,6162 **** StartupXLOG(void)
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
! 	if (read_backup_label(&checkPointLoc, &backupEndRequired))
  	{
  		/*
  		 * When a backup_label file is present, we want to roll forward from
--- 6158,6165 ----
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
! 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
! 						  &backupDuringRecovery))
  	{
  		/*
  		 * When a backup_label file is present, we want to roll forward from
***************
*** 6312,6317 **** StartupXLOG(void)
--- 6315,6321 ----
  		 * pg_control with any minimum recovery stop point obtained from a
  		 * backup history file.
  		 */
+ 		save_state = ControlFile->state;
  		if (InArchiveRecovery)
  			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
  		else
***************
*** 6332,6343 **** StartupXLOG(void)
  		}
  
  		/*
! 		 * set backupStartPoint if we're starting recovery from a base backup
  		 */
  		if (haveBackupLabel)
  		{
  			ControlFile->backupStartPoint = checkPoint.redo;
  			ControlFile->backupEndRequired = backupEndRequired;
  		}
  		ControlFile->time = (pg_time_t) time(NULL);
  		/* No need to hold ControlFileLock yet, we aren't up far enough */
--- 6336,6369 ----
  		}
  
  		/*
! 		 * Set backupStartPoint if we're starting recovery from a base backup.
! 		 *
! 		 * Set backupEndPoint if we're starting recovery from a base backup
! 		 * which was taken from the server in recovery mode. We confirm
! 		 * that minRecoveryPoint can be used as the backup end location by
! 		 * checking whether the database system status in pg_control indicates
! 		 * DB_IN_ARCHIVE_RECOVERY. If minRecoveryPoint is not available,
! 		 * there is no way to know the backup end location, so we cannot
! 		 * advance recovery any more. In this case, we have to cancel recovery
! 		 * before changing the database system status in pg_control to
! 		 * DB_IN_ARCHIVE_RECOVERY because otherwise subsequent
! 		 * restarted recovery would go through this check wrongly.
  		 */
  		if (haveBackupLabel)
  		{
  			ControlFile->backupStartPoint = checkPoint.redo;
  			ControlFile->backupEndRequired = backupEndRequired;
+ 
+ 			if (backupDuringRecovery)
+ 			{
+ 				if (save_state != DB_IN_ARCHIVE_RECOVERY)
+ 					ereport(FATAL,
+ 							(errmsg("database system status mismatches between "
+ 									"pg_control and backup_label"),
+ 							 errhint("This means that the backup is corrupted and you will "
+ 									 "have to use another backup for recovery.")));
+ 				ControlFile->backupEndPoint = ControlFile->minRecoveryPoint;
+ 			}
  		}
  		ControlFile->time = (pg_time_t) time(NULL);
  		/* No need to hold ControlFileLock yet, we aren't up far enough */
***************
*** 6617,6622 **** StartupXLOG(void)
--- 6643,6670 ----
  				/* Pop the error context stack */
  				error_context_stack = errcontext.previous;
  
+ 				if (!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
+ 					XLByteLE(ControlFile->backupEndPoint, EndRecPtr))
+ 				{
+ 					/*
+ 					 * We have reached the end of base backup, the point where
+ 					 * the minimum recovery point in pg_control which was
+ 					 * backed up just before pg_stop_backup() indicates.
+ 					 * The data on disk is now consistent. Reset backupStartPoint
+ 					 * and backupEndPoint.
+ 					 */
+ 					elog(DEBUG1, "end of backup reached");
+ 
+ 					LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 
+ 					MemSet(&ControlFile->backupStartPoint, 0, sizeof(XLogRecPtr));
+ 					MemSet(&ControlFile->backupEndPoint, 0, sizeof(XLogRecPtr));
+ 					ControlFile->backupEndRequired = false;
+ 					UpdateControlFile();
+ 
+ 					LWLockRelease(ControlFileLock);
+ 				}
+ 
  				/*
  				 * Update shared recoveryLastRecPtr after this record has been
  				 * replayed.
***************
*** 8417,8423 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
  		 * never arrive.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint))
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
--- 8465,8472 ----
  		 * never arrive.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
! 			XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
***************
*** 8880,8885 **** XLogRecPtr
--- 8929,8935 ----
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		recovery_in_progress = false;
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
  	pg_time_t	stamp_time;
***************
*** 8891,8908 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	FILE	   *fp;
  	StringInfoData labelfbuf;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		   errmsg("must be superuser or replication role to run a backup")));
  
! 	if (RecoveryInProgress())
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 				 errmsg("recovery is in progress"),
! 				 errhint("WAL control functions cannot be executed during recovery.")));
! 
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
--- 8941,8960 ----
  	FILE	   *fp;
  	StringInfoData labelfbuf;
  
+ 	recovery_in_progress = RecoveryInProgress();
+ 
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		   errmsg("must be superuser or replication role to run a backup")));
  
! 	/*
! 	 * During recovery, we don't need to check WAL level. Because the fact that
! 	 * we are now executing pg_start_backup() during recovery means that
! 	 * wal_level is set to hot_standby on the master, i.e., WAL level is sufficient
! 	 * for making an online backup.
! 	 */
! 	if (!recovery_in_progress && !XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
***************
*** 8924,8931 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	 * we won't have a history file covering the old timeline if pg_xlog
  	 * directory was not included in the base backup and the WAL archive was
  	 * cleared too before starting the backup.
  	 */
! 	RequestXLogSwitch();
  
  	/*
  	 * Mark backup active in shared memory.  We must do full-page WAL writes
--- 8976,8988 ----
  	 * we won't have a history file covering the old timeline if pg_xlog
  	 * directory was not included in the base backup and the WAL archive was
  	 * cleared too before starting the backup.
+ 	 *
+ 	 * During recovery, we skip forcing XLOG file switch, which means that
+ 	 * the backup taken during recovery is not available for the special recovery
+ 	 * case described above.
  	 */
! 	if (!recovery_in_progress)
! 		RequestXLogSwitch();
  
  	/*
  	 * Mark backup active in shared memory.  We must do full-page WAL writes
***************
*** 8941,8946 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
--- 8998,9006 ----
  	 * since we expect that any pages not modified during the backup interval
  	 * must have been correctly captured by the backup.)
  	 *
+ 	 * Note that forcePageWrites has no effect during an online backup from
+ 	 * the server in recovery mode.
+ 	 *
  	 * We must hold WALInsertLock to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
***************
*** 8970,8980 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  		do
  		{
  			/*
! 			 * Force a CHECKPOINT.	Aside from being necessary to prevent torn
  			 * page problems, this guarantees that two successive backup runs
  			 * will have different checkpoint positions and hence different
  			 * history file names, even if nothing happened in between.
  			 *
  			 * We use CHECKPOINT_IMMEDIATE only if requested by user (via
  			 * passing fast = true).  Otherwise this can take awhile.
  			 */
--- 9030,9048 ----
  		do
  		{
  			/*
! 			 * Force a CHECKPOINT.  Aside from being necessary to prevent torn
  			 * page problems, this guarantees that two successive backup runs
  			 * will have different checkpoint positions and hence different
  			 * history file names, even if nothing happened in between.
  			 *
+ 			 * During recovery, establish a restartpoint if possible. We use the last
+ 			 * restartpoint as the backup starting checkpoint. This means that two
+ 			 * successive backup runs can have same checkpoint positions.
+ 			 *
+ 			 * Since the fact that we are executing pg_start_backup() during
+ 			 * recovery means that bgwriter is running, we can use
+ 			 * RequestCheckpoint() to establish a restartpoint.
+ 			 *
  			 * We use CHECKPOINT_IMMEDIATE only if requested by user (via
  			 * passing fast = true).  Otherwise this can take awhile.
  			 */
***************
*** 9002,9007 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
--- 9070,9081 ----
  			 * for each backup instead of forcing another checkpoint, but
  			 * taking a checkpoint right after another is not that expensive
  			 * either because only few buffers have been dirtied yet.
+ 			 *
+ 			 * During recovery, since we don't use the end-of-backup WAL
+ 			 * record and don't write the backup history file, the starting WAL
+ 			 * location doesn't need to be unique. This means that two base
+ 			 * backups started at the same time might use the same checkpoint
+ 			 * as starting locations.
  			 */
  			LWLockAcquire(WALInsertLock, LW_SHARED);
  			if (XLByteLT(XLogCtl->Insert.lastBackupStart, startpoint))
***************
*** 9010,9015 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
--- 9084,9091 ----
  				gotUniqueStartpoint = true;
  			}
  			LWLockRelease(WALInsertLock);
+ 			if (recovery_in_progress)
+ 				gotUniqueStartpoint = true;
  		} while (!gotUniqueStartpoint);
  
  		XLByteToSeg(startpoint, _logId, _logSeg);
***************
*** 9031,9036 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
--- 9107,9114 ----
  						 checkpointloc.xlogid, checkpointloc.xrecoff);
  		appendStringInfo(&labelfbuf, "BACKUP METHOD: %s\n",
  						 exclusive ? "pg_start_backup" : "streamed");
+ 		appendStringInfo(&labelfbuf, "SYSTEM STATUS: %s\n",
+ 						 recovery_in_progress ? "recovery" : "in production");
  		appendStringInfo(&labelfbuf, "START TIME: %s\n", strfbuf);
  		appendStringInfo(&labelfbuf, "LABEL: %s\n", backupidstr);
  
***************
*** 9123,9128 **** pg_start_backup_callback(int code, Datum arg)
--- 9201,9208 ----
   * history file at the beginning of archive recovery, but we now use the WAL
   * record for that and the file is for informational and debug purposes only.
   *
+  * During recovery, we only remove the backup label file.
+  *
   * Note: different from CancelBackup which just cancels online backup mode.
   */
  Datum
***************
*** 9149,9154 **** XLogRecPtr
--- 9229,9235 ----
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		recovery_in_progress = false;
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
  	XLogRecData rdata;
***************
*** 9159,9164 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
--- 9240,9246 ----
  	char		stopxlogfilename[MAXFNAMELEN];
  	char		lastxlogfilename[MAXFNAMELEN];
  	char		histfilename[MAXFNAMELEN];
+ 	char		systemstatus[20];
  	uint32		_logId;
  	uint32		_logSeg;
  	FILE	   *lfp;
***************
*** 9168,9186 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  	int			waits = 0;
  	bool		reported_waiting = false;
  	char	   *remaining;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
! 	if (RecoveryInProgress())
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 				 errmsg("recovery is in progress"),
! 				 errhint("WAL control functions cannot be executed during recovery.")));
! 
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
--- 9250,9271 ----
  	int			waits = 0;
  	bool		reported_waiting = false;
  	char	   *remaining;
+ 	char	   *ptr;
+ 
+ 	recovery_in_progress = RecoveryInProgress();
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
! 	/*
! 	 * During recovery, we don't need to check WAL level. Because the fact that
! 	 * we are now executing pg_stop_backup() means that wal_level is set to
! 	 * hot_standby on the master, i.e., WAL level is sufficient for making an online
! 	 * backup.
! 	 */
! 	if (!recovery_in_progress && !XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
***************
*** 9271,9276 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
--- 9356,9414 ----
  	remaining = strchr(labelfile, '\n') + 1;	/* %n is not portable enough */
  
  	/*
+ 	 * Parse the SYSTEM STATUS line, and check that database system
+ 	 * status matches between pg_start_backup() and pg_stop_backup().
+ 	 */
+ 	ptr = strstr(remaining, "SYSTEM STATUS:");
+ 	if (sscanf(ptr, "SYSTEM STATUS: %19s\n", systemstatus) != 1)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
+ 	if (strcmp(systemstatus, "recovery") == 0 && !recovery_in_progress)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("pg_stop_backup() was executed during normal processing "
+ 						"though pg_start_backup() was executed during recovery"),
+ 				 errhint("The database backup will not be usable.")));
+ 
+ 	/*
+ 	 * During recovery, we don't write an end-of-backup record. We can
+ 	 * assume that pg_control was backed up just before pg_stop_backup()
+ 	 * and its minimum recovery point can be available as the backup end
+ 	 * location. Without an end-of-backup record, we can check correctly
+ 	 * whether we've reached the end of backup when starting recovery
+ 	 * from this backup.
+ 	 *
+ 	 * We don't force a switch to new WAL file and wait for all the required
+ 	 * files to be archived. This is okay if we use the backup to start
+ 	 * the standby. But, if it's for an archive recovery, to ensure all the
+ 	 * required files are available, a user should wait for them to be archived,
+ 	 * or include them into the backup after pg_stop_backup().
+ 	 *
+ 	 * We return the current minimum recovery point as the backup end
+ 	 * location. Note that it's would be bigger than the exact backup end
+ 	 * location if the minimum recovery point is updated since the backup
+ 	 * of pg_control. The return value of pg_stop_backup() is often used
+ 	 * for a user to calculate the required files. Returning approximate
+ 	 * location is harmless for that use because it's guaranteed not to be
+ 	 * smaller than the exact backup end location.
+ 	 *
+ 	 * XXX currently a backup history file is for informational and debug
+ 	 * purposes only. It's not essential for an online backup. Furthermore,
+ 	 * even if it's created, it will not be archived during recovery because
+ 	 * an archiver is not invoked. So it doesn't seem worthwhile to write
+ 	 * a backup history file during recovery.
+ 	 */
+ 	if (recovery_in_progress)
+ 	{
+ 		LWLockAcquire(ControlFileLock, LW_SHARED);
+ 		stoppoint = ControlFile->minRecoveryPoint;
+ 		LWLockRelease(ControlFileLock);
+ 
+ 		return stoppoint;
+ 	}
+ 
+ 	/*
  	 * Write the backup-end xlog record
  	 */
  	rdata.data = (char *) (&startpoint);
***************
*** 9787,9804 **** pg_xlogfile_name(PG_FUNCTION_ARGS)
   * Returns TRUE if a backup_label was found (and fills the checkpoint
   * location and its REDO location into *checkPointLoc and RedoStartLSN,
   * respectively); returns FALSE if not. If this backup_label came from a
!  * streamed backup, *backupEndRequired is set to TRUE.
   */
  static bool
! read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired)
  {
  	char		startxlogfilename[MAXFNAMELEN];
  	TimeLineID	tli;
  	FILE	   *lfp;
  	char		ch;
  	char		backuptype[20];
  
  	*backupEndRequired = false;
  
  	/*
  	 * See if label file is present
--- 9925,9946 ----
   * Returns TRUE if a backup_label was found (and fills the checkpoint
   * location and its REDO location into *checkPointLoc and RedoStartLSN,
   * respectively); returns FALSE if not. If this backup_label came from a
!  * streamed backup, *backupEndRequired is set to TRUE. If this backup_label
!  * was created during recovery, *backupDuringRecovery is set to TRUE.
   */
  static bool
! read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired,
! 				  bool *backupDuringRecovery)
  {
  	char		startxlogfilename[MAXFNAMELEN];
  	TimeLineID	tli;
  	FILE	   *lfp;
  	char		ch;
  	char		backuptype[20];
+ 	char		systemstatus[20];
  
  	*backupEndRequired = false;
+ 	*backupDuringRecovery = false;
  
  	/*
  	 * See if label file is present
***************
*** 9832,9847 **** read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired)
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
  	/*
! 	 * BACKUP METHOD line is new in 9.1. We can't restore from an older backup
! 	 * anyway, but since the information on it is not strictly required, don't
! 	 * error out if it's missing for some reason.
  	 */
! 	if (fscanf(lfp, "BACKUP METHOD: %19s", backuptype) == 1)
  	{
  		if (strcmp(backuptype, "streamed") == 0)
  			*backupEndRequired = true;
  	}
  
  	if (ferror(lfp) || FreeFile(lfp))
  		ereport(FATAL,
  				(errcode_for_file_access(),
--- 9974,9995 ----
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
  	/*
! 	 * BACKUP METHOD and SYSTEM STATUS lines are new in 9.2. We can't
! 	 * restore from an older backup anyway, but since the information on it
! 	 * is not strictly required, don't error out if it's missing for some reason.
  	 */
! 	if (fscanf(lfp, "BACKUP METHOD: %19s\n", backuptype) == 1)
  	{
  		if (strcmp(backuptype, "streamed") == 0)
  			*backupEndRequired = true;
  	}
  
+ 	if (fscanf(lfp, "SYSTEM STATUS: %19s\n", systemstatus) == 1)
+ 	{
+ 		if (strcmp(systemstatus, "recovery") == 0)
+ 			*backupDuringRecovery = true;
+ 	}
+ 
  	if (ferror(lfp) || FreeFile(lfp))
  		ereport(FATAL,
  				(errcode_for_file_access(),
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***************
*** 287,292 **** typedef enum
--- 287,294 ----
  static PMState pmState = PM_INIT;
  
  static bool ReachedNormalRunning = false;		/* T if we've reached PM_RUN */
+ static bool OnlineBackupAllowed = false;		/* T if we've reached PM_RUN or
+ 												 * PM_HOT_STANDBY */
  
  bool		ClientAuthInProgress = false;		/* T during new-client
  												 * authentication */
***************
*** 2105,2122 **** pmdie(SIGNAL_ARGS)
  				/* and the walwriter too */
  				if (WalWriterPID != 0)
  					signal_child(WalWriterPID, SIGTERM);
! 
! 				/*
! 				 * If we're in recovery, we can't kill the startup process
! 				 * right away, because at present doing so does not release
! 				 * its locks.  We might want to change this in a future
! 				 * release.  For the time being, the PM_WAIT_READONLY state
! 				 * indicates that we're waiting for the regular (read only)
! 				 * backends to die off; once they do, we'll kill the startup
! 				 * and walreceiver processes.
! 				 */
! 				pmState = (pmState == PM_RUN) ?
! 					PM_WAIT_BACKUP : PM_WAIT_READONLY;
  			}
  
  			/*
--- 2107,2113 ----
  				/* and the walwriter too */
  				if (WalWriterPID != 0)
  					signal_child(WalWriterPID, SIGTERM);
! 				pmState = PM_WAIT_BACKUP;
  			}
  
  			/*
***************
*** 2299,2304 **** reaper(SIGNAL_ARGS)
--- 2290,2296 ----
  			 */
  			FatalError = false;
  			ReachedNormalRunning = true;
+ 			OnlineBackupAllowed = true;
  			pmState = PM_RUN;
  
  			/*
***************
*** 2823,2831 **** PostmasterStateMachine(void)
  	{
  		/*
  		 * PM_WAIT_BACKUP state ends when online backup mode is not active.
  		 */
  		if (!BackupInProgress())
! 			pmState = PM_WAIT_BACKENDS;
  	}
  
  	if (pmState == PM_WAIT_READONLY)
--- 2815,2831 ----
  	{
  		/*
  		 * PM_WAIT_BACKUP state ends when online backup mode is not active.
+ 		 *
+ 		 * If we're in recovery, we can't kill the startup process right away,
+ 		 * because at present doing so does not release its locks.  We might
+ 		 * want to change this in a future release.  For the time being,
+ 		 * the PM_WAIT_READONLY state indicates that we're waiting for
+ 		 * the regular (read only) backends to die off; once they do,
+ 		 * we'll kill the startup and walreceiver processes.
  		 */
  		if (!BackupInProgress())
! 			pmState = ReachedNormalRunning ?
! 				PM_WAIT_BACKENDS : PM_WAIT_READONLY;
  	}
  
  	if (pmState == PM_WAIT_READONLY)
***************
*** 2994,3006 **** PostmasterStateMachine(void)
  			/*
  			 * Terminate backup mode to avoid recovery after a clean fast
  			 * shutdown.  Since a backup can only be taken during normal
! 			 * running (and not, for example, while running under Hot Standby)
! 			 * it only makes sense to do this if we reached normal running. If
! 			 * we're still in recovery, the backup file is one we're
! 			 * recovering *from*, and we must keep it around so that recovery
! 			 * restarts from the right place.
  			 */
! 			if (ReachedNormalRunning)
  				CancelBackup();
  
  			/* Normal exit from the postmaster is here */
--- 2994,3006 ----
  			/*
  			 * Terminate backup mode to avoid recovery after a clean fast
  			 * shutdown.  Since a backup can only be taken during normal
! 			 * running and hot standby, it only makes sense to do this
! 			 * if we reached normal running or hot standby. If we have not
! 			 * reached a consistent recovery state yet, the backup file is
! 			 * one we're recovering *from*, and we must keep it around
! 			 * so that recovery restarts from the right place.
  			 */
! 			if (OnlineBackupAllowed)
  				CancelBackup();
  
  			/* Normal exit from the postmaster is here */
***************
*** 4157,4162 **** sigusr1_handler(SIGNAL_ARGS)
--- 4157,4163 ----
  		ereport(LOG,
  		(errmsg("database system is ready to accept read only connections")));
  
+ 		OnlineBackupAllowed = true;
  		pmState = PM_HOT_STANDBY;
  	}
  
*** a/src/bin/pg_controldata/pg_controldata.c
--- b/src/bin/pg_controldata/pg_controldata.c
***************
*** 232,237 **** main(int argc, char *argv[])
--- 232,240 ----
  	printf(_("Backup start location:                %X/%X\n"),
  		   ControlFile.backupStartPoint.xlogid,
  		   ControlFile.backupStartPoint.xrecoff);
+ 	printf(_("Backup end location:                  %X/%X\n"),
+ 		   ControlFile.backupEndPoint.xlogid,
+ 		   ControlFile.backupEndPoint.xrecoff);
  	printf(_("End-of-backup record required:        %s\n"),
  		   ControlFile.backupEndRequired ? _("yes") : _("no"));
  	printf(_("Current wal_level setting:            %s\n"),
*** a/src/bin/pg_ctl/pg_ctl.c
--- b/src/bin/pg_ctl/pg_ctl.c
***************
*** 883,897 **** do_stop(void)
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
  		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present, we're recovering from an online
! 		 * backup instead of performing one.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0 &&
! 			stat(recovery_file, &statbuf) != 0)
  		{
! 			print_msg(_("WARNING: online backup mode is active\n"
! 						"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
  		}
  
  		print_msg(_("waiting for server to shut down..."));
--- 883,900 ----
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
  		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present and new connection has not been
! 		 * allowed yet, an online backup mode must not be active.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0)
  		{
! 			if (stat(recovery_file, &statbuf) != 0)
! 				print_msg(_("WARNING: online backup mode is active\n"
! 							"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
! 			else
! 				print_msg(_("WARNING: online backup mode is active if you can connect as a superuser to server\n"
! 							"If so, shutdown will not complete until pg_stop_backup() is called.\n\n"));
  		}
  
  		print_msg(_("waiting for server to shut down..."));
***************
*** 971,985 **** do_restart(void)
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
  		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present, we're recovering from an online
! 		 * backup instead of performing one.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0 &&
! 			stat(recovery_file, &statbuf) != 0)
  		{
! 			print_msg(_("WARNING: online backup mode is active\n"
! 						"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
  		}
  
  		print_msg(_("waiting for server to shut down..."));
--- 974,991 ----
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
  		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present and new connection has not been
! 		 * allowed yet, an online backup mode must not be active.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0)
  		{
! 			if (stat(recovery_file, &statbuf) != 0)
! 				print_msg(_("WARNING: online backup mode is active\n"
! 							"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
! 			else
! 				print_msg(_("WARNING: online backup mode is active if you can connect as a superuser to server\n"
! 							"If so, shutdown will not complete until pg_stop_backup() is called.\n\n"));
  		}
  
  		print_msg(_("waiting for server to shut down..."));
*** a/src/bin/pg_resetxlog/pg_resetxlog.c
--- b/src/bin/pg_resetxlog/pg_resetxlog.c
***************
*** 503,509 **** GuessControlValues(void)
  	ControlFile.time = (pg_time_t) time(NULL);
  	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
  
! 	/* minRecoveryPoint and backupStartPoint can be left zero */
  
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.MaxConnections = 100;
--- 503,509 ----
  	ControlFile.time = (pg_time_t) time(NULL);
  	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
  
! 	/* minRecoveryPoint, backupStartPoint and backupEndPoint can be left zero */
  
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.MaxConnections = 100;
***************
*** 637,642 **** RewriteControlFile(void)
--- 637,644 ----
  	ControlFile.minRecoveryPoint.xrecoff = 0;
  	ControlFile.backupStartPoint.xlogid = 0;
  	ControlFile.backupStartPoint.xrecoff = 0;
+ 	ControlFile.backupEndPoint.xlogid = 0;
+ 	ControlFile.backupEndPoint.xrecoff = 0;
  	ControlFile.backupEndRequired = false;
  
  	/*
*** a/src/include/catalog/pg_control.h
--- b/src/include/catalog/pg_control.h
***************
*** 21,27 ****
  
  
  /* Version identifier for this pg_control format */
! #define PG_CONTROL_VERSION	921
  
  /*
   * Body of CheckPoint XLOG records.  This is declared here because we keep
--- 21,27 ----
  
  
  /* Version identifier for this pg_control format */
! #define PG_CONTROL_VERSION	922
  
  /*
   * Body of CheckPoint XLOG records.  This is declared here because we keep
***************
*** 138,143 **** typedef struct ControlFileData
--- 138,150 ----
  	 * record, to make sure the end-of-backup record corresponds the base
  	 * backup we're recovering from.
  	 *
+ 	 * backupEndPoint is the backup end location, if we are recovering from
+ 	 * an online backup which was taken from the server in recovery mode
+ 	 * and haven't reached the end of backup yet. It is initialized to
+ 	 * the minimum recovery point in pg_control which was backed up just
+ 	 * before pg_stop_backup(). It is reset to zero when the end of backup
+ 	 * is reached, and we mustn't start up before that.
+ 	 *
  	 * If backupEndRequired is true, we know for sure that we're restoring
  	 * from a backup, and must see a backup-end record before we can safely
  	 * start up. If it's false, but backupStartPoint is set, a backup_label
***************
*** 146,151 **** typedef struct ControlFileData
--- 153,159 ----
  	 */
  	XLogRecPtr	minRecoveryPoint;
  	XLogRecPtr	backupStartPoint;
+ 	XLogRecPtr	backupEndPoint;
  	bool		backupEndRequired;
  
  	/*

#32

Steve Singer

ssinger_pg@sympatico.ca

over 14 years ago

In reply to: Fujii Masao (#30)

Re: Online base backup from the hot-standby

On 11-09-26 10:56 PM, Fujii Masao wrote:

Looks weired. Though the WAL record starting from 0/6000298 was read
successfully, then re-fetch of the same record fails at the end of recovery.
One possible cause is the corruption of archived WAL file. What
restore_command on the standby and archive_command on the master
are you using? Could you confirm that there is no chance to overwrite
archive WAL files in your environment?

I tried to reproduce this problem several times, but I could not. Could
you provide the test case which reproduces the problem?

This is the test procedure I'm trying today, I wasn't able to reproduce
the crash. What I was doing the other day was similar but I can't speak
to unintentional differences.

I have my master server
data
port=5439
wal_level=hot_standby
archive_mode=on
archive_command='cp -i %p /usr/local/pgsql92git/archive/%f'
hot_standby=on

I then run
select pg_start_backup('foo');
$ rm -r ../data2
$ cp -r ../data ../data2
$ rm ../data2/postmaster.pid
select pg_stop_backup();
I edit data2/postgresql.conf so
port=5438
I commented out archive_mode and archive_command (or at least today I did)
recovery.conf is

standby_mode='on'
primary_conninfo='host=127.0.0.1 port=5439 user=ssinger dbname=test'
restore_command='cp /usr/local/pgsql92git/archive/%f %p'

I then start up the second cluster. On it I run

select pg_start_backup('1');
$ rm -r ../data3
$ rm -r ../archive2
$ cp -r ../data2 ../data3
$ cp ../data2/global/pg_control ../data3/global

select pg_stop_backup();
I edit ../data2/postgresql.conf
port=5437
archive_mode=on
# (change requires restart)
archive_command='cp -i %p /usr/local/pgsql92git/archive2/%f'

recovery.conf is

standby_mode='on'
primary_conninfo='host=127.0.0.1 port=5439 user=ssinger dbname=test'
restore_command='cp /usr/local/pgsql92git/archive/%f %p'
trigger_file='/tmp/3'

$ postgres -D ../data3

The first time I did this postgres came up quickly.

$ touch /tmp/3

worked fine.

I then stopped data3
$ rm -r ../data3
on data 2 I run
pg_start_backup('1')
$ cp -r ../data2 ../data3
$ cp ../data2/global/pg_control ../data3/global
select pg_stop_backup() # on data2
$ rm ../data3/postmaster.pid
vi ../data3/postgresql.conf # same changes as above for data3
vi ../data3/recovery.conf # same as above for data 3
postgres -D ../data3

This time I got
./postgres -D ../data3
LOG: database system was interrupted while in recovery at log time
2011-09-27 22:04:17 GMT
HINT: If this has occurred more than once some data might be corrupted
and you might need to choose an earlier recovery target.
LOG: entering standby mode
cp: cannot stat
`/usr/local/pgsql92git/archive/00000001000000000000000C': No such file
or directory
LOG: redo starts at 0/C000020
LOG: record with incorrect prev-link 0/9000058 at 0/C0000B0
cp: cannot stat
`/usr/local/pgsql92git/archive/00000001000000000000000C': No such file
or directory
LOG: streaming replication successfully connected to primary
FATAL: the database system is starting up
FATAL: the database system is starting up
LOG: consistent recovery state reached at 0/C0000E8
LOG: database system is ready to accept read only connections

In order to get the database to come in read only mode I manually issued
a checkpoint on the master (data) shortly after the checkpoint command
the data3 instance went to read only mode.

then

touch /tmp/3

trigger file found: /tmp/3
FATAL: terminating walreceiver process due to administrator command
cp: cannot stat
`/usr/local/pgsql92git/archive/00000001000000000000000C': No such file
or directory
LOG: record with incorrect prev-link 0/9000298 at 0/C0002F0
cp: cannot stat
`/usr/local/pgsql92git/archive/00000001000000000000000C': No such file
or directory
LOG: redo done at 0/C000298
cp: cannot stat
`/usr/local/pgsql92git/archive/00000001000000000000000C': No such file
or directory
cp: cannot stat `/usr/local/pgsql92git/archive/00000002.history': No
such file or directory
LOG: selected new timeline ID: 2
cp: cannot stat `/usr/local/pgsql92git/archive/00000001.history': No
such file or directory
LOG: archive recovery complete
LOG: database system is ready to accept connections
LOG: autovacuum launcher started

It looks like data3 is still pulling files with the recovery command
after it sees the touch file (is this expected behaviour?)
$ grep archive ../data3/postgresql.conf
#wal_level = minimal # minimal, archive, or hot_standby
#archive_mode = off # allows archiving to be done
archive_mode=on
archive_command='cp -i %p /usr/local/pgsql92git/archive2/%f'

I have NOT been able to make postgres crash during a recovery (today).
It is *possible* that on some of my runs the other day I had skipped
changing the archive command on data3 to write to archive2 instead of
archive.

I have also today not been able to get it to attempt to restore the same
WAL file twice.

If a base backup is in progress on a recovery database and that recovery
database is promoted to master, following the promotion (if you don't
restart the postmaster). I see
select pg_stop_backup();
ERROR: database system status mismatches between pg_start_backup() and
pg_stop_backup()

If you restart the postmaster this goes away. When the postmaster leaves
recovery mode I think it should abort an existing base backup so
pg_stop_backup() will say no backup in progress,

I don't think that it's good idea to cancel the backup when promoting
the standby.
Because if we do so, we need to handle correctly the case where cancel of backup
and pg_start_backup/pg_stop_backup are performed at the same time. We can
simply do that by protecting those whole operations including pg_start_backup's
checkpoint by the lwlock. But I don't think that it's worth
introducing new lwlock
only for that. And it's not good to take a lwlock through
time-consuming checkpoint
operation. Of course we can avoid such a lwlock, but which would require more
complicated code.

or give an error message on
pg_stop_backup() saying that the base backup won't be usable. The above
error doesn't really tell the user why there is a mismatch.

What about the following error message?

ERROR: pg_stop_backup() was executed during normal processing though
pg_start_backup() was executed during recovery
HINT: The database backup will not be usable.

Or, you have better idea?

I like that error message better. It tells me what is going on versus
complaining about a state mismatch.

Show quoted text

In my testing a few times I got into a situation where a standby server
coming from a recovery target took a while to finish recovery (this is on a
database with no activity). Then when i tried promoting that server to
master I got

LOG: trigger file found: /tmp/3
FATAL: terminating walreceiver process due to administrator command
LOG: restored log file "000000010000000000000009" from archive
LOG: restored log file "000000010000000000000009" from archive
LOG: redo done at 0/90000E8
LOG: restored log file "000000010000000000000009" from archive
PANIC: unexpected pageaddr 0/6000000 in log file 0, segment 9, offset 0
LOG: startup process (PID 1804) was terminated by signal 6: Aborted
LOG: terminating any other active server processes

It is *possible* I mixed up the order of a step somewhere since my testing
isn't script based. A standby server that 'looks' okay but can't actually be
promoted is dangerous.

Looks the same problem as the above. Another weired point is that
the same archived WAL file is restored two times before redo is done.
I'm not sure why this happens... Could you provide the test case which
reproduces this problem? Will diagnose.

Regards,

#33

Fujii Masao

masao.fujii@gmail.com

over 14 years ago

In reply to: Steve Singer (#32)

1 attachment(s)

Re: Online base backup from the hot-standby

On Wed, Sep 28, 2011 at 8:10 AM, Steve Singer <ssinger_pg@sympatico.ca> wrote:

This is the test procedure I'm trying today, I wasn't able to reproduce the
crash. What I was doing the other day was similar but I can't speak to
unintentional differences.

Thanks for the info! I tried your test case three times, but was not able to
reproduce the issue, too.

BTW, I created the shell script (attached) which runs your test scenario and
used it for the test.

If the issue will happen again, please feel free to share the information about
it. I will diagnose it.

It looks like data3 is still pulling files with the recovery command after
it sees the touch file (is this expected behaviour?)

Yes, that's expected behavior. After the trigger file is found, PostgreSQL
tries to replay all available WAL files in pg_xlog directory and archive one.
So, if there is unreplayed archived WAL file at that time, PostgreSQL tries
to pull it by calling the recovery command.

And, after WAL replay is done, PostgreSQL tries to re-fetch the last
replayed WAL record in order to identify the end of replay location. So,
if the last replayed record is included in the archived WAL file, it's pulled
by the recovery command.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#34

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

over 14 years ago

In reply to: Fujii Masao (#31)

1 attachment(s)

Re: Online base backup from the hot-standby

I created a patch corresponding FPW.
Fujii's patch (ver 9) is based.

Manage own FPW in shared-memory (on master)
* startup and walwriter process update it. startup initializes it
after REDO. walwriter updates it when started or received SIGHUP.

Insert WAL including a value of current FPW (on master)
* In the the same timing as update, they insert WAL (is named
XLOG_FPW_CHANGE). XLOG_FPW_CHANGE has a value of the changed FPW.
* When it creates CHECKPOINT, it adds a value of current FPW to the
CHECKPOINT WAL.

Manage master's FPW in local-memory in startup (on standby)
* It takes a value of the master's FPW by reading XLOG_FPW_CHANGE at
REDO.

Check when pg_start_backup/pg_stop_backup (on standby)
* It checks to use these two value.
* master's FPW at latest CHECKPOINT
* current master's FPW by XLOG_FPW_CHANGE

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

Attachments:

standby_online_backup_09base_01fpw.patchapplication/octet-stream; name=standby_online_backup_09base_01fpw.patchDownload

diff -rcN postgresql_with_9fujii_patch/src/backend/access/transam/xlog.c postgresql_with_patch/src/backend/access/transam/xlog.c
*** postgresql_with_9fujii_patch/src/backend/access/transam/xlog.c	2011-10-06 06:06:19.000000000 +0900
--- postgresql_with_patch/src/backend/access/transam/xlog.c	2011-10-09 02:11:12.000000000 +0900
***************
*** 364,369 ****
--- 364,372 ----
  	bool		exclusiveBackup;
  	int			nonExclusiveBackups;
  	XLogRecPtr	lastBackupStart;
+ 
+ 	/* the startup or the walwriter is logged to its own FPW */
+ 	bool		fullPageWrites;
  } XLogCtlInsert;
  
  /*
***************
*** 453,458 ****
--- 456,464 ----
  	bool		recoveryPause;
  
  	slock_t		info_lck;		/* locks shared variables shown above */
+ 
+ 	/* latest LSN that has recovered a WAL which fpw is changed 'off' */
+ 	XLogRecPtr	lastFpwDisabledLSN;
  } XLogCtlData;
  
  static XLogCtlData *XLogCtl = NULL;
***************
*** 564,569 ****
--- 570,578 ----
  /* Have we launched bgwriter during recovery? */
  static bool bgwriterLaunched = false;
  
+ /* */
+ static bool master_fpw;
+ 
  /*
   * Information logged when we detect a change in one of the parameters
   * important for Hot Standby.
***************
*** 763,769 ****
  	 * don't yet have the insert lock, forcePageWrites could change under us,
  	 * but we'll recheck it once we have the lock.
  	 */
! 	doPageWrites = fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
--- 772,778 ----
  	 * don't yet have the insert lock, forcePageWrites could change under us,
  	 * but we'll recheck it once we have the lock.
  	 */
! 	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
***************
*** 909,915 ****
  	 * just turned off, we could recompute the record without full pages, but
  	 * we choose not to bother.)
  	 */
! 	if (Insert->forcePageWrites && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
--- 918,924 ----
  	 * just turned off, we could recompute the record without full pages, but
  	 * we choose not to bother.)
  	 */
! 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
***************
*** 6370,6377 ****
  		/* No need to hold ControlFileLock yet, we aren't up far enough */
  		UpdateControlFile();
  
! 		/* initialize our local copy of minRecoveryPoint */
  		minRecoveryPoint = ControlFile->minRecoveryPoint;
  
  		/*
  		 * Reset pgstat data, because it may be invalid after recovery.
--- 6379,6387 ----
  		/* No need to hold ControlFileLock yet, we aren't up far enough */
  		UpdateControlFile();
  
! 		/* initialize our local copy of minRecoveryPoint and fullPageWrites */
  		minRecoveryPoint = ControlFile->minRecoveryPoint;
+ 		master_fpw = ControlFile->checkPointCopy.fullPageWrites;
  
  		/*
  		 * Reset pgstat data, because it may be invalid after recovery.
***************
*** 6865,6870 ****
--- 6875,6889 ----
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
  	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
  
+ 	/*
+ 	 * the startup updates FPW after REDO. However, it must perform before writing
+ 	 * the WAL of the CHECKPOINT. This is because of the need to update own fpw to
+ 	 * shared memory before writing the WAL of its CHECKPOTNT.
+ 	 */
+ 	LocalSetXLogInsertAllowed();
+ 	ReportFpwParameters(true);
+ 	LocalXLogInsertAllowed = -1;
+ 
  	if (InRecovery)
  	{
  		int			rmid;
***************
*** 7723,7728 ****
--- 7742,7750 ----
  
  	checkPoint.ThisTimeLineID = ThisTimeLineID;
  
+ 	/* record current FPW to the WAL of the CHECKPOINT. */
+ 	checkPoint.fullPageWrites = Insert->fullPageWrites;
+ 
  	/*
  	 * Compute new REDO record ptr = location of next XLOG record.
  	 *
***************
*** 8636,8641 ****
--- 8658,8676 ----
  		/* Check to see if any changes to max_connections give problems */
  		CheckRequiredParameterValues();
  	}
+ 	else if (info == XLOG_FPW_CHANGE)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+ 		memcpy(&master_fpw, XLogRecGetData(record), sizeof(master_fpw));
+ 
+ 		/* record the LSN when FPW is changed false on master */
+ 		SpinLockAcquire(&xlogctl->info_lck);
+ 		if (!master_fpw)
+ 			xlogctl->lastFpwDisabledLSN = lsn;
+ 		SpinLockRelease(&xlogctl->info_lck);
+ 	}
  }
  
  void
***************
*** 8650,8656 ****
  
  		appendStringInfo(buf, "checkpoint: redo %X/%X; "
  						 "tli %u; xid %u/%u; oid %u; multi %u; offset %u; "
! 						 "oldest xid %u in DB %u; oldest running xid %u; %s",
  						 checkpoint->redo.xlogid, checkpoint->redo.xrecoff,
  						 checkpoint->ThisTimeLineID,
  						 checkpoint->nextXidEpoch, checkpoint->nextXid,
--- 8685,8691 ----
  
  		appendStringInfo(buf, "checkpoint: redo %X/%X; "
  						 "tli %u; xid %u/%u; oid %u; multi %u; offset %u; "
! 						 "oldest xid %u in DB %u; oldest running xid %u; full_page_writes %s; %s",
  						 checkpoint->redo.xlogid, checkpoint->redo.xrecoff,
  						 checkpoint->ThisTimeLineID,
  						 checkpoint->nextXidEpoch, checkpoint->nextXid,
***************
*** 8660,8665 ****
--- 8695,8701 ----
  						 checkpoint->oldestXid,
  						 checkpoint->oldestXidDB,
  						 checkpoint->oldestActiveXid,
+ 						 checkpoint->fullPageWrites ? "true" : "false",
  				 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
  	}
  	else if (info == XLOG_NOOP)
***************
*** 8717,8722 ****
--- 8753,8766 ----
  						 xlrec.max_locks_per_xact,
  						 wal_level_str);
  	}
+ 	else if (info == XLOG_FPW_CHANGE)
+ 	{
+ 		bool fpw;
+ 
+ 		memcpy(&fpw, rec, sizeof(fpw));
+ 		appendStringInfo(buf, "fpw change: %s",
+ 						 fpw ? "true" : "false");
+ 	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
  }
***************
*** 9089,9094 ****
--- 9133,9149 ----
  				gotUniqueStartpoint = true;
  		} while (!gotUniqueStartpoint);
  
+ 		/*
+ 		 * check whether the master's FPW is 'off' when latest CHECKPOINT or
+ 		 * since then.
+ 		 */
+ 		if (recovery_in_progress &&
+ 			(!ControlFile->checkPointCopy.fullPageWrites ||
+ 			 XLByteLE(startpoint, XLogCtl->lastFpwDisabledLSN)))
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 					 errmsg("full_page_writes on master is set invalid more than once since latest checkpoint")));
+ 
  		XLByteToSeg(startpoint, _logId, _logSeg);
  		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
  
***************
*** 9372,9377 ****
--- 9427,9438 ----
  						"though pg_start_backup() was executed during recovery"),
  				 errhint("The database backup will not be usable.")));
  
+ 	/* check whether the master's FPW is 'off' since pg_start_backup. */
+ 	if (recovery_in_progress && XLByteLE(startpoint, XLogCtl->lastFpwDisabledLSN))
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 			  errmsg("full_page_writes on master is set invalid more than once during online backup")));
+ 
  	/*
  	 * During recovery, we don't write an end-of-backup record. We can
  	 * assume that pg_control was backed up just before pg_stop_backup()
***************
*** 10743,10745 ****
--- 10804,10856 ----
  {
  	SetLatch(&XLogCtl->recoveryWakeupLatch);
  }
+ 
+ /*
+  * insert a WAL of XLOG_FPW_CHANGE or update to the shared memory if there
+  * is a change of FPW. However, always update when the startup have finished.
+  */
+ void
+ ReportFpwParameters(bool startup_finish)
+ {
+ 	bool fpwReport = false;
+ 	bool fpwXLogInsert = true;
+ 
+ 	if (startup_finish)
+ 	{
+ 		fpwReport = true;
+ 		if (master_fpw != fullPageWrites)
+ 			fpwXLogInsert = false;
+ 	}
+ 	else
+ 	{
+ 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+ 		if (XLogCtl->Insert.fullPageWrites != fullPageWrites)
+ 			fpwReport = true;
+ 		LWLockRelease(WALInsertLock);
+ 	}
+ 
+ 	if (fpwReport)
+ 	{
+ 		/*
+ 		 * insert own fpw to a WAL. However, it does not perform
+ 		 * when wal_level is not 'hotstandby' or fpw is same as shared-memory.
+ 		 */
+ 		if (XLogStandbyInfoActive() && fpwXLogInsert)
+ 		{
+ 			XLogRecData rdata;
+ 			bool record = fullPageWrites;
+ 
+ 			rdata.buffer = InvalidBuffer;
+ 			rdata.data = (char *) &record;
+ 			rdata.len = sizeof(record);
+ 			rdata.next = NULL;
+ 
+ 			XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE, &rdata);
+ 		}
+ 
+ 		/* update own fpw in shared-memory */
+ 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+ 		XLogCtl->Insert.fullPageWrites = fullPageWrites;
+ 		LWLockRelease(WALInsertLock);
+ 	}
+ }
diff -rcN postgresql_with_9fujii_patch/src/backend/postmaster/walwriter.c postgresql_with_patch/src/backend/postmaster/walwriter.c
*** postgresql_with_9fujii_patch/src/backend/postmaster/walwriter.c	2011-10-06 06:05:45.000000000 +0900
--- postgresql_with_patch/src/backend/postmaster/walwriter.c	2011-10-09 01:40:52.000000000 +0900
***************
*** 216,221 ****
--- 216,227 ----
  	PG_SETMASK(&UnBlockSig);
  
  	/*
+ 	 * After the startup process, the walwriter manages the FPW. Because
+ 	 * the walwriter may have not received a SIGHUP then, it updates the FPW.
+ 	 */
+ 	ReportFpwParameters(false);
+ 
+ 	/*
  	 * Loop forever
  	 */
  	for (;;)
***************
*** 236,241 ****
--- 242,252 ----
  		{
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
+ 			/*
+ 			 * the walwriter manages the FPW. When the walwriter has received
+ 			 * a SIGHUP, it updates the FPW.
+ 			 */
+ 			ReportFpwParameters(false);
  		}
  		if (shutdown_requested)
  		{
diff -rcN postgresql_with_9fujii_patch/src/bin/pg_controldata/pg_controldata.c postgresql_with_patch/src/bin/pg_controldata/pg_controldata.c
*** postgresql_with_9fujii_patch/src/bin/pg_controldata/pg_controldata.c	2011-10-06 06:06:19.000000000 +0900
--- postgresql_with_patch/src/bin/pg_controldata/pg_controldata.c	2011-10-09 01:40:52.000000000 +0900
***************
*** 224,229 ****
--- 224,231 ----
  		   ControlFile.checkPointCopy.oldestXidDB);
  	printf(_("Latest checkpoint's oldestActiveXID:  %u\n"),
  		   ControlFile.checkPointCopy.oldestActiveXid);
+ 	printf(_("Latest checkpoint's full_page_writes: %s\n"),
+ 		   ControlFile.checkPointCopy.fullPageWrites ? "true" : "false");
  	printf(_("Time of latest checkpoint:            %s\n"),
  		   ckpttime_str);
  	printf(_("Minimum recovery ending location:     %X/%X\n"),
diff -rcN postgresql_with_9fujii_patch/src/bin/pg_resetxlog/pg_resetxlog.c postgresql_with_patch/src/bin/pg_resetxlog/pg_resetxlog.c
*** postgresql_with_9fujii_patch/src/bin/pg_resetxlog/pg_resetxlog.c	2011-10-06 06:06:19.000000000 +0900
--- postgresql_with_patch/src/bin/pg_resetxlog/pg_resetxlog.c	2011-10-09 01:40:52.000000000 +0900
***************
*** 498,503 ****
--- 498,504 ----
  	ControlFile.checkPointCopy.oldestXidDB = InvalidOid;
  	ControlFile.checkPointCopy.time = (pg_time_t) time(NULL);
  	ControlFile.checkPointCopy.oldestActiveXid = InvalidTransactionId;
+ 	ControlFile.checkPointCopy.fullPageWrites = true;
  
  	ControlFile.state = DB_SHUTDOWNED;
  	ControlFile.time = (pg_time_t) time(NULL);
***************
*** 584,589 ****
--- 585,592 ----
  		   ControlFile.checkPointCopy.oldestXidDB);
  	printf(_("Latest checkpoint's oldestActiveXID:  %u\n"),
  		   ControlFile.checkPointCopy.oldestActiveXid);
+ 	printf(_("Latest checkpoint's full_page_writes: %s\n"),
+ 		   ControlFile.checkPointCopy.fullPageWrites ? "true" : "false");
  	printf(_("Maximum data alignment:               %u\n"),
  		   ControlFile.maxAlign);
  	/* we don't print floatFormat since can't say much useful about it */
diff -rcN postgresql_with_9fujii_patch/src/include/access/xlog.h postgresql_with_patch/src/include/access/xlog.h
*** postgresql_with_9fujii_patch/src/include/access/xlog.h	2011-10-06 06:05:45.000000000 +0900
--- postgresql_with_patch/src/include/access/xlog.h	2011-10-09 01:40:52.000000000 +0900
***************
*** 316,321 ****
--- 316,322 ----
  extern void StartupProcessMain(void);
  extern bool CheckPromoteSignal(void);
  extern void WakeupRecovery(void);
+ extern void ReportFpwParameters(bool startup_finish);
  
  /*
   * Starting/stopping a base backup
diff -rcN postgresql_with_9fujii_patch/src/include/catalog/pg_control.h postgresql_with_patch/src/include/catalog/pg_control.h
*** postgresql_with_9fujii_patch/src/include/catalog/pg_control.h	2011-10-06 06:06:19.000000000 +0900
--- postgresql_with_patch/src/include/catalog/pg_control.h	2011-10-09 01:40:51.000000000 +0900
***************
*** 49,54 ****
--- 49,59 ----
  	 * it's set to InvalidTransactionId.
  	 */
  	TransactionId oldestActiveXid;
+ 
+ 	/*
+ 	 * current FPW. It is used when executing pg_start_backup on hot standby.
+ 	 */
+ 	bool		fullPageWrites;
  } CheckPoint;
  
  /* XLOG info values for XLOG rmgr */
***************
*** 60,65 ****
--- 65,71 ----
  #define XLOG_BACKUP_END					0x50
  #define XLOG_PARAMETER_CHANGE			0x60
  #define XLOG_RESTORE_POINT				0x70
+ #define XLOG_FPW_CHANGE					0x80
  
  
  /*

#35

Simon Riggs

simon@2ndQuadrant.com

over 14 years ago

In reply to: Jun Ishiduka (#34)

Re: Online base backup from the hot-standby

2011/10/9 Jun Ishiduka <ishizuka.jun@po.ntts.co.jp>:

Insert WAL including a value of current FPW (on master)
* In the the same timing as update, they insert WAL (is named
XLOG_FPW_CHANGE). XLOG_FPW_CHANGE has a value of the changed FPW.
* When it creates CHECKPOINT, it adds a value of current FPW to the
CHECKPOINT WAL.

I can't see a reason why we would use a new WAL record for this,
rather than modify the XLOG_PARAMETER_CHANGE record type which was
created for a very similar reason.
The code would be much simpler if we just extend
XLOG_PARAMETER_CHANGE, so please can we do that?

The log message "full_page_writes on master is set invalid more than
once during online backup" should read "at least once" rather than
"more than once".

lastFpwDisabledLSN needs to be initialized.

Is there a reason to add lastFpwDisabledLSN onto the Control file? If
we log parameters after every checkpoint then we'll know the values
when we startup. If we keep logging parameters this way we'll end up
with a very awkward and large control file. I would personally prefer
to avoid that, but that thought could go either way. Let's see if
anyone else thinks that also.

Looks good.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#36

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

over 14 years ago

In reply to: Simon Riggs (#35)

Re: Online base backup from the hot-standby

I can't see a reason why we would use a new WAL record for this,
rather than modify the XLOG_PARAMETER_CHANGE record type which was
created for a very similar reason.
The code would be much simpler if we just extend
XLOG_PARAMETER_CHANGE, so please can we do that?

Sure.

The log message "full_page_writes on master is set invalid more than
once during online backup" should read "at least once" rather than
"more than once".

Yes.

lastFpwDisabledLSN needs to be initialized.

I think it don't need because all values in XLogCtl is initialized 0.

Is there a reason to add lastFpwDisabledLSN onto the Control file? If
we log parameters after every checkpoint then we'll know the values
when we startup. If we keep logging parameters this way we'll end up
with a very awkward and large control file. I would personally prefer
to avoid that, but that thought could go either way. Let's see if
anyone else thinks that also.

Yes. I add to CreateCheckPoint().

Image:
CreateCheckPoint()
{
if (!shutdown && XLogStandbyInfoActive())
{
LogStandbySnapshot()
XLogReportParameters()
}
}

XLogReportParameters()
{
if (fpw == 'off' || ... )
XLOGINSERT()
}

However, it'll write XLOG_PARAMETER_CHANGE every checkpoints when FPW is 'off'.
(It will increases the amount of WAL.)
Is it OK?

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

#37

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

over 14 years ago

In reply to: Jun Ishiduka (#36)

1 attachment(s)

Re: Online base backup from the hot-standby

I can't see a reason why we would use a new WAL record for this,
rather than modify the XLOG_PARAMETER_CHANGE record type which was
created for a very similar reason.
The code would be much simpler if we just extend
XLOG_PARAMETER_CHANGE, so please can we do that?

Sure.

The log message "full_page_writes on master is set invalid more than
once during online backup" should read "at least once" rather than
"more than once".

Yes.

lastFpwDisabledLSN needs to be initialized.

I think it don't need because all values in XLogCtl is initialized 0.

Is there a reason to add lastFpwDisabledLSN onto the Control file? If
we log parameters after every checkpoint then we'll know the values
when we startup. If we keep logging parameters this way we'll end up
with a very awkward and large control file. I would personally prefer
to avoid that, but that thought could go either way. Let's see if
anyone else thinks that also.

Yes. I add to CreateCheckPoint().

Image:
CreateCheckPoint()
{
if (!shutdown && XLogStandbyInfoActive())
{
LogStandbySnapshot()
XLogReportParameters()
}
}

XLogReportParameters()
{
if (fpw == 'off' || ... )
XLOGINSERT()
}

However, it'll write XLOG_PARAMETER_CHANGE every checkpoints when FPW is 'off'.
(It will increases the amount of WAL.)
Is it OK?

Done.

Updated patch attached.

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

Attachments:

standby_online_backup_09base-02fpw.patchapplication/octet-stream; name=standby_online_backup_09base-02fpw.patchDownload

diff -rcN postgresql_with_9fujii_patch/src/backend/access/transam/xlog.c postgresql_with_patch/src/backend/access/transam/xlog.c
*** postgresql_with_9fujii_patch/src/backend/access/transam/xlog.c	2011-10-06 06:06:19.000000000 +0900
--- postgresql_with_patch/src/backend/access/transam/xlog.c	2011-10-11 14:58:50.000000000 +0900
***************
*** 364,369 ****
--- 364,372 ----
  	bool		exclusiveBackup;
  	int			nonExclusiveBackups;
  	XLogRecPtr	lastBackupStart;
+ 
+ 	/* the startup or the walwriter is logged to its own FPW */
+ 	bool		fullPageWrites;
  } XLogCtlInsert;
  
  /*
***************
*** 453,458 ****
--- 456,464 ----
  	bool		recoveryPause;
  
  	slock_t		info_lck;		/* locks shared variables shown above */
+ 
+ 	/* latest LSN that has recovered a WAL which fpw is 'off' */
+ 	XLogRecPtr	lastFpwDisabledLSN;
  } XLogCtlData;
  
  static XLogCtlData *XLogCtl = NULL;
***************
*** 574,579 ****
--- 580,586 ----
  	int			max_prepared_xacts;
  	int			max_locks_per_xact;
  	int			wal_level;
+ 	bool		fullPageWrites;
  } xl_parameter_change;
  
  /* logs restore point */
***************
*** 612,618 ****
  static void SetLatestXTime(TimestampTz xtime);
  static TimestampTz GetLatestXTime(void);
  static void CheckRequiredParameterValues(void);
- static void XLogReportParameters(void);
  static void LocalSetXLogInsertAllowed(void);
  static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
  static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
--- 619,624 ----
***************
*** 763,769 ****
  	 * don't yet have the insert lock, forcePageWrites could change under us,
  	 * but we'll recheck it once we have the lock.
  	 */
! 	doPageWrites = fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
--- 769,775 ----
  	 * don't yet have the insert lock, forcePageWrites could change under us,
  	 * but we'll recheck it once we have the lock.
  	 */
! 	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
***************
*** 909,915 ****
  	 * just turned off, we could recompute the record without full pages, but
  	 * we choose not to bother.)
  	 */
! 	if (Insert->forcePageWrites && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
--- 915,921 ----
  	 * just turned off, we could recompute the record without full pages, but
  	 * we choose not to bother.)
  	 */
! 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
***************
*** 6865,6870 ****
--- 6871,6886 ----
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
  	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
  
+ 	/*
+ 	 * The startup updates FPW in shaerd-memory after REDO. However, it must
+ 	 * perform before writing the WAL of the CHECKPOINT. The reason is that
+ 	 * it uses a value of fpw in shared-memory when it writes a WAL of its
+ 	 * CHECKPOTNT.
+ 	 */
+ 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+ 	XLogCtl->Insert.fullPageWrites = fullPageWrites;
+ 	LWLockRelease(WALInsertLock);
+ 
  	if (InRecovery)
  	{
  		int			rmid;
***************
*** 6998,7004 ****
  	 * backends to write WAL.
  	 */
  	LocalSetXLogInsertAllowed();
! 	XLogReportParameters();
  
  	/*
  	 * All done.  Allow backends to write WAL.	(Although the bool flag is
--- 7014,7020 ----
  	 * backends to write WAL.
  	 */
  	LocalSetXLogInsertAllowed();
! 	XLogReportParameters(true);
  
  	/*
  	 * All done.  Allow backends to write WAL.	(Although the bool flag is
***************
*** 7856,7862 ****
--- 7872,7881 ----
  	 * Update checkPoint.nextXid since we have a later value
  	 */
  	if (!shutdown && XLogStandbyInfoActive())
+ 	{
  		LogStandbySnapshot(&checkPoint.oldestActiveXid, &checkPoint.nextXid);
+ 		XLogReportParameters(false);
+ 	}
  	else
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
***************
*** 8381,8393 ****
   * Check if any of the GUC parameters that are critical for hot standby
   * have changed, and update the value in pg_control file if necessary.
   */
! static void
! XLogReportParameters(void)
  {
  	if (wal_level != ControlFile->wal_level ||
  		MaxConnections != ControlFile->MaxConnections ||
  		max_prepared_xacts != ControlFile->max_prepared_xacts ||
! 		max_locks_per_xact != ControlFile->max_locks_per_xact)
  	{
  		/*
  		 * The change in number of backend slots doesn't need to be WAL-logged
--- 8400,8422 ----
   * Check if any of the GUC parameters that are critical for hot standby
   * have changed, and update the value in pg_control file if necessary.
   */
! void
! XLogReportParameters(bool fpw_manager)
  {
+ 	bool fpw = fullPageWrites;
+ 
+ 	if (!fpw_manager)
+ 	{
+ 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+ 		fpw = XLogCtl->Insert.fullPageWrites;
+ 		LWLockRelease(WALInsertLock);
+ 	}
+ 
  	if (wal_level != ControlFile->wal_level ||
  		MaxConnections != ControlFile->MaxConnections ||
  		max_prepared_xacts != ControlFile->max_prepared_xacts ||
! 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
! 		!fpw)
  	{
  		/*
  		 * The change in number of backend slots doesn't need to be WAL-logged
***************
*** 8396,8402 ****
  		 * values in pg_control either if wal_level=minimal, but seems better
  		 * to keep them up-to-date to avoid confusion.
  		 */
! 		if (wal_level != ControlFile->wal_level || XLogIsNeeded())
  		{
  			XLogRecData rdata;
  			xl_parameter_change xlrec;
--- 8425,8431 ----
  		 * values in pg_control either if wal_level=minimal, but seems better
  		 * to keep them up-to-date to avoid confusion.
  		 */
! 		if (wal_level != ControlFile->wal_level || XLogIsNeeded() || !fpw)
  		{
  			XLogRecData rdata;
  			xl_parameter_change xlrec;
***************
*** 8405,8410 ****
--- 8434,8440 ----
  			xlrec.max_prepared_xacts = max_prepared_xacts;
  			xlrec.max_locks_per_xact = max_locks_per_xact;
  			xlrec.wal_level = wal_level;
+ 			xlrec.fullPageWrites = fpw;
  
  			rdata.buffer = InvalidBuffer;
  			rdata.data = (char *) &xlrec;
***************
*** 8420,8425 ****
--- 8450,8463 ----
  		ControlFile->wal_level = wal_level;
  		UpdateControlFile();
  	}
+ 
+ 	/* update own fpw in shared-memory when it has managed fpw */
+ 	if (fpw_manager)
+ 	{
+ 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+ 		XLogCtl->Insert.fullPageWrites = fullPageWrites;
+ 		LWLockRelease(WALInsertLock);
+ 	}
  }
  
  /*
***************
*** 8604,8609 ****
--- 8642,8650 ----
  	}
  	else if (info == XLOG_PARAMETER_CHANGE)
  	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
  		xl_parameter_change xlrec;
  
  		/* Update our copy of the parameters in pg_control */
***************
*** 8633,8638 ****
--- 8674,8687 ----
  		UpdateControlFile();
  		LWLockRelease(ControlFileLock);
  
+ 		/* record the LSN when FPW is false on master */
+ 		if (!xlrec.fullPageWrites)
+ 		{
+ 			SpinLockAcquire(&xlogctl->info_lck);
+ 			xlogctl->lastFpwDisabledLSN = lsn;
+ 			SpinLockRelease(&xlogctl->info_lck);
+ 		}
+ 
  		/* Check to see if any changes to max_connections give problems */
  		CheckRequiredParameterValues();
  	}
***************
*** 8711,8721 ****
  			}
  		}
  
! 		appendStringInfo(buf, "parameter change: max_connections=%d max_prepared_xacts=%d max_locks_per_xact=%d wal_level=%s",
  						 xlrec.MaxConnections,
  						 xlrec.max_prepared_xacts,
  						 xlrec.max_locks_per_xact,
! 						 wal_level_str);
  	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
--- 8760,8771 ----
  			}
  		}
  
! 		appendStringInfo(buf, "parameter change: max_connections=%d max_prepared_xacts=%d max_locks_per_xact=%d wal_level=%s full_page_writes=%s",
  						 xlrec.MaxConnections,
  						 xlrec.max_prepared_xacts,
  						 xlrec.max_locks_per_xact,
! 						 wal_level_str,
! 						 xlrec.fullPageWrites ? "true" : "false");
  	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
***************
*** 9089,9094 ****
--- 9139,9152 ----
  				gotUniqueStartpoint = true;
  		} while (!gotUniqueStartpoint);
  
+ 		/*
+ 		 * check whether the master's FPW is 'off' since latest CHECKPOINT.
+ 		 */
+ 		if (recovery_in_progress && XLByteLE(startpoint, XLogCtl->lastFpwDisabledLSN))
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 					 errmsg("full_page_writes on master is set invalid at least once since latest checkpoint")));
+ 
  		XLByteToSeg(startpoint, _logId, _logSeg);
  		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
  
***************
*** 9372,9377 ****
--- 9430,9441 ----
  						"though pg_start_backup() was executed during recovery"),
  				 errhint("The database backup will not be usable.")));
  
+ 	/* check whether the master's FPW is 'off' since pg_start_backup. */
+ 	if (recovery_in_progress && XLByteLE(startpoint, XLogCtl->lastFpwDisabledLSN))
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 			  errmsg("full_page_writes on master is set invalid at least once during online backup")));
+ 
  	/*
  	 * During recovery, we don't write an end-of-backup record. We can
  	 * assume that pg_control was backed up just before pg_stop_backup()
diff -rcN postgresql_with_9fujii_patch/src/backend/postmaster/walwriter.c postgresql_with_patch/src/backend/postmaster/walwriter.c
*** postgresql_with_9fujii_patch/src/backend/postmaster/walwriter.c	2011-10-06 06:05:45.000000000 +0900
--- postgresql_with_patch/src/backend/postmaster/walwriter.c	2011-10-11 14:53:58.000000000 +0900
***************
*** 216,221 ****
--- 216,227 ----
  	PG_SETMASK(&UnBlockSig);
  
  	/*
+ 	 * After the startup process, the walwriter manages the FPW. Because
+ 	 * the walwriter may have not received a SIGHUP then, it updates the FPW.
+ 	 */
+ 	XLogReportParameters(true);
+ 
+ 	/*
  	 * Loop forever
  	 */
  	for (;;)
***************
*** 236,241 ****
--- 242,253 ----
  		{
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
+ 
+ 			/*
+ 			 * The walwriter manages the FPW. When the walwriter has received
+ 			 * a SIGHUP, it updates the FPW.
+ 			 */
+ 			XLogReportParameters(true);
  		}
  		if (shutdown_requested)
  		{
diff -rcN postgresql_with_9fujii_patch/src/include/access/xlog.h postgresql_with_patch/src/include/access/xlog.h
*** postgresql_with_9fujii_patch/src/include/access/xlog.h	2011-10-06 06:05:45.000000000 +0900
--- postgresql_with_patch/src/include/access/xlog.h	2011-10-11 14:53:58.000000000 +0900
***************
*** 306,311 ****
--- 306,312 ----
  extern bool CreateRestartPoint(int flags);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr XLogRestorePoint(const char *rpName);
+ extern void XLogReportParameters(bool fpw_manager);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern XLogRecPtr GetInsertRecPtr(void);
  extern XLogRecPtr GetFlushRecPtr(void);

#38

Steve Singer

ssinger_pg@sympatico.ca

over 14 years ago

In reply to: Jun Ishiduka (#37)

Re: Online base backup from the hot-standby

On 11-10-11 11:17 AM, Jun Ishiduka wrote:

Done.

Updated patch attached.

I have taken Jun's latest patch and applied it on top of Fujii's most
recent patch. I did some testing with the result but nothing theory
enough to stumble on any race conditions.

Some testing notes
------------------------------
select pg_start_backup('x');
ERROR: full_page_writes on master is set invalid at least once since
latest checkpoint

I think this error should be rewritten as
ERROR: full_page_writes on master has been off at some point since
latest checkpoint

We should be using 'off' instead of 'invalid' since that is what is what
the user sets it to.

I switched full_page_writes=on , on the master

did a pg_start_backup() on the slave1.

Then I switched full_page_writes=off on the master, did a reload +
checkpoint.

I was able to then do my backup of slave1, copy the control file, and
pg_stop_backup().
When I did the test slave2 started okay, but is this safe? Do we need a
warning from pg_stop_backup() that is printed if it is detected that
full_page_writes was turned off on the master during the backup period?

Code Notes
---------------------
*** 6865,6870 ****
--- 6871,6886 ----
/* Pre-scan prepared transactions to find out the range of XIDs present */
oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);

+ /*
+ * The startup updates FPW in shaerd-memory after REDO. However, it must
+ * perform before writing the WAL of the CHECKPOINT. The reason is that
+ * it uses a value of fpw in shared-memory when it writes a WAL of its
+ * CHECKPOTNT.
+ */

Minor typo above at 'CHECKPOTNT'

If my concern about full page writes being switched to off in the middle
of a backup is unfounded then I think this patch is ready for a
committer. They can clean the two editorial changes when they apply the
patches.

If do_pg_stop_backup is going to need some logic to recheck the full
page write status then an updated patch is required.

Show quoted text

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

#39

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

over 14 years ago

In reply to: Steve Singer (#38)

Re: Online base backup from the hot-standby

Some testing notes
------------------------------
select pg_start_backup('x');
ERROR: full_page_writes on master is set invalid at least once since
latest checkpoint

I think this error should be rewritten as
ERROR: full_page_writes on master has been off at some point since
latest checkpoint

We should be using 'off' instead of 'invalid' since that is what is what
the user sets it to.

Sure.

I switched full_page_writes=on , on the master

did a pg_start_backup() on the slave1.

Then I switched full_page_writes=off on the master, did a reload +
checkpoint.

I was able to then do my backup of slave1, copy the control file, and
pg_stop_backup().

When I did the test slave2 started okay, but is this safe? Do we need a
warning from pg_stop_backup() that is printed if it is detected that
full_page_writes was turned off on the master during the backup period?

I also reproduced.

pg_stop_backup() fails in most cases.
However, it succeeds if both the following cases are true.
* checkpoint is done before walwriter recieves SIGHUP.
* slave1 has not received the WAL of 'off' by SIGHUP yet.

Minor typo above at 'CHECKPOTNT'

Yes.

If my concern about full page writes being switched to off in the middle
of a backup is unfounded then I think this patch is ready for a
committer. They can clean the two editorial changes when they apply the
patches.

Yes. I'll clean since these comments fix.

If do_pg_stop_backup is going to need some logic to recheck the full
page write status then an updated patch is required.

It already contains.

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

#40

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

over 14 years ago

In reply to: Jun Ishiduka (#39)

1 attachment(s)

Re: Online base backup from the hot-standby

Some testing notes
------------------------------
select pg_start_backup('x');
ERROR: full_page_writes on master is set invalid at least once since
latest checkpoint

I think this error should be rewritten as
ERROR: full_page_writes on master has been off at some point since
latest checkpoint

We should be using 'off' instead of 'invalid' since that is what is what
the user sets it to.

Sure.

Minor typo above at 'CHECKPOTNT'

Yes.

I updated to patch corresponded above-comments.

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

Attachments:

standby_online_backup_09base-03fpw.patchapplication/octet-stream; name=standby_online_backup_09base-03fpw.patchDownload

diff -rcN postgresql_with_9fujii_patch/src/backend/access/transam/xlog.c postgresql_with_patch/src/backend/access/transam/xlog.c
*** postgresql_with_9fujii_patch/src/backend/access/transam/xlog.c	2011-10-06 06:06:19.000000000 +0900
--- postgresql_with_patch/src/backend/access/transam/xlog.c	2011-10-12 06:23:15.000000000 +0900
***************
*** 364,369 ****
--- 364,372 ----
  	bool		exclusiveBackup;
  	int			nonExclusiveBackups;
  	XLogRecPtr	lastBackupStart;
+ 
+ 	/* the startup or the walwriter is logged to its own FPW */
+ 	bool		fullPageWrites;
  } XLogCtlInsert;
  
  /*
***************
*** 453,458 ****
--- 456,464 ----
  	bool		recoveryPause;
  
  	slock_t		info_lck;		/* locks shared variables shown above */
+ 
+ 	/* latest LSN that has recovered a WAL which fpw is 'off' */
+ 	XLogRecPtr	lastFpwDisabledLSN;
  } XLogCtlData;
  
  static XLogCtlData *XLogCtl = NULL;
***************
*** 574,579 ****
--- 580,586 ----
  	int			max_prepared_xacts;
  	int			max_locks_per_xact;
  	int			wal_level;
+ 	bool		fullPageWrites;
  } xl_parameter_change;
  
  /* logs restore point */
***************
*** 612,618 ****
  static void SetLatestXTime(TimestampTz xtime);
  static TimestampTz GetLatestXTime(void);
  static void CheckRequiredParameterValues(void);
- static void XLogReportParameters(void);
  static void LocalSetXLogInsertAllowed(void);
  static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
  static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
--- 619,624 ----
***************
*** 763,769 ****
  	 * don't yet have the insert lock, forcePageWrites could change under us,
  	 * but we'll recheck it once we have the lock.
  	 */
! 	doPageWrites = fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
--- 769,775 ----
  	 * don't yet have the insert lock, forcePageWrites could change under us,
  	 * but we'll recheck it once we have the lock.
  	 */
! 	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
***************
*** 909,915 ****
  	 * just turned off, we could recompute the record without full pages, but
  	 * we choose not to bother.)
  	 */
! 	if (Insert->forcePageWrites && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
--- 915,921 ----
  	 * just turned off, we could recompute the record without full pages, but
  	 * we choose not to bother.)
  	 */
! 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
***************
*** 6865,6870 ****
--- 6871,6886 ----
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
  	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
  
+ 	/*
+ 	 * The startup updates FPW in shaerd-memory after REDO. However, it must
+ 	 * perform before writing the WAL of the CHECKPOINT. The reason is that
+ 	 * it uses a value of fpw in shared-memory when it writes a WAL of its
+ 	 * CHECKPOINT.
+ 	 */
+ 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+ 	XLogCtl->Insert.fullPageWrites = fullPageWrites;
+ 	LWLockRelease(WALInsertLock);
+ 
  	if (InRecovery)
  	{
  		int			rmid;
***************
*** 6998,7004 ****
  	 * backends to write WAL.
  	 */
  	LocalSetXLogInsertAllowed();
! 	XLogReportParameters();
  
  	/*
  	 * All done.  Allow backends to write WAL.	(Although the bool flag is
--- 7014,7020 ----
  	 * backends to write WAL.
  	 */
  	LocalSetXLogInsertAllowed();
! 	XLogReportParameters(true);
  
  	/*
  	 * All done.  Allow backends to write WAL.	(Although the bool flag is
***************
*** 7860,7865 ****
--- 7876,7885 ----
  	else
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
+ 	/* Write FPW parameter to WAL at CHECKPOINT */
+ 	if (XLogStandbyInfoActive())
+ 		XLogReportParameters(false);
+ 
  	START_CRIT_SECTION();
  
  	/*
***************
*** 8381,8393 ****
   * Check if any of the GUC parameters that are critical for hot standby
   * have changed, and update the value in pg_control file if necessary.
   */
! static void
! XLogReportParameters(void)
  {
  	if (wal_level != ControlFile->wal_level ||
  		MaxConnections != ControlFile->MaxConnections ||
  		max_prepared_xacts != ControlFile->max_prepared_xacts ||
! 		max_locks_per_xact != ControlFile->max_locks_per_xact)
  	{
  		/*
  		 * The change in number of backend slots doesn't need to be WAL-logged
--- 8401,8423 ----
   * Check if any of the GUC parameters that are critical for hot standby
   * have changed, and update the value in pg_control file if necessary.
   */
! void
! XLogReportParameters(bool fpw_manager)
  {
+ 	bool fpw = fullPageWrites;
+ 
+ 	if (!fpw_manager)
+ 	{
+ 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+ 		fpw = XLogCtl->Insert.fullPageWrites;
+ 		LWLockRelease(WALInsertLock);
+ 	}
+ 
  	if (wal_level != ControlFile->wal_level ||
  		MaxConnections != ControlFile->MaxConnections ||
  		max_prepared_xacts != ControlFile->max_prepared_xacts ||
! 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
! 		!fpw)
  	{
  		/*
  		 * The change in number of backend slots doesn't need to be WAL-logged
***************
*** 8396,8402 ****
  		 * values in pg_control either if wal_level=minimal, but seems better
  		 * to keep them up-to-date to avoid confusion.
  		 */
! 		if (wal_level != ControlFile->wal_level || XLogIsNeeded())
  		{
  			XLogRecData rdata;
  			xl_parameter_change xlrec;
--- 8426,8432 ----
  		 * values in pg_control either if wal_level=minimal, but seems better
  		 * to keep them up-to-date to avoid confusion.
  		 */
! 		if (wal_level != ControlFile->wal_level || XLogIsNeeded() || !fpw)
  		{
  			XLogRecData rdata;
  			xl_parameter_change xlrec;
***************
*** 8405,8410 ****
--- 8435,8441 ----
  			xlrec.max_prepared_xacts = max_prepared_xacts;
  			xlrec.max_locks_per_xact = max_locks_per_xact;
  			xlrec.wal_level = wal_level;
+ 			xlrec.fullPageWrites = fpw;
  
  			rdata.buffer = InvalidBuffer;
  			rdata.data = (char *) &xlrec;
***************
*** 8420,8425 ****
--- 8451,8464 ----
  		ControlFile->wal_level = wal_level;
  		UpdateControlFile();
  	}
+ 
+ 	/* update own fpw in shared-memory when it has managed fpw */
+ 	if (fpw_manager)
+ 	{
+ 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+ 		XLogCtl->Insert.fullPageWrites = fullPageWrites;
+ 		LWLockRelease(WALInsertLock);
+ 	}
  }
  
  /*
***************
*** 8604,8609 ****
--- 8643,8651 ----
  	}
  	else if (info == XLOG_PARAMETER_CHANGE)
  	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
  		xl_parameter_change xlrec;
  
  		/* Update our copy of the parameters in pg_control */
***************
*** 8633,8638 ****
--- 8675,8688 ----
  		UpdateControlFile();
  		LWLockRelease(ControlFileLock);
  
+ 		/* record the LSN when FPW is false on master */
+ 		if (!xlrec.fullPageWrites)
+ 		{
+ 			SpinLockAcquire(&xlogctl->info_lck);
+ 			xlogctl->lastFpwDisabledLSN = lsn;
+ 			SpinLockRelease(&xlogctl->info_lck);
+ 		}
+ 
  		/* Check to see if any changes to max_connections give problems */
  		CheckRequiredParameterValues();
  	}
***************
*** 8711,8721 ****
  			}
  		}
  
! 		appendStringInfo(buf, "parameter change: max_connections=%d max_prepared_xacts=%d max_locks_per_xact=%d wal_level=%s",
  						 xlrec.MaxConnections,
  						 xlrec.max_prepared_xacts,
  						 xlrec.max_locks_per_xact,
! 						 wal_level_str);
  	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
--- 8761,8772 ----
  			}
  		}
  
! 		appendStringInfo(buf, "parameter change: max_connections=%d max_prepared_xacts=%d max_locks_per_xact=%d wal_level=%s full_page_writes=%s",
  						 xlrec.MaxConnections,
  						 xlrec.max_prepared_xacts,
  						 xlrec.max_locks_per_xact,
! 						 wal_level_str,
! 						 xlrec.fullPageWrites ? "true" : "false");
  	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
***************
*** 9089,9094 ****
--- 9140,9153 ----
  				gotUniqueStartpoint = true;
  		} while (!gotUniqueStartpoint);
  
+ 		/*
+ 		 * check whether the master's FPW is 'off' since latest CHECKPOINT.
+ 		 */
+ 		if (recovery_in_progress && XLByteLE(startpoint, XLogCtl->lastFpwDisabledLSN))
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 					 errmsg("full_page_writes on master has been off at some point since latest checkpoint")));
+ 
  		XLByteToSeg(startpoint, _logId, _logSeg);
  		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
  
***************
*** 9372,9377 ****
--- 9431,9442 ----
  						"though pg_start_backup() was executed during recovery"),
  				 errhint("The database backup will not be usable.")));
  
+ 	/* check whether the master's FPW is 'off' since pg_start_backup. */
+ 	if (recovery_in_progress && XLByteLE(startpoint, XLogCtl->lastFpwDisabledLSN))
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 			  errmsg("full_page_writes on master has been off at some point during online backup")));
+ 
  	/*
  	 * During recovery, we don't write an end-of-backup record. We can
  	 * assume that pg_control was backed up just before pg_stop_backup()
diff -rcN postgresql_with_9fujii_patch/src/backend/postmaster/walwriter.c postgresql_with_patch/src/backend/postmaster/walwriter.c
*** postgresql_with_9fujii_patch/src/backend/postmaster/walwriter.c	2011-10-06 06:05:45.000000000 +0900
--- postgresql_with_patch/src/backend/postmaster/walwriter.c	2011-10-12 06:23:15.000000000 +0900
***************
*** 216,221 ****
--- 216,227 ----
  	PG_SETMASK(&UnBlockSig);
  
  	/*
+ 	 * After the startup process, the walwriter manages the FPW. Because
+ 	 * the walwriter may have not received a SIGHUP then, it updates the FPW.
+ 	 */
+ 	XLogReportParameters(true);
+ 
+ 	/*
  	 * Loop forever
  	 */
  	for (;;)
***************
*** 236,241 ****
--- 242,253 ----
  		{
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
+ 
+ 			/*
+ 			 * The walwriter manages the FPW. When the walwriter has received
+ 			 * a SIGHUP, it updates the FPW.
+ 			 */
+ 			XLogReportParameters(true);
  		}
  		if (shutdown_requested)
  		{
diff -rcN postgresql_with_9fujii_patch/src/include/access/xlog.h postgresql_with_patch/src/include/access/xlog.h
*** postgresql_with_9fujii_patch/src/include/access/xlog.h	2011-10-06 06:05:45.000000000 +0900
--- postgresql_with_patch/src/include/access/xlog.h	2011-10-12 06:23:15.000000000 +0900
***************
*** 306,311 ****
--- 306,312 ----
  extern bool CreateRestartPoint(int flags);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr XLogRestorePoint(const char *rpName);
+ extern void XLogReportParameters(bool fpw_manager);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern XLogRecPtr GetInsertRecPtr(void);
  extern XLogRecPtr GetFlushRecPtr(void);

#41

Fujii Masao

masao.fujii@gmail.com

over 14 years ago

In reply to: Jun Ishiduka (#40)

Re: Online base backup from the hot-standby

2011/10/12 Jun Ishiduka <ishizuka.jun@po.ntts.co.jp>:

ERROR: full_page_writes on master is set invalid at least once since
latest checkpoint

I think this error should be rewritten as
ERROR: full_page_writes on master has been off at some point since
latest checkpoint

We should be using 'off' instead of 'invalid' since that is what is what
the user sets it to.

Sure.

What about the following message? It sounds more precise to me.

ERROR: WAL generated with full_page_writes=off was replayed since last
restartpoint

I updated to patch corresponded above-comments.

Thanks for updating the patch! Here are the comments:

 	 * don't yet have the insert lock, forcePageWrites could change under us,
 	 * but we'll recheck it once we have the lock.
 	 */
-	doPageWrites = fullPageWrites || Insert->forcePageWrites;
+	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;

The source comment needs to be modified.

 	 * just turned off, we could recompute the record without full pages, but
 	 * we choose not to bother.)
 	 */
-	if (Insert->forcePageWrites && !doPageWrites)
+	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)

Same as above.

+	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+	XLogCtl->Insert.fullPageWrites = fullPageWrites;
+	LWLockRelease(WALInsertLock);

I don't think WALInsertLock needs to be hold here because there is no
concurrently running process which can access Insert.fullPageWrites.
For example, Insert->currpos and Insert->LogwrtResult are also changed
without the lock there.

The source comment of XLogReportParameters() needs to be modified.

XLogReportParameters() should skip writing WAL if full_page_writes has not been
changed by SIGHUP.

XLogReportParameters() should skip updating pg_control if any parameter related
to hot standby has not been changed.

+	if (!fpw_manager)
+	{
+		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+		fpw = XLogCtl->Insert.fullPageWrites;
+		LWLockRelease(WALInsertLock);

It's safe to take WALInsertLock with shared mode here.

In checkpoint, XLogReportParameters() is called only when wal_level is
hot_standby.
OTOH, in walwriter, it's always called even when wal_level is not hot_standby.
Can't we skip calling XLogReportParameters() whenever wal_level is not
hot_standby?

In do_pg_start_backup() and do_pg_stop_backup(), the spinlock must be held to
see XLogCtl->lastFpwDisabledLSN.

+	/* check whether the master's FPW is 'off' since pg_start_backup. */
+	if (recovery_in_progress && XLByteLE(startpoint, XLogCtl->lastFpwDisabledLSN))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+			  errmsg("full_page_writes on master has been off at some point
during online backup")));

What about changing the error message to:
ERROR: WAL generated with full_page_writes=off was replayed during online backup

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#42

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

about 14 years ago

In reply to: Fujii Masao (#41)

Re: Online base backup from the hot-standby

ERROR: full_page_writes on master is set invalid at least once since
latest checkpoint

I think this error should be rewritten as
ERROR: full_page_writes on master has been off at some point since
latest checkpoint

We should be using 'off' instead of 'invalid' since that is what is what
the user sets it to.

Sure.

What about the following message? It sounds more precise to me.

ERROR: WAL generated with full_page_writes=off was replayed since last
restartpoint

Okay, I changes the patch to this messages.
If someone says there is a idea better than it, I will consider again.

I updated to patch corresponded above-comments.

Thanks for updating the patch! Here are the comments:

* don't yet have the insert lock, forcePageWrites could change under us,
* but we'll recheck it once we have the lock.
*/
-	doPageWrites = fullPageWrites || Insert->forcePageWrites;
+	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;

The source comment needs to be modified.

* just turned off, we could recompute the record without full pages, but
* we choose not to bother.)
*/
-	if (Insert->forcePageWrites && !doPageWrites)
+	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)

Same as above.

Sure.

XLogReportParameters() should skip writing WAL if full_page_writes has not been
changed by SIGHUP.

XLogReportParameters() should skip updating pg_control if any parameter related
to hot standby has not been changed.

YES.

In checkpoint, XLogReportParameters() is called only when wal_level is
hot_standby.
OTOH, in walwriter, it's always called even when wal_level is not hot_standby.
Can't we skip calling XLogReportParameters() whenever wal_level is not
hot_standby?

Yes, It is possible.

In do_pg_start_backup() and do_pg_stop_backup(), the spinlock must be held to
see XLogCtl->lastFpwDisabledLSN.

Yes.

What about changing the error message to:
ERROR: WAL generated with full_page_writes=off was replayed during online backup

Okay, too.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

#43

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

about 14 years ago

In reply to: Fujii Masao (#41)

Re: Online base backup from the hot-standby

Sorry.
I was not previously able to answer fujii's all comments.
This is the remaining answers.

+	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+	XLogCtl->Insert.fullPageWrites = fullPageWrites;
+	LWLockRelease(WALInsertLock);
I don't think WALInsertLock needs to be hold here because there is no
concurrently running process which can access Insert.fullPageWrites.
For example, Insert->currpos and Insert->LogwrtResult are also changed
without the lock there.

Yes.

The source comment of XLogReportParameters() needs to be modified.

Yes, too.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

#44

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

about 14 years ago

In reply to: Jun Ishiduka (#43)

1 attachment(s)

Re: Online base backup from the hot-standby

ERROR: full_page_writes on master is set invalid at least once since
latest checkpoint

I think this error should be rewritten as
ERROR: full_page_writes on master has been off at some point since
latest checkpoint

We should be using 'off' instead of 'invalid' since that is what is what
the user sets it to.

Sure.

What about the following message? It sounds more precise to me.

ERROR: WAL generated with full_page_writes=off was replayed since last
restartpoint

Okay, I changes the patch to this messages.
If someone says there is a idea better than it, I will consider again.
I updated to patch corresponded above-comments.

Thanks for updating the patch! Here are the comments:
* don't yet have the insert lock, forcePageWrites could change under us,
* but we'll recheck it once we have the lock.
*/
-	doPageWrites = fullPageWrites || Insert->forcePageWrites;
+	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
The source comment needs to be modified.
* just turned off, we could recompute the record without full pages, but
* we choose not to bother.)
*/
-	if (Insert->forcePageWrites && !doPageWrites)
+	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
Same as above.
Sure.

XLogReportParameters() should skip writing WAL if full_page_writes has not been
changed by SIGHUP.

XLogReportParameters() should skip updating pg_control if any parameter related
to hot standby has not been changed.

YES.

In checkpoint, XLogReportParameters() is called only when wal_level is
hot_standby.
OTOH, in walwriter, it's always called even when wal_level is not hot_standby.
Can't we skip calling XLogReportParameters() whenever wal_level is not
hot_standby?

Yes, It is possible.

In do_pg_start_backup() and do_pg_stop_backup(), the spinlock must be held to
see XLogCtl->lastFpwDisabledLSN.

Yes.

What about changing the error message to:
ERROR: WAL generated with full_page_writes=off was replayed during online backup

Okay, too.

Sorry.
I was not previously able to answer fujii's all comments.
This is the remaining answers.
+	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+	XLogCtl->Insert.fullPageWrites = fullPageWrites;
+	LWLockRelease(WALInsertLock);
I don't think WALInsertLock needs to be hold here because there is no
concurrently running process which can access Insert.fullPageWrites.
For example, Insert->currpos and Insert->LogwrtResult are also changed
without the lock there.
Yes.

The source comment of XLogReportParameters() needs to be modified.

Yes, too.

Done.
I updated to patch corresponded above-comments.

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

Attachments:

standby_online_backup_09base-04fpw.patchapplication/octet-stream; name=standby_online_backup_09base-04fpw.patchDownload

diff -rcN postgresql_with_9fujii_patch/src/backend/access/transam/xlog.c postgresql_with_patch/src/backend/access/transam/xlog.c
*** postgresql_with_9fujii_patch/src/backend/access/transam/xlog.c	2011-10-06 06:06:19.000000000 +0900
--- postgresql_with_patch/src/backend/access/transam/xlog.c	2011-10-13 09:21:26.000000000 +0900
***************
*** 364,369 ****
--- 364,372 ----
  	bool		exclusiveBackup;
  	int			nonExclusiveBackups;
  	XLogRecPtr	lastBackupStart;
+ 
+ 	/* the startup or the walwriter is logged to its own FPW */
+ 	bool		fullPageWrites;
  } XLogCtlInsert;
  
  /*
***************
*** 453,458 ****
--- 456,464 ----
  	bool		recoveryPause;
  
  	slock_t		info_lck;		/* locks shared variables shown above */
+ 
+ 	/* latest LSN that has recovered a WAL which fpw is 'off' */
+ 	XLogRecPtr	lastFpwDisabledLSN;
  } XLogCtlData;
  
  static XLogCtlData *XLogCtl = NULL;
***************
*** 574,579 ****
--- 580,586 ----
  	int			max_prepared_xacts;
  	int			max_locks_per_xact;
  	int			wal_level;
+ 	bool		fullPageWrites;
  } xl_parameter_change;
  
  /* logs restore point */
***************
*** 612,618 ****
  static void SetLatestXTime(TimestampTz xtime);
  static TimestampTz GetLatestXTime(void);
  static void CheckRequiredParameterValues(void);
- static void XLogReportParameters(void);
  static void LocalSetXLogInsertAllowed(void);
  static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
  static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
--- 619,624 ----
***************
*** 759,769 ****
  
  	/*
  	 * Decide if we need to do full-page writes in this XLOG record: true if
! 	 * full_page_writes is on or we have a PITR request for it.  Since we
! 	 * don't yet have the insert lock, forcePageWrites could change under us,
! 	 * but we'll recheck it once we have the lock.
  	 */
! 	doPageWrites = fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
--- 765,776 ----
  
  	/*
  	 * Decide if we need to do full-page writes in this XLOG record: true if
! 	 * full_page_writes in shared-memory is on or we have a PITR request for
! 	 * it.  Since we don't yet have the insert lock, fullPageWrites or
! 	 * forcePageWrites could change under us, but we'll recheck it once we
! 	 * have the lock.
  	 */
! 	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
***************
*** 904,915 ****
  	}
  
  	/*
! 	 * Also check to see if forcePageWrites was just turned on; if we weren't
! 	 * already doing full-page writes then go back and recompute. (If it was
! 	 * just turned off, we could recompute the record without full pages, but
! 	 * we choose not to bother.)
  	 */
! 	if (Insert->forcePageWrites && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
--- 911,922 ----
  	}
  
  	/*
! 	 * Also check to see if fullPageWrites or forcePageWrites was just
! 	 * turned on; if we weren't already doing full-page writes then go back
! 	 * and recompute. (If it was just turned off, we could recompute the
! 	 * record without full pages, but we choose not to bother.)
  	 */
! 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
***************
*** 6865,6870 ****
--- 6872,6885 ----
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
  	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
  
+ 	/*
+ 	 * The startup updates FPW in shaerd-memory after REDO. However, it must
+ 	 * perform before writing the WAL of the CHECKPOINT. The reason is that
+ 	 * it uses a value of fpw in shared-memory when it writes a WAL of its
+ 	 * CHECKPOINT.
+ 	 */
+ 	XLogCtl->Insert.fullPageWrites = fullPageWrites;
+ 
  	if (InRecovery)
  	{
  		int			rmid;
***************
*** 6998,7004 ****
  	 * backends to write WAL.
  	 */
  	LocalSetXLogInsertAllowed();
! 	XLogReportParameters();
  
  	/*
  	 * All done.  Allow backends to write WAL.	(Although the bool flag is
--- 7013,7019 ----
  	 * backends to write WAL.
  	 */
  	LocalSetXLogInsertAllowed();
! 	XLogReportParameters(REPORT_ON_STARTUP);
  
  	/*
  	 * All done.  Allow backends to write WAL.	(Although the bool flag is
***************
*** 7856,7862 ****
--- 7871,7880 ----
  	 * Update checkPoint.nextXid since we have a later value
  	 */
  	if (!shutdown && XLogStandbyInfoActive())
+ 	{
  		LogStandbySnapshot(&checkPoint.oldestActiveXid, &checkPoint.nextXid);
+ 		XLogReportParameters(REPORT_ON_BACKEND);
+ 	}
  	else
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
***************
*** 8380,8393 ****
  /*
   * Check if any of the GUC parameters that are critical for hot standby
   * have changed, and update the value in pg_control file if necessary.
   */
! static void
! XLogReportParameters(void)
  {
  	if (wal_level != ControlFile->wal_level ||
  		MaxConnections != ControlFile->MaxConnections ||
  		max_prepared_xacts != ControlFile->max_prepared_xacts ||
! 		max_locks_per_xact != ControlFile->max_locks_per_xact)
  	{
  		/*
  		 * The change in number of backend slots doesn't need to be WAL-logged
--- 8398,8433 ----
  /*
   * Check if any of the GUC parameters that are critical for hot standby
   * have changed, and update the value in pg_control file if necessary.
+  * This function is called at three timing (backend executes checkpoint,
+  * startup finishes and walwriter receives SIGHUP). The backend and the
+  * startup writes a WAL when FPW is 'off' in addition to when any of the
+  * GUC parameters is changed.
   */
! void
! XLogReportParameters(int updatetiming)
  {
+ 	bool do_fpw_xloginsert = false;
+ 	bool fpw = fullPageWrites;
+ 
+ 	if (updatetiming == REPORT_ON_BACKEND)
+ 	{
+ 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+ 		fpw = XLogCtl->Insert.fullPageWrites;
+ 		LWLockRelease(WALInsertLock);
+ 	}
+ 
+ 	if (!fpw)
+ 	{
+ 		if (updatetiming <= REPORT_ON_STARTUP ||
+ 			(updatetiming == REPORT_ON_WALWRITER && XLogCtl->Insert.fullPageWrites))
+ 			do_fpw_xloginsert = true;
+ 	}
+ 
  	if (wal_level != ControlFile->wal_level ||
  		MaxConnections != ControlFile->MaxConnections ||
  		max_prepared_xacts != ControlFile->max_prepared_xacts ||
! 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
! 		do_fpw_xloginsert)
  	{
  		/*
  		 * The change in number of backend slots doesn't need to be WAL-logged
***************
*** 8396,8402 ****
  		 * values in pg_control either if wal_level=minimal, but seems better
  		 * to keep them up-to-date to avoid confusion.
  		 */
! 		if (wal_level != ControlFile->wal_level || XLogIsNeeded())
  		{
  			XLogRecData rdata;
  			xl_parameter_change xlrec;
--- 8436,8442 ----
  		 * values in pg_control either if wal_level=minimal, but seems better
  		 * to keep them up-to-date to avoid confusion.
  		 */
! 		if (wal_level != ControlFile->wal_level || XLogIsNeeded() || do_fpw_xloginsert)
  		{
  			XLogRecData rdata;
  			xl_parameter_change xlrec;
***************
*** 8405,8410 ****
--- 8445,8451 ----
  			xlrec.max_prepared_xacts = max_prepared_xacts;
  			xlrec.max_locks_per_xact = max_locks_per_xact;
  			xlrec.wal_level = wal_level;
+ 			xlrec.fullPageWrites = fpw;
  
  			rdata.buffer = InvalidBuffer;
  			rdata.data = (char *) &xlrec;
***************
*** 8414,8424 ****
  			XLogInsert(RM_XLOG_ID, XLOG_PARAMETER_CHANGE, &rdata);
  		}
  
! 		ControlFile->MaxConnections = MaxConnections;
! 		ControlFile->max_prepared_xacts = max_prepared_xacts;
! 		ControlFile->max_locks_per_xact = max_locks_per_xact;
! 		ControlFile->wal_level = wal_level;
! 		UpdateControlFile();
  	}
  }
  
--- 8455,8476 ----
  			XLogInsert(RM_XLOG_ID, XLOG_PARAMETER_CHANGE, &rdata);
  		}
  
! 		if (!do_fpw_xloginsert)
! 		{
! 			ControlFile->MaxConnections = MaxConnections;
! 			ControlFile->max_prepared_xacts = max_prepared_xacts;
! 			ControlFile->max_locks_per_xact = max_locks_per_xact;
! 			ControlFile->wal_level = wal_level;
! 			UpdateControlFile();
! 		}
! 	}
! 
! 	/* update own fpw in shared-memory when it has managed fpw */
! 	if (updatetiming >= REPORT_ON_STARTUP)
! 	{
! 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 		XLogCtl->Insert.fullPageWrites = fullPageWrites;
! 		LWLockRelease(WALInsertLock);
  	}
  }
  
***************
*** 8604,8609 ****
--- 8656,8664 ----
  	}
  	else if (info == XLOG_PARAMETER_CHANGE)
  	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
  		xl_parameter_change xlrec;
  
  		/* Update our copy of the parameters in pg_control */
***************
*** 8633,8638 ****
--- 8688,8701 ----
  		UpdateControlFile();
  		LWLockRelease(ControlFileLock);
  
+ 		/* record the LSN when FPW is false on master */
+ 		if (!xlrec.fullPageWrites)
+ 		{
+ 			SpinLockAcquire(&xlogctl->info_lck);
+ 			xlogctl->lastFpwDisabledLSN = lsn;
+ 			SpinLockRelease(&xlogctl->info_lck);
+ 		}
+ 
  		/* Check to see if any changes to max_connections give problems */
  		CheckRequiredParameterValues();
  	}
***************
*** 8711,8721 ****
  			}
  		}
  
! 		appendStringInfo(buf, "parameter change: max_connections=%d max_prepared_xacts=%d max_locks_per_xact=%d wal_level=%s",
  						 xlrec.MaxConnections,
  						 xlrec.max_prepared_xacts,
  						 xlrec.max_locks_per_xact,
! 						 wal_level_str);
  	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
--- 8774,8785 ----
  			}
  		}
  
! 		appendStringInfo(buf, "parameter change: max_connections=%d max_prepared_xacts=%d max_locks_per_xact=%d wal_level=%s full_page_writes=%s",
  						 xlrec.MaxConnections,
  						 xlrec.max_prepared_xacts,
  						 xlrec.max_locks_per_xact,
! 						 wal_level_str,
! 						 xlrec.fullPageWrites ? "true" : "false");
  	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
***************
*** 8933,8938 ****
--- 8997,9003 ----
  	bool		recovery_in_progress = false;
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
+ 	XLogRecPtr	lastFpwDisabledLSN;
  	pg_time_t	stamp_time;
  	char		strfbuf[128];
  	char		xlogfilename[MAXFNAMELEN];
***************
*** 9089,9094 ****
--- 9154,9177 ----
  				gotUniqueStartpoint = true;
  		} while (!gotUniqueStartpoint);
  
+ 		/*
+ 		 * check whether the master's FPW is 'off' since latest CHECKPOINT.
+ 		 */
+ 		if (recovery_in_progress)
+ 		{
+ 			/* use volatile pointer to prevent code rearrangement */
+ 			volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+ 			SpinLockAcquire(&xlogctl->info_lck);
+ 			lastFpwDisabledLSN = xlogctl->lastFpwDisabledLSN;
+ 			SpinLockRelease(&xlogctl->info_lck);
+ 
+ 			if (XLByteLE(startpoint, lastFpwDisabledLSN))
+ 				ereport(ERROR,
+ 						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 						 errmsg("WAL generated with full_page_writes=off was replayed since latest checkpoint")));
+ 		}
+ 
  		XLByteToSeg(startpoint, _logId, _logSeg);
  		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
  
***************
*** 9233,9238 ****
--- 9316,9322 ----
  	bool		recovery_in_progress = false;
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
+ 	XLogRecPtr	lastFpwDisabledLSN;
  	XLogRecData rdata;
  	pg_time_t	stamp_time;
  	char		strfbuf[128];
***************
*** 9372,9377 ****
--- 9456,9477 ----
  						"though pg_start_backup() was executed during recovery"),
  				 errhint("The database backup will not be usable.")));
  
+ 	/* check whether the master's FPW is 'off' since pg_start_backup. */
+ 	if (recovery_in_progress)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+ 		SpinLockAcquire(&xlogctl->info_lck);
+ 		lastFpwDisabledLSN = xlogctl->lastFpwDisabledLSN;
+ 		SpinLockRelease(&xlogctl->info_lck);
+ 
+ 		if (XLByteLE(startpoint, lastFpwDisabledLSN))
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 					  errmsg("WAL generated with full_page_writes=off was replayed during online backup")));
+ 	}
+ 
  	/*
  	 * During recovery, we don't write an end-of-backup record. We can
  	 * assume that pg_control was backed up just before pg_stop_backup()
diff -rcN postgresql_with_9fujii_patch/src/backend/postmaster/walwriter.c postgresql_with_patch/src/backend/postmaster/walwriter.c
*** postgresql_with_9fujii_patch/src/backend/postmaster/walwriter.c	2011-10-06 06:05:45.000000000 +0900
--- postgresql_with_patch/src/backend/postmaster/walwriter.c	2011-10-13 09:21:26.000000000 +0900
***************
*** 216,221 ****
--- 216,229 ----
  	PG_SETMASK(&UnBlockSig);
  
  	/*
+ 	 * After the startup process, the walwriter manages the FPW. Because
+ 	 * the walwriter may have not received a SIGHUP then, it updates the FPW
+ 	 * when wal_level is hotstandby.
+ 	 */
+ 	if (XLogStandbyInfoActive())
+ 		XLogReportParameters(REPORT_ON_WALWRITER);
+ 
+ 	/*
  	 * Loop forever
  	 */
  	for (;;)
***************
*** 236,241 ****
--- 244,256 ----
  		{
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
+ 
+ 			/*
+ 			 * The walwriter manages the FPW. When the walwriter has received
+ 			 * a SIGHUP, when wal_level is hotstandby, it updates the FPW.
+ 			 */
+ 			if (XLogStandbyInfoActive())
+ 				XLogReportParameters(REPORT_ON_WALWRITER);
  		}
  		if (shutdown_requested)
  		{
diff -rcN postgresql_with_9fujii_patch/src/include/access/xlog.h postgresql_with_patch/src/include/access/xlog.h
*** postgresql_with_9fujii_patch/src/include/access/xlog.h	2011-10-06 06:05:45.000000000 +0900
--- postgresql_with_patch/src/include/access/xlog.h	2011-10-13 09:21:26.000000000 +0900
***************
*** 208,213 ****
--- 208,224 ----
  } WalLevel;
  extern int	wal_level;
  
+ /*
+  * The place of updating xlog parameter.
+  * If it is backend then this means a timing for CHECKPOINT.
+  */
+ typedef enum
+ {
+ 	REPORT_ON_BACKEND = 0,
+ 	REPORT_ON_STARTUP,
+ 	REPORT_ON_WALWRITER
+ } XLogParemeterUpdate;
+ 
  #define XLogArchivingActive()	(XLogArchiveMode && wal_level >= WAL_LEVEL_ARCHIVE)
  #define XLogArchiveCommandSet() (XLogArchiveCommand[0] != '\0')
  
***************
*** 306,311 ****
--- 317,323 ----
  extern bool CreateRestartPoint(int flags);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr XLogRestorePoint(const char *rpName);
+ extern void XLogReportParameters(int updatetiming);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern XLogRecPtr GetInsertRecPtr(void);
  extern XLogRecPtr GetFlushRecPtr(void);

#45

Fujii Masao

masao.fujii@gmail.com

about 14 years ago

In reply to: Simon Riggs (#35)

Re: Online base backup from the hot-standby

On Mon, Oct 10, 2011 at 3:56 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

2011/10/9 Jun Ishiduka <ishizuka.jun@po.ntts.co.jp>:

Insert WAL including a value of current FPW (on master)
* In the the same timing as update, they insert WAL (is named
XLOG_FPW_CHANGE). XLOG_FPW_CHANGE has a value of the changed FPW.
* When it creates CHECKPOINT, it adds a value of current FPW to the
CHECKPOINT WAL.

I can't see a reason why we would use a new WAL record for this,
rather than modify the XLOG_PARAMETER_CHANGE record type which was
created for a very similar reason.
The code would be much simpler if we just extend
XLOG_PARAMETER_CHANGE, so please can we do that?

After reading Ishiduka-san's patch, I'm thinking the opposite because
(1) Whenever full_page_writes must be WAL-logged, there is no need
to WAL-log the HS parameters. The opposite is also true. (2) How
full_page_writes record should be replayed is quite different from
how HS parameters record is.

So ISTM that the code would be simpler if we introduce new WAL
record for full_page_writes. Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#46

Fujii Masao

masao.fujii@gmail.com

about 14 years ago

In reply to: Jun Ishiduka (#44)

Re: Online base backup from the hot-standby

2011/10/13 Jun Ishiduka <ishizuka.jun@po.ntts.co.jp>:

I updated to patch corresponded above-comments.

Thanks for updating the patch!

As I suggested in the reply to Simon, I think that the change of FPW
should be WAL-logged separately from that of HS parameters. ISTM
packing them in one WAL record makes XLogReportParameters()
quite confusing. Thought?

 	if (!shutdown && XLogStandbyInfoActive())
+	{
 		LogStandbySnapshot(&checkPoint.oldestActiveXid, &checkPoint.nextXid);
+		XLogReportParameters(REPORT_ON_BACKEND);
+	}

Why doesn't the change of FPW need to be WAL-logged when
shutdown checkpoint is performed? It's helpful to add the comment
explaining why.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#47

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

about 14 years ago

In reply to: Fujii Masao (#46)

Re: Online base backup from the hot-standby

As I suggested in the reply to Simon, I think that the change of FPW
should be WAL-logged separately from that of HS parameters. ISTM
packing them in one WAL record makes XLogReportParameters()
quite confusing. Thought?

I want to confirm the reply of Simon. I think we cannot decide how this
code should be if there is not the reply.

if (!shutdown && XLogStandbyInfoActive())
+	{
LogStandbySnapshot(&checkPoint.oldestActiveXid, &checkPoint.nextXid);
+		XLogReportParameters(REPORT_ON_BACKEND);
+	}
Why doesn't the change of FPW need to be WAL-logged when
shutdown checkpoint is performed? It's helpful to add the comment
explaining why.

Sure. I update the patch soon.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

#48

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

about 14 years ago

In reply to: Jun Ishiduka (#47)

1 attachment(s)

Re: Online base backup from the hot-standby

if (!shutdown && XLogStandbyInfoActive())
+	{
LogStandbySnapshot(&checkPoint.oldestActiveXid, &checkPoint.nextXid);
+		XLogReportParameters(REPORT_ON_BACKEND);
+	}
Why doesn't the change of FPW need to be WAL-logged when
shutdown checkpoint is performed? It's helpful to add the comment
explaining why.
Sure. I update the patch soon.

Done.
Please check this.

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

Attachments:

standby_online_backup_09base-05fpw.patchapplication/octet-stream; name=standby_online_backup_09base-05fpw.patchDownload

diff -rcN postgresql_with_9fujii_patch/src/backend/access/transam/xlog.c postgresql_with_patch/src/backend/access/transam/xlog.c
*** postgresql_with_9fujii_patch/src/backend/access/transam/xlog.c	2011-10-06 06:06:19.000000000 +0900
--- postgresql_with_patch/src/backend/access/transam/xlog.c	2011-10-15 02:00:45.000000000 +0900
***************
*** 364,369 ****
--- 364,372 ----
  	bool		exclusiveBackup;
  	int			nonExclusiveBackups;
  	XLogRecPtr	lastBackupStart;
+ 
+ 	/* the startup or the walwriter is logged to its own FPW */
+ 	bool		fullPageWrites;
  } XLogCtlInsert;
  
  /*
***************
*** 453,458 ****
--- 456,464 ----
  	bool		recoveryPause;
  
  	slock_t		info_lck;		/* locks shared variables shown above */
+ 
+ 	/* latest LSN that has recovered a WAL which fpw is 'off' */
+ 	XLogRecPtr	lastFpwDisabledLSN;
  } XLogCtlData;
  
  static XLogCtlData *XLogCtl = NULL;
***************
*** 574,579 ****
--- 580,586 ----
  	int			max_prepared_xacts;
  	int			max_locks_per_xact;
  	int			wal_level;
+ 	bool		fullPageWrites;
  } xl_parameter_change;
  
  /* logs restore point */
***************
*** 612,618 ****
  static void SetLatestXTime(TimestampTz xtime);
  static TimestampTz GetLatestXTime(void);
  static void CheckRequiredParameterValues(void);
- static void XLogReportParameters(void);
  static void LocalSetXLogInsertAllowed(void);
  static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
  static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
--- 619,624 ----
***************
*** 759,769 ****
  
  	/*
  	 * Decide if we need to do full-page writes in this XLOG record: true if
! 	 * full_page_writes is on or we have a PITR request for it.  Since we
! 	 * don't yet have the insert lock, forcePageWrites could change under us,
! 	 * but we'll recheck it once we have the lock.
  	 */
! 	doPageWrites = fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
--- 765,776 ----
  
  	/*
  	 * Decide if we need to do full-page writes in this XLOG record: true if
! 	 * full_page_writes in shared-memory is on or we have a PITR request for
! 	 * it.  Since we don't yet have the insert lock, fullPageWrites or
! 	 * forcePageWrites could change under us, but we'll recheck it once we
! 	 * have the lock.
  	 */
! 	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
***************
*** 904,915 ****
  	}
  
  	/*
! 	 * Also check to see if forcePageWrites was just turned on; if we weren't
! 	 * already doing full-page writes then go back and recompute. (If it was
! 	 * just turned off, we could recompute the record without full pages, but
! 	 * we choose not to bother.)
  	 */
! 	if (Insert->forcePageWrites && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
--- 911,922 ----
  	}
  
  	/*
! 	 * Also check to see if fullPageWrites or forcePageWrites was just
! 	 * turned on; if we weren't already doing full-page writes then go back
! 	 * and recompute. (If it was just turned off, we could recompute the
! 	 * record without full pages, but we choose not to bother.)
  	 */
! 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
***************
*** 6865,6870 ****
--- 6872,6885 ----
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
  	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
  
+ 	/*
+ 	 * The startup updates FPW in shaerd-memory after REDO. However, it must
+ 	 * perform before writing the WAL of the CHECKPOINT. The reason is that
+ 	 * it uses a value of fpw in shared-memory when it writes a WAL of its
+ 	 * CHECKPOINT.
+ 	 */
+ 	XLogCtl->Insert.fullPageWrites = fullPageWrites;
+ 
  	if (InRecovery)
  	{
  		int			rmid;
***************
*** 6998,7004 ****
  	 * backends to write WAL.
  	 */
  	LocalSetXLogInsertAllowed();
! 	XLogReportParameters();
  
  	/*
  	 * All done.  Allow backends to write WAL.	(Although the bool flag is
--- 7013,7019 ----
  	 * backends to write WAL.
  	 */
  	LocalSetXLogInsertAllowed();
! 	XLogReportParameters(REPORT_ON_STARTUP);
  
  	/*
  	 * All done.  Allow backends to write WAL.	(Although the bool flag is
***************
*** 7856,7862 ****
--- 7871,7886 ----
  	 * Update checkPoint.nextXid since we have a later value
  	 */
  	if (!shutdown && XLogStandbyInfoActive())
+ 	{
  		LogStandbySnapshot(&checkPoint.oldestActiveXid, &checkPoint.nextXid);
+ 
+ 		/*
+ 		 * The backend writes WAL of FPW at checkpoint. However, The backend do
+ 		 * not need to write WAL of FPW at checkpoint shutdown because it
+ 		 * performs when startup finishes.
+ 		 */
+ 		XLogReportParameters(REPORT_ON_BACKEND);
+ 	}
  	else
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
***************
*** 8380,8393 ****
  /*
   * Check if any of the GUC parameters that are critical for hot standby
   * have changed, and update the value in pg_control file if necessary.
   */
! static void
! XLogReportParameters(void)
  {
  	if (wal_level != ControlFile->wal_level ||
  		MaxConnections != ControlFile->MaxConnections ||
  		max_prepared_xacts != ControlFile->max_prepared_xacts ||
! 		max_locks_per_xact != ControlFile->max_locks_per_xact)
  	{
  		/*
  		 * The change in number of backend slots doesn't need to be WAL-logged
--- 8404,8439 ----
  /*
   * Check if any of the GUC parameters that are critical for hot standby
   * have changed, and update the value in pg_control file if necessary.
+  * This function is called at three timing (backend executes checkpoint,
+  * startup finishes and walwriter receives SIGHUP). The backend and the
+  * startup writes a WAL when FPW is 'off' in addition to when any of the
+  * GUC parameters is changed.
   */
! void
! XLogReportParameters(int updatetiming)
  {
+ 	bool do_fpw_xloginsert = false;
+ 	bool fpw = fullPageWrites;
+ 
+ 	if (updatetiming == REPORT_ON_BACKEND)
+ 	{
+ 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+ 		fpw = XLogCtl->Insert.fullPageWrites;
+ 		LWLockRelease(WALInsertLock);
+ 	}
+ 
+ 	if (!fpw)
+ 	{
+ 		if (updatetiming <= REPORT_ON_STARTUP ||
+ 			(updatetiming == REPORT_ON_WALWRITER && XLogCtl->Insert.fullPageWrites))
+ 			do_fpw_xloginsert = true;
+ 	}
+ 
  	if (wal_level != ControlFile->wal_level ||
  		MaxConnections != ControlFile->MaxConnections ||
  		max_prepared_xacts != ControlFile->max_prepared_xacts ||
! 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
! 		do_fpw_xloginsert)
  	{
  		/*
  		 * The change in number of backend slots doesn't need to be WAL-logged
***************
*** 8396,8402 ****
  		 * values in pg_control either if wal_level=minimal, but seems better
  		 * to keep them up-to-date to avoid confusion.
  		 */
! 		if (wal_level != ControlFile->wal_level || XLogIsNeeded())
  		{
  			XLogRecData rdata;
  			xl_parameter_change xlrec;
--- 8442,8448 ----
  		 * values in pg_control either if wal_level=minimal, but seems better
  		 * to keep them up-to-date to avoid confusion.
  		 */
! 		if (wal_level != ControlFile->wal_level || XLogIsNeeded() || do_fpw_xloginsert)
  		{
  			XLogRecData rdata;
  			xl_parameter_change xlrec;
***************
*** 8405,8410 ****
--- 8451,8457 ----
  			xlrec.max_prepared_xacts = max_prepared_xacts;
  			xlrec.max_locks_per_xact = max_locks_per_xact;
  			xlrec.wal_level = wal_level;
+ 			xlrec.fullPageWrites = fpw;
  
  			rdata.buffer = InvalidBuffer;
  			rdata.data = (char *) &xlrec;
***************
*** 8414,8424 ****
  			XLogInsert(RM_XLOG_ID, XLOG_PARAMETER_CHANGE, &rdata);
  		}
  
! 		ControlFile->MaxConnections = MaxConnections;
! 		ControlFile->max_prepared_xacts = max_prepared_xacts;
! 		ControlFile->max_locks_per_xact = max_locks_per_xact;
! 		ControlFile->wal_level = wal_level;
! 		UpdateControlFile();
  	}
  }
  
--- 8461,8482 ----
  			XLogInsert(RM_XLOG_ID, XLOG_PARAMETER_CHANGE, &rdata);
  		}
  
! 		if (!do_fpw_xloginsert)
! 		{
! 			ControlFile->MaxConnections = MaxConnections;
! 			ControlFile->max_prepared_xacts = max_prepared_xacts;
! 			ControlFile->max_locks_per_xact = max_locks_per_xact;
! 			ControlFile->wal_level = wal_level;
! 			UpdateControlFile();
! 		}
! 	}
! 
! 	/* update own fpw in shared-memory when it has managed fpw */
! 	if (updatetiming >= REPORT_ON_STARTUP)
! 	{
! 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 		XLogCtl->Insert.fullPageWrites = fullPageWrites;
! 		LWLockRelease(WALInsertLock);
  	}
  }
  
***************
*** 8604,8609 ****
--- 8662,8670 ----
  	}
  	else if (info == XLOG_PARAMETER_CHANGE)
  	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
  		xl_parameter_change xlrec;
  
  		/* Update our copy of the parameters in pg_control */
***************
*** 8633,8638 ****
--- 8694,8707 ----
  		UpdateControlFile();
  		LWLockRelease(ControlFileLock);
  
+ 		/* record the LSN when FPW is false on master */
+ 		if (!xlrec.fullPageWrites)
+ 		{
+ 			SpinLockAcquire(&xlogctl->info_lck);
+ 			xlogctl->lastFpwDisabledLSN = lsn;
+ 			SpinLockRelease(&xlogctl->info_lck);
+ 		}
+ 
  		/* Check to see if any changes to max_connections give problems */
  		CheckRequiredParameterValues();
  	}
***************
*** 8711,8721 ****
  			}
  		}
  
! 		appendStringInfo(buf, "parameter change: max_connections=%d max_prepared_xacts=%d max_locks_per_xact=%d wal_level=%s",
  						 xlrec.MaxConnections,
  						 xlrec.max_prepared_xacts,
  						 xlrec.max_locks_per_xact,
! 						 wal_level_str);
  	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
--- 8780,8791 ----
  			}
  		}
  
! 		appendStringInfo(buf, "parameter change: max_connections=%d max_prepared_xacts=%d max_locks_per_xact=%d wal_level=%s full_page_writes=%s",
  						 xlrec.MaxConnections,
  						 xlrec.max_prepared_xacts,
  						 xlrec.max_locks_per_xact,
! 						 wal_level_str,
! 						 xlrec.fullPageWrites ? "true" : "false");
  	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
***************
*** 8933,8938 ****
--- 9003,9009 ----
  	bool		recovery_in_progress = false;
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
+ 	XLogRecPtr	lastFpwDisabledLSN;
  	pg_time_t	stamp_time;
  	char		strfbuf[128];
  	char		xlogfilename[MAXFNAMELEN];
***************
*** 9089,9094 ****
--- 9160,9183 ----
  				gotUniqueStartpoint = true;
  		} while (!gotUniqueStartpoint);
  
+ 		/*
+ 		 * check whether the master's FPW is 'off' since latest CHECKPOINT.
+ 		 */
+ 		if (recovery_in_progress)
+ 		{
+ 			/* use volatile pointer to prevent code rearrangement */
+ 			volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+ 			SpinLockAcquire(&xlogctl->info_lck);
+ 			lastFpwDisabledLSN = xlogctl->lastFpwDisabledLSN;
+ 			SpinLockRelease(&xlogctl->info_lck);
+ 
+ 			if (XLByteLE(startpoint, lastFpwDisabledLSN))
+ 				ereport(ERROR,
+ 						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 						 errmsg("WAL generated with full_page_writes=off was replayed since latest checkpoint")));
+ 		}
+ 
  		XLByteToSeg(startpoint, _logId, _logSeg);
  		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
  
***************
*** 9233,9238 ****
--- 9322,9328 ----
  	bool		recovery_in_progress = false;
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
+ 	XLogRecPtr	lastFpwDisabledLSN;
  	XLogRecData rdata;
  	pg_time_t	stamp_time;
  	char		strfbuf[128];
***************
*** 9372,9377 ****
--- 9462,9483 ----
  						"though pg_start_backup() was executed during recovery"),
  				 errhint("The database backup will not be usable.")));
  
+ 	/* check whether the master's FPW is 'off' since pg_start_backup. */
+ 	if (recovery_in_progress)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+ 		SpinLockAcquire(&xlogctl->info_lck);
+ 		lastFpwDisabledLSN = xlogctl->lastFpwDisabledLSN;
+ 		SpinLockRelease(&xlogctl->info_lck);
+ 
+ 		if (XLByteLE(startpoint, lastFpwDisabledLSN))
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 					  errmsg("WAL generated with full_page_writes=off was replayed during online backup")));
+ 	}
+ 
  	/*
  	 * During recovery, we don't write an end-of-backup record. We can
  	 * assume that pg_control was backed up just before pg_stop_backup()
diff -rcN postgresql_with_9fujii_patch/src/backend/postmaster/walwriter.c postgresql_with_patch/src/backend/postmaster/walwriter.c
*** postgresql_with_9fujii_patch/src/backend/postmaster/walwriter.c	2011-10-06 06:05:45.000000000 +0900
--- postgresql_with_patch/src/backend/postmaster/walwriter.c	2011-10-15 02:00:45.000000000 +0900
***************
*** 216,221 ****
--- 216,229 ----
  	PG_SETMASK(&UnBlockSig);
  
  	/*
+ 	 * After the startup process, the walwriter manages the FPW. Because
+ 	 * the walwriter may have not received a SIGHUP then, it updates the FPW
+ 	 * when wal_level is hotstandby.
+ 	 */
+ 	if (XLogStandbyInfoActive())
+ 		XLogReportParameters(REPORT_ON_WALWRITER);
+ 
+ 	/*
  	 * Loop forever
  	 */
  	for (;;)
***************
*** 236,241 ****
--- 244,256 ----
  		{
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
+ 
+ 			/*
+ 			 * The walwriter manages the FPW. When the walwriter has received
+ 			 * a SIGHUP, when wal_level is hotstandby, it updates the FPW.
+ 			 */
+ 			if (XLogStandbyInfoActive())
+ 				XLogReportParameters(REPORT_ON_WALWRITER);
  		}
  		if (shutdown_requested)
  		{
diff -rcN postgresql_with_9fujii_patch/src/include/access/xlog.h postgresql_with_patch/src/include/access/xlog.h
*** postgresql_with_9fujii_patch/src/include/access/xlog.h	2011-10-06 06:05:45.000000000 +0900
--- postgresql_with_patch/src/include/access/xlog.h	2011-10-15 02:00:45.000000000 +0900
***************
*** 208,213 ****
--- 208,224 ----
  } WalLevel;
  extern int	wal_level;
  
+ /*
+  * The place of updating xlog parameter.
+  * If it is backend then this means a timing for CHECKPOINT.
+  */
+ typedef enum
+ {
+ 	REPORT_ON_BACKEND = 0,
+ 	REPORT_ON_STARTUP,
+ 	REPORT_ON_WALWRITER
+ } XLogParemeterUpdate;
+ 
  #define XLogArchivingActive()	(XLogArchiveMode && wal_level >= WAL_LEVEL_ARCHIVE)
  #define XLogArchiveCommandSet() (XLogArchiveCommand[0] != '\0')
  
***************
*** 306,311 ****
--- 317,323 ----
  extern bool CreateRestartPoint(int flags);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr XLogRestorePoint(const char *rpName);
+ extern void XLogReportParameters(int updatetiming);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern XLogRecPtr GetInsertRecPtr(void);
  extern XLogRecPtr GetFlushRecPtr(void);

#49

Fujii Masao

masao.fujii@gmail.com

about 14 years ago

In reply to: Jun Ishiduka (#48)

Re: Online base backup from the hot-standby

2011/10/15 Jun Ishiduka <ishizuka.jun@po.ntts.co.jp>:

    if (!shutdown && XLogStandbyInfoActive())
+   {
            LogStandbySnapshot(&checkPoint.oldestActiveXid, &checkPoint.nextXid);
+           XLogReportParameters(REPORT_ON_BACKEND);
+   }
Why doesn't the change of FPW need to be WAL-logged when
shutdown checkpoint is performed? It's helpful to add the comment
explaining why.
Sure. I update the patch soon.
Done.

+ 		/*
+ 		 * The backend writes WAL of FPW at checkpoint. However, The backend do
+ 		 * not need to write WAL of FPW at checkpoint shutdown because it
+ 		 * performs when startup finishes.
+ 		 */
+ 		XLogReportParameters(REPORT_ON_BACKEND);

I'm still unclear why that WAL doesn't need to be written at shutdown
checkpoint.
Anyway, the first sentence in the above comments is not right. Not a backend but
a bgwriter writes that WAL at checkpoint.

The second also seems not to be right. It implies that a shutdown checkpoint is
performed only at end of startup. But it may be done when smart or fast shutdown
is requested.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#50

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

about 14 years ago

In reply to: Fujii Masao (#49)

Re: Online base backup from the hot-standby

+ 		/*
+ 		 * The backend writes WAL of FPW at checkpoint. However, The backend do
+ 		 * not need to write WAL of FPW at checkpoint shutdown because it
+ 		 * performs when startup finishes.
+ 		 */
+ 		XLogReportParameters(REPORT_ON_BACKEND);
I'm still unclear why that WAL doesn't need to be written at shutdown
checkpoint.
Anyway, the first sentence in the above comments is not right. Not a backend but
a bgwriter writes that WAL at checkpoint.

The second also seems not to be right. It implies that a shutdown checkpoint is
performed only at end of startup. But it may be done when smart or fast shutdown
is requested.

Okay.
I change to the following messages.

/*
* The bgwriter writes WAL of FPW at checkpoint. But does not at shutdown.
* Because XLogReportParameters() is always called at the end of startup
* process, it does not need to be called at shutdown.
*/

In addition, I change macro name.

REPORT_ON_BACKEND -> REPORT_ON_BGWRITER

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

#51

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

about 14 years ago

In reply to: Jun Ishiduka (#50)

1 attachment(s)

Re: Online base backup from the hot-standby

+ 		/*
+ 		 * The backend writes WAL of FPW at checkpoint. However, The backend do
+ 		 * not need to write WAL of FPW at checkpoint shutdown because it
+ 		 * performs when startup finishes.
+ 		 */
+ 		XLogReportParameters(REPORT_ON_BACKEND);
I'm still unclear why that WAL doesn't need to be written at shutdown
checkpoint.
Anyway, the first sentence in the above comments is not right. Not a backend but
a bgwriter writes that WAL at checkpoint.

The second also seems not to be right. It implies that a shutdown checkpoint is
performed only at end of startup. But it may be done when smart or fast shutdown
is requested.
Okay.
I change to the following messages.

/*
* The bgwriter writes WAL of FPW at checkpoint. But does not at shutdown.
* Because XLogReportParameters() is always called at the end of startup
* process, it does not need to be called at shutdown.
*/

In addition, I change macro name.

REPORT_ON_BACKEND -> REPORT_ON_BGWRITER

I have updated as above-comment.
Please check this.

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

Attachments:

standby_online_backup_09base-06fpw.patchapplication/octet-stream; name=standby_online_backup_09base-06fpw.patchDownload

diff -rcN postgresql_with_9fujii_patch/src/backend/access/transam/xlog.c postgresql_with_patch/src/backend/access/transam/xlog.c
*** postgresql_with_9fujii_patch/src/backend/access/transam/xlog.c	2011-10-06 06:06:19.000000000 +0900
--- postgresql_with_patch/src/backend/access/transam/xlog.c	2011-10-19 02:22:09.000000000 +0900
***************
*** 364,369 ****
--- 364,372 ----
  	bool		exclusiveBackup;
  	int			nonExclusiveBackups;
  	XLogRecPtr	lastBackupStart;
+ 
+ 	/* the startup or the walwriter is logged to its own FPW */
+ 	bool		fullPageWrites;
  } XLogCtlInsert;
  
  /*
***************
*** 453,458 ****
--- 456,464 ----
  	bool		recoveryPause;
  
  	slock_t		info_lck;		/* locks shared variables shown above */
+ 
+ 	/* latest LSN that has recovered a WAL which fpw is 'off' */
+ 	XLogRecPtr	lastFpwDisabledLSN;
  } XLogCtlData;
  
  static XLogCtlData *XLogCtl = NULL;
***************
*** 574,579 ****
--- 580,586 ----
  	int			max_prepared_xacts;
  	int			max_locks_per_xact;
  	int			wal_level;
+ 	bool		fullPageWrites;
  } xl_parameter_change;
  
  /* logs restore point */
***************
*** 612,618 ****
  static void SetLatestXTime(TimestampTz xtime);
  static TimestampTz GetLatestXTime(void);
  static void CheckRequiredParameterValues(void);
- static void XLogReportParameters(void);
  static void LocalSetXLogInsertAllowed(void);
  static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
  static void KeepLogSeg(XLogRecPtr recptr, uint32 *logId, uint32 *logSeg);
--- 619,624 ----
***************
*** 759,769 ****
  
  	/*
  	 * Decide if we need to do full-page writes in this XLOG record: true if
! 	 * full_page_writes is on or we have a PITR request for it.  Since we
! 	 * don't yet have the insert lock, forcePageWrites could change under us,
! 	 * but we'll recheck it once we have the lock.
  	 */
! 	doPageWrites = fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
--- 765,776 ----
  
  	/*
  	 * Decide if we need to do full-page writes in this XLOG record: true if
! 	 * full_page_writes in shared-memory is on or we have a PITR request for
! 	 * it.  Since we don't yet have the insert lock, fullPageWrites or
! 	 * forcePageWrites could change under us, but we'll recheck it once we
! 	 * have the lock.
  	 */
! 	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
***************
*** 904,915 ****
  	}
  
  	/*
! 	 * Also check to see if forcePageWrites was just turned on; if we weren't
! 	 * already doing full-page writes then go back and recompute. (If it was
! 	 * just turned off, we could recompute the record without full pages, but
! 	 * we choose not to bother.)
  	 */
! 	if (Insert->forcePageWrites && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
--- 911,922 ----
  	}
  
  	/*
! 	 * Also check to see if fullPageWrites or forcePageWrites was just
! 	 * turned on; if we weren't already doing full-page writes then go back
! 	 * and recompute. (If it was just turned off, we could recompute the
! 	 * record without full pages, but we choose not to bother.)
  	 */
! 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
***************
*** 6865,6870 ****
--- 6872,6885 ----
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
  	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
  
+ 	/*
+ 	 * The startup updates FPW in shaerd-memory after REDO. However, it must
+ 	 * perform before writing the WAL of the CHECKPOINT. The reason is that
+ 	 * it uses a value of fpw in shared-memory when it writes a WAL of its
+ 	 * CHECKPOINT.
+ 	 */
+ 	XLogCtl->Insert.fullPageWrites = fullPageWrites;
+ 
  	if (InRecovery)
  	{
  		int			rmid;
***************
*** 6998,7004 ****
  	 * backends to write WAL.
  	 */
  	LocalSetXLogInsertAllowed();
! 	XLogReportParameters();
  
  	/*
  	 * All done.  Allow backends to write WAL.	(Although the bool flag is
--- 7013,7019 ----
  	 * backends to write WAL.
  	 */
  	LocalSetXLogInsertAllowed();
! 	XLogReportParameters(REPORT_ON_STARTUP);
  
  	/*
  	 * All done.  Allow backends to write WAL.	(Although the bool flag is
***************
*** 7856,7862 ****
--- 7871,7886 ----
  	 * Update checkPoint.nextXid since we have a later value
  	 */
  	if (!shutdown && XLogStandbyInfoActive())
+ 	{
  		LogStandbySnapshot(&checkPoint.oldestActiveXid, &checkPoint.nextXid);
+ 
+ 		/*
+ 		 * The bgwriter writes WAL of FPW at checkpoint. But does not at shutdown.
+ 		 * Because XLogReportParameters() is always called at the end of startup
+ 		 * process, it does not need to be called at shutdown.
+ 		 */
+ 		XLogReportParameters(REPORT_ON_BGWRITER);
+ 	}
  	else
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
***************
*** 8380,8393 ****
  /*
   * Check if any of the GUC parameters that are critical for hot standby
   * have changed, and update the value in pg_control file if necessary.
   */
! static void
! XLogReportParameters(void)
  {
  	if (wal_level != ControlFile->wal_level ||
  		MaxConnections != ControlFile->MaxConnections ||
  		max_prepared_xacts != ControlFile->max_prepared_xacts ||
! 		max_locks_per_xact != ControlFile->max_locks_per_xact)
  	{
  		/*
  		 * The change in number of backend slots doesn't need to be WAL-logged
--- 8404,8439 ----
  /*
   * Check if any of the GUC parameters that are critical for hot standby
   * have changed, and update the value in pg_control file if necessary.
+  * This function is called at three timing (backend executes checkpoint,
+  * startup finishes and walwriter receives SIGHUP). The backend and the
+  * startup writes a WAL when FPW is 'off' in addition to when any of the
+  * GUC parameters is changed.
   */
! void
! XLogReportParameters(int updatetiming)
  {
+ 	bool do_fpw_xloginsert = false;
+ 	bool fpw = fullPageWrites;
+ 
+ 	if (updatetiming == REPORT_ON_BGWRITER)
+ 	{
+ 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+ 		fpw = XLogCtl->Insert.fullPageWrites;
+ 		LWLockRelease(WALInsertLock);
+ 	}
+ 
+ 	if (!fpw)
+ 	{
+ 		if (updatetiming <= REPORT_ON_STARTUP ||
+ 			(updatetiming == REPORT_ON_WALWRITER && XLogCtl->Insert.fullPageWrites))
+ 			do_fpw_xloginsert = true;
+ 	}
+ 
  	if (wal_level != ControlFile->wal_level ||
  		MaxConnections != ControlFile->MaxConnections ||
  		max_prepared_xacts != ControlFile->max_prepared_xacts ||
! 		max_locks_per_xact != ControlFile->max_locks_per_xact ||
! 		do_fpw_xloginsert)
  	{
  		/*
  		 * The change in number of backend slots doesn't need to be WAL-logged
***************
*** 8396,8402 ****
  		 * values in pg_control either if wal_level=minimal, but seems better
  		 * to keep them up-to-date to avoid confusion.
  		 */
! 		if (wal_level != ControlFile->wal_level || XLogIsNeeded())
  		{
  			XLogRecData rdata;
  			xl_parameter_change xlrec;
--- 8442,8448 ----
  		 * values in pg_control either if wal_level=minimal, but seems better
  		 * to keep them up-to-date to avoid confusion.
  		 */
! 		if (wal_level != ControlFile->wal_level || XLogIsNeeded() || do_fpw_xloginsert)
  		{
  			XLogRecData rdata;
  			xl_parameter_change xlrec;
***************
*** 8405,8410 ****
--- 8451,8457 ----
  			xlrec.max_prepared_xacts = max_prepared_xacts;
  			xlrec.max_locks_per_xact = max_locks_per_xact;
  			xlrec.wal_level = wal_level;
+ 			xlrec.fullPageWrites = fpw;
  
  			rdata.buffer = InvalidBuffer;
  			rdata.data = (char *) &xlrec;
***************
*** 8414,8424 ****
  			XLogInsert(RM_XLOG_ID, XLOG_PARAMETER_CHANGE, &rdata);
  		}
  
! 		ControlFile->MaxConnections = MaxConnections;
! 		ControlFile->max_prepared_xacts = max_prepared_xacts;
! 		ControlFile->max_locks_per_xact = max_locks_per_xact;
! 		ControlFile->wal_level = wal_level;
! 		UpdateControlFile();
  	}
  }
  
--- 8461,8482 ----
  			XLogInsert(RM_XLOG_ID, XLOG_PARAMETER_CHANGE, &rdata);
  		}
  
! 		if (!do_fpw_xloginsert)
! 		{
! 			ControlFile->MaxConnections = MaxConnections;
! 			ControlFile->max_prepared_xacts = max_prepared_xacts;
! 			ControlFile->max_locks_per_xact = max_locks_per_xact;
! 			ControlFile->wal_level = wal_level;
! 			UpdateControlFile();
! 		}
! 	}
! 
! 	/* update own fpw in shared-memory when it has managed fpw */
! 	if (updatetiming >= REPORT_ON_STARTUP)
! 	{
! 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 		XLogCtl->Insert.fullPageWrites = fullPageWrites;
! 		LWLockRelease(WALInsertLock);
  	}
  }
  
***************
*** 8604,8609 ****
--- 8662,8670 ----
  	}
  	else if (info == XLOG_PARAMETER_CHANGE)
  	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
  		xl_parameter_change xlrec;
  
  		/* Update our copy of the parameters in pg_control */
***************
*** 8633,8638 ****
--- 8694,8707 ----
  		UpdateControlFile();
  		LWLockRelease(ControlFileLock);
  
+ 		/* record the LSN when FPW is false on master */
+ 		if (!xlrec.fullPageWrites)
+ 		{
+ 			SpinLockAcquire(&xlogctl->info_lck);
+ 			xlogctl->lastFpwDisabledLSN = lsn;
+ 			SpinLockRelease(&xlogctl->info_lck);
+ 		}
+ 
  		/* Check to see if any changes to max_connections give problems */
  		CheckRequiredParameterValues();
  	}
***************
*** 8711,8721 ****
  			}
  		}
  
! 		appendStringInfo(buf, "parameter change: max_connections=%d max_prepared_xacts=%d max_locks_per_xact=%d wal_level=%s",
  						 xlrec.MaxConnections,
  						 xlrec.max_prepared_xacts,
  						 xlrec.max_locks_per_xact,
! 						 wal_level_str);
  	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
--- 8780,8791 ----
  			}
  		}
  
! 		appendStringInfo(buf, "parameter change: max_connections=%d max_prepared_xacts=%d max_locks_per_xact=%d wal_level=%s full_page_writes=%s",
  						 xlrec.MaxConnections,
  						 xlrec.max_prepared_xacts,
  						 xlrec.max_locks_per_xact,
! 						 wal_level_str,
! 						 xlrec.fullPageWrites ? "true" : "false");
  	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
***************
*** 8933,8938 ****
--- 9003,9009 ----
  	bool		recovery_in_progress = false;
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
+ 	XLogRecPtr	lastFpwDisabledLSN;
  	pg_time_t	stamp_time;
  	char		strfbuf[128];
  	char		xlogfilename[MAXFNAMELEN];
***************
*** 9089,9094 ****
--- 9160,9183 ----
  				gotUniqueStartpoint = true;
  		} while (!gotUniqueStartpoint);
  
+ 		/*
+ 		 * check whether the master's FPW is 'off' since latest CHECKPOINT.
+ 		 */
+ 		if (recovery_in_progress)
+ 		{
+ 			/* use volatile pointer to prevent code rearrangement */
+ 			volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+ 			SpinLockAcquire(&xlogctl->info_lck);
+ 			lastFpwDisabledLSN = xlogctl->lastFpwDisabledLSN;
+ 			SpinLockRelease(&xlogctl->info_lck);
+ 
+ 			if (XLByteLE(startpoint, lastFpwDisabledLSN))
+ 				ereport(ERROR,
+ 						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 						 errmsg("WAL generated with full_page_writes=off was replayed since latest checkpoint")));
+ 		}
+ 
  		XLByteToSeg(startpoint, _logId, _logSeg);
  		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
  
***************
*** 9233,9238 ****
--- 9322,9328 ----
  	bool		recovery_in_progress = false;
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
+ 	XLogRecPtr	lastFpwDisabledLSN;
  	XLogRecData rdata;
  	pg_time_t	stamp_time;
  	char		strfbuf[128];
***************
*** 9372,9377 ****
--- 9462,9483 ----
  						"though pg_start_backup() was executed during recovery"),
  				 errhint("The database backup will not be usable.")));
  
+ 	/* check whether the master's FPW is 'off' since pg_start_backup. */
+ 	if (recovery_in_progress)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+ 		SpinLockAcquire(&xlogctl->info_lck);
+ 		lastFpwDisabledLSN = xlogctl->lastFpwDisabledLSN;
+ 		SpinLockRelease(&xlogctl->info_lck);
+ 
+ 		if (XLByteLE(startpoint, lastFpwDisabledLSN))
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 					  errmsg("WAL generated with full_page_writes=off was replayed during online backup")));
+ 	}
+ 
  	/*
  	 * During recovery, we don't write an end-of-backup record. We can
  	 * assume that pg_control was backed up just before pg_stop_backup()
diff -rcN postgresql_with_9fujii_patch/src/backend/postmaster/walwriter.c postgresql_with_patch/src/backend/postmaster/walwriter.c
*** postgresql_with_9fujii_patch/src/backend/postmaster/walwriter.c	2011-10-06 06:05:45.000000000 +0900
--- postgresql_with_patch/src/backend/postmaster/walwriter.c	2011-10-19 02:22:09.000000000 +0900
***************
*** 216,221 ****
--- 216,229 ----
  	PG_SETMASK(&UnBlockSig);
  
  	/*
+ 	 * After the startup process, the walwriter manages the FPW. Because
+ 	 * the walwriter may have not received a SIGHUP then, it updates the FPW
+ 	 * when wal_level is hotstandby.
+ 	 */
+ 	if (XLogStandbyInfoActive())
+ 		XLogReportParameters(REPORT_ON_WALWRITER);
+ 
+ 	/*
  	 * Loop forever
  	 */
  	for (;;)
***************
*** 236,241 ****
--- 244,256 ----
  		{
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
+ 
+ 			/*
+ 			 * The walwriter manages the FPW. When the walwriter has received
+ 			 * a SIGHUP, when wal_level is hotstandby, it updates the FPW.
+ 			 */
+ 			if (XLogStandbyInfoActive())
+ 				XLogReportParameters(REPORT_ON_WALWRITER);
  		}
  		if (shutdown_requested)
  		{
diff -rcN postgresql_with_9fujii_patch/src/include/access/xlog.h postgresql_with_patch/src/include/access/xlog.h
*** postgresql_with_9fujii_patch/src/include/access/xlog.h	2011-10-06 06:05:45.000000000 +0900
--- postgresql_with_patch/src/include/access/xlog.h	2011-10-19 02:22:09.000000000 +0900
***************
*** 208,213 ****
--- 208,224 ----
  } WalLevel;
  extern int	wal_level;
  
+ /*
+  * The place of updating xlog parameter.
+  * If it is backend then this means a timing for CHECKPOINT.
+  */
+ typedef enum
+ {
+ 	REPORT_ON_BGWRITER = 0,
+ 	REPORT_ON_STARTUP,
+ 	REPORT_ON_WALWRITER
+ } XLogParemeterUpdate;
+ 
  #define XLogArchivingActive()	(XLogArchiveMode && wal_level >= WAL_LEVEL_ARCHIVE)
  #define XLogArchiveCommandSet() (XLogArchiveCommand[0] != '\0')
  
***************
*** 306,311 ****
--- 317,323 ----
  extern bool CreateRestartPoint(int flags);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr XLogRestorePoint(const char *rpName);
+ extern void XLogReportParameters(int updatetiming);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern XLogRecPtr GetInsertRecPtr(void);
  extern XLogRecPtr GetFlushRecPtr(void);

#52

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

about 14 years ago

In reply to: Jun Ishiduka (#51)

1 attachment(s)

Re: Online base backup from the hot-standby

As I suggested in the reply to Simon, I think that the change of FPW
should be WAL-logged separately from that of HS parameters. ISTM
packing them in one WAL record makes XLogReportParameters()
quite confusing. Thought?

I updated a patch for what you have suggested (that the change of FPW
should be WAL-logged separately from that of HS parameters).

I want to base on this patch if there are no other opinions.

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

Attachments:

standby_online_backup_09base-07fpw.patchapplication/octet-stream; name=standby_online_backup_09base-07fpw.patchDownload

diff -rcN postgresql_with_9fujii_patch/src/backend/access/transam/xlog.c postgresql_with_patch/src/backend/access/transam/xlog.c
*** postgresql_with_9fujii_patch/src/backend/access/transam/xlog.c	2011-10-06 06:06:19.000000000 +0900
--- postgresql_with_patch/src/backend/access/transam/xlog.c	2011-10-19 07:02:07.000000000 +0900
***************
*** 364,369 ****
--- 364,372 ----
  	bool		exclusiveBackup;
  	int			nonExclusiveBackups;
  	XLogRecPtr	lastBackupStart;
+ 
+ 	/* the startup or the walwriter is logged to its own FPW */
+ 	bool		fullPageWrites;
  } XLogCtlInsert;
  
  /*
***************
*** 453,458 ****
--- 456,464 ----
  	bool		recoveryPause;
  
  	slock_t		info_lck;		/* locks shared variables shown above */
+ 
+ 	/* latest LSN that has recovered a WAL which fpw is 'off' */
+ 	XLogRecPtr	lastFpwDisabledLSN;
  } XLogCtlData;
  
  static XLogCtlData *XLogCtl = NULL;
***************
*** 759,769 ****
  
  	/*
  	 * Decide if we need to do full-page writes in this XLOG record: true if
! 	 * full_page_writes is on or we have a PITR request for it.  Since we
! 	 * don't yet have the insert lock, forcePageWrites could change under us,
! 	 * but we'll recheck it once we have the lock.
  	 */
! 	doPageWrites = fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
--- 765,776 ----
  
  	/*
  	 * Decide if we need to do full-page writes in this XLOG record: true if
! 	 * full_page_writes in shared-memory is on or we have a PITR request for
! 	 * it.  Since we don't yet have the insert lock, fullPageWrites or
! 	 * forcePageWrites could change under us, but we'll recheck it once we
! 	 * have the lock.
  	 */
! 	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
***************
*** 904,915 ****
  	}
  
  	/*
! 	 * Also check to see if forcePageWrites was just turned on; if we weren't
! 	 * already doing full-page writes then go back and recompute. (If it was
! 	 * just turned off, we could recompute the record without full pages, but
! 	 * we choose not to bother.)
  	 */
! 	if (Insert->forcePageWrites && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
--- 911,922 ----
  	}
  
  	/*
! 	 * Also check to see if fullPageWrites or forcePageWrites was just
! 	 * turned on; if we weren't already doing full-page writes then go back
! 	 * and recompute. (If it was just turned off, we could recompute the
! 	 * record without full pages, but we choose not to bother.)
  	 */
! 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
***************
*** 7001,7006 ****
--- 7008,7020 ----
  	XLogReportParameters();
  
  	/*
+ 	 * The startup updates FPW after REDO. ReportFpwParameters() is called
+ 	 * here because it is not called on checkpoint after REDO.
+ 	 * It is safe because we cannot update data during the startup running.
+ 	 */
+ 	ReportFpwParameters(REPORT_ON_STARTUP);
+ 
+ 	/*
  	 * All done.  Allow backends to write WAL.	(Although the bool flag is
  	 * probably atomic in itself, we use the info_lck here to ensure that
  	 * there are no race conditions concerning visibility of other recent
***************
*** 7856,7862 ****
--- 7870,7885 ----
  	 * Update checkPoint.nextXid since we have a later value
  	 */
  	if (!shutdown && XLogStandbyInfoActive())
+ 	{
  		LogStandbySnapshot(&checkPoint.oldestActiveXid, &checkPoint.nextXid);
+ 
+ 		/*
+ 		 * The bgwriter writes WAL of FPW at checkpoint. But does not at shutdown.
+ 		 * Because ReportFpwParameters() is always called at the end of startup
+ 		 * process, it does not need to be called at shutdown.
+ 		 */
+ 		ReportFpwParameters(REPORT_ON_BGWRITER);
+ 	}
  	else
  		checkPoint.oldestActiveXid = InvalidTransactionId;
  
***************
*** 8636,8641 ****
--- 8659,8681 ----
  		/* Check to see if any changes to max_connections give problems */
  		CheckRequiredParameterValues();
  	}
+ 	else if (info == XLOG_FPW_CHANGE)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+ 		bool fpw;
+ 
+ 		memcpy(&fpw, XLogRecGetData(record), sizeof(fpw));
+ 
+ 		/* record the LSN when FPW is false on master */
+ 		if (!fpw)
+ 		{
+ 			SpinLockAcquire(&xlogctl->info_lck);
+ 			xlogctl->lastFpwDisabledLSN = lsn;
+ 			SpinLockRelease(&xlogctl->info_lck);
+ 		}
+ 	}
  }
  
  void
***************
*** 8717,8722 ****
--- 8757,8770 ----
  						 xlrec.max_locks_per_xact,
  						 wal_level_str);
  	}
+ 	else if (info == XLOG_FPW_CHANGE)
+ 	{
+ 		bool fpw;
+ 
+ 		memcpy(&fpw, rec, sizeof(fpw));
+ 		appendStringInfo(buf, "fpw change: full_page_writes=%s",
+ 						 fpw ? "true" : "false");
+ 	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
  }
***************
*** 8933,8938 ****
--- 8981,8987 ----
  	bool		recovery_in_progress = false;
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
+ 	XLogRecPtr	lastFpwDisabledLSN;
  	pg_time_t	stamp_time;
  	char		strfbuf[128];
  	char		xlogfilename[MAXFNAMELEN];
***************
*** 9089,9094 ****
--- 9138,9161 ----
  				gotUniqueStartpoint = true;
  		} while (!gotUniqueStartpoint);
  
+ 		/*
+ 		 * check whether the master's FPW is 'off' since latest CHECKPOINT.
+ 		 */
+ 		if (recovery_in_progress)
+ 		{
+ 			/* use volatile pointer to prevent code rearrangement */
+ 			volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+ 			SpinLockAcquire(&xlogctl->info_lck);
+ 			lastFpwDisabledLSN = xlogctl->lastFpwDisabledLSN;
+ 			SpinLockRelease(&xlogctl->info_lck);
+ 
+ 			if (XLByteLE(startpoint, lastFpwDisabledLSN))
+ 				ereport(ERROR,
+ 						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 						 errmsg("WAL generated with full_page_writes=off was replayed since latest checkpoint")));
+ 		}
+ 
  		XLByteToSeg(startpoint, _logId, _logSeg);
  		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
  
***************
*** 9233,9238 ****
--- 9300,9306 ----
  	bool		recovery_in_progress = false;
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
+ 	XLogRecPtr	lastFpwDisabledLSN;
  	XLogRecData rdata;
  	pg_time_t	stamp_time;
  	char		strfbuf[128];
***************
*** 9372,9377 ****
--- 9440,9461 ----
  						"though pg_start_backup() was executed during recovery"),
  				 errhint("The database backup will not be usable.")));
  
+ 	/* check whether the master's FPW is 'off' since pg_start_backup. */
+ 	if (recovery_in_progress)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 
+ 		SpinLockAcquire(&xlogctl->info_lck);
+ 		lastFpwDisabledLSN = xlogctl->lastFpwDisabledLSN;
+ 		SpinLockRelease(&xlogctl->info_lck);
+ 
+ 		if (XLByteLE(startpoint, lastFpwDisabledLSN))
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 					  errmsg("WAL generated with full_page_writes=off was replayed during online backup")));
+ 	}
+ 
  	/*
  	 * During recovery, we don't write an end-of-backup record. We can
  	 * assume that pg_control was backed up just before pg_stop_backup()
***************
*** 10743,10745 ****
--- 10827,10879 ----
  {
  	SetLatch(&XLogCtl->recoveryWakeupLatch);
  }
+ 
+ /*
+  * Insert a WAL of XLOG_FPW_CHANGE or update to the shared memory.
+  * The called timing is at checkpoint, at the end of startup or at receiving
+  * SIGHUP on walwriter.
+  * checkpoint: insert, but not update.
+  * startup: insert and update.
+  * walwriter: insert and update if FPW was changed.
+  */
+ void
+ ReportFpwParameters(int updatetiming)
+ {
+ 	bool fpw = fullPageWrites;
+ 
+ 	if (updatetiming == REPORT_ON_BGWRITER)
+ 	{
+ 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+ 		fpw = XLogCtl->Insert.fullPageWrites;
+ 		LWLockRelease(WALInsertLock);
+ 	}
+ 
+ 	if (updatetiming <= REPORT_ON_STARTUP ||
+ 		(updatetiming == REPORT_ON_WALWRITER && fpw != XLogCtl->Insert.fullPageWrites))
+ 	{
+ 		/*
+ 		 * insert own fpw to a WAL. However, it does not perform
+ 		 * when wal_level is not 'hotstandby' or fpw is same as shared-memory.
+ 		 */
+ 		if (XLogStandbyInfoActive())
+ 		{
+ 			XLogRecData rdata;
+ 			bool record = fullPageWrites;
+ 
+ 			rdata.buffer = InvalidBuffer;
+ 			rdata.data = (char *) &record;
+ 			rdata.len = sizeof(record);
+ 			rdata.next = NULL;
+ 
+ 			XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE, &rdata);
+ 		}
+ 
+ 		/* update own fpw in shared-memory when it has managed fpw */
+ 		if (updatetiming >= REPORT_ON_STARTUP)
+ 		{
+ 			LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+ 			XLogCtl->Insert.fullPageWrites = fullPageWrites;
+ 			LWLockRelease(WALInsertLock);
+ 		}
+ 	}
+ }
diff -rcN postgresql_with_9fujii_patch/src/backend/postmaster/walwriter.c postgresql_with_patch/src/backend/postmaster/walwriter.c
*** postgresql_with_9fujii_patch/src/backend/postmaster/walwriter.c	2011-10-06 06:05:45.000000000 +0900
--- postgresql_with_patch/src/backend/postmaster/walwriter.c	2011-10-19 07:02:07.000000000 +0900
***************
*** 216,221 ****
--- 216,228 ----
  	PG_SETMASK(&UnBlockSig);
  
  	/*
+ 	 * After the startup process, the walwriter manages the FPW. Because
+ 	 * the walwriter may have not received a SIGHUP then, it updates the FPW
+ 	 * when wal_level is hotstandby.
+ 	 */
+ 	ReportFpwParameters(REPORT_ON_WALWRITER);
+ 
+ 	/*
  	 * Loop forever
  	 */
  	for (;;)
***************
*** 236,241 ****
--- 243,254 ----
  		{
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
+ 
+ 			/*
+ 			 * The walwriter manages the FPW. When the walwriter has received
+ 			 * a SIGHUP, when wal_level is hotstandby, it updates the FPW.
+ 			 */
+ 			ReportFpwParameters(REPORT_ON_WALWRITER);
  		}
  		if (shutdown_requested)
  		{
diff -rcN postgresql_with_9fujii_patch/src/include/access/xlog.h postgresql_with_patch/src/include/access/xlog.h
*** postgresql_with_9fujii_patch/src/include/access/xlog.h	2011-10-06 06:05:45.000000000 +0900
--- postgresql_with_patch/src/include/access/xlog.h	2011-10-19 07:02:07.000000000 +0900
***************
*** 208,213 ****
--- 208,224 ----
  } WalLevel;
  extern int	wal_level;
  
+ /*
+  * The place of updating xlog parameter.
+  * If it is backend then this means a timing for CHECKPOINT.
+  */
+ typedef enum
+ {
+ 	REPORT_ON_BGWRITER = 0,
+ 	REPORT_ON_STARTUP,
+ 	REPORT_ON_WALWRITER
+ } XLogParemeterUpdate;
+ 
  #define XLogArchivingActive()	(XLogArchiveMode && wal_level >= WAL_LEVEL_ARCHIVE)
  #define XLogArchiveCommandSet() (XLogArchiveCommand[0] != '\0')
  
***************
*** 316,321 ****
--- 327,333 ----
  extern void StartupProcessMain(void);
  extern bool CheckPromoteSignal(void);
  extern void WakeupRecovery(void);
+ extern void ReportFpwParameters(int updatetiming);
  
  /*
   * Starting/stopping a base backup
diff -rcN postgresql_with_9fujii_patch/src/include/catalog/pg_control.h postgresql_with_patch/src/include/catalog/pg_control.h
*** postgresql_with_9fujii_patch/src/include/catalog/pg_control.h	2011-10-06 06:06:19.000000000 +0900
--- postgresql_with_patch/src/include/catalog/pg_control.h	2011-10-19 07:02:07.000000000 +0900
***************
*** 60,65 ****
--- 60,66 ----
  #define XLOG_BACKUP_END					0x50
  #define XLOG_PARAMETER_CHANGE			0x60
  #define XLOG_RESTORE_POINT				0x70
+ #define XLOG_FPW_CHANGE					0x80
  
  
  /*

#53

Fujii Masao

masao.fujii@gmail.com

about 14 years ago

In reply to: Jun Ishiduka (#52)

1 attachment(s)

Re: Online base backup from the hot-standby

2011/10/19 Jun Ishiduka <ishizuka.jun@po.ntts.co.jp>:

As I suggested in the reply to Simon, I think that the change of FPW
should be WAL-logged separately from that of HS parameters. ISTM
packing them in one WAL record makes XLogReportParameters()
quite confusing. Thought?

I updated a patch for what you have suggested (that the change of FPW
should be WAL-logged separately from that of HS parameters).

I want to base on this patch if there are no other opinions.

Thanks for updating the patch!

Attached is the updated version of the patch. I merged your patch into
standby_online_backup_09_fujii.patch, refactored the code, fixed some
bugs, added lots of source code comments, but didn't change the basic
design that you proposed.

In your patch, FPW is always WAL-logged at startup even when FPW has
not been changed since last shutdown. I don't think that's required.
I changed the recovery code so that it keeps track of last FPW indicated
by WAL record. Then, at end of startup, if that FPW is equal to FPW
specified in postgresql.conf (which means that FPW has not been changed
since last shutdown or crash), WAL-logging of FPW is skipped. This change
prevents unnecessary WAL-logging. Thought?

Is the patch well-formed enough to mark as ready-for-committer? It would
be very helpful if you review the patch.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

standby_online_backup_10_fujii.patchtext/x-diff; charset=US-ASCII; name=standby_online_backup_10_fujii.patchDownload

*** a/doc/src/sgml/backup.sgml
--- b/doc/src/sgml/backup.sgml
***************
*** 939,944 **** SELECT pg_stop_backup();
--- 939,1004 ----
     </para>
    </sect2>
  
+   <sect2 id="backup-from-standby">
+    <title>Making a Base Backup from Standby Database</title>
+ 
+    <para>
+     It's possible to make a base backup during recovery. Which allows a user
+     to take a base backup from the standby to offload the expense of
+     periodic backups from the master. Its procedure is similar to that
+     during normal running. All these steps must be performed on the standby.
+   <orderedlist>
+    <listitem>
+     <para>
+      Ensure that hot standby is enabled (see <xref linkend="hot-standby">
+      for more information).
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Connect to the database as a superuser and execute <function>pg_start_backup</>.
+      This performs a restartpoint if there is at least one checkpoint record
+      replayed since last restartpoint.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Perform a file system backup.
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Copy the pg_control file from the cluster directory to the global
+      sub-directory of the backup. For example:
+ <programlisting>
+ cp $PGDATA/global/pg_control /mnt/server/backupdir/global
+ </programlisting>
+     </para>
+    </listitem>
+    <listitem>
+     <para>
+      Again connect to the database as a superuser, and execute
+      <function>pg_stop_backup</>. This terminates the backup mode, but does not
+      perform a switch to the next WAL segment, create a backup history file and
+      wait for all required WAL segments to be archived,
+      unlike that during normal processing.
+     </para>
+    </listitem>
+   </orderedlist>
+    </para>
+ 
+    <para>
+     You cannot use the <application>pg_basebackup</> tool to take the backup
+     from the standby.
+    </para>
+    <para>
+     It's not possible to make a base backup from the server in recovery mode
+     when reading WAL written during a period when <varname>full_page_writes</>
+     was disabled. If you want to take a base backup from the standby,
+     <varname>full_page_writes</> must be set to true on the master.
+    </para>
+   </sect2>
+ 
    <sect2 id="backup-pitr-recovery">
     <title>Recovering Using a Continuous Archive Backup</title>
  
*** a/doc/src/sgml/config.sgml
--- b/doc/src/sgml/config.sgml
***************
*** 1682,1687 **** SET ENABLE_SEQSCAN TO OFF;
--- 1682,1695 ----
         </para>
  
         <para>
+         WAL written while <varname>full_page_writes</> is disabled does not
+         contain enough information to make a base backup during recovery
+         (see <xref linkend="backup-from-standby">),
+         so <varname>full_page_writes</> must be enabled on the master
+         to take a backup from the standby.
+        </para>
+ 
+        <para>
          This parameter can only be set in the <filename>postgresql.conf</>
          file or on the server command line.
          The default is <literal>on</>.
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***************
*** 14034,14040 **** SELECT set_config('log_statement_stats', 'off', false);
     <para>
      The functions shown in <xref
      linkend="functions-admin-backup-table"> assist in making on-line backups.
!     These functions cannot be executed during recovery.
     </para>
  
     <table id="functions-admin-backup-table">
--- 14034,14041 ----
     <para>
      The functions shown in <xref
      linkend="functions-admin-backup-table"> assist in making on-line backups.
!     These functions except <function>pg_start_backup</> and <function>pg_stop_backup</>
!     cannot be executed during recovery.
     </para>
  
     <table id="functions-admin-backup-table">
***************
*** 14114,14120 **** SELECT set_config('log_statement_stats', 'off', false);
      database cluster's data directory, performs a checkpoint,
      and then returns the backup's starting transaction log location as text.
      The user can ignore this result value, but it is
!     provided in case it is useful.
  <programlisting>
  postgres=# select pg_start_backup('label_goes_here');
   pg_start_backup
--- 14115,14123 ----
      database cluster's data directory, performs a checkpoint,
      and then returns the backup's starting transaction log location as text.
      The user can ignore this result value, but it is
!     provided in case it is useful. If <function>pg_start_backup</> is
!     executed during recovery, it performs a restartpoint rather than
!     writing a new checkpoint.
  <programlisting>
  postgres=# select pg_start_backup('label_goes_here');
   pg_start_backup
***************
*** 14142,14147 **** postgres=# select pg_start_backup('label_goes_here');
--- 14145,14157 ----
     </para>
  
     <para>
+     If <function>pg_stop_backup</> is executed during recovery, it just
+     removes the label file, but doesn't create a backup history file and wait for
+     the ending transaction log file to be archived. The return value is equal to
+     or bigger than the exact backup's ending transaction log location.
+    </para>
+ 
+    <para>
      <function>pg_switch_xlog</> moves to the next transaction log file, allowing the
      current file to be archived (assuming you are using continuous archiving).
      The return value is the ending transaction log location + 1 within the just-completed transaction log file.
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 158,163 **** HotStandbyState standbyState = STANDBY_DISABLED;
--- 158,174 ----
  static XLogRecPtr LastRec;
  
  /*
+  * During recovery, lastFullPageWrites keeps track of full_page_writes that
+  * the replayed WAL records indicate. It's initialized with full_page_writes
+  * that the recovery starting checkpoint record indicates, and then updated
+  * each time XLOG_FPW_CHANGE record is replayed. At the end of startup,
+  * if it's equal to full_page_writes in postgresql.conf, which means that
+  * full_page_writes has not been changed since last shutdown or crash, so
+  * in this case we skip writing an XLOG_FPW_CHANGE record.
+  */
+ static bool lastFullPageWrites;
+ 
+ /*
   * Local copy of SharedRecoveryInProgress variable. True actually means "not
   * known, need to check the shared state".
   */
***************
*** 356,361 **** typedef struct XLogCtlInsert
--- 367,382 ----
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
  	/*
+ 	 * fullPageWrites is shared-memory copy of walwriter's or startup
+ 	 * process' full_page_writes. All backends use this flag to determine
+ 	 * whether to write full-page to WAL, instead of using process-local
+ 	 * one. This is required because, when full_page_writes is changed
+ 	 * by SIGHUP, we must WAL-log it before it actually affects
+ 	 * WAL-logging by backends.
+ 	 */
+ 	bool		fullPageWrites;
+ 
+ 	/*
  	 * exclusiveBackup is true if a backup started with pg_start_backup() is
  	 * in progress, and nonExclusiveBackups is a counter indicating the number
  	 * of streaming base backups currently in progress. forcePageWrites is set
***************
*** 453,458 **** typedef struct XLogCtlData
--- 474,485 ----
  	/* Are we requested to pause recovery? */
  	bool		recoveryPause;
  
+ 	/*
+ 	 * lastFpwDisableRecPtr points to the start of the last replayed
+ 	 * XLOG_FPW_CHANGE record that instructs full_page_writes is disabled.
+ 	 */
+ 	XLogRecPtr	lastFpwDisableRecPtr;
+ 
  	slock_t		info_lck;		/* locks shared variables shown above */
  } XLogCtlData;
  
***************
*** 665,671 **** static void xlog_outrec(StringInfo buf, XLogRecord *record);
  #endif
  static void pg_start_backup_callback(int code, Datum arg);
  static bool read_backup_label(XLogRecPtr *checkPointLoc,
! 				  bool *backupEndRequired);
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
--- 692,698 ----
  #endif
  static void pg_start_backup_callback(int code, Datum arg);
  static bool read_backup_label(XLogRecPtr *checkPointLoc,
! 				  bool *backupEndRequired, bool *backupDuringRecovery);
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
***************
*** 710,715 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
--- 737,743 ----
  	bool		updrqst;
  	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+ 	bool		fpwChange = (rmid == RM_XLOG_ID && info == XLOG_FPW_CHANGE);
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
***************
*** 761,770 **** begin:;
  	/*
  	 * Decide if we need to do full-page writes in this XLOG record: true if
  	 * full_page_writes is on or we have a PITR request for it.  Since we
! 	 * don't yet have the insert lock, forcePageWrites could change under us,
! 	 * but we'll recheck it once we have the lock.
  	 */
! 	doPageWrites = fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
--- 789,798 ----
  	/*
  	 * Decide if we need to do full-page writes in this XLOG record: true if
  	 * full_page_writes is on or we have a PITR request for it.  Since we
! 	 * don't yet have the insert lock, fullPageWrites and forcePageWrites
! 	 * could change under us, but we'll recheck them once we have the lock.
  	 */
! 	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
***************
*** 905,916 **** begin:;
  	}
  
  	/*
! 	 * Also check to see if forcePageWrites was just turned on; if we weren't
! 	 * already doing full-page writes then go back and recompute. (If it was
! 	 * just turned off, we could recompute the record without full pages, but
! 	 * we choose not to bother.)
  	 */
! 	if (Insert->forcePageWrites && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
--- 933,944 ----
  	}
  
  	/*
! 	 * Also check to see if fullPageWrites or forcePageWrites was just turned on;
! 	 * if we weren't already doing full-page writes then go back and recompute.
! 	 * (If it was just turned off, we could recompute the record without full pages,
! 	 * but we choose not to bother.)
  	 */
! 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
***************
*** 1224,1229 **** begin:;
--- 1252,1266 ----
  		WriteRqst = XLogCtl->xlblocks[curridx];
  	}
  
+ 	/*
+ 	 * If the record is an XLOG_FPW_CHANGE, we update full_page_writes
+ 	 * in shared memory before releasing WALInsertLock. This ensures that
+ 	 * an XLOG_FPW_CHANGE record precedes any WAL record affected
+ 	 * by this parameter change.
+ 	 */
+ 	if (fpwChange)
+ 		Insert->fullPageWrites = fullPageWrites;
+ 
  	LWLockRelease(WALInsertLock);
  
  	if (updrqst)
***************
*** 5155,5160 **** BootStrapXLOG(void)
--- 5192,5198 ----
  	checkPoint.redo.xlogid = 0;
  	checkPoint.redo.xrecoff = XLogSegSize + SizeOfXLogLongPHD;
  	checkPoint.ThisTimeLineID = ThisTimeLineID;
+ 	checkPoint.fullPageWrites = fullPageWrites;
  	checkPoint.nextXidEpoch = 0;
  	checkPoint.nextXid = FirstNormalTransactionId;
  	checkPoint.nextOid = FirstBootstrapObjectId;
***************
*** 6025,6030 **** StartupXLOG(void)
--- 6063,6070 ----
  	uint32		freespace;
  	TransactionId oldestActiveXID;
  	bool		backupEndRequired = false;
+ 	bool		backupDuringRecovery = false;
+ 	DBState	save_state;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6158,6164 **** StartupXLOG(void)
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
! 	if (read_backup_label(&checkPointLoc, &backupEndRequired))
  	{
  		/*
  		 * When a backup_label file is present, we want to roll forward from
--- 6198,6205 ----
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
! 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
! 						  &backupDuringRecovery))
  	{
  		/*
  		 * When a backup_label file is present, we want to roll forward from
***************
*** 6274,6279 **** StartupXLOG(void)
--- 6315,6322 ----
  	 */
  	ThisTimeLineID = checkPoint.ThisTimeLineID;
  
+ 	lastFullPageWrites = checkPoint.fullPageWrites;
+ 
  	RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
  
  	if (XLByteLT(RecPtr, checkPoint.redo))
***************
*** 6314,6319 **** StartupXLOG(void)
--- 6357,6363 ----
  		 * pg_control with any minimum recovery stop point obtained from a
  		 * backup history file.
  		 */
+ 		save_state = ControlFile->state;
  		if (InArchiveRecovery)
  			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
  		else
***************
*** 6334,6345 **** StartupXLOG(void)
  		}
  
  		/*
! 		 * set backupStartPoint if we're starting recovery from a base backup
  		 */
  		if (haveBackupLabel)
  		{
  			ControlFile->backupStartPoint = checkPoint.redo;
  			ControlFile->backupEndRequired = backupEndRequired;
  		}
  		ControlFile->time = (pg_time_t) time(NULL);
  		/* No need to hold ControlFileLock yet, we aren't up far enough */
--- 6378,6411 ----
  		}
  
  		/*
! 		 * Set backupStartPoint if we're starting recovery from a base backup.
! 		 *
! 		 * Set backupEndPoint if we're starting recovery from a base backup
! 		 * which was taken from the server in recovery mode. We confirm
! 		 * that minRecoveryPoint can be used as the backup end location by
! 		 * checking whether the database system status in pg_control indicates
! 		 * DB_IN_ARCHIVE_RECOVERY. If minRecoveryPoint is not available,
! 		 * there is no way to know the backup end location, so we cannot
! 		 * advance recovery any more. In this case, we have to cancel recovery
! 		 * before changing the database system status in pg_control to
! 		 * DB_IN_ARCHIVE_RECOVERY because otherwise subsequent
! 		 * restarted recovery would go through this check wrongly.
  		 */
  		if (haveBackupLabel)
  		{
  			ControlFile->backupStartPoint = checkPoint.redo;
  			ControlFile->backupEndRequired = backupEndRequired;
+ 
+ 			if (backupDuringRecovery)
+ 			{
+ 				if (save_state != DB_IN_ARCHIVE_RECOVERY)
+ 					ereport(FATAL,
+ 							(errmsg("database system status mismatches between "
+ 									"pg_control and backup_label"),
+ 							 errhint("This means that the backup is corrupted and you will "
+ 									 "have to use another backup for recovery.")));
+ 				ControlFile->backupEndPoint = ControlFile->minRecoveryPoint;
+ 			}
  		}
  		ControlFile->time = (pg_time_t) time(NULL);
  		/* No need to hold ControlFileLock yet, we aren't up far enough */
***************
*** 6625,6630 **** StartupXLOG(void)
--- 6691,6718 ----
  				/* Pop the error context stack */
  				error_context_stack = errcontext.previous;
  
+ 				if (!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
+ 					XLByteLE(ControlFile->backupEndPoint, EndRecPtr))
+ 				{
+ 					/*
+ 					 * We have reached the end of base backup, the point where
+ 					 * the minimum recovery point in pg_control which was
+ 					 * backed up just before pg_stop_backup() indicates.
+ 					 * The data on disk is now consistent. Reset backupStartPoint
+ 					 * and backupEndPoint.
+ 					 */
+ 					elog(DEBUG1, "end of backup reached");
+ 
+ 					LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 
+ 					MemSet(&ControlFile->backupStartPoint, 0, sizeof(XLogRecPtr));
+ 					MemSet(&ControlFile->backupEndPoint, 0, sizeof(XLogRecPtr));
+ 					ControlFile->backupEndRequired = false;
+ 					UpdateControlFile();
+ 
+ 					LWLockRelease(ControlFileLock);
+ 				}
+ 
  				/*
  				 * Update shared recoveryLastRecPtr after this record has been
  				 * replayed.
***************
*** 6824,6829 **** StartupXLOG(void)
--- 6912,6933 ----
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
  	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
  
+ 	/*
+ 	 * Update full_page_writes in shared memory and write an
+ 	 * XLOG_FPW_CHANGE record before resource manager writes cleanup
+ 	 * WAL records or checkpoint record is written.
+ 	 *
+ 	 * Note that full_page_writes in shared memory is initialized with
+ 	 * lastFullPageWrites so that UpdateFullPageWrites() can check whether
+ 	 * it's equal to full_page_writes specified in postgresql.conf (i.e., whether
+ 	 * full_page_writes has been changed since last shutdown or crash) and
+ 	 * then skip writing an XLOG_FPW_CHANGE record if not.
+ 	 */
+ 	Insert->fullPageWrites = lastFullPageWrites;
+ 	LocalSetXLogInsertAllowed();
+ 	UpdateFullPageWrites();
+ 	LocalXLogInsertAllowed = -1;
+ 
  	if (InRecovery)
  	{
  		int			rmid;
***************
*** 7681,7686 **** CreateCheckPoint(int flags)
--- 7785,7791 ----
  		LocalSetXLogInsertAllowed();
  
  	checkPoint.ThisTimeLineID = ThisTimeLineID;
+ 	checkPoint.fullPageWrites = Insert->fullPageWrites;
  
  	/*
  	 * Compute new REDO record ptr = location of next XLOG record.
***************
*** 8382,8387 **** XLogReportParameters(void)
--- 8487,8534 ----
  }
  
  /*
+  * Update full_page_writes in shared memory, and write an
+  * XLOG_FPW_CHANGE record if necessary.
+  */
+ void
+ UpdateFullPageWrites(void)
+ {
+ 	XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 
+ 	/*
+ 	 * Do nothing if full_page_writes has not been changed.
+ 	 *
+ 	 * It's safe to check the shared full_page_writes without the lock,
+ 	 * because we can guarantee that there is no concurrently running
+ 	 * process which can update it.
+ 	 */
+ 	if (fullPageWrites == Insert->fullPageWrites)
+ 		return;
+ 
+ 	/*
+ 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep
+ 	 * track of full_page_writes during archive recovery, if required.
+ 	 */
+ 	if (XLogStandbyInfoActive())
+ 	{
+ 		XLogRecData	rdata;
+ 
+ 		rdata.data = (char *) (&fullPageWrites);
+ 		rdata.len = sizeof(bool);
+ 		rdata.buffer = InvalidBuffer;
+ 		rdata.next = NULL;
+ 
+ 		XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE, &rdata);
+ 	}
+ 	else
+ 	{
+ 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+ 		Insert->fullPageWrites = fullPageWrites;
+ 		LWLockRelease(WALInsertLock);
+ 	}
+ }
+ 
+ /*
   * XLOG resource manager's routines
   *
   * Definitions of info values are in include/catalog/pg_control.h, though
***************
*** 8425,8431 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
  		 * never arrive.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint))
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
--- 8572,8579 ----
  		 * never arrive.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
! 			XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
***************
*** 8594,8599 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
--- 8742,8771 ----
  		/* Check to see if any changes to max_connections give problems */
  		CheckRequiredParameterValues();
  	}
+ 	else if (info == XLOG_FPW_CHANGE)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 		bool		fpw;
+ 
+ 		memcpy(&fpw, XLogRecGetData(record), sizeof(bool));
+ 
+ 		/*
+ 		 * Update the LSN of the last replayed XLOG_FPW_CHANGE record
+ 		 * so that pg_start_backup() and pg_stop_backup() can check
+ 		 * whether full_page_writes has been disabled during online backup.
+ 		 */
+ 		if (!fpw)
+ 		{
+ 			SpinLockAcquire(&xlogctl->info_lck);
+ 			if (XLByteLT(xlogctl->lastFpwDisableRecPtr, ReadRecPtr))
+ 				xlogctl->lastFpwDisableRecPtr = ReadRecPtr;
+ 			SpinLockRelease(&xlogctl->info_lck);
+ 		}
+ 
+ 		/* Keep track of full_page_writes */
+ 		lastFullPageWrites = fpw;
+ 	}
  }
  
  void
***************
*** 8607,8616 **** xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
  		CheckPoint *checkpoint = (CheckPoint *) rec;
  
  		appendStringInfo(buf, "checkpoint: redo %X/%X; "
! 						 "tli %u; xid %u/%u; oid %u; multi %u; offset %u; "
  						 "oldest xid %u in DB %u; oldest running xid %u; %s",
  						 checkpoint->redo.xlogid, checkpoint->redo.xrecoff,
  						 checkpoint->ThisTimeLineID,
  						 checkpoint->nextXidEpoch, checkpoint->nextXid,
  						 checkpoint->nextOid,
  						 checkpoint->nextMulti,
--- 8779,8789 ----
  		CheckPoint *checkpoint = (CheckPoint *) rec;
  
  		appendStringInfo(buf, "checkpoint: redo %X/%X; "
! 						 "tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
  						 "oldest xid %u in DB %u; oldest running xid %u; %s",
  						 checkpoint->redo.xlogid, checkpoint->redo.xrecoff,
  						 checkpoint->ThisTimeLineID,
+ 						 checkpoint->fullPageWrites ? "true" : "false",
  						 checkpoint->nextXidEpoch, checkpoint->nextXid,
  						 checkpoint->nextOid,
  						 checkpoint->nextMulti,
***************
*** 8675,8680 **** xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
--- 8848,8860 ----
  						 xlrec.max_locks_per_xact,
  						 wal_level_str);
  	}
+ 	else if (info == XLOG_FPW_CHANGE)
+ 	{
+ 		bool		fpw;
+ 
+ 		memcpy(&fpw, rec, sizeof(bool));
+ 		appendStringInfo(buf, "full_page_writes: %s", fpw ? "true" : "false");
+ 	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
  }
***************
*** 8888,8893 **** XLogRecPtr
--- 9068,9074 ----
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		recovery_in_progress = false;
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
  	pg_time_t	stamp_time;
***************
*** 8899,8916 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	FILE	   *fp;
  	StringInfoData labelfbuf;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		   errmsg("must be superuser or replication role to run a backup")));
  
! 	if (RecoveryInProgress())
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 				 errmsg("recovery is in progress"),
! 				 errhint("WAL control functions cannot be executed during recovery.")));
! 
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
--- 9080,9099 ----
  	FILE	   *fp;
  	StringInfoData labelfbuf;
  
+ 	recovery_in_progress = RecoveryInProgress();
+ 
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		   errmsg("must be superuser or replication role to run a backup")));
  
! 	/*
! 	 * During recovery, we don't need to check WAL level. Because the fact that
! 	 * we are now executing pg_start_backup() during recovery means that
! 	 * wal_level is set to hot_standby on the master, i.e., WAL level is sufficient
! 	 * for making an online backup.
! 	 */
! 	if (!recovery_in_progress && !XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
***************
*** 8932,8939 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	 * we won't have a history file covering the old timeline if pg_xlog
  	 * directory was not included in the base backup and the WAL archive was
  	 * cleared too before starting the backup.
  	 */
! 	RequestXLogSwitch();
  
  	/*
  	 * Mark backup active in shared memory.  We must do full-page WAL writes
--- 9115,9127 ----
  	 * we won't have a history file covering the old timeline if pg_xlog
  	 * directory was not included in the base backup and the WAL archive was
  	 * cleared too before starting the backup.
+ 	 *
+ 	 * During recovery, we skip forcing XLOG file switch, which means that
+ 	 * the backup taken during recovery is not available for the special recovery
+ 	 * case described above.
  	 */
! 	if (!recovery_in_progress)
! 		RequestXLogSwitch();
  
  	/*
  	 * Mark backup active in shared memory.  We must do full-page WAL writes
***************
*** 8949,8954 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
--- 9137,9145 ----
  	 * since we expect that any pages not modified during the backup interval
  	 * must have been correctly captured by the backup.)
  	 *
+ 	 * Note that forcePageWrites has no effect during an online backup from
+ 	 * the server in recovery mode.
+ 	 *
  	 * We must hold WALInsertLock to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
***************
*** 8977,8988 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  
  		do
  		{
  			/*
! 			 * Force a CHECKPOINT.	Aside from being necessary to prevent torn
  			 * page problems, this guarantees that two successive backup runs
  			 * will have different checkpoint positions and hence different
  			 * history file names, even if nothing happened in between.
  			 *
  			 * We use CHECKPOINT_IMMEDIATE only if requested by user (via
  			 * passing fast = true).  Otherwise this can take awhile.
  			 */
--- 9168,9189 ----
  
  		do
  		{
+ 			bool		checkpointfpw;
+ 
  			/*
! 			 * Force a CHECKPOINT.  Aside from being necessary to prevent torn
  			 * page problems, this guarantees that two successive backup runs
  			 * will have different checkpoint positions and hence different
  			 * history file names, even if nothing happened in between.
  			 *
+ 			 * During recovery, establish a restartpoint if possible. We use the last
+ 			 * restartpoint as the backup starting checkpoint. This means that two
+ 			 * successive backup runs can have same checkpoint positions.
+ 			 *
+ 			 * Since the fact that we are executing pg_start_backup() during
+ 			 * recovery means that bgwriter is running, we can use
+ 			 * RequestCheckpoint() to establish a restartpoint.
+ 			 *
  			 * We use CHECKPOINT_IMMEDIATE only if requested by user (via
  			 * passing fast = true).  Otherwise this can take awhile.
  			 */
***************
*** 8998,9005 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
--- 9199,9238 ----
  			LWLockAcquire(ControlFileLock, LW_SHARED);
  			checkpointloc = ControlFile->checkPoint;
  			startpoint = ControlFile->checkPointCopy.redo;
+ 			checkpointfpw = ControlFile->checkPointCopy.fullPageWrites;
  			LWLockRelease(ControlFileLock);
  
+ 			if (recovery_in_progress)
+ 			{
+ 				/* use volatile pointer to prevent code rearrangement */
+ 				volatile XLogCtlData *xlogctl = XLogCtl;
+ 				XLogRecPtr		recptr;
+ 
+ 				/*
+ 				 * Check to see if all WAL replayed during online backup (i.e.,
+ 				 * since last restartpoint used as backup starting checkpoint)
+ 				 * contain full-page writes.
+ 				 */
+ 				SpinLockAcquire(&xlogctl->info_lck);
+ 				recptr = xlogctl->lastFpwDisableRecPtr;
+ 				SpinLockRelease(&xlogctl->info_lck);
+ 
+ 				if (!checkpointfpw || XLByteLE(startpoint, recptr))
+ 					ereport(ERROR,
+ 							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 							 errmsg("WAL generated with full_page_writes=off was replayed "
+ 									"since last restartpoint")));
+ 
+ 				/*
+ 				 * During recovery, since we don't use the end-of-backup WAL
+ 				 * record and don't write the backup history file, the starting WAL
+ 				 * location doesn't need to be unique. This means that two base
+ 				 * backups started at the same time might use the same checkpoint
+ 				 * as starting locations.
+ 				 */
+ 				gotUniqueStartpoint = true;
+ 			}
+ 
  			/*
  			 * If two base backups are started at the same time (in WAL sender
  			 * processes), we need to make sure that they use different
***************
*** 9039,9044 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
--- 9272,9279 ----
  						 checkpointloc.xlogid, checkpointloc.xrecoff);
  		appendStringInfo(&labelfbuf, "BACKUP METHOD: %s\n",
  						 exclusive ? "pg_start_backup" : "streamed");
+ 		appendStringInfo(&labelfbuf, "SYSTEM STATUS: %s\n",
+ 						 recovery_in_progress ? "recovery" : "in production");
  		appendStringInfo(&labelfbuf, "START TIME: %s\n", strfbuf);
  		appendStringInfo(&labelfbuf, "LABEL: %s\n", backupidstr);
  
***************
*** 9133,9138 **** pg_start_backup_callback(int code, Datum arg)
--- 9368,9375 ----
   * history file at the beginning of archive recovery, but we now use the WAL
   * record for that and the file is for informational and debug purposes only.
   *
+  * During recovery, we only remove the backup label file.
+  *
   * Note: different from CancelBackup which just cancels online backup mode.
   */
  Datum
***************
*** 9159,9164 **** XLogRecPtr
--- 9396,9402 ----
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		recovery_in_progress = false;
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
  	XLogRecData rdata;
***************
*** 9169,9174 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
--- 9407,9413 ----
  	char		stopxlogfilename[MAXFNAMELEN];
  	char		lastxlogfilename[MAXFNAMELEN];
  	char		histfilename[MAXFNAMELEN];
+ 	char		systemstatus[20];
  	uint32		_logId;
  	uint32		_logSeg;
  	FILE	   *lfp;
***************
*** 9178,9196 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  	int			waits = 0;
  	bool		reported_waiting = false;
  	char	   *remaining;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
! 	if (RecoveryInProgress())
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 				 errmsg("recovery is in progress"),
! 				 errhint("WAL control functions cannot be executed during recovery.")));
! 
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
--- 9417,9438 ----
  	int			waits = 0;
  	bool		reported_waiting = false;
  	char	   *remaining;
+ 	char	   *ptr;
+ 
+ 	recovery_in_progress = RecoveryInProgress();
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
! 	/*
! 	 * During recovery, we don't need to check WAL level. Because the fact that
! 	 * we are now executing pg_stop_backup() means that wal_level is set to
! 	 * hot_standby on the master, i.e., WAL level is sufficient for making an online
! 	 * backup.
! 	 */
! 	if (!recovery_in_progress && !XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
***************
*** 9281,9286 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
--- 9523,9599 ----
  	remaining = strchr(labelfile, '\n') + 1;	/* %n is not portable enough */
  
  	/*
+ 	 * Parse the SYSTEM STATUS line, and check that database system
+ 	 * status matches between pg_start_backup() and pg_stop_backup().
+ 	 */
+ 	ptr = strstr(remaining, "SYSTEM STATUS:");
+ 	if (sscanf(ptr, "SYSTEM STATUS: %19s\n", systemstatus) != 1)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
+ 	if (strcmp(systemstatus, "recovery") == 0 && !recovery_in_progress)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("pg_stop_backup() was executed during normal processing "
+ 						"though pg_start_backup() was executed during recovery"),
+ 				 errhint("The database backup will not be usable.")));
+ 
+ 	/*
+ 	 * During recovery, we don't write an end-of-backup record. We can
+ 	 * assume that pg_control was backed up just before pg_stop_backup()
+ 	 * and its minimum recovery point can be available as the backup end
+ 	 * location. Without an end-of-backup record, we can check correctly
+ 	 * whether we've reached the end of backup when starting recovery
+ 	 * from this backup.
+ 	 *
+ 	 * We don't force a switch to new WAL file and wait for all the required
+ 	 * files to be archived. This is okay if we use the backup to start
+ 	 * the standby. But, if it's for an archive recovery, to ensure all the
+ 	 * required files are available, a user should wait for them to be archived,
+ 	 * or include them into the backup after pg_stop_backup().
+ 	 *
+ 	 * We return the current minimum recovery point as the backup end
+ 	 * location. Note that it's would be bigger than the exact backup end
+ 	 * location if the minimum recovery point is updated since the backup
+ 	 * of pg_control. The return value of pg_stop_backup() is often used
+ 	 * for a user to calculate the required files. Returning approximate
+ 	 * location is harmless for that use because it's guaranteed not to be
+ 	 * smaller than the exact backup end location.
+ 	 *
+ 	 * XXX currently a backup history file is for informational and debug
+ 	 * purposes only. It's not essential for an online backup. Furthermore,
+ 	 * even if it's created, it will not be archived during recovery because
+ 	 * an archiver is not invoked. So it doesn't seem worthwhile to write
+ 	 * a backup history file during recovery.
+ 	 */
+ 	if (recovery_in_progress)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 		XLogRecPtr	recptr;
+ 
+ 		/*
+ 		 * Check to see if all WAL replayed during online backup contain
+ 		 * full-page writes.
+ 		 */
+ 		SpinLockAcquire(&xlogctl->info_lck);
+ 		recptr = xlogctl->lastFpwDisableRecPtr;
+ 		SpinLockRelease(&xlogctl->info_lck);
+ 
+ 		if (XLByteLE(startpoint, recptr))
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 					 errmsg("WAL generated with full_page_writes=off was replayed "
+ 							"during online backup")));
+ 
+ 		LWLockAcquire(ControlFileLock, LW_SHARED);
+ 		stoppoint = ControlFile->minRecoveryPoint;
+ 		LWLockRelease(ControlFileLock);
+ 
+ 		return stoppoint;
+ 	}
+ 
+ 	/*
  	 * Write the backup-end xlog record
  	 */
  	rdata.data = (char *) (&startpoint);
***************
*** 9797,9814 **** pg_xlogfile_name(PG_FUNCTION_ARGS)
   * Returns TRUE if a backup_label was found (and fills the checkpoint
   * location and its REDO location into *checkPointLoc and RedoStartLSN,
   * respectively); returns FALSE if not. If this backup_label came from a
!  * streamed backup, *backupEndRequired is set to TRUE.
   */
  static bool
! read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired)
  {
  	char		startxlogfilename[MAXFNAMELEN];
  	TimeLineID	tli;
  	FILE	   *lfp;
  	char		ch;
  	char		backuptype[20];
  
  	*backupEndRequired = false;
  
  	/*
  	 * See if label file is present
--- 10110,10131 ----
   * Returns TRUE if a backup_label was found (and fills the checkpoint
   * location and its REDO location into *checkPointLoc and RedoStartLSN,
   * respectively); returns FALSE if not. If this backup_label came from a
!  * streamed backup, *backupEndRequired is set to TRUE. If this backup_label
!  * was created during recovery, *backupDuringRecovery is set to TRUE.
   */
  static bool
! read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired,
! 				  bool *backupDuringRecovery)
  {
  	char		startxlogfilename[MAXFNAMELEN];
  	TimeLineID	tli;
  	FILE	   *lfp;
  	char		ch;
  	char		backuptype[20];
+ 	char		systemstatus[20];
  
  	*backupEndRequired = false;
+ 	*backupDuringRecovery = false;
  
  	/*
  	 * See if label file is present
***************
*** 9842,9857 **** read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired)
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
  	/*
! 	 * BACKUP METHOD line is new in 9.1. We can't restore from an older backup
! 	 * anyway, but since the information on it is not strictly required, don't
! 	 * error out if it's missing for some reason.
  	 */
! 	if (fscanf(lfp, "BACKUP METHOD: %19s", backuptype) == 1)
  	{
  		if (strcmp(backuptype, "streamed") == 0)
  			*backupEndRequired = true;
  	}
  
  	if (ferror(lfp) || FreeFile(lfp))
  		ereport(FATAL,
  				(errcode_for_file_access(),
--- 10159,10180 ----
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
  	/*
! 	 * BACKUP METHOD and SYSTEM STATUS lines are new in 9.2. We can't
! 	 * restore from an older backup anyway, but since the information on it
! 	 * is not strictly required, don't error out if it's missing for some reason.
  	 */
! 	if (fscanf(lfp, "BACKUP METHOD: %19s\n", backuptype) == 1)
  	{
  		if (strcmp(backuptype, "streamed") == 0)
  			*backupEndRequired = true;
  	}
  
+ 	if (fscanf(lfp, "SYSTEM STATUS: %19s\n", systemstatus) == 1)
+ 	{
+ 		if (strcmp(systemstatus, "recovery") == 0)
+ 			*backupDuringRecovery = true;
+ 	}
+ 
  	if (ferror(lfp) || FreeFile(lfp))
  		ereport(FATAL,
  				(errcode_for_file_access(),
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***************
*** 289,294 **** typedef enum
--- 289,296 ----
  static PMState pmState = PM_INIT;
  
  static bool ReachedNormalRunning = false;		/* T if we've reached PM_RUN */
+ static bool OnlineBackupAllowed = false;		/* T if we've reached PM_RUN or
+ 												 * PM_HOT_STANDBY */
  
  bool		ClientAuthInProgress = false;		/* T during new-client
  												 * authentication */
***************
*** 2119,2136 **** pmdie(SIGNAL_ARGS)
  				/* and the walwriter too */
  				if (WalWriterPID != 0)
  					signal_child(WalWriterPID, SIGTERM);
! 
! 				/*
! 				 * If we're in recovery, we can't kill the startup process
! 				 * right away, because at present doing so does not release
! 				 * its locks.  We might want to change this in a future
! 				 * release.  For the time being, the PM_WAIT_READONLY state
! 				 * indicates that we're waiting for the regular (read only)
! 				 * backends to die off; once they do, we'll kill the startup
! 				 * and walreceiver processes.
! 				 */
! 				pmState = (pmState == PM_RUN) ?
! 					PM_WAIT_BACKUP : PM_WAIT_READONLY;
  			}
  
  			/*
--- 2121,2127 ----
  				/* and the walwriter too */
  				if (WalWriterPID != 0)
  					signal_child(WalWriterPID, SIGTERM);
! 				pmState = PM_WAIT_BACKUP;
  			}
  
  			/*
***************
*** 2313,2318 **** reaper(SIGNAL_ARGS)
--- 2304,2310 ----
  			 */
  			FatalError = false;
  			ReachedNormalRunning = true;
+ 			OnlineBackupAllowed = true;
  			pmState = PM_RUN;
  
  			/*
***************
*** 2854,2862 **** PostmasterStateMachine(void)
  	{
  		/*
  		 * PM_WAIT_BACKUP state ends when online backup mode is not active.
  		 */
  		if (!BackupInProgress())
! 			pmState = PM_WAIT_BACKENDS;
  	}
  
  	if (pmState == PM_WAIT_READONLY)
--- 2846,2862 ----
  	{
  		/*
  		 * PM_WAIT_BACKUP state ends when online backup mode is not active.
+ 		 *
+ 		 * If we're in recovery, we can't kill the startup process right away,
+ 		 * because at present doing so does not release its locks.  We might
+ 		 * want to change this in a future release.  For the time being,
+ 		 * the PM_WAIT_READONLY state indicates that we're waiting for
+ 		 * the regular (read only) backends to die off; once they do,
+ 		 * we'll kill the startup and walreceiver processes.
  		 */
  		if (!BackupInProgress())
! 			pmState = ReachedNormalRunning ?
! 				PM_WAIT_BACKENDS : PM_WAIT_READONLY;
  	}
  
  	if (pmState == PM_WAIT_READONLY)
***************
*** 3025,3037 **** PostmasterStateMachine(void)
  			/*
  			 * Terminate backup mode to avoid recovery after a clean fast
  			 * shutdown.  Since a backup can only be taken during normal
! 			 * running (and not, for example, while running under Hot Standby)
! 			 * it only makes sense to do this if we reached normal running. If
! 			 * we're still in recovery, the backup file is one we're
! 			 * recovering *from*, and we must keep it around so that recovery
! 			 * restarts from the right place.
  			 */
! 			if (ReachedNormalRunning)
  				CancelBackup();
  
  			/* Normal exit from the postmaster is here */
--- 3025,3037 ----
  			/*
  			 * Terminate backup mode to avoid recovery after a clean fast
  			 * shutdown.  Since a backup can only be taken during normal
! 			 * running and hot standby, it only makes sense to do this
! 			 * if we reached normal running or hot standby. If we have not
! 			 * reached a consistent recovery state yet, the backup file is
! 			 * one we're recovering *from*, and we must keep it around
! 			 * so that recovery restarts from the right place.
  			 */
! 			if (OnlineBackupAllowed)
  				CancelBackup();
  
  			/* Normal exit from the postmaster is here */
***************
*** 4188,4193 **** sigusr1_handler(SIGNAL_ARGS)
--- 4188,4194 ----
  		ereport(LOG,
  		(errmsg("database system is ready to accept read only connections")));
  
+ 		OnlineBackupAllowed = true;
  		pmState = PM_HOT_STANDBY;
  	}
  
*** a/src/backend/postmaster/walwriter.c
--- b/src/backend/postmaster/walwriter.c
***************
*** 216,221 **** WalWriterMain(void)
--- 216,228 ----
  	PG_SETMASK(&UnBlockSig);
  
  	/*
+ 	 * There is a race condition: full_page_writes might have been changed
+ 	 * since the startup process had updated it in shared memory. To handle
+ 	 * this case, we always update shared full_page_writes here.
+ 	 */
+ 	UpdateFullPageWrites();
+ 
+ 	/*
  	 * Loop forever
  	 */
  	for (;;)
***************
*** 236,241 **** WalWriterMain(void)
--- 243,254 ----
  		{
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
+ 
+ 			/*
+ 			 * If full_page_writes has been changed by SIGHUP, we update it
+ 			 * in shared memory and write an XLOG_FPW_CHANGE record.
+ 			 */
+ 			UpdateFullPageWrites();
  		}
  		if (shutdown_requested)
  		{
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 130,136 **** extern int	CommitSiblings;
  extern char *default_tablespace;
  extern char *temp_tablespaces;
  extern bool synchronize_seqscans;
- extern bool fullPageWrites;
  extern int	ssl_renegotiation_limit;
  extern char *SSLCipherSuites;
  
--- 130,135 ----
*** a/src/bin/pg_controldata/pg_controldata.c
--- b/src/bin/pg_controldata/pg_controldata.c
***************
*** 209,214 **** main(int argc, char *argv[])
--- 209,216 ----
  		   ControlFile.checkPointCopy.redo.xrecoff);
  	printf(_("Latest checkpoint's TimeLineID:       %u\n"),
  		   ControlFile.checkPointCopy.ThisTimeLineID);
+ 	printf(_("Latest checkpoint's full_page_writes: %s\n"),
+ 		   ControlFile.checkPointCopy.fullPageWrites ? _("yes") : _("no"));
  	printf(_("Latest checkpoint's NextXID:          %u/%u\n"),
  		   ControlFile.checkPointCopy.nextXidEpoch,
  		   ControlFile.checkPointCopy.nextXid);
***************
*** 232,237 **** main(int argc, char *argv[])
--- 234,242 ----
  	printf(_("Backup start location:                %X/%X\n"),
  		   ControlFile.backupStartPoint.xlogid,
  		   ControlFile.backupStartPoint.xrecoff);
+ 	printf(_("Backup end location:                  %X/%X\n"),
+ 		   ControlFile.backupEndPoint.xlogid,
+ 		   ControlFile.backupEndPoint.xrecoff);
  	printf(_("End-of-backup record required:        %s\n"),
  		   ControlFile.backupEndRequired ? _("yes") : _("no"));
  	printf(_("Current wal_level setting:            %s\n"),
*** a/src/bin/pg_ctl/pg_ctl.c
--- b/src/bin/pg_ctl/pg_ctl.c
***************
*** 885,899 **** do_stop(void)
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
  		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present, we're recovering from an online
! 		 * backup instead of performing one.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0 &&
! 			stat(recovery_file, &statbuf) != 0)
  		{
! 			print_msg(_("WARNING: online backup mode is active\n"
! 						"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
  		}
  
  		print_msg(_("waiting for server to shut down..."));
--- 885,902 ----
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
  		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present and new connection has not been
! 		 * allowed yet, an online backup mode must not be active.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0)
  		{
! 			if (stat(recovery_file, &statbuf) != 0)
! 				print_msg(_("WARNING: online backup mode is active\n"
! 							"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
! 			else
! 				print_msg(_("WARNING: online backup mode is active if you can connect as a superuser to server\n"
! 							"If so, shutdown will not complete until pg_stop_backup() is called.\n\n"));
  		}
  
  		print_msg(_("waiting for server to shut down..."));
***************
*** 973,987 **** do_restart(void)
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
  		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present, we're recovering from an online
! 		 * backup instead of performing one.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0 &&
! 			stat(recovery_file, &statbuf) != 0)
  		{
! 			print_msg(_("WARNING: online backup mode is active\n"
! 						"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
  		}
  
  		print_msg(_("waiting for server to shut down..."));
--- 976,993 ----
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
  		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present and new connection has not been
! 		 * allowed yet, an online backup mode must not be active.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0)
  		{
! 			if (stat(recovery_file, &statbuf) != 0)
! 				print_msg(_("WARNING: online backup mode is active\n"
! 							"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
! 			else
! 				print_msg(_("WARNING: online backup mode is active if you can connect as a superuser to server\n"
! 							"If so, shutdown will not complete until pg_stop_backup() is called.\n\n"));
  		}
  
  		print_msg(_("waiting for server to shut down..."));
*** a/src/bin/pg_resetxlog/pg_resetxlog.c
--- b/src/bin/pg_resetxlog/pg_resetxlog.c
***************
*** 489,494 **** GuessControlValues(void)
--- 489,495 ----
  	ControlFile.checkPointCopy.redo.xlogid = 0;
  	ControlFile.checkPointCopy.redo.xrecoff = SizeOfXLogLongPHD;
  	ControlFile.checkPointCopy.ThisTimeLineID = 1;
+ 	ControlFile.checkPointCopy.fullPageWrites = false;
  	ControlFile.checkPointCopy.nextXidEpoch = 0;
  	ControlFile.checkPointCopy.nextXid = FirstNormalTransactionId;
  	ControlFile.checkPointCopy.nextOid = FirstBootstrapObjectId;
***************
*** 503,509 **** GuessControlValues(void)
  	ControlFile.time = (pg_time_t) time(NULL);
  	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
  
! 	/* minRecoveryPoint and backupStartPoint can be left zero */
  
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.MaxConnections = 100;
--- 504,510 ----
  	ControlFile.time = (pg_time_t) time(NULL);
  	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
  
! 	/* minRecoveryPoint, backupStartPoint and backupEndPoint can be left zero */
  
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.MaxConnections = 100;
***************
*** 569,574 **** PrintControlValues(bool guessed)
--- 570,577 ----
  		   sysident_str);
  	printf(_("Latest checkpoint's TimeLineID:       %u\n"),
  		   ControlFile.checkPointCopy.ThisTimeLineID);
+ 	printf(_("Latest checkpoint's full_page_writes:       %s\n"),
+ 		   ControlFile.checkPointCopy.fullPageWrites ? _("yes") : _("no"));
  	printf(_("Latest checkpoint's NextXID:          %u/%u\n"),
  		   ControlFile.checkPointCopy.nextXidEpoch,
  		   ControlFile.checkPointCopy.nextXid);
***************
*** 637,642 **** RewriteControlFile(void)
--- 640,647 ----
  	ControlFile.minRecoveryPoint.xrecoff = 0;
  	ControlFile.backupStartPoint.xlogid = 0;
  	ControlFile.backupStartPoint.xrecoff = 0;
+ 	ControlFile.backupEndPoint.xlogid = 0;
+ 	ControlFile.backupEndPoint.xrecoff = 0;
  	ControlFile.backupEndRequired = false;
  
  	/*
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 197,202 **** extern int	XLogArchiveTimeout;
--- 197,203 ----
  extern bool XLogArchiveMode;
  extern char *XLogArchiveCommand;
  extern bool EnableHotStandby;
+ extern bool fullPageWrites;
  extern bool log_checkpoints;
  
  /* WAL levels */
***************
*** 306,311 **** extern void CreateCheckPoint(int flags);
--- 307,313 ----
  extern bool CreateRestartPoint(int flags);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr XLogRestorePoint(const char *rpName);
+ extern void UpdateFullPageWrites(void);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern XLogRecPtr GetInsertRecPtr(void);
  extern XLogRecPtr GetFlushRecPtr(void);
*** a/src/include/catalog/pg_control.h
--- b/src/include/catalog/pg_control.h
***************
*** 21,27 ****
  
  
  /* Version identifier for this pg_control format */
! #define PG_CONTROL_VERSION	921
  
  /*
   * Body of CheckPoint XLOG records.  This is declared here because we keep
--- 21,27 ----
  
  
  /* Version identifier for this pg_control format */
! #define PG_CONTROL_VERSION	922
  
  /*
   * Body of CheckPoint XLOG records.  This is declared here because we keep
***************
*** 33,38 **** typedef struct CheckPoint
--- 33,39 ----
  	XLogRecPtr	redo;			/* next RecPtr available when we began to
  								 * create CheckPoint (i.e. REDO start point) */
  	TimeLineID	ThisTimeLineID; /* current TLI */
+ 	bool			fullPageWrites;	/* current full_page_writes */
  	uint32		nextXidEpoch;	/* higher-order bits of nextXid */
  	TransactionId nextXid;		/* next free XID */
  	Oid			nextOid;		/* next free OID */
***************
*** 60,65 **** typedef struct CheckPoint
--- 61,67 ----
  #define XLOG_BACKUP_END					0x50
  #define XLOG_PARAMETER_CHANGE			0x60
  #define XLOG_RESTORE_POINT				0x70
+ #define XLOG_FPW_CHANGE				0x80
  
  
  /*
***************
*** 138,143 **** typedef struct ControlFileData
--- 140,152 ----
  	 * record, to make sure the end-of-backup record corresponds the base
  	 * backup we're recovering from.
  	 *
+ 	 * backupEndPoint is the backup end location, if we are recovering from
+ 	 * an online backup which was taken from the server in recovery mode
+ 	 * and haven't reached the end of backup yet. It is initialized to
+ 	 * the minimum recovery point in pg_control which was backed up just
+ 	 * before pg_stop_backup(). It is reset to zero when the end of backup
+ 	 * is reached, and we mustn't start up before that.
+ 	 *
  	 * If backupEndRequired is true, we know for sure that we're restoring
  	 * from a backup, and must see a backup-end record before we can safely
  	 * start up. If it's false, but backupStartPoint is set, a backup_label
***************
*** 146,151 **** typedef struct ControlFileData
--- 155,161 ----
  	 */
  	XLogRecPtr	minRecoveryPoint;
  	XLogRecPtr	backupStartPoint;
+ 	XLogRecPtr	backupEndPoint;
  	bool		backupEndRequired;
  
  	/*

#54

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 14 years ago

In reply to: Fujii Masao (#53)

Re: Online base backup from the hot-standby

On 24.10.2011 15:29, Fujii Masao wrote:

+    <listitem>
+     <para>
+      Copy the pg_control file from the cluster directory to the global
+      sub-directory of the backup. For example:
+ <programlisting>
+ cp $PGDATA/global/pg_control /mnt/server/backupdir/global
+ </programlisting>
+     </para>
+    </listitem>

Why is this step required? The control file is overwritten by
information from the backup_label anyway, no?

+    <listitem>
+     <para>
+      Again connect to the database as a superuser, and execute
+      <function>pg_stop_backup</>. This terminates the backup mode, but does not
+      perform a switch to the next WAL segment, create a backup history file and
+      wait for all required WAL segments to be archived,
+      unlike that during normal processing.
+     </para>
+    </listitem>

How do you ensure that all the required WAL segments have been archived,
then?

+   </orderedlist>
+    </para>
+
+    <para>
+     You cannot use the <application>pg_basebackup</> tool to take the backup
+     from the standby.
+    </para>

Why not? We have cascading replication now.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#55

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 14 years ago

In reply to: Fujii Masao (#53)

Re: Online base backup from the hot-standby

On 24.10.2011 15:29, Fujii Masao wrote:

In your patch, FPW is always WAL-logged at startup even when FPW has
not been changed since last shutdown. I don't think that's required.
I changed the recovery code so that it keeps track of last FPW indicated
by WAL record. Then, at end of startup, if that FPW is equal to FPW
specified in postgresql.conf (which means that FPW has not been changed
since last shutdown or crash), WAL-logging of FPW is skipped. This change
prevents unnecessary WAL-logging. Thought?

One problem with this whole FPW-tracking is that pg_lesslog makes it
fail. I'm not sure what we need to do about that - maybe just add a
warning to the docs. But it leaves a bit bad feeling in my mouth.
Usually we try to make features work orthogonally, without dependencies
to other settings. Now this feature requires that full_page_writes is
turned on in the master, and also that you don't use pg_lesslog to
compress the WAL segments or your base backup might be corrupt. The
procedure to take a backup from the standby seems more complicated than
taking it on the master - there are more steps to follow.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#56

Robert Haas

robertmhaas@gmail.com

about 14 years ago

In reply to: Heikki Linnakangas (#55)

Re: Online base backup from the hot-standby

On Mon, Oct 24, 2011 at 11:33 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 24.10.2011 15:29, Fujii Masao wrote:

In your patch, FPW is always WAL-logged at startup even when FPW has
not been changed since last shutdown. I don't think that's required.
I changed the recovery code so that it keeps track of last FPW indicated
by WAL record. Then, at end of startup, if that FPW is equal to FPW
specified in postgresql.conf (which means that FPW has not been changed
since last shutdown or crash), WAL-logging of FPW is skipped. This change
prevents unnecessary WAL-logging. Thought?

One problem with this whole FPW-tracking is that pg_lesslog makes it fail.
I'm not sure what we need to do about that - maybe just add a warning to the
docs. But it leaves a bit bad feeling in my mouth. Usually we try to make
features work orthogonally, without dependencies to other settings. Now this
feature requires that full_page_writes is turned on in the master, and also
that you don't use pg_lesslog to compress the WAL segments or your base
backup might be corrupt. The procedure to take a backup from the standby
seems more complicated than taking it on the master - there are more steps
to follow.

Doing it on the master isn't as easy as I'd like it to be, either.

But it's not really clear how to make it simpler.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#57

Fujii Masao

masao.fujii@gmail.com

about 14 years ago

In reply to: Heikki Linnakangas (#54)

Re: Online base backup from the hot-standby

Thanks for the review!

On Tue, Oct 25, 2011 at 12:24 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 24.10.2011 15:29, Fujii Masao wrote:
+    <listitem>
+     <para>
+      Copy the pg_control file from the cluster directory to the global
+      sub-directory of the backup. For example:
+ <programlisting>
+ cp $PGDATA/global/pg_control /mnt/server/backupdir/global
+ </programlisting>
+     </para>
+    </listitem>
Why is this step required? The control file is overwritten by information
from the backup_label anyway, no?

Yes, when recovery starts, the control file is overwritten. But before that,
we retrieve the minimum recovery point from the control file. Then it's used
as the backup end location.

During recovery, pg_stop_backup() cannot write an end-of-backup record.
So, in standby-only backup, other way to retrieve the backup end location
(instead of an end-of-backup record) is required. Ishiduka-san used the
control file as that, according to your suggestion ;)
http://archives.postgresql.org/pgsql-hackers/2011-05/msg01405.php

+    <listitem>
+     <para>
+      Again connect to the database as a superuser, and execute
+      <function>pg_stop_backup</>. This terminates the backup mode, but
does not
+      perform a switch to the next WAL segment, create a backup history
file and
+      wait for all required WAL segments to be archived,
+      unlike that during normal processing.
+     </para>
+    </listitem>

How do you ensure that all the required WAL segments have been archived,
then?

The patch doesn't provide any capability to ensure that, IOW assumes that's
a user responsibility. If a user wants to ensure that, he/she needs to calculate
the backup start and end WAL files from the result of pg_start_backup()
and pg_stop_backup() respectively, and needs to wait until those files have
appeared in the archive. Also if the required WAL file has not been archived
yet, a user might need to execute pg_switch_xlog() in the master.

If we change pg_stop_backup() so that, even during recovery, it waits until
all required WAL files have been archived, we would need to WAL-log
the completion of WAL archiving in the master. This enables the standby to
check whether specified WAL files have been archived. We should change
the patch in this way? But even if we change, you still might need to execute
pg_switch_xlog() in the master additionally, and pg_stop_backup() might keep
waiting infinitely if the master is not in progress.

+   </orderedlist>
+    </para>
+
+    <para>
+     You cannot use the <application>pg_basebackup</> tool to take the
backup
+     from the standby.
+    </para>

Why not? We have cascading replication now.

Because no one has implemented that feature.

Yeah, we have cascading replication, but without adopting the standby-only
backup patch, pg_basebackup cannot execute do_pg_start_backup() and
do_pg_stop_backup() during recovery. So we can think that the patch that
Ishiduka-san proposed is the first step to extend pg_basebackup so that it
can take backup from the standby.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#58

Fujii Masao

masao.fujii@gmail.com

about 14 years ago

In reply to: Heikki Linnakangas (#55)

Re: Online base backup from the hot-standby

On Tue, Oct 25, 2011 at 12:33 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

One problem with this whole FPW-tracking is that pg_lesslog makes it fail.
I'm not sure what we need to do about that - maybe just add a warning to the
docs. But it leaves a bit bad feeling in my mouth. Usually we try to make
features work orthogonally, without dependencies to other settings. Now this
feature requires that full_page_writes is turned on in the master, and also
that you don't use pg_lesslog to compress the WAL segments or your base
backup might be corrupt.

Right, pg_lesslog users cannot use the documented procedure. They need to
do more complex one;

1. Execute pg_start_backup() in the master, and save its return value.
2. Wait until the backup starting checkpoint record has been replayed
in the standby. You can do this by comparing the return value of
pg_start_backup() with pg_last_replay_location().
3. Do the documented standby-only backup procedure.
4. Execute pg_stop_backup() in the master.

This is complicated, but I'm not sure how we can simplify it. Anyway we can
document this procedure for pg_lesslog users. We should?

The procedure to take a backup from the standby
seems more complicated than taking it on the master - there are more steps
to follow.

Extending pg_basebackup so that it can take a backup from the standby would
make the procedure simple to a certain extent, I think. Though a user
still needs
to enable FPW in the master and must not use pg_lesslog.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#59

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 14 years ago

In reply to: Fujii Masao (#57)

Re: Online base backup from the hot-standby

On 25.10.2011 08:12, Fujii Masao wrote:

On Tue, Oct 25, 2011 at 12:24 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
On 24.10.2011 15:29, Fujii Masao wrote:
+<listitem>
+<para>
+      Copy the pg_control file from the cluster directory to the global
+      sub-directory of the backup. For example:
+<programlisting>
+ cp $PGDATA/global/pg_control /mnt/server/backupdir/global
+</programlisting>
+</para>
+</listitem>
Why is this step required? The control file is overwritten by information
from the backup_label anyway, no?
Yes, when recovery starts, the control file is overwritten. But before that,
we retrieve the minimum recovery point from the control file. Then it's used
as the backup end location.

During recovery, pg_stop_backup() cannot write an end-of-backup record.
So, in standby-only backup, other way to retrieve the backup end location
(instead of an end-of-backup record) is required. Ishiduka-san used the
control file as that, according to your suggestion ;)
http://archives.postgresql.org/pgsql-hackers/2011-05/msg01405.php

Oh :-)

+<para>
+      Again connect to the database as a superuser, and execute
+<function>pg_stop_backup</>. This terminates the backup mode, but
does not
+      perform a switch to the next WAL segment, create a backup history
file and
+      wait for all required WAL segments to be archived,
+      unlike that during normal processing.
+</para>
+</listitem>
How do you ensure that all the required WAL segments have been archived,
then?
The patch doesn't provide any capability to ensure that, IOW assumes that's
a user responsibility. If a user wants to ensure that, he/she needs to calculate
the backup start and end WAL files from the result of pg_start_backup()
and pg_stop_backup() respectively, and needs to wait until those files have
appeared in the archive. Also if the required WAL file has not been archived
yet, a user might need to execute pg_switch_xlog() in the master.

Frankly, I think this whole thing is too fragile. The procedure is
superficially similar to what you do on master: run pg_start_backup(),
rsync data directory, run pg_stop_backup(), but is actually subtly
different and more complicated. If you don't know that, and don't follow
the full procedure, you get a corrupt backup. And the backup might look
ok, and might even sometimes work, which means that you won't notice in
quick testing. That's a *huge* foot-gun.

I think we need to step back and find a way to make this:
a) less complicated, or at least
b) more robust, so that if you don't follow the procedure, you get an error.

With pg_basebackup, we have a fighting chance of getting this right,
because we have more control over how the backup is made. For example,
we can co-operate with the buffer manager to avoid torn-pages,
eliminating the need for full_page_writes=on, and we can include a
control file with the correct end-of-backup location automatically,
without requiring user intervention. pg_basebackup is less flexible than
the pg_start/stop_backup method, and unfortunately you're more likely to
need the flexibility in a more complicated setup with a hot standby
server and all, but making the generic pg_start/stop_backup method work
seems infeasible at the moment.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#60

Fujii Masao

masao.fujii@gmail.com

about 14 years ago

In reply to: Heikki Linnakangas (#59)

Re: Online base backup from the hot-standby

On Tue, Oct 25, 2011 at 3:44 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

+<para>
+      Again connect to the database as a superuser, and execute
+<function>pg_stop_backup</>. This terminates the backup mode, but
does not
+      perform a switch to the next WAL segment, create a backup history
file and
+      wait for all required WAL segments to be archived,
+      unlike that during normal processing.
+</para>
+</listitem>
How do you ensure that all the required WAL segments have been archived,
then?
The patch doesn't provide any capability to ensure that, IOW assumes
that's
a user responsibility. If a user wants to ensure that, he/she needs to
calculate
the backup start and end WAL files from the result of pg_start_backup()
and pg_stop_backup() respectively, and needs to wait until those files
have
appeared in the archive. Also if the required WAL file has not been
archived
yet, a user might need to execute pg_switch_xlog() in the master.
Frankly, I think this whole thing is too fragile. The procedure is
superficially similar to what you do on master: run pg_start_backup(), rsync
data directory, run pg_stop_backup(), but is actually subtly different and
more complicated. If you don't know that, and don't follow the full
procedure, you get a corrupt backup. And the backup might look ok, and might
even sometimes work, which means that you won't notice in quick testing.
That's a *huge* foot-gun.

I think we need to step back and find a way to make this:
a) less complicated, or at least
b) more robust, so that if you don't follow the procedure, you get an error.

One idea to make the way more robust is to change the PostgreSQL so that
it writes the buffer page to a temporary space instead of database file
during a backup. This means that there is no torn-pages in the database files
of the backup. After backup, the data blocks are written back to the database
files over time. When recovery starts from that backup(i.e., backup_label is
found), it clears the temporary space in the backup first and continues recovery
by using the database files which contain no torn-pages. OTOH,
in crash recovery (i.e., backup_label is not found), recovery is performed by
using both database files and temporary space. This whole approach would
make the standby-only backup available even if FPW is disabled in the master
and you don't care about the order to backup the control file.

But this idea looks overkill. It seems very complicated to implement that, and
likely to invite other bugs. I don't have any other good and simple
idea for now.

With pg_basebackup, we have a fighting chance of getting this right, because
we have more control over how the backup is made. For example, we can
co-operate with the buffer manager to avoid torn-pages, eliminating the need
for full_page_writes=on, and we can include a control file with the correct
end-of-backup location automatically, without requiring user intervention.
pg_basebackup is less flexible than the pg_start/stop_backup method, and
unfortunately you're more likely to need the flexibility in a more
complicated setup with a hot standby server and all, but making the generic
pg_start/stop_backup method work seems infeasible at the moment.

Yes, so we should give up supporting manual procedure? And extend
pg_basebackup for the standby-only backup, first? I can live with this.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#61

Magnus Hagander

magnus@hagander.net

about 14 years ago

In reply to: Fujii Masao (#60)

Re: Online base backup from the hot-standby

On Tue, Oct 25, 2011 at 10:50, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Oct 25, 2011 at 3:44 PM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:
+<para>
+      Again connect to the database as a superuser, and execute
+<function>pg_stop_backup</>. This terminates the backup mode, but
does not
+      perform a switch to the next WAL segment, create a backup history
file and
+      wait for all required WAL segments to be archived,
+      unlike that during normal processing.
+</para>
+</listitem>
How do you ensure that all the required WAL segments have been archived,
then?
The patch doesn't provide any capability to ensure that, IOW assumes
that's
a user responsibility. If a user wants to ensure that, he/she needs to
calculate
the backup start and end WAL files from the result of pg_start_backup()
and pg_stop_backup() respectively, and needs to wait until those files
have
appeared in the archive. Also if the required WAL file has not been
archived
yet, a user might need to execute pg_switch_xlog() in the master.
Frankly, I think this whole thing is too fragile. The procedure is
superficially similar to what you do on master: run pg_start_backup(), rsync
data directory, run pg_stop_backup(), but is actually subtly different and
more complicated. If you don't know that, and don't follow the full
procedure, you get a corrupt backup. And the backup might look ok, and might
even sometimes work, which means that you won't notice in quick testing.
That's a *huge* foot-gun.

I think we need to step back and find a way to make this:
a) less complicated, or at least
b) more robust, so that if you don't follow the procedure, you get an error.
One idea to make the way more robust is to change the PostgreSQL so that
it writes the buffer page to a temporary space instead of database file
during a backup. This means that there is no torn-pages in the database files
of the backup. After backup, the data blocks are written back to the database
files over time. When recovery starts from that backup(i.e., backup_label is
found), it clears the temporary space in the backup first and continues recovery
by using the database files which contain no torn-pages. OTOH,
in crash recovery (i.e., backup_label is not found), recovery is performed by
using both database files and temporary space. This whole approach would
make the standby-only backup available even if FPW is disabled in the master
and you don't care about the order to backup the control file.

But this idea looks overkill. It seems very complicated to implement that, and
likely to invite other bugs. I don't have any other good and simple
idea for now.

With pg_basebackup, we have a fighting chance of getting this right, because
we have more control over how the backup is made. For example, we can
co-operate with the buffer manager to avoid torn-pages, eliminating the need
for full_page_writes=on, and we can include a control file with the correct
end-of-backup location automatically, without requiring user intervention.
pg_basebackup is less flexible than the pg_start/stop_backup method, and
unfortunately you're more likely to need the flexibility in a more
complicated setup with a hot standby server and all, but making the generic
pg_start/stop_backup method work seems infeasible at the moment.

Yes, so we should give up supporting manual procedure? And extend
pg_basebackup for the standby-only backup, first? I can live with this.

I don't think we should necessarily give up completely. But doing a
pg_basebackup way *first* seems reasonable - because it's going to be
the easiest one to "get right", given that we have more control there.
Doesn't mean we shouldn't extend it in the future...

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#62

Fujii Masao

masao.fujii@gmail.com

about 14 years ago

In reply to: Magnus Hagander (#61)

Re: Online base backup from the hot-standby

On Tue, Oct 25, 2011 at 7:19 PM, Magnus Hagander <magnus@hagander.net> wrote:

I don't think we should necessarily give up completely. But doing a
pg_basebackup way *first* seems reasonable - because it's going to be
the easiest one to "get right", given that we have more control there.
Doesn't mean we shouldn't extend it in the future...

Agreed. The question is -- how far should we change pg_basebackup to
"get right"? I think it's not difficult to change it so that it backs up
the control file at the end. But eliminating the need for full_page_writes=on
seems not easy. No? So I'm not inclined to do that in at least first commit.
Otherwise, I'm afraid the patch would become huge.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#63

Magnus Hagander

magnus@hagander.net

about 14 years ago

In reply to: Fujii Masao (#62)

Re: Online base backup from the hot-standby

On Tue, Oct 25, 2011 at 13:54, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Oct 25, 2011 at 7:19 PM, Magnus Hagander <magnus@hagander.net> wrote:

I don't think we should necessarily give up completely. But doing a
pg_basebackup way *first* seems reasonable - because it's going to be
the easiest one to "get right", given that we have more control there.
Doesn't mean we shouldn't extend it in the future...

Agreed. The question is -- how far should we change pg_basebackup to
"get right"? I think it's not difficult to change it so that it backs up
the control file at the end. But eliminating the need for full_page_writes=on
seems not easy. No? So I'm not inclined to do that in at least first commit.
Otherwise, I'm afraid the patch would become huge.

It's more server side of base backups than the actual pg_basebackup
tool of course, but I'm sure that's what we're all referring to here.

Personally, I'd see the fpw stuff as part of the infrastructure
needed. Meaning that the fpw stuff should go in *first*, and the
pg_basebackup stuff later.

If we want something to go in early, that could be as simple as a
version of pg_basebackup that runs against the slave but only if
full_page_writes=on on the master. If it's not, it throws an error.
Then we can improve upon that by adding handling of fpw=off, first by
infrastructure, then by tool.

Doing it piece by piece like that is probably a good idea, since as
you say, all at once will be pretty huge.

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

#64

Steve Singer

ssinger_pg@sympatico.ca

about 14 years ago

In reply to: Heikki Linnakangas (#59)

Re: Online base backup from the hot-standby

On 11-10-25 02:44 AM, Heikki Linnakangas wrote:

With pg_basebackup, we have a fighting chance of getting this right,
because we have more control over how the backup is made. For example,
we can co-operate with the buffer manager to avoid torn-pages,
eliminating the need for full_page_writes=on, and we can include a
control file with the correct end-of-backup location automatically,
without requiring user intervention. pg_basebackup is less flexible
than the pg_start/stop_backup method, and unfortunately you're more
likely to need the flexibility in a more complicated setup with a hot
standby server and all, but making the generic pg_start/stop_backup
method work seems infeasible at the moment.

Would pg_basebackup be able to work with the buffer manager on the slave
to avoid full_page_writes=on needing to be set on the master? (the
point of this is to be able to take the base backup without having the
backup program contact the master). If so could pg_start_backup() not
just put the buffer manager into the same state?

#65

Heikki Linnakangas

heikki.linnakangas@enterprisedb.com

about 14 years ago

In reply to: Steve Singer (#64)

Re: Online base backup from the hot-standby

On 25.10.2011 15:56, Steve Singer wrote:

On 11-10-25 02:44 AM, Heikki Linnakangas wrote:

With pg_basebackup, we have a fighting chance of getting this right,
because we have more control over how the backup is made. For example,
we can co-operate with the buffer manager to avoid torn-pages,
eliminating the need for full_page_writes=on, and we can include a
control file with the correct end-of-backup location automatically,
without requiring user intervention. pg_basebackup is less flexible
than the pg_start/stop_backup method, and unfortunately you're more
likely to need the flexibility in a more complicated setup with a hot
standby server and all, but making the generic pg_start/stop_backup
method work seems infeasible at the moment.

Would pg_basebackup be able to work with the buffer manager on the slave
to avoid full_page_writes=on needing to be set on the master? (the point
of this is to be able to take the base backup without having the backup
program contact the master).

In theory, yes. I'm not sure how difficult it would be in practice.
Currently, the walsender process just scans and copies everything in the
data directory, at the filesystem level. It would have to go through the
buffer manager instead, to avoid reading a page at the same time that
the buffer manager is writing it out.

If so could pg_start_backup() not just put the buffer manager into the same state?

No. . The trick that pg_basebackup (= walsender) can do is to co-operate
with the buffer manager when reading each page. An external program
cannot do that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#66

Fujii Masao

masao.fujii@gmail.com

about 14 years ago

In reply to: Magnus Hagander (#63)

Re: Online base backup from the hot-standby

On Tue, Oct 25, 2011 at 9:03 PM, Magnus Hagander <magnus@hagander.net> wrote:

On Tue, Oct 25, 2011 at 13:54, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Oct 25, 2011 at 7:19 PM, Magnus Hagander <magnus@hagander.net> wrote:

I don't think we should necessarily give up completely. But doing a
pg_basebackup way *first* seems reasonable - because it's going to be
the easiest one to "get right", given that we have more control there.
Doesn't mean we shouldn't extend it in the future...

Agreed. The question is -- how far should we change pg_basebackup to
"get right"? I think it's not difficult to change it so that it backs up
the control file at the end. But eliminating the need for full_page_writes=on
seems not easy. No? So I'm not inclined to do that in at least first commit.
Otherwise, I'm afraid the patch would become huge.

It's more server side of base backups than the actual pg_basebackup
tool of course, but I'm sure that's what we're all referring to here.

Personally, I'd see the fpw stuff as part of the infrastructure
needed. Meaning that the fpw stuff should go in *first*, and the
pg_basebackup stuff later.

Agreed. I'll extract FPW stuff from the patch that I submitted, and revise it
as the infrastructure patch.

The changes of pg_start_backup() etc that Ishiduka-san did are also
a server-side infrastructure. I will extract them as another infrastructure one.

Ishiduka-san, if you have time, feel free to try the above, barring objection.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#67

Jun Ishiduka

ishizuka.jun@po.ntts.co.jp

about 14 years ago

In reply to: Fujii Masao (#66)

1 attachment(s)

Re: Online base backup from the hot-standby

On Tue, Oct 25, 2011 at 13:54, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Oct 25, 2011 at 7:19 PM, Magnus Hagander <magnus@hagander.net> wrote:

I don't think we should necessarily give up completely. But doing a
pg_basebackup way *first* seems reasonable - because it's going to be
the easiest one to "get right", given that we have more control there.
Doesn't mean we shouldn't extend it in the future...

Agreed. The question is -- how far should we change pg_basebackup to
"get right"? I think it's not difficult to change it so that it backs up
the control file at the end. But eliminating the need for full_page_writes=on
seems not easy. No? So I'm not inclined to do that in at least first commit.
Otherwise, I'm afraid the patch would become huge.

It's more server side of base backups than the actual pg_basebackup
tool of course, but I'm sure that's what we're all referring to here.

Personally, I'd see the fpw stuff as part of the infrastructure
needed. Meaning that the fpw stuff should go in *first*, and the
pg_basebackup stuff later.

Agreed. I'll extract FPW stuff from the patch that I submitted, and revise it
as the infrastructure patch.

The changes of pg_start_backup() etc that Ishiduka-san did are also
a server-side infrastructure. I will extract them as another infrastructure one.

Ishiduka-san, if you have time, feel free to try the above, barring objection.

Done.
Changed the name of the patch.

<Modifications>
So changed to the positioning of infrastructure,
* Removed the documentation.
* changed to an error when you run pg_start/stop_backup() on the standby.

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

Attachments:

standby_online_backup_infra_11.patchapplication/octet-stream; name=standby_online_backup_infra_11.patchDownload

diff -rcN postgresql/src/backend/access/transam/xlog.c postgresql_with_patch/src/backend/access/transam/xlog.c
*** postgresql/src/backend/access/transam/xlog.c	2011-10-28 04:42:33.000000000 +0900
--- postgresql_with_patch/src/backend/access/transam/xlog.c	2011-10-31 02:02:06.000000000 +0900
***************
*** 158,163 ****
--- 158,174 ----
  static XLogRecPtr LastRec;
  
  /*
+  * During recovery, lastFullPageWrites keeps track of full_page_writes that
+  * the replayed WAL records indicate. It's initialized with full_page_writes
+  * that the recovery starting checkpoint record indicates, and then updated
+  * each time XLOG_FPW_CHANGE record is replayed. At the end of startup,
+  * if it's equal to full_page_writes in postgresql.conf, which means that
+  * full_page_writes has not been changed since last shutdown or crash, so
+  * in this case we skip writing an XLOG_FPW_CHANGE record.
+  */
+ static bool lastFullPageWrites;
+ 
+ /*
   * Local copy of SharedRecoveryInProgress variable. True actually means "not
   * known, need to check the shared state".
   */
***************
*** 356,361 ****
--- 367,382 ----
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
  	/*
+ 	 * fullPageWrites is shared-memory copy of walwriter's or startup
+ 	 * process' full_page_writes. All backends use this flag to determine
+ 	 * whether to write full-page to WAL, instead of using process-local
+ 	 * one. This is required because, when full_page_writes is changed
+ 	 * by SIGHUP, we must WAL-log it before it actually affects
+ 	 * WAL-logging by backends.
+ 	 */
+ 	bool		fullPageWrites;
+ 
+ 	/*
  	 * exclusiveBackup is true if a backup started with pg_start_backup() is
  	 * in progress, and nonExclusiveBackups is a counter indicating the number
  	 * of streaming base backups currently in progress. forcePageWrites is set
***************
*** 453,458 ****
--- 474,485 ----
  	/* Are we requested to pause recovery? */
  	bool		recoveryPause;
  
+ 	/*
+ 	 * lastFpwDisableRecPtr points to the start of the last replayed
+ 	 * XLOG_FPW_CHANGE record that instructs full_page_writes is disabled.
+ 	 */
+ 	XLogRecPtr	lastFpwDisableRecPtr;
+ 
  	slock_t		info_lck;		/* locks shared variables shown above */
  } XLogCtlData;
  
***************
*** 665,671 ****
  #endif
  static void pg_start_backup_callback(int code, Datum arg);
  static bool read_backup_label(XLogRecPtr *checkPointLoc,
! 				  bool *backupEndRequired);
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
--- 692,698 ----
  #endif
  static void pg_start_backup_callback(int code, Datum arg);
  static bool read_backup_label(XLogRecPtr *checkPointLoc,
! 				  bool *backupEndRequired, bool *backupDuringRecovery);
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
***************
*** 710,715 ****
--- 737,743 ----
  	bool		updrqst;
  	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+ 	bool		fpwChange = (rmid == RM_XLOG_ID && info == XLOG_FPW_CHANGE);
  
  	/* cross-check on whether we should be here or not */
  	if (!XLogInsertAllowed())
***************
*** 761,770 ****
  	/*
  	 * Decide if we need to do full-page writes in this XLOG record: true if
  	 * full_page_writes is on or we have a PITR request for it.  Since we
! 	 * don't yet have the insert lock, forcePageWrites could change under us,
! 	 * but we'll recheck it once we have the lock.
  	 */
! 	doPageWrites = fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
--- 789,798 ----
  	/*
  	 * Decide if we need to do full-page writes in this XLOG record: true if
  	 * full_page_writes is on or we have a PITR request for it.  Since we
! 	 * don't yet have the insert lock, fullPageWrites and forcePageWrites
! 	 * could change under us, but we'll recheck them once we have the lock.
  	 */
! 	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
  
  	INIT_CRC32(rdata_crc);
  	len = 0;
***************
*** 905,916 ****
  	}
  
  	/*
! 	 * Also check to see if forcePageWrites was just turned on; if we weren't
! 	 * already doing full-page writes then go back and recompute. (If it was
! 	 * just turned off, we could recompute the record without full pages, but
! 	 * we choose not to bother.)
  	 */
! 	if (Insert->forcePageWrites && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
--- 933,944 ----
  	}
  
  	/*
! 	 * Also check to see if fullPageWrites or forcePageWrites was just turned on;
! 	 * if we weren't already doing full-page writes then go back and recompute.
! 	 * (If it was just turned off, we could recompute the record without full pages,
! 	 * but we choose not to bother.)
  	 */
! 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data */
  		LWLockRelease(WALInsertLock);
***************
*** 1224,1229 ****
--- 1252,1266 ----
  		WriteRqst = XLogCtl->xlblocks[curridx];
  	}
  
+ 	/*
+ 	 * If the record is an XLOG_FPW_CHANGE, we update full_page_writes
+ 	 * in shared memory before releasing WALInsertLock. This ensures that
+ 	 * an XLOG_FPW_CHANGE record precedes any WAL record affected
+ 	 * by this parameter change.
+ 	 */
+ 	if (fpwChange)
+ 		Insert->fullPageWrites = fullPageWrites;
+ 
  	LWLockRelease(WALInsertLock);
  
  	if (updrqst)
***************
*** 5155,5160 ****
--- 5192,5198 ----
  	checkPoint.redo.xlogid = 0;
  	checkPoint.redo.xrecoff = XLogSegSize + SizeOfXLogLongPHD;
  	checkPoint.ThisTimeLineID = ThisTimeLineID;
+ 	checkPoint.fullPageWrites = fullPageWrites;
  	checkPoint.nextXidEpoch = 0;
  	checkPoint.nextXid = FirstNormalTransactionId;
  	checkPoint.nextOid = FirstBootstrapObjectId;
***************
*** 6025,6030 ****
--- 6063,6070 ----
  	uint32		freespace;
  	TransactionId oldestActiveXID;
  	bool		backupEndRequired = false;
+ 	bool		backupDuringRecovery = false;
+ 	DBState	save_state;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6158,6164 ****
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
! 	if (read_backup_label(&checkPointLoc, &backupEndRequired))
  	{
  		/*
  		 * When a backup_label file is present, we want to roll forward from
--- 6198,6205 ----
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
! 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
! 						  &backupDuringRecovery))
  	{
  		/*
  		 * When a backup_label file is present, we want to roll forward from
***************
*** 6274,6279 ****
--- 6315,6322 ----
  	 */
  	ThisTimeLineID = checkPoint.ThisTimeLineID;
  
+ 	lastFullPageWrites = checkPoint.fullPageWrites;
+ 
  	RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
  
  	if (XLByteLT(RecPtr, checkPoint.redo))
***************
*** 6314,6319 ****
--- 6357,6363 ----
  		 * pg_control with any minimum recovery stop point obtained from a
  		 * backup history file.
  		 */
+ 		save_state = ControlFile->state;
  		if (InArchiveRecovery)
  			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
  		else
***************
*** 6334,6345 ****
  		}
  
  		/*
! 		 * set backupStartPoint if we're starting recovery from a base backup
  		 */
  		if (haveBackupLabel)
  		{
  			ControlFile->backupStartPoint = checkPoint.redo;
  			ControlFile->backupEndRequired = backupEndRequired;
  		}
  		ControlFile->time = (pg_time_t) time(NULL);
  		/* No need to hold ControlFileLock yet, we aren't up far enough */
--- 6378,6411 ----
  		}
  
  		/*
! 		 * Set backupStartPoint if we're starting recovery from a base backup.
! 		 *
! 		 * Set backupEndPoint if we're starting recovery from a base backup
! 		 * which was taken from the server in recovery mode. We confirm
! 		 * that minRecoveryPoint can be used as the backup end location by
! 		 * checking whether the database system status in pg_control indicates
! 		 * DB_IN_ARCHIVE_RECOVERY. If minRecoveryPoint is not available,
! 		 * there is no way to know the backup end location, so we cannot
! 		 * advance recovery any more. In this case, we have to cancel recovery
! 		 * before changing the database system status in pg_control to
! 		 * DB_IN_ARCHIVE_RECOVERY because otherwise subsequent
! 		 * restarted recovery would go through this check wrongly.
  		 */
  		if (haveBackupLabel)
  		{
  			ControlFile->backupStartPoint = checkPoint.redo;
  			ControlFile->backupEndRequired = backupEndRequired;
+ 
+ 			if (backupDuringRecovery)
+ 			{
+ 				if (save_state != DB_IN_ARCHIVE_RECOVERY)
+ 					ereport(FATAL,
+ 							(errmsg("database system status mismatches between "
+ 									"pg_control and backup_label"),
+ 							 errhint("This means that the backup is corrupted and you will "
+ 									 "have to use another backup for recovery.")));
+ 				ControlFile->backupEndPoint = ControlFile->minRecoveryPoint;
+ 			}
  		}
  		ControlFile->time = (pg_time_t) time(NULL);
  		/* No need to hold ControlFileLock yet, we aren't up far enough */
***************
*** 6625,6630 ****
--- 6691,6718 ----
  				/* Pop the error context stack */
  				error_context_stack = errcontext.previous;
  
+ 				if (!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
+ 					XLByteLE(ControlFile->backupEndPoint, EndRecPtr))
+ 				{
+ 					/*
+ 					 * We have reached the end of base backup, the point where
+ 					 * the minimum recovery point in pg_control which was
+ 					 * backed up just before pg_stop_backup() indicates.
+ 					 * The data on disk is now consistent. Reset backupStartPoint
+ 					 * and backupEndPoint.
+ 					 */
+ 					elog(DEBUG1, "end of backup reached");
+ 
+ 					LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 
+ 					MemSet(&ControlFile->backupStartPoint, 0, sizeof(XLogRecPtr));
+ 					MemSet(&ControlFile->backupEndPoint, 0, sizeof(XLogRecPtr));
+ 					ControlFile->backupEndRequired = false;
+ 					UpdateControlFile();
+ 
+ 					LWLockRelease(ControlFileLock);
+ 				}
+ 
  				/*
  				 * Update shared recoveryLastRecPtr after this record has been
  				 * replayed.
***************
*** 6824,6829 ****
--- 6912,6933 ----
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
  	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
  
+ 	/*
+ 	 * Update full_page_writes in shared memory and write an
+ 	 * XLOG_FPW_CHANGE record before resource manager writes cleanup
+ 	 * WAL records or checkpoint record is written.
+ 	 *
+ 	 * Note that full_page_writes in shared memory is initialized with
+ 	 * lastFullPageWrites so that UpdateFullPageWrites() can check whether
+ 	 * it's equal to full_page_writes specified in postgresql.conf (i.e., whether
+ 	 * full_page_writes has been changed since last shutdown or crash) and
+ 	 * then skip writing an XLOG_FPW_CHANGE record if not.
+ 	 */
+ 	Insert->fullPageWrites = lastFullPageWrites;
+ 	LocalSetXLogInsertAllowed();
+ 	UpdateFullPageWrites();
+ 	LocalXLogInsertAllowed = -1;
+ 
  	if (InRecovery)
  	{
  		int			rmid;
***************
*** 7681,7686 ****
--- 7785,7791 ----
  		LocalSetXLogInsertAllowed();
  
  	checkPoint.ThisTimeLineID = ThisTimeLineID;
+ 	checkPoint.fullPageWrites = Insert->fullPageWrites;
  
  	/*
  	 * Compute new REDO record ptr = location of next XLOG record.
***************
*** 8382,8387 ****
--- 8487,8534 ----
  }
  
  /*
+  * Update full_page_writes in shared memory, and write an
+  * XLOG_FPW_CHANGE record if necessary.
+  */
+ void
+ UpdateFullPageWrites(void)
+ {
+ 	XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 
+ 	/*
+ 	 * Do nothing if full_page_writes has not been changed.
+ 	 *
+ 	 * It's safe to check the shared full_page_writes without the lock,
+ 	 * because we can guarantee that there is no concurrently running
+ 	 * process which can update it.
+ 	 */
+ 	if (fullPageWrites == Insert->fullPageWrites)
+ 		return;
+ 
+ 	/*
+ 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep
+ 	 * track of full_page_writes during archive recovery, if required.
+ 	 */
+ 	if (XLogStandbyInfoActive())
+ 	{
+ 		XLogRecData	rdata;
+ 
+ 		rdata.data = (char *) (&fullPageWrites);
+ 		rdata.len = sizeof(bool);
+ 		rdata.buffer = InvalidBuffer;
+ 		rdata.next = NULL;
+ 
+ 		XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE, &rdata);
+ 	}
+ 	else
+ 	{
+ 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+ 		Insert->fullPageWrites = fullPageWrites;
+ 		LWLockRelease(WALInsertLock);
+ 	}
+ }
+ 
+ /*
   * XLOG resource manager's routines
   *
   * Definitions of info values are in include/catalog/pg_control.h, though
***************
*** 8425,8431 ****
  		 * never arrive.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint))
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
--- 8572,8579 ----
  		 * never arrive.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
! 			XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
***************
*** 8594,8599 ****
--- 8742,8771 ----
  		/* Check to see if any changes to max_connections give problems */
  		CheckRequiredParameterValues();
  	}
+ 	else if (info == XLOG_FPW_CHANGE)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 		bool		fpw;
+ 
+ 		memcpy(&fpw, XLogRecGetData(record), sizeof(bool));
+ 
+ 		/*
+ 		 * Update the LSN of the last replayed XLOG_FPW_CHANGE record
+ 		 * so that pg_start_backup() and pg_stop_backup() can check
+ 		 * whether full_page_writes has been disabled during online backup.
+ 		 */
+ 		if (!fpw)
+ 		{
+ 			SpinLockAcquire(&xlogctl->info_lck);
+ 			if (XLByteLT(xlogctl->lastFpwDisableRecPtr, ReadRecPtr))
+ 				xlogctl->lastFpwDisableRecPtr = ReadRecPtr;
+ 			SpinLockRelease(&xlogctl->info_lck);
+ 		}
+ 
+ 		/* Keep track of full_page_writes */
+ 		lastFullPageWrites = fpw;
+ 	}
  }
  
  void
***************
*** 8607,8616 ****
  		CheckPoint *checkpoint = (CheckPoint *) rec;
  
  		appendStringInfo(buf, "checkpoint: redo %X/%X; "
! 						 "tli %u; xid %u/%u; oid %u; multi %u; offset %u; "
  						 "oldest xid %u in DB %u; oldest running xid %u; %s",
  						 checkpoint->redo.xlogid, checkpoint->redo.xrecoff,
  						 checkpoint->ThisTimeLineID,
  						 checkpoint->nextXidEpoch, checkpoint->nextXid,
  						 checkpoint->nextOid,
  						 checkpoint->nextMulti,
--- 8779,8789 ----
  		CheckPoint *checkpoint = (CheckPoint *) rec;
  
  		appendStringInfo(buf, "checkpoint: redo %X/%X; "
! 						 "tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
  						 "oldest xid %u in DB %u; oldest running xid %u; %s",
  						 checkpoint->redo.xlogid, checkpoint->redo.xrecoff,
  						 checkpoint->ThisTimeLineID,
+ 						 checkpoint->fullPageWrites ? "true" : "false",
  						 checkpoint->nextXidEpoch, checkpoint->nextXid,
  						 checkpoint->nextOid,
  						 checkpoint->nextMulti,
***************
*** 8675,8680 ****
--- 8848,8860 ----
  						 xlrec.max_locks_per_xact,
  						 wal_level_str);
  	}
+ 	else if (info == XLOG_FPW_CHANGE)
+ 	{
+ 		bool		fpw;
+ 
+ 		memcpy(&fpw, rec, sizeof(bool));
+ 		appendStringInfo(buf, "full_page_writes: %s", fpw ? "true" : "false");
+ 	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
  }
***************
*** 8888,8893 ****
--- 9068,9074 ----
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		recovery_in_progress = false;
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
  	pg_time_t	stamp_time;
***************
*** 8899,8916 ****
  	FILE	   *fp;
  	StringInfoData labelfbuf;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		   errmsg("must be superuser or replication role to run a backup")));
  
! 	if (RecoveryInProgress())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
--- 9080,9109 ----
  	FILE	   *fp;
  	StringInfoData labelfbuf;
  
+ 	recovery_in_progress = RecoveryInProgress();
+ 
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		   errmsg("must be superuser or replication role to run a backup")));
  
! 	/*
! 	 * During recovery, we cannot execute pg_start_backup(). But, when we used
! 	 * pg_basebackup, we can avoid error and continue these processing.
! 	 */
! 	if (recovery_in_progress && exclusive)
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	/*
! 	 * During recovery, we don't need to check WAL level. Because the fact that
! 	 * we are now executing pg_start_backup() during recovery means that
! 	 * wal_level is set to hot_standby on the master, i.e., WAL level is sufficient
! 	 * for making an online backup.
! 	 */
! 	if (!recovery_in_progress && !XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
***************
*** 8932,8939 ****
  	 * we won't have a history file covering the old timeline if pg_xlog
  	 * directory was not included in the base backup and the WAL archive was
  	 * cleared too before starting the backup.
  	 */
! 	RequestXLogSwitch();
  
  	/*
  	 * Mark backup active in shared memory.  We must do full-page WAL writes
--- 9125,9137 ----
  	 * we won't have a history file covering the old timeline if pg_xlog
  	 * directory was not included in the base backup and the WAL archive was
  	 * cleared too before starting the backup.
+ 	 *
+ 	 * During recovery, we skip forcing XLOG file switch, which means that
+ 	 * the backup taken during recovery is not available for the special recovery
+ 	 * case described above.
  	 */
! 	if (!recovery_in_progress)
! 		RequestXLogSwitch();
  
  	/*
  	 * Mark backup active in shared memory.  We must do full-page WAL writes
***************
*** 8949,8954 ****
--- 9147,9155 ----
  	 * since we expect that any pages not modified during the backup interval
  	 * must have been correctly captured by the backup.)
  	 *
+ 	 * Note that forcePageWrites has no effect during an online backup from
+ 	 * the server in recovery mode.
+ 	 *
  	 * We must hold WALInsertLock to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
***************
*** 8977,8988 ****
  
  		do
  		{
  			/*
! 			 * Force a CHECKPOINT.	Aside from being necessary to prevent torn
  			 * page problems, this guarantees that two successive backup runs
  			 * will have different checkpoint positions and hence different
  			 * history file names, even if nothing happened in between.
  			 *
  			 * We use CHECKPOINT_IMMEDIATE only if requested by user (via
  			 * passing fast = true).  Otherwise this can take awhile.
  			 */
--- 9178,9199 ----
  
  		do
  		{
+ 			bool		checkpointfpw;
+ 
  			/*
! 			 * Force a CHECKPOINT.  Aside from being necessary to prevent torn
  			 * page problems, this guarantees that two successive backup runs
  			 * will have different checkpoint positions and hence different
  			 * history file names, even if nothing happened in between.
  			 *
+ 			 * During recovery, establish a restartpoint if possible. We use the last
+ 			 * restartpoint as the backup starting checkpoint. This means that two
+ 			 * successive backup runs can have same checkpoint positions.
+ 			 *
+ 			 * Since the fact that we are executing pg_start_backup() during
+ 			 * recovery means that bgwriter is running, we can use
+ 			 * RequestCheckpoint() to establish a restartpoint.
+ 			 *
  			 * We use CHECKPOINT_IMMEDIATE only if requested by user (via
  			 * passing fast = true).  Otherwise this can take awhile.
  			 */
***************
*** 8998,9005 ****
--- 9209,9248 ----
  			LWLockAcquire(ControlFileLock, LW_SHARED);
  			checkpointloc = ControlFile->checkPoint;
  			startpoint = ControlFile->checkPointCopy.redo;
+ 			checkpointfpw = ControlFile->checkPointCopy.fullPageWrites;
  			LWLockRelease(ControlFileLock);
  
+ 			if (recovery_in_progress)
+ 			{
+ 				/* use volatile pointer to prevent code rearrangement */
+ 				volatile XLogCtlData *xlogctl = XLogCtl;
+ 				XLogRecPtr		recptr;
+ 
+ 				/*
+ 				 * Check to see if all WAL replayed during online backup (i.e.,
+ 				 * since last restartpoint used as backup starting checkpoint)
+ 				 * contain full-page writes.
+ 				 */
+ 				SpinLockAcquire(&xlogctl->info_lck);
+ 				recptr = xlogctl->lastFpwDisableRecPtr;
+ 				SpinLockRelease(&xlogctl->info_lck);
+ 
+ 				if (!checkpointfpw || XLByteLE(startpoint, recptr))
+ 					ereport(ERROR,
+ 							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 							 errmsg("WAL generated with full_page_writes=off was replayed "
+ 									"since last restartpoint")));
+ 
+ 				/*
+ 				 * During recovery, since we don't use the end-of-backup WAL
+ 				 * record and don't write the backup history file, the starting WAL
+ 				 * location doesn't need to be unique. This means that two base
+ 				 * backups started at the same time might use the same checkpoint
+ 				 * as starting locations.
+ 				 */
+ 				gotUniqueStartpoint = true;
+ 			}
+ 
  			/*
  			 * If two base backups are started at the same time (in WAL sender
  			 * processes), we need to make sure that they use different
***************
*** 9039,9044 ****
--- 9282,9289 ----
  						 checkpointloc.xlogid, checkpointloc.xrecoff);
  		appendStringInfo(&labelfbuf, "BACKUP METHOD: %s\n",
  						 exclusive ? "pg_start_backup" : "streamed");
+ 		appendStringInfo(&labelfbuf, "SYSTEM STATUS: %s\n",
+ 						 recovery_in_progress ? "recovery" : "in production");
  		appendStringInfo(&labelfbuf, "START TIME: %s\n", strfbuf);
  		appendStringInfo(&labelfbuf, "LABEL: %s\n", backupidstr);
  
***************
*** 9133,9138 ****
--- 9378,9385 ----
   * history file at the beginning of archive recovery, but we now use the WAL
   * record for that and the file is for informational and debug purposes only.
   *
+  * During recovery, we only remove the backup label file.
+  *
   * Note: different from CancelBackup which just cancels online backup mode.
   */
  Datum
***************
*** 9159,9164 ****
--- 9406,9412 ----
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		recovery_in_progress = false;
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
  	XLogRecData rdata;
***************
*** 9169,9174 ****
--- 9417,9423 ----
  	char		stopxlogfilename[MAXFNAMELEN];
  	char		lastxlogfilename[MAXFNAMELEN];
  	char		histfilename[MAXFNAMELEN];
+ 	char		systemstatus[20];
  	uint32		_logId;
  	uint32		_logSeg;
  	FILE	   *lfp;
***************
*** 9178,9196 ****
  	int			waits = 0;
  	bool		reported_waiting = false;
  	char	   *remaining;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
! 	if (RecoveryInProgress())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
--- 9427,9458 ----
  	int			waits = 0;
  	bool		reported_waiting = false;
  	char	   *remaining;
+ 	char	   *ptr;
+ 
+ 	recovery_in_progress = RecoveryInProgress();
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
! 	/*
! 	 * During recovery, we cannot execute pg_stop_backup(). But, when we used
! 	 * pg_basebackup, we can avoid error and continue these processing.
! 	 */
! 	if (recovery_in_progress && exclusive)
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	/*
! 	 * During recovery, we don't need to check WAL level. Because the fact that
! 	 * we are now executing pg_stop_backup() means that wal_level is set to
! 	 * hot_standby on the master, i.e., WAL level is sufficient for making an online
! 	 * backup.
! 	 */
! 	if (!recovery_in_progress && !XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
***************
*** 9281,9286 ****
--- 9543,9619 ----
  	remaining = strchr(labelfile, '\n') + 1;	/* %n is not portable enough */
  
  	/*
+ 	 * Parse the SYSTEM STATUS line, and check that database system
+ 	 * status matches between pg_start_backup() and pg_stop_backup().
+ 	 */
+ 	ptr = strstr(remaining, "SYSTEM STATUS:");
+ 	if (sscanf(ptr, "SYSTEM STATUS: %19s\n", systemstatus) != 1)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
+ 	if (strcmp(systemstatus, "recovery") == 0 && !recovery_in_progress)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("pg_stop_backup() was executed during normal processing "
+ 						"though pg_start_backup() was executed during recovery"),
+ 				 errhint("The database backup will not be usable.")));
+ 
+ 	/*
+ 	 * During recovery, we don't write an end-of-backup record. We can
+ 	 * assume that pg_control was backed up just before pg_stop_backup()
+ 	 * and its minimum recovery point can be available as the backup end
+ 	 * location. Without an end-of-backup record, we can check correctly
+ 	 * whether we've reached the end of backup when starting recovery
+ 	 * from this backup.
+ 	 *
+ 	 * We don't force a switch to new WAL file and wait for all the required
+ 	 * files to be archived. This is okay if we use the backup to start
+ 	 * the standby. But, if it's for an archive recovery, to ensure all the
+ 	 * required files are available, a user should wait for them to be archived,
+ 	 * or include them into the backup after pg_stop_backup().
+ 	 *
+ 	 * We return the current minimum recovery point as the backup end
+ 	 * location. Note that it's would be bigger than the exact backup end
+ 	 * location if the minimum recovery point is updated since the backup
+ 	 * of pg_control. The return value of pg_stop_backup() is often used
+ 	 * for a user to calculate the required files. Returning approximate
+ 	 * location is harmless for that use because it's guaranteed not to be
+ 	 * smaller than the exact backup end location.
+ 	 *
+ 	 * XXX currently a backup history file is for informational and debug
+ 	 * purposes only. It's not essential for an online backup. Furthermore,
+ 	 * even if it's created, it will not be archived during recovery because
+ 	 * an archiver is not invoked. So it doesn't seem worthwhile to write
+ 	 * a backup history file during recovery.
+ 	 */
+ 	if (recovery_in_progress)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 		XLogRecPtr	recptr;
+ 
+ 		/*
+ 		 * Check to see if all WAL replayed during online backup contain
+ 		 * full-page writes.
+ 		 */
+ 		SpinLockAcquire(&xlogctl->info_lck);
+ 		recptr = xlogctl->lastFpwDisableRecPtr;
+ 		SpinLockRelease(&xlogctl->info_lck);
+ 
+ 		if (XLByteLE(startpoint, recptr))
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 					 errmsg("WAL generated with full_page_writes=off was replayed "
+ 							"during online backup")));
+ 
+ 		LWLockAcquire(ControlFileLock, LW_SHARED);
+ 		stoppoint = ControlFile->minRecoveryPoint;
+ 		LWLockRelease(ControlFileLock);
+ 
+ 		return stoppoint;
+ 	}
+ 
+ 	/*
  	 * Write the backup-end xlog record
  	 */
  	rdata.data = (char *) (&startpoint);
***************
*** 9797,9814 ****
   * Returns TRUE if a backup_label was found (and fills the checkpoint
   * location and its REDO location into *checkPointLoc and RedoStartLSN,
   * respectively); returns FALSE if not. If this backup_label came from a
!  * streamed backup, *backupEndRequired is set to TRUE.
   */
  static bool
! read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired)
  {
  	char		startxlogfilename[MAXFNAMELEN];
  	TimeLineID	tli;
  	FILE	   *lfp;
  	char		ch;
  	char		backuptype[20];
  
  	*backupEndRequired = false;
  
  	/*
  	 * See if label file is present
--- 10130,10151 ----
   * Returns TRUE if a backup_label was found (and fills the checkpoint
   * location and its REDO location into *checkPointLoc and RedoStartLSN,
   * respectively); returns FALSE if not. If this backup_label came from a
!  * streamed backup, *backupEndRequired is set to TRUE. If this backup_label
!  * was created during recovery, *backupDuringRecovery is set to TRUE.
   */
  static bool
! read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired,
! 				  bool *backupDuringRecovery)
  {
  	char		startxlogfilename[MAXFNAMELEN];
  	TimeLineID	tli;
  	FILE	   *lfp;
  	char		ch;
  	char		backuptype[20];
+ 	char		systemstatus[20];
  
  	*backupEndRequired = false;
+ 	*backupDuringRecovery = false;
  
  	/*
  	 * See if label file is present
***************
*** 9842,9857 ****
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
  	/*
! 	 * BACKUP METHOD line is new in 9.1. We can't restore from an older backup
! 	 * anyway, but since the information on it is not strictly required, don't
! 	 * error out if it's missing for some reason.
  	 */
! 	if (fscanf(lfp, "BACKUP METHOD: %19s", backuptype) == 1)
  	{
  		if (strcmp(backuptype, "streamed") == 0)
  			*backupEndRequired = true;
  	}
  
  	if (ferror(lfp) || FreeFile(lfp))
  		ereport(FATAL,
  				(errcode_for_file_access(),
--- 10179,10200 ----
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
  	/*
! 	 * BACKUP METHOD and SYSTEM STATUS lines are new in 9.2. We can't
! 	 * restore from an older backup anyway, but since the information on it
! 	 * is not strictly required, don't error out if it's missing for some reason.
  	 */
! 	if (fscanf(lfp, "BACKUP METHOD: %19s\n", backuptype) == 1)
  	{
  		if (strcmp(backuptype, "streamed") == 0)
  			*backupEndRequired = true;
  	}
  
+ 	if (fscanf(lfp, "SYSTEM STATUS: %19s\n", systemstatus) == 1)
+ 	{
+ 		if (strcmp(systemstatus, "recovery") == 0)
+ 			*backupDuringRecovery = true;
+ 	}
+ 
  	if (ferror(lfp) || FreeFile(lfp))
  		ereport(FATAL,
  				(errcode_for_file_access(),
diff -rcN postgresql/src/backend/postmaster/postmaster.c postgresql_with_patch/src/backend/postmaster/postmaster.c
*** postgresql/src/backend/postmaster/postmaster.c	2011-10-28 04:42:33.000000000 +0900
--- postgresql_with_patch/src/backend/postmaster/postmaster.c	2011-10-31 01:48:34.000000000 +0900
***************
*** 289,294 ****
--- 289,296 ----
  static PMState pmState = PM_INIT;
  
  static bool ReachedNormalRunning = false;		/* T if we've reached PM_RUN */
+ static bool OnlineBackupAllowed = false;		/* T if we've reached PM_RUN or
+ 												 * PM_HOT_STANDBY */
  
  bool		ClientAuthInProgress = false;		/* T during new-client
  												 * authentication */
***************
*** 2119,2136 ****
  				/* and the walwriter too */
  				if (WalWriterPID != 0)
  					signal_child(WalWriterPID, SIGTERM);
! 
! 				/*
! 				 * If we're in recovery, we can't kill the startup process
! 				 * right away, because at present doing so does not release
! 				 * its locks.  We might want to change this in a future
! 				 * release.  For the time being, the PM_WAIT_READONLY state
! 				 * indicates that we're waiting for the regular (read only)
! 				 * backends to die off; once they do, we'll kill the startup
! 				 * and walreceiver processes.
! 				 */
! 				pmState = (pmState == PM_RUN) ?
! 					PM_WAIT_BACKUP : PM_WAIT_READONLY;
  			}
  
  			/*
--- 2121,2127 ----
  				/* and the walwriter too */
  				if (WalWriterPID != 0)
  					signal_child(WalWriterPID, SIGTERM);
! 				pmState = PM_WAIT_BACKUP;
  			}
  
  			/*
***************
*** 2313,2318 ****
--- 2304,2310 ----
  			 */
  			FatalError = false;
  			ReachedNormalRunning = true;
+ 			OnlineBackupAllowed = true;
  			pmState = PM_RUN;
  
  			/*
***************
*** 2854,2862 ****
  	{
  		/*
  		 * PM_WAIT_BACKUP state ends when online backup mode is not active.
  		 */
  		if (!BackupInProgress())
! 			pmState = PM_WAIT_BACKENDS;
  	}
  
  	if (pmState == PM_WAIT_READONLY)
--- 2846,2862 ----
  	{
  		/*
  		 * PM_WAIT_BACKUP state ends when online backup mode is not active.
+ 		 *
+ 		 * If we're in recovery, we can't kill the startup process right away,
+ 		 * because at present doing so does not release its locks.  We might
+ 		 * want to change this in a future release.  For the time being,
+ 		 * the PM_WAIT_READONLY state indicates that we're waiting for
+ 		 * the regular (read only) backends to die off; once they do,
+ 		 * we'll kill the startup and walreceiver processes.
  		 */
  		if (!BackupInProgress())
! 			pmState = ReachedNormalRunning ?
! 				PM_WAIT_BACKENDS : PM_WAIT_READONLY;
  	}
  
  	if (pmState == PM_WAIT_READONLY)
***************
*** 3025,3037 ****
  			/*
  			 * Terminate backup mode to avoid recovery after a clean fast
  			 * shutdown.  Since a backup can only be taken during normal
! 			 * running (and not, for example, while running under Hot Standby)
! 			 * it only makes sense to do this if we reached normal running. If
! 			 * we're still in recovery, the backup file is one we're
! 			 * recovering *from*, and we must keep it around so that recovery
! 			 * restarts from the right place.
  			 */
! 			if (ReachedNormalRunning)
  				CancelBackup();
  
  			/* Normal exit from the postmaster is here */
--- 3025,3037 ----
  			/*
  			 * Terminate backup mode to avoid recovery after a clean fast
  			 * shutdown.  Since a backup can only be taken during normal
! 			 * running and hot standby, it only makes sense to do this
! 			 * if we reached normal running or hot standby. If we have not
! 			 * reached a consistent recovery state yet, the backup file is
! 			 * one we're recovering *from*, and we must keep it around
! 			 * so that recovery restarts from the right place.
  			 */
! 			if (OnlineBackupAllowed)
  				CancelBackup();
  
  			/* Normal exit from the postmaster is here */
***************
*** 4188,4193 ****
--- 4188,4194 ----
  		ereport(LOG,
  		(errmsg("database system is ready to accept read only connections")));
  
+ 		OnlineBackupAllowed = true;
  		pmState = PM_HOT_STANDBY;
  	}
  
diff -rcN postgresql/src/backend/postmaster/walwriter.c postgresql_with_patch/src/backend/postmaster/walwriter.c
*** postgresql/src/backend/postmaster/walwriter.c	2011-10-28 04:42:33.000000000 +0900
--- postgresql_with_patch/src/backend/postmaster/walwriter.c	2011-10-31 01:48:34.000000000 +0900
***************
*** 216,221 ****
--- 216,228 ----
  	PG_SETMASK(&UnBlockSig);
  
  	/*
+ 	 * There is a race condition: full_page_writes might have been changed
+ 	 * since the startup process had updated it in shared memory. To handle
+ 	 * this case, we always update shared full_page_writes here.
+ 	 */
+ 	UpdateFullPageWrites();
+ 
+ 	/*
  	 * Loop forever
  	 */
  	for (;;)
***************
*** 236,241 ****
--- 243,254 ----
  		{
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
+ 
+ 			/*
+ 			 * If full_page_writes has been changed by SIGHUP, we update it
+ 			 * in shared memory and write an XLOG_FPW_CHANGE record.
+ 			 */
+ 			UpdateFullPageWrites();
  		}
  		if (shutdown_requested)
  		{
diff -rcN postgresql/src/backend/utils/misc/guc.c postgresql_with_patch/src/backend/utils/misc/guc.c
*** postgresql/src/backend/utils/misc/guc.c	2011-10-28 04:42:34.000000000 +0900
--- postgresql_with_patch/src/backend/utils/misc/guc.c	2011-10-31 01:48:34.000000000 +0900
***************
*** 130,136 ****
  extern char *default_tablespace;
  extern char *temp_tablespaces;
  extern bool synchronize_seqscans;
- extern bool fullPageWrites;
  extern int	ssl_renegotiation_limit;
  extern char *SSLCipherSuites;
  
--- 130,135 ----
diff -rcN postgresql/src/bin/pg_controldata/pg_controldata.c postgresql_with_patch/src/bin/pg_controldata/pg_controldata.c
*** postgresql/src/bin/pg_controldata/pg_controldata.c	2011-10-28 04:42:34.000000000 +0900
--- postgresql_with_patch/src/bin/pg_controldata/pg_controldata.c	2011-10-31 01:48:33.000000000 +0900
***************
*** 209,214 ****
--- 209,216 ----
  		   ControlFile.checkPointCopy.redo.xrecoff);
  	printf(_("Latest checkpoint's TimeLineID:       %u\n"),
  		   ControlFile.checkPointCopy.ThisTimeLineID);
+ 	printf(_("Latest checkpoint's full_page_writes: %s\n"),
+ 		   ControlFile.checkPointCopy.fullPageWrites ? _("yes") : _("no"));
  	printf(_("Latest checkpoint's NextXID:          %u/%u\n"),
  		   ControlFile.checkPointCopy.nextXidEpoch,
  		   ControlFile.checkPointCopy.nextXid);
***************
*** 232,237 ****
--- 234,242 ----
  	printf(_("Backup start location:                %X/%X\n"),
  		   ControlFile.backupStartPoint.xlogid,
  		   ControlFile.backupStartPoint.xrecoff);
+ 	printf(_("Backup end location:                  %X/%X\n"),
+ 		   ControlFile.backupEndPoint.xlogid,
+ 		   ControlFile.backupEndPoint.xrecoff);
  	printf(_("End-of-backup record required:        %s\n"),
  		   ControlFile.backupEndRequired ? _("yes") : _("no"));
  	printf(_("Current wal_level setting:            %s\n"),
diff -rcN postgresql/src/bin/pg_ctl/pg_ctl.c postgresql_with_patch/src/bin/pg_ctl/pg_ctl.c
*** postgresql/src/bin/pg_ctl/pg_ctl.c	2011-10-28 04:42:34.000000000 +0900
--- postgresql_with_patch/src/bin/pg_ctl/pg_ctl.c	2011-10-31 01:48:33.000000000 +0900
***************
*** 885,899 ****
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
  		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present, we're recovering from an online
! 		 * backup instead of performing one.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0 &&
! 			stat(recovery_file, &statbuf) != 0)
  		{
! 			print_msg(_("WARNING: online backup mode is active\n"
! 						"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
  		}
  
  		print_msg(_("waiting for server to shut down..."));
--- 885,902 ----
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
  		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present and new connection has not been
! 		 * allowed yet, an online backup mode must not be active.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0)
  		{
! 			if (stat(recovery_file, &statbuf) != 0)
! 				print_msg(_("WARNING: online backup mode is active\n"
! 							"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
! 			else
! 				print_msg(_("WARNING: online backup mode is active if you can connect as a superuser to server\n"
! 							"If so, shutdown will not complete until pg_stop_backup() is called.\n\n"));
  		}
  
  		print_msg(_("waiting for server to shut down..."));
***************
*** 973,987 ****
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
  		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present, we're recovering from an online
! 		 * backup instead of performing one.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0 &&
! 			stat(recovery_file, &statbuf) != 0)
  		{
! 			print_msg(_("WARNING: online backup mode is active\n"
! 						"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
  		}
  
  		print_msg(_("waiting for server to shut down..."));
--- 976,993 ----
  		/*
  		 * If backup_label exists, an online backup is running. Warn the user
  		 * that smart shutdown will wait for it to finish. However, if
! 		 * recovery.conf is also present and new connection has not been
! 		 * allowed yet, an online backup mode must not be active.
  		 */
  		if (shutdown_mode == SMART_MODE &&
! 			stat(backup_file, &statbuf) == 0)
  		{
! 			if (stat(recovery_file, &statbuf) != 0)
! 				print_msg(_("WARNING: online backup mode is active\n"
! 							"Shutdown will not complete until pg_stop_backup() is called.\n\n"));
! 			else
! 				print_msg(_("WARNING: online backup mode is active if you can connect as a superuser to server\n"
! 							"If so, shutdown will not complete until pg_stop_backup() is called.\n\n"));
  		}
  
  		print_msg(_("waiting for server to shut down..."));
diff -rcN postgresql/src/bin/pg_resetxlog/pg_resetxlog.c postgresql_with_patch/src/bin/pg_resetxlog/pg_resetxlog.c
*** postgresql/src/bin/pg_resetxlog/pg_resetxlog.c	2011-10-28 04:42:34.000000000 +0900
--- postgresql_with_patch/src/bin/pg_resetxlog/pg_resetxlog.c	2011-10-31 01:48:33.000000000 +0900
***************
*** 489,494 ****
--- 489,495 ----
  	ControlFile.checkPointCopy.redo.xlogid = 0;
  	ControlFile.checkPointCopy.redo.xrecoff = SizeOfXLogLongPHD;
  	ControlFile.checkPointCopy.ThisTimeLineID = 1;
+ 	ControlFile.checkPointCopy.fullPageWrites = false;
  	ControlFile.checkPointCopy.nextXidEpoch = 0;
  	ControlFile.checkPointCopy.nextXid = FirstNormalTransactionId;
  	ControlFile.checkPointCopy.nextOid = FirstBootstrapObjectId;
***************
*** 503,509 ****
  	ControlFile.time = (pg_time_t) time(NULL);
  	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
  
! 	/* minRecoveryPoint and backupStartPoint can be left zero */
  
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.MaxConnections = 100;
--- 504,510 ----
  	ControlFile.time = (pg_time_t) time(NULL);
  	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
  
! 	/* minRecoveryPoint, backupStartPoint and backupEndPoint can be left zero */
  
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.MaxConnections = 100;
***************
*** 569,574 ****
--- 570,577 ----
  		   sysident_str);
  	printf(_("Latest checkpoint's TimeLineID:       %u\n"),
  		   ControlFile.checkPointCopy.ThisTimeLineID);
+ 	printf(_("Latest checkpoint's full_page_writes:       %s\n"),
+ 		   ControlFile.checkPointCopy.fullPageWrites ? _("yes") : _("no"));
  	printf(_("Latest checkpoint's NextXID:          %u/%u\n"),
  		   ControlFile.checkPointCopy.nextXidEpoch,
  		   ControlFile.checkPointCopy.nextXid);
***************
*** 637,642 ****
--- 640,647 ----
  	ControlFile.minRecoveryPoint.xrecoff = 0;
  	ControlFile.backupStartPoint.xlogid = 0;
  	ControlFile.backupStartPoint.xrecoff = 0;
+ 	ControlFile.backupEndPoint.xlogid = 0;
+ 	ControlFile.backupEndPoint.xrecoff = 0;
  	ControlFile.backupEndRequired = false;
  
  	/*
diff -rcN postgresql/src/include/access/xlog.h postgresql_with_patch/src/include/access/xlog.h
*** postgresql/src/include/access/xlog.h	2011-10-28 04:42:34.000000000 +0900
--- postgresql_with_patch/src/include/access/xlog.h	2011-10-31 01:48:33.000000000 +0900
***************
*** 197,202 ****
--- 197,203 ----
  extern bool XLogArchiveMode;
  extern char *XLogArchiveCommand;
  extern bool EnableHotStandby;
+ extern bool fullPageWrites;
  extern bool log_checkpoints;
  
  /* WAL levels */
***************
*** 306,311 ****
--- 307,313 ----
  extern bool CreateRestartPoint(int flags);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr XLogRestorePoint(const char *rpName);
+ extern void UpdateFullPageWrites(void);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern XLogRecPtr GetInsertRecPtr(void);
  extern XLogRecPtr GetFlushRecPtr(void);
diff -rcN postgresql/src/include/catalog/pg_control.h postgresql_with_patch/src/include/catalog/pg_control.h
*** postgresql/src/include/catalog/pg_control.h	2011-10-28 04:42:34.000000000 +0900
--- postgresql_with_patch/src/include/catalog/pg_control.h	2011-10-31 01:48:33.000000000 +0900
***************
*** 21,27 ****
  
  
  /* Version identifier for this pg_control format */
! #define PG_CONTROL_VERSION	921
  
  /*
   * Body of CheckPoint XLOG records.  This is declared here because we keep
--- 21,27 ----
  
  
  /* Version identifier for this pg_control format */
! #define PG_CONTROL_VERSION	922
  
  /*
   * Body of CheckPoint XLOG records.  This is declared here because we keep
***************
*** 33,38 ****
--- 33,39 ----
  	XLogRecPtr	redo;			/* next RecPtr available when we began to
  								 * create CheckPoint (i.e. REDO start point) */
  	TimeLineID	ThisTimeLineID; /* current TLI */
+ 	bool			fullPageWrites;	/* current full_page_writes */
  	uint32		nextXidEpoch;	/* higher-order bits of nextXid */
  	TransactionId nextXid;		/* next free XID */
  	Oid			nextOid;		/* next free OID */
***************
*** 60,65 ****
--- 61,67 ----
  #define XLOG_BACKUP_END					0x50
  #define XLOG_PARAMETER_CHANGE			0x60
  #define XLOG_RESTORE_POINT				0x70
+ #define XLOG_FPW_CHANGE				0x80
  
  
  /*
***************
*** 138,143 ****
--- 140,152 ----
  	 * record, to make sure the end-of-backup record corresponds the base
  	 * backup we're recovering from.
  	 *
+ 	 * backupEndPoint is the backup end location, if we are recovering from
+ 	 * an online backup which was taken from the server in recovery mode
+ 	 * and haven't reached the end of backup yet. It is initialized to
+ 	 * the minimum recovery point in pg_control which was backed up just
+ 	 * before pg_stop_backup(). It is reset to zero when the end of backup
+ 	 * is reached, and we mustn't start up before that.
+ 	 *
  	 * If backupEndRequired is true, we know for sure that we're restoring
  	 * from a backup, and must see a backup-end record before we can safely
  	 * start up. If it's false, but backupStartPoint is set, a backup_label
***************
*** 146,151 ****
--- 155,161 ----
  	 */
  	XLogRecPtr	minRecoveryPoint;
  	XLogRecPtr	backupStartPoint;
+ 	XLogRecPtr	backupEndPoint;
  	bool		backupEndRequired;
  
  	/*

#68

Josh Berkus

josh@agliodbs.com

about 14 years ago

In reply to: Magnus Hagander (#63)

Re: Online base backup from the hot-standby

On 10/25/11 5:03 AM, Magnus Hagander wrote:

If we want something to go in early, that could be as simple as a
version of pg_basebackup that runs against the slave but only if
full_page_writes=on on the master. If it's not, it throws an error.
Then we can improve upon that by adding handling of fpw=off, first by
infrastructure, then by tool.

Just to be clear, the idea is to require full_page_writes to do backup
from the standby in 9.2, but to remove the requirement later?

--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com

#69

Fujii Masao

masao.fujii@gmail.com

about 14 years ago

In reply to: Josh Berkus (#68)

Re: Online base backup from the hot-standby

On Fri, Nov 4, 2011 at 8:06 AM, Josh Berkus <josh@agliodbs.com> wrote:

On 10/25/11 5:03 AM, Magnus Hagander wrote:

If we want something to go in early, that could be as simple as a
version of pg_basebackup that runs against the slave but only if
full_page_writes=on on the master. If it's not, it throws an error.
Then we can improve upon that by adding handling of fpw=off, first by
infrastructure, then by tool.

Just to be clear, the idea is to require full_page_writes to do backup
from the standby in 9.2, but to remove the requirement later?

Yes unless I'm missing something. Not sure if we can remove that in 9.2, though.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#70

Steve Singer

ssinger_pg@sympatico.ca

about 14 years ago

In reply to: Jun Ishiduka (#67)

Re: Online base backup from the hot-standby

On 11-10-31 12:11 AM, Jun Ishiduka wrote:

Agreed. I'll extract FPW stuff from the patch that I submitted, and revise it
as the infrastructure patch.

The changes of pg_start_backup() etc that Ishiduka-san did are also
a server-side infrastructure. I will extract them as another infrastructure one.

Ishiduka-san, if you have time, feel free to try the above, barring objection.

Done.
Changed the name of the patch.

<Modifications>
So changed to the positioning of infrastructure,
* Removed the documentation.
* changed to an error when you run pg_start/stop_backup() on the standby.

Here is my stab at reviewing this version of this version of the patch.

Submission
-------------------
The purpose of this version of the patch is to provide some
infrastructure needed for backups from the slave without having to solve
some of the usability issues raised in previous versions of the patch.

This patch applied fine earlier versions of head but it doesn't today.
Simon moved some of the code touched by this patch as part of the xlog
refactoring. Please post an updated/rebased version of the patch.

I think the purpose of this patch is to provide

a) The code changes to record changes to fpw state of the master in WAL.
b) Track the state of FPW while in recovery mode

This version of the patch is NOT intended to allow SQL calls to
pg_start_backup() on slaves to work. This patch lays the infrastructure
for another patch (which I haven't seen) to allow pg_basebackup to do a
base backup from a slave assuming fpw=on has been set on the master (my
understanding of this patch is that it puts into place all of the pieces
required for the pg_basebackup patch to detect if fpw!=on and abort).

The consensus upthread was to get this infrastructure in and figure out
a safe+usable way of doing a slave backup without pg_basebackup later.

The patch seems to do what I expect of it.

I don't see any issues with most of the code changes in this patch.
However I admit that even after reviewing many versions of this patch I
still am not familiar enough with the recovery code to comment on a lot
of the details.

One thing I did see:

In pg_ctl.c

! if (stat(recovery_file, &statbuf) != 0)
! print_msg(_("WARNING: online backup mode is active\n"
! "Shutdown will not complete until pg_stop_backup() is called.\n\n"));
! else
! print_msg(_("WARNING: online backup mode is active if you can connect
as a superuser to server\n"
! "If so, shutdown will not complete until pg_stop_backup() is
called.\n\n"));

I am having difficulty understanding what this error message is trying
to tell me. I think it is telling me (based on the code comments) that
if I can't connect to the server because the server is not yet accepting
connections then I shouldn't worry about anything. However if the server
is accepting connections then I need to login and call pg_stop_backup().

Maybe
"WARNING: online backup mode is active. If your server is accepting
connections then you must connect as superuser and run pg_stop_backup()
before shutdown will complete"

I will wait on attempting to test the patch until you have sent a
version that applies against the current HEAD.

Show quoted text

Regards.

--------------------------------------------
Jun Ishizuka
NTT Software Corporation
TEL：045-317-7018
E-Mail: ishizuka.jun@po.ntts.co.jp
--------------------------------------------

#71

Fujii Masao

masao.fujii@gmail.com

almost 14 years ago

In reply to: Steve Singer (#70)

1 attachment(s)

Re: Online base backup from the hot-standby

Sorry for the delay.

2011/11/15 Steve Singer <ssinger_pg@sympatico.ca>:

Here is my stab at reviewing this version of this version of the patch.

Thanks for the review!

This version of the patch is NOT intended to allow SQL calls to
pg_start_backup() on slaves to work. This patch lays the infrastructure
for another patch (which I haven't seen) to allow pg_basebackup to do a base
backup from a slave assuming fpw=on has been set on the master (my
understanding of this patch is that it puts into place all of the pieces
required for the pg_basebackup patch to detect if fpw!=on and abort).

The amount of code changes to allow pg_basebackup to make a backup from
the standby seems to be small. So I ended up merging that changes and the
infrastructure patch. WIP patch attached. But I'd happy to split the patch again
if you want.

In pg_ctl.c

!             if (stat(recovery_file, &statbuf) != 0)
!                 print_msg(_("WARNING: online backup mode is active\n"
!                             "Shutdown will not complete until
pg_stop_backup() is called.\n\n"));
!             else
!                 print_msg(_("WARNING: online backup mode is active if you
can connect as a superuser to server\n"
!                             "If so, shutdown will not complete until
pg_stop_backup() is called.\n\n"));

I am having difficulty understanding what this error message is trying to
tell me.   I think it is telling me (based on the code comments) that if I
can't connect to the server because the server is not yet accepting
connections then I shouldn't worry about anything.   However if the server
is accepting connections then I need to login and call pg_stop_backup().

Maybe
"WARNING: online backup mode is active. If your server is accepting
connections then you must connect as superuser and run pg_stop_backup()
before shutdown will complete"

The reason why the above change of pg_ctl.c was required is that new
backup_label can be created by standby-only backup during recovery.
But, now, we decided to disallow pg_start_backup() and pg_stop_backup()
to be called during recovery again, and allow only pg_basebackup to make
a base backup from the standby, which means that backup_label will not be
created during recovery. So the above change of pg_ctl.c has not been
required now. I excluded that change from the patch.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

standby_online_backup_v12.patchtext/x-diff; charset=US-ASCII; name=standby_online_backup_v12.patchDownload

*** a/doc/src/sgml/ref/pg_basebackup.sgml
--- b/doc/src/sgml/ref/pg_basebackup.sgml
***************
*** 63,68 **** PostgreSQL documentation
--- 63,77 ----
     better from a performance point of view to take only one backup, and copy
     the result.
    </para>
+ 
+   <para>
+    <application>pg_basebackup</application> can make a base backup from
+    not only the master but also the standby. To take a backup from the standby,
+    set up the standby so that it can accept replication connections (that is, set
+    <varname>max_wal_senders</> and <xref linkend="guc-hot-standby">,
+    and configure <link linkend="auth-pg-hba-conf">host-based authentication</link>).
+    You will also need to enable <xref linked="guc-full-page-writes"> on the master.
+   </para>
   </refsect1>
  
   <refsect1>
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 157,162 **** HotStandbyState standbyState = STANDBY_DISABLED;
--- 157,170 ----
  static XLogRecPtr LastRec;
  
  /*
+  * During recovery, lastFullPageWrites keeps track of full_page_writes that
+  * the replayed WAL records indicate. It's initialized with full_page_writes
+  * that the recovery starting checkpoint record indicates, and then updated
+  * each time XLOG_FPW_CHANGE record is replayed.
+  */
+ static bool lastFullPageWrites;
+ 
+ /*
   * Local copy of SharedRecoveryInProgress variable. True actually means "not
   * known, need to check the shared state".
   */
***************
*** 355,360 **** typedef struct XLogCtlInsert
--- 363,378 ----
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
  	/*
+ 	 * fullPageWrites is shared-memory copy of walwriter's or startup
+ 	 * process' full_page_writes. All backends use this flag to determine
+ 	 * whether to write full-page to WAL, instead of using process-local
+ 	 * one. This is required because, when full_page_writes is changed
+ 	 * by SIGHUP, we must WAL-log it before it actually affects
+ 	 * WAL-logging by backends.
+ 	 */
+ 	bool		fullPageWrites;
+ 
+ 	/*
  	 * exclusiveBackup is true if a backup started with pg_start_backup() is
  	 * in progress, and nonExclusiveBackups is a counter indicating the number
  	 * of streaming base backups currently in progress. forcePageWrites is set
***************
*** 460,465 **** typedef struct XLogCtlData
--- 478,489 ----
  	/* Are we requested to pause recovery? */
  	bool		recoveryPause;
  
+ 	/*
+ 	 * lastFpwDisableRecPtr points to the start of the last replayed
+ 	 * XLOG_FPW_CHANGE record that instructs full_page_writes is disabled.
+ 	 */
+ 	XLogRecPtr	lastFpwDisableRecPtr;
+ 
  	slock_t		info_lck;		/* locks shared variables shown above */
  } XLogCtlData;
  
***************
*** 663,669 **** static void xlog_outrec(StringInfo buf, XLogRecord *record);
  #endif
  static void pg_start_backup_callback(int code, Datum arg);
  static bool read_backup_label(XLogRecPtr *checkPointLoc,
! 				  bool *backupEndRequired);
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
--- 687,693 ----
  #endif
  static void pg_start_backup_callback(int code, Datum arg);
  static bool read_backup_label(XLogRecPtr *checkPointLoc,
! 				  bool *backupEndRequired, bool *backupFromStandby);
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
***************
*** 709,714 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
--- 733,739 ----
  	bool		updrqst;
  	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+ 	bool		fpwChange = (rmid == RM_XLOG_ID && info == XLOG_FPW_CHANGE);
  	uint8		info_orig = info;
  
  	/* cross-check on whether we should be here or not */
***************
*** 756,765 **** begin:;
  	/*
  	 * Decide if we need to do full-page writes in this XLOG record: true if
  	 * full_page_writes is on or we have a PITR request for it.  Since we
! 	 * don't yet have the insert lock, forcePageWrites could change under us,
! 	 * but we'll recheck it once we have the lock.
  	 */
! 	doPageWrites = fullPageWrites || Insert->forcePageWrites;
  
  	len = 0;
  	for (rdt = rdata;;)
--- 781,790 ----
  	/*
  	 * Decide if we need to do full-page writes in this XLOG record: true if
  	 * full_page_writes is on or we have a PITR request for it.  Since we
! 	 * don't yet have the insert lock, fullPageWrites and forcePageWrites
! 	 * could change under us, but we'll recheck them once we have the lock.
  	 */
! 	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
  
  	len = 0;
  	for (rdt = rdata;;)
***************
*** 939,950 **** begin:;
  	}
  
  	/*
! 	 * Also check to see if forcePageWrites was just turned on; if we weren't
! 	 * already doing full-page writes then go back and recompute. (If it was
! 	 * just turned off, we could recompute the record without full pages, but
! 	 * we choose not to bother.)
  	 */
! 	if (Insert->forcePageWrites && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data. */
  		LWLockRelease(WALInsertLock);
--- 964,975 ----
  	}
  
  	/*
! 	 * Also check to see if fullPageWrites or forcePageWrites was just turned on;
! 	 * if we weren't already doing full-page writes then go back and recompute.
! 	 * (If it was just turned off, we could recompute the record without full pages,
! 	 * but we choose not to bother.)
  	 */
! 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data. */
  		LWLockRelease(WALInsertLock);
***************
*** 1189,1194 **** begin:;
--- 1214,1228 ----
  		WriteRqst = XLogCtl->xlblocks[curridx];
  	}
  
+ 	/*
+ 	 * If the record is an XLOG_FPW_CHANGE, we update full_page_writes
+ 	 * in shared memory before releasing WALInsertLock. This ensures that
+ 	 * an XLOG_FPW_CHANGE record precedes any WAL record affected
+ 	 * by this change of full_page_writes.
+ 	 */
+ 	if (fpwChange)
+ 		Insert->fullPageWrites = fullPageWrites;
+ 
  	LWLockRelease(WALInsertLock);
  
  	if (updrqst)
***************
*** 5147,5152 **** BootStrapXLOG(void)
--- 5181,5187 ----
  	checkPoint.redo.xlogid = 0;
  	checkPoint.redo.xrecoff = XLogSegSize + SizeOfXLogLongPHD;
  	checkPoint.ThisTimeLineID = ThisTimeLineID;
+ 	checkPoint.fullPageWrites = fullPageWrites;
  	checkPoint.nextXidEpoch = 0;
  	checkPoint.nextXid = FirstNormalTransactionId;
  	checkPoint.nextOid = FirstBootstrapObjectId;
***************
*** 5961,5966 **** StartupXLOG(void)
--- 5996,6003 ----
  	uint32		freespace;
  	TransactionId oldestActiveXID;
  	bool		backupEndRequired = false;
+ 	bool		backupFromStandby = false;
+ 	DBState	save_state;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6094,6100 **** StartupXLOG(void)
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
! 	if (read_backup_label(&checkPointLoc, &backupEndRequired))
  	{
  		/*
  		 * When a backup_label file is present, we want to roll forward from
--- 6131,6138 ----
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
! 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
! 						  &backupFromStandby))
  	{
  		/*
  		 * When a backup_label file is present, we want to roll forward from
***************
*** 6210,6215 **** StartupXLOG(void)
--- 6248,6255 ----
  	 */
  	ThisTimeLineID = checkPoint.ThisTimeLineID;
  
+ 	lastFullPageWrites = checkPoint.fullPageWrites;
+ 
  	RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
  
  	if (XLByteLT(RecPtr, checkPoint.redo))
***************
*** 6250,6255 **** StartupXLOG(void)
--- 6290,6296 ----
  		 * pg_control with any minimum recovery stop point obtained from a
  		 * backup history file.
  		 */
+ 		save_state = ControlFile->state;
  		if (InArchiveRecovery)
  			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
  		else
***************
*** 6270,6281 **** StartupXLOG(void)
  		}
  
  		/*
! 		 * set backupStartPoint if we're starting recovery from a base backup
  		 */
  		if (haveBackupLabel)
  		{
  			ControlFile->backupStartPoint = checkPoint.redo;
  			ControlFile->backupEndRequired = backupEndRequired;
  		}
  		ControlFile->time = (pg_time_t) time(NULL);
  		/* No need to hold ControlFileLock yet, we aren't up far enough */
--- 6311,6339 ----
  		}
  
  		/*
! 		 * Set backupStartPoint if we're starting recovery from a base backup.
! 		 *
! 		 * Set backupEndPoint and use minRecoveryPoint as the backup end location
! 		 * if we're starting recovery from a base backup which was taken from
! 		 * the standby. In this case, the database system status in pg_control must
! 		 * indicate DB_IN_ARCHIVE_RECOVERY. If not, which means that backup
! 		 * is corrupted, so we cancel recovery.
  		 */
  		if (haveBackupLabel)
  		{
  			ControlFile->backupStartPoint = checkPoint.redo;
  			ControlFile->backupEndRequired = backupEndRequired;
+ 
+ 			if (backupFromStandby)
+ 			{
+ 				if (save_state != DB_IN_ARCHIVE_RECOVERY)
+ 					ereport(FATAL,
+ 							(errmsg("database system status mismatches between "
+ 									"pg_control and backup_label"),
+ 							 errhint("This means that the backup is corrupted and you will "
+ 									 "have to use another backup for recovery.")));
+ 				ControlFile->backupEndPoint = ControlFile->minRecoveryPoint;
+ 			}
  		}
  		ControlFile->time = (pg_time_t) time(NULL);
  		/* No need to hold ControlFileLock yet, we aren't up far enough */
***************
*** 6564,6569 **** StartupXLOG(void)
--- 6622,6648 ----
  				/* Pop the error context stack */
  				error_context_stack = errcontext.previous;
  
+ 				if (!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
+ 					XLByteLE(ControlFile->backupEndPoint, EndRecPtr))
+ 				{
+ 					/*
+ 					 * We have reached the end of base backup, the point where
+ 					 * the minimum recovery point in pg_control indicates.
+ 					 * The data on disk is now consistent. Reset backupStartPoint
+ 					 * and backupEndPoint.
+ 					 */
+ 					elog(DEBUG1, "end of backup reached");
+ 
+ 					LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 
+ 					MemSet(&ControlFile->backupStartPoint, 0, sizeof(XLogRecPtr));
+ 					MemSet(&ControlFile->backupEndPoint, 0, sizeof(XLogRecPtr));
+ 					ControlFile->backupEndRequired = false;
+ 					UpdateControlFile();
+ 
+ 					LWLockRelease(ControlFileLock);
+ 				}
+ 
  				/*
  				 * Update shared recoveryLastRecPtr after this record has been
  				 * replayed.
***************
*** 6763,6768 **** StartupXLOG(void)
--- 6842,6857 ----
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
  	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
  
+ 	/*
+ 	 * Update full_page_writes in shared memory and write an
+ 	 * XLOG_FPW_CHANGE record before resource manager writes cleanup
+ 	 * WAL records or checkpoint record is written.
+ 	 */
+ 	Insert->fullPageWrites = lastFullPageWrites;
+ 	LocalSetXLogInsertAllowed();
+ 	UpdateFullPageWrites();
+ 	LocalXLogInsertAllowed = -1;
+ 
  	if (InRecovery)
  	{
  		int			rmid;
***************
*** 7644,7649 **** CreateCheckPoint(int flags)
--- 7733,7739 ----
  		LocalSetXLogInsertAllowed();
  
  	checkPoint.ThisTimeLineID = ThisTimeLineID;
+ 	checkPoint.fullPageWrites = Insert->fullPageWrites;
  
  	/*
  	 * Compute new REDO record ptr = location of next XLOG record.
***************
*** 8359,8364 **** XLogReportParameters(void)
--- 8449,8496 ----
  }
  
  /*
+  * Update full_page_writes in shared memory, and write an
+  * XLOG_FPW_CHANGE record if necessary.
+  */
+ void
+ UpdateFullPageWrites(void)
+ {
+ 	XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 
+ 	/*
+ 	 * Do nothing if full_page_writes has not been changed.
+ 	 *
+ 	 * It's safe to check the shared full_page_writes without the lock,
+ 	 * because we can guarantee that there is no concurrently running
+ 	 * process which can update it.
+ 	 */
+ 	if (fullPageWrites == Insert->fullPageWrites)
+ 		return;
+ 
+ 	/*
+ 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep
+ 	 * track of full_page_writes during archive recovery, if required.
+ 	 */
+ 	if (XLogStandbyInfoActive())
+ 	{
+ 		XLogRecData	rdata;
+ 
+ 		rdata.data = (char *) (&fullPageWrites);
+ 		rdata.len = sizeof(bool);
+ 		rdata.buffer = InvalidBuffer;
+ 		rdata.next = NULL;
+ 
+ 		XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE, &rdata);
+ 	}
+ 	else
+ 	{
+ 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+ 		Insert->fullPageWrites = fullPageWrites;
+ 		LWLockRelease(WALInsertLock);
+ 	}
+ }
+ 
+ /*
   * XLOG resource manager's routines
   *
   * Definitions of info values are in include/catalog/pg_control.h, though
***************
*** 8402,8408 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
  		 * never arrive.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint))
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
--- 8534,8541 ----
  		 * never arrive.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
! 			XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
***************
*** 8571,8576 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
--- 8704,8733 ----
  		/* Check to see if any changes to max_connections give problems */
  		CheckRequiredParameterValues();
  	}
+ 	else if (info == XLOG_FPW_CHANGE)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 		bool		fpw;
+ 
+ 		memcpy(&fpw, XLogRecGetData(record), sizeof(bool));
+ 
+ 		/*
+ 		 * Update the LSN of the last replayed XLOG_FPW_CHANGE record
+ 		 * so that do_pg_start_backup() and do_pg_stop_backup() can check
+ 		 * whether full_page_writes has been disabled during online backup.
+ 		 */
+ 		if (!fpw)
+ 		{
+ 			SpinLockAcquire(&xlogctl->info_lck);
+ 			if (XLByteLT(xlogctl->lastFpwDisableRecPtr, ReadRecPtr))
+ 				xlogctl->lastFpwDisableRecPtr = ReadRecPtr;
+ 			SpinLockRelease(&xlogctl->info_lck);
+ 		}
+ 
+ 		/* Keep track of full_page_writes */
+ 		lastFullPageWrites = fpw;
+ 	}
  }
  
  void
***************
*** 8584,8593 **** xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
  		CheckPoint *checkpoint = (CheckPoint *) rec;
  
  		appendStringInfo(buf, "checkpoint: redo %X/%X; "
! 						 "tli %u; xid %u/%u; oid %u; multi %u; offset %u; "
  						 "oldest xid %u in DB %u; oldest running xid %u; %s",
  						 checkpoint->redo.xlogid, checkpoint->redo.xrecoff,
  						 checkpoint->ThisTimeLineID,
  						 checkpoint->nextXidEpoch, checkpoint->nextXid,
  						 checkpoint->nextOid,
  						 checkpoint->nextMulti,
--- 8741,8751 ----
  		CheckPoint *checkpoint = (CheckPoint *) rec;
  
  		appendStringInfo(buf, "checkpoint: redo %X/%X; "
! 						 "tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
  						 "oldest xid %u in DB %u; oldest running xid %u; %s",
  						 checkpoint->redo.xlogid, checkpoint->redo.xrecoff,
  						 checkpoint->ThisTimeLineID,
+ 						 checkpoint->fullPageWrites ? "true" : "false",
  						 checkpoint->nextXidEpoch, checkpoint->nextXid,
  						 checkpoint->nextOid,
  						 checkpoint->nextMulti,
***************
*** 8652,8657 **** xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
--- 8810,8822 ----
  						 xlrec.max_locks_per_xact,
  						 wal_level_str);
  	}
+ 	else if (info == XLOG_FPW_CHANGE)
+ 	{
+ 		bool		fpw;
+ 
+ 		memcpy(&fpw, rec, sizeof(bool));
+ 		appendStringInfo(buf, "full_page_writes: %s", fpw ? "true" : "false");
+ 	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
  }
***************
*** 8837,8842 **** XLogRecPtr
--- 9002,9008 ----
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		recovery_in_progress = false;
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
  	pg_time_t	stamp_time;
***************
*** 8848,8865 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	FILE	   *fp;
  	StringInfoData labelfbuf;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		   errmsg("must be superuser or replication role to run a backup")));
  
! 	if (RecoveryInProgress())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
--- 9014,9040 ----
  	FILE	   *fp;
  	StringInfoData labelfbuf;
  
+ 	recovery_in_progress = RecoveryInProgress();
+ 
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		   errmsg("must be superuser or replication role to run a backup")));
  
! 	/*
! 	 * Currently only non-exclusive backup can be taken during recovery.
! 	 */
! 	if (recovery_in_progress && exclusive)
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	/*
! 	 * During recovery, we don't need to check WAL level. Because, if WAL level
! 	 * is not sufficient, it's impossible to get here during recovery.
! 	 */
! 	if (!recovery_in_progress && !XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
***************
*** 8885,8890 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
--- 9060,9068 ----
  	 * since we expect that any pages not modified during the backup interval
  	 * must have been correctly captured by the backup.)
  	 *
+ 	 * Note that forcePageWrites has no effect during an online backup from
+ 	 * the standby.
+ 	 *
  	 * We must hold WALInsertLock to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
***************
*** 8927,8943 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  		 * Therefore, if a WAL archiver (such as pglesslog) is trying to
  		 * compress out removable backup blocks, it won't remove any that
  		 * occur after this point.
  		 */
! 		RequestXLogSwitch();
  
  		do
  		{
  			/*
! 			 * Force a CHECKPOINT.	Aside from being necessary to prevent torn
  			 * page problems, this guarantees that two successive backup runs
  			 * will have different checkpoint positions and hence different
  			 * history file names, even if nothing happened in between.
  			 *
  			 * We use CHECKPOINT_IMMEDIATE only if requested by user (via
  			 * passing fast = true).  Otherwise this can take awhile.
  			 */
--- 9105,9136 ----
  		 * Therefore, if a WAL archiver (such as pglesslog) is trying to
  		 * compress out removable backup blocks, it won't remove any that
  		 * occur after this point.
+ 		 *
+ 		 * During recovery, we skip forcing XLOG file switch, which means that
+ 		 * the backup taken during recovery is not available for the special
+ 		 * recovery case described above.
  		 */
! 		if (!recovery_in_progress)
! 			RequestXLogSwitch();
  
  		do
  		{
+ 			bool		checkpointfpw;
+ 
  			/*
! 			 * Force a CHECKPOINT.  Aside from being necessary to prevent torn
  			 * page problems, this guarantees that two successive backup runs
  			 * will have different checkpoint positions and hence different
  			 * history file names, even if nothing happened in between.
  			 *
+ 			 * During recovery, establish a restartpoint if possible. We use the last
+ 			 * restartpoint as the backup starting checkpoint. This means that two
+ 			 * successive backup runs can have same checkpoint positions.
+ 			 *
+ 			 * Since the fact that we are executing do_pg_start_backup() during
+ 			 * recovery means that checkpointer is running, we can use
+ 			 * RequestCheckpoint() to establish a restartpoint.
+ 			 *
  			 * We use CHECKPOINT_IMMEDIATE only if requested by user (via
  			 * passing fast = true).  Otherwise this can take awhile.
  			 */
***************
*** 8953,8960 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
--- 9146,9187 ----
  			LWLockAcquire(ControlFileLock, LW_SHARED);
  			checkpointloc = ControlFile->checkPoint;
  			startpoint = ControlFile->checkPointCopy.redo;
+ 			checkpointfpw = ControlFile->checkPointCopy.fullPageWrites;
  			LWLockRelease(ControlFileLock);
  
+ 			if (recovery_in_progress)
+ 			{
+ 				/* use volatile pointer to prevent code rearrangement */
+ 				volatile XLogCtlData *xlogctl = XLogCtl;
+ 				XLogRecPtr		recptr;
+ 
+ 				/*
+ 				 * Check to see if all WAL replayed during online backup (i.e.,
+ 				 * since last restartpoint used as backup starting checkpoint)
+ 				 * contain full-page writes.
+ 				 */
+ 				SpinLockAcquire(&xlogctl->info_lck);
+ 				recptr = xlogctl->lastFpwDisableRecPtr;
+ 				SpinLockRelease(&xlogctl->info_lck);
+ 
+ 				if (!checkpointfpw || XLByteLE(startpoint, recptr))
+ 					ereport(ERROR,
+ 							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 							 errmsg("WAL generated with full_page_writes=off was replayed "
+ 									"since last restartpoint"),
+ 							 errhint("Enable full_page_writes and run CHECKPOINT on the master, "
+ 									 "and then try an online backup again.")));
+ 
+ 				/*
+ 				 * During recovery, since we don't use the end-of-backup WAL
+ 				 * record and don't write the backup history file, the starting WAL
+ 				 * location doesn't need to be unique. This means that two base
+ 				 * backups started at the same time might use the same checkpoint
+ 				 * as starting locations.
+ 				 */
+ 				gotUniqueStartpoint = true;
+ 			}
+ 
  			/*
  			 * If two base backups are started at the same time (in WAL sender
  			 * processes), we need to make sure that they use different
***************
*** 8994,8999 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
--- 9221,9228 ----
  						 checkpointloc.xlogid, checkpointloc.xrecoff);
  		appendStringInfo(&labelfbuf, "BACKUP METHOD: %s\n",
  						 exclusive ? "pg_start_backup" : "streamed");
+ 		appendStringInfo(&labelfbuf, "BACKUP FROM: %s\n",
+ 						 recovery_in_progress ? "standby" : "master");
  		appendStringInfo(&labelfbuf, "START TIME: %s\n", strfbuf);
  		appendStringInfo(&labelfbuf, "LABEL: %s\n", backupidstr);
  
***************
*** 9088,9093 **** XLogRecPtr
--- 9317,9323 ----
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		recovery_in_progress = false;
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
  	XLogRecData rdata;
***************
*** 9098,9103 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
--- 9328,9334 ----
  	char		stopxlogfilename[MAXFNAMELEN];
  	char		lastxlogfilename[MAXFNAMELEN];
  	char		histfilename[MAXFNAMELEN];
+ 	char		backupfrom[20];
  	uint32		_logId;
  	uint32		_logSeg;
  	FILE	   *lfp;
***************
*** 9107,9125 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  	int			waits = 0;
  	bool		reported_waiting = false;
  	char	   *remaining;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
! 	if (RecoveryInProgress())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
--- 9338,9366 ----
  	int			waits = 0;
  	bool		reported_waiting = false;
  	char	   *remaining;
+ 	char	   *ptr;
+ 
+ 	recovery_in_progress = RecoveryInProgress();
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
! 	/*
! 	 * Currently only non-exclusive backup can be taken during recovery.
! 	 */
! 	if (recovery_in_progress && exclusive)
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	/*
! 	 * During recovery, we don't need to check WAL level. Because, if WAL level
! 	 * is not sufficient, it's impossible to get here during recovery.
! 	 */
! 	if (!recovery_in_progress && !XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
***************
*** 9210,9215 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
--- 9451,9526 ----
  	remaining = strchr(labelfile, '\n') + 1;	/* %n is not portable enough */
  
  	/*
+ 	 * Parse the BACKUP FROM line. If we are taking an online backup from
+ 	 * the standby, we confirm that the standby has not been promoted
+ 	 * during the backup.
+ 	 */
+ 	ptr = strstr(remaining, "BACKUP FROM:");
+ 	if (sscanf(ptr, "BACKUP FROM: %19s\n", backupfrom) != 1)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
+ 	if (strcmp(backupfrom, "standby") == 0 && !recovery_in_progress)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("online backup from the standby was canceled because "
+ 						"the standby had been promoted during the backup"),
+ 				 errhint("The database backup will not be usable.")));
+ 
+ 	/*
+ 	 * During recovery, we don't write an end-of-backup record. We can
+ 	 * assume that pg_control was backed up last and its minimum recovery
+ 	 * point can be available as the backup end location. Without an
+ 	 * end-of-backup record, we can check correctly whether we've
+ 	 * reached the end of backup when starting recovery from this backup.
+ 	 *
+ 	 * We don't force a switch to new WAL file and wait for all the required
+ 	 * files to be archived. This is okay if we use the backup to start
+ 	 * the standby. But, if it's for an archive recovery, to ensure all the
+ 	 * required files are available, a user should wait for them to be archived,
+ 	 * or include them into the backup.
+ 	 *
+ 	 * We return the current minimum recovery point as the backup end
+ 	 * location. Note that it's would be bigger than the exact backup end
+ 	 * location if the minimum recovery point is updated since the backup
+ 	 * of pg_control. This is harmless for current uses.
+ 	 *
+ 	 * XXX currently a backup history file is for informational and debug
+ 	 * purposes only. It's not essential for an online backup. Furthermore,
+ 	 * even if it's created, it will not be archived during recovery because
+ 	 * an archiver is not invoked. So it doesn't seem worthwhile to write
+ 	 * a backup history file during recovery.
+ 	 */
+ 	if (recovery_in_progress)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 		XLogRecPtr	recptr;
+ 
+ 		/*
+ 		 * Check to see if all WAL replayed during online backup contain
+ 		 * full-page writes.
+ 		 */
+ 		SpinLockAcquire(&xlogctl->info_lck);
+ 		recptr = xlogctl->lastFpwDisableRecPtr;
+ 		SpinLockRelease(&xlogctl->info_lck);
+ 
+ 		if (XLByteLE(startpoint, recptr))
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 					 errmsg("WAL generated with full_page_writes=off was replayed "
+ 							"during online backup"),
+ 					 errhint("Enable full_page_writes and run CHECKPOINT on the master, "
+ 							 "and then try an online backup again.")));
+ 
+ 		LWLockAcquire(ControlFileLock, LW_SHARED);
+ 		stoppoint = ControlFile->minRecoveryPoint;
+ 		LWLockRelease(ControlFileLock);
+ 
+ 		return stoppoint;
+ 	}
+ 
+ 	/*
  	 * Write the backup-end xlog record
  	 */
  	rdata.data = (char *) (&startpoint);
***************
*** 9454,9471 **** GetXLogWriteRecPtr(void)
   * Returns TRUE if a backup_label was found (and fills the checkpoint
   * location and its REDO location into *checkPointLoc and RedoStartLSN,
   * respectively); returns FALSE if not. If this backup_label came from a
!  * streamed backup, *backupEndRequired is set to TRUE.
   */
  static bool
! read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired)
  {
  	char		startxlogfilename[MAXFNAMELEN];
  	TimeLineID	tli;
  	FILE	   *lfp;
  	char		ch;
  	char		backuptype[20];
  
  	*backupEndRequired = false;
  
  	/*
  	 * See if label file is present
--- 9765,9786 ----
   * Returns TRUE if a backup_label was found (and fills the checkpoint
   * location and its REDO location into *checkPointLoc and RedoStartLSN,
   * respectively); returns FALSE if not. If this backup_label came from a
!  * streamed backup, *backupEndRequired is set to TRUE. If this backup_label
!  * was created during recovery, *backupFromStandby is set to TRUE.
   */
  static bool
! read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired,
! 				  bool *backupFromStandby)
  {
  	char		startxlogfilename[MAXFNAMELEN];
  	TimeLineID	tli;
  	FILE	   *lfp;
  	char		ch;
  	char		backuptype[20];
+ 	char		backupfrom[20];
  
  	*backupEndRequired = false;
+ 	*backupFromStandby = false;
  
  	/*
  	 * See if label file is present
***************
*** 9499,9514 **** read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired)
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
  	/*
! 	 * BACKUP METHOD line is new in 9.1. We can't restore from an older backup
! 	 * anyway, but since the information on it is not strictly required, don't
! 	 * error out if it's missing for some reason.
  	 */
! 	if (fscanf(lfp, "BACKUP METHOD: %19s", backuptype) == 1)
  	{
  		if (strcmp(backuptype, "streamed") == 0)
  			*backupEndRequired = true;
  	}
  
  	if (ferror(lfp) || FreeFile(lfp))
  		ereport(FATAL,
  				(errcode_for_file_access(),
--- 9814,9835 ----
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
  	/*
! 	 * BACKUP METHOD and BACKUP FROM lines are new in 9.2. We can't
! 	 * restore from an older backup anyway, but since the information on it
! 	 * is not strictly required, don't error out if it's missing for some reason.
  	 */
! 	if (fscanf(lfp, "BACKUP METHOD: %19s\n", backuptype) == 1)
  	{
  		if (strcmp(backuptype, "streamed") == 0)
  			*backupEndRequired = true;
  	}
  
+ 	if (fscanf(lfp, "BACKUP FROM: %19s\n", backupfrom) == 1)
+ 	{
+ 		if (strcmp(backupfrom, "standby") == 0)
+ 			*backupFromStandby = true;
+ 	}
+ 
  	if (ferror(lfp) || FreeFile(lfp))
  		ereport(FATAL,
  				(errcode_for_file_access(),
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***************
*** 3067,3074 **** PostmasterStateMachine(void)
  		else
  		{
  			/*
! 			 * Terminate backup mode to avoid recovery after a clean fast
! 			 * shutdown.  Since a backup can only be taken during normal
  			 * running (and not, for example, while running under Hot Standby)
  			 * it only makes sense to do this if we reached normal running. If
  			 * we're still in recovery, the backup file is one we're
--- 3067,3074 ----
  		else
  		{
  			/*
! 			 * Terminate exclusive backup mode to avoid recovery after a clean fast
! 			 * shutdown.  Since an exclusive backup can only be taken during normal
  			 * running (and not, for example, while running under Hot Standby)
  			 * it only makes sense to do this if we reached normal running. If
  			 * we're still in recovery, the backup file is one we're
*** a/src/backend/postmaster/walwriter.c
--- b/src/backend/postmaster/walwriter.c
***************
*** 218,223 **** WalWriterMain(void)
--- 218,230 ----
  	PG_SETMASK(&UnBlockSig);
  
  	/*
+ 	 * There is a race condition: full_page_writes might have been changed
+ 	 * by SIGHUP since the startup process had updated it in shared memory.
+ 	 * To handle this case, we always update shared full_page_writes here.
+ 	 */
+ 	UpdateFullPageWrites();
+ 
+ 	/*
  	 * Loop forever
  	 */
  	for (;;)
***************
*** 238,243 **** WalWriterMain(void)
--- 245,256 ----
  		{
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
+ 
+ 			/*
+ 			 * If full_page_writes has been changed by SIGHUP, we update it
+ 			 * in shared memory and write an XLOG_FPW_CHANGE record.
+ 			 */
+ 			UpdateFullPageWrites();
  		}
  		if (shutdown_requested)
  		{
*** a/src/backend/replication/basebackup.c
--- b/src/backend/replication/basebackup.c
***************
*** 180,185 **** perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
--- 180,201 ----
  					ti->path == NULL ? 1 : strlen(ti->path),
  					false);
  
+ 			/* In the main tar, include pg_control last. */
+ 			if (ti->path == NULL)
+ 			{
+ 				struct stat statbuf;
+ 
+ 				if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
+ 				{
+ 					ereport(ERROR,
+ 							(errcode_for_file_access(),
+ 							 errmsg("could not stat control file \"%s\": %m",
+ 									XLOG_CONTROL_FILE)));
+ 				}
+ 
+ 				sendFile(XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf);
+ 			}
+ 
  			/*
  			 * If we're including WAL, and this is the main data directory we
  			 * don't terminate the tar stream here. Instead, we will append
***************
*** 361,371 **** SendBaseBackup(BaseBackupCmd *cmd)
  	MemoryContext old_context;
  	basebackup_options opt;
  
- 	if (am_cascading_walsender)
- 		ereport(FATAL,
- 				(errcode(ERRCODE_CANNOT_CONNECT_NOW),
- 				 errmsg("recovery is still in progress, can't accept WAL streaming connections for backup")));
- 
  	parse_basebackup_options(cmd->options, &opt);
  
  	backup_context = AllocSetContextCreate(CurrentMemoryContext,
--- 377,382 ----
***************
*** 609,614 **** sendDir(char *path, int basepathlen, bool sizeonly)
--- 620,633 ----
  			strcmp(pathbuf, "./postmaster.opts") == 0)
  			continue;
  
+ 		/* Skip recovery.conf in the data directory */
+ 		if (strcmp(pathbuf, "./recovery.conf") == 0)
+ 			continue;
+ 
+ 		/* Skip pg_control here to back up it last */
+ 		if (strcmp(pathbuf, "./global/pg_control") == 0)
+ 			continue;
+ 
  		if (lstat(pathbuf, &statbuf) != 0)
  		{
  			if (errno != ENOENT)
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 130,136 **** extern int	CommitSiblings;
  extern char *default_tablespace;
  extern char *temp_tablespaces;
  extern bool synchronize_seqscans;
- extern bool fullPageWrites;
  extern int	ssl_renegotiation_limit;
  extern char *SSLCipherSuites;
  
--- 130,135 ----
*** a/src/bin/pg_controldata/pg_controldata.c
--- b/src/bin/pg_controldata/pg_controldata.c
***************
*** 209,214 **** main(int argc, char *argv[])
--- 209,216 ----
  		   ControlFile.checkPointCopy.redo.xrecoff);
  	printf(_("Latest checkpoint's TimeLineID:       %u\n"),
  		   ControlFile.checkPointCopy.ThisTimeLineID);
+ 	printf(_("Latest checkpoint's full_page_writes: %s\n"),
+ 		   ControlFile.checkPointCopy.fullPageWrites ? _("yes") : _("no"));
  	printf(_("Latest checkpoint's NextXID:          %u/%u\n"),
  		   ControlFile.checkPointCopy.nextXidEpoch,
  		   ControlFile.checkPointCopy.nextXid);
***************
*** 232,237 **** main(int argc, char *argv[])
--- 234,242 ----
  	printf(_("Backup start location:                %X/%X\n"),
  		   ControlFile.backupStartPoint.xlogid,
  		   ControlFile.backupStartPoint.xrecoff);
+ 	printf(_("Backup end location:                  %X/%X\n"),
+ 		   ControlFile.backupEndPoint.xlogid,
+ 		   ControlFile.backupEndPoint.xrecoff);
  	printf(_("End-of-backup record required:        %s\n"),
  		   ControlFile.backupEndRequired ? _("yes") : _("no"));
  	printf(_("Current wal_level setting:            %s\n"),
*** a/src/bin/pg_resetxlog/pg_resetxlog.c
--- b/src/bin/pg_resetxlog/pg_resetxlog.c
***************
*** 489,494 **** GuessControlValues(void)
--- 489,495 ----
  	ControlFile.checkPointCopy.redo.xlogid = 0;
  	ControlFile.checkPointCopy.redo.xrecoff = SizeOfXLogLongPHD;
  	ControlFile.checkPointCopy.ThisTimeLineID = 1;
+ 	ControlFile.checkPointCopy.fullPageWrites = false;
  	ControlFile.checkPointCopy.nextXidEpoch = 0;
  	ControlFile.checkPointCopy.nextXid = FirstNormalTransactionId;
  	ControlFile.checkPointCopy.nextOid = FirstBootstrapObjectId;
***************
*** 503,509 **** GuessControlValues(void)
  	ControlFile.time = (pg_time_t) time(NULL);
  	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
  
! 	/* minRecoveryPoint and backupStartPoint can be left zero */
  
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.MaxConnections = 100;
--- 504,510 ----
  	ControlFile.time = (pg_time_t) time(NULL);
  	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
  
! 	/* minRecoveryPoint, backupStartPoint and backupEndPoint can be left zero */
  
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.MaxConnections = 100;
***************
*** 569,574 **** PrintControlValues(bool guessed)
--- 570,577 ----
  		   sysident_str);
  	printf(_("Latest checkpoint's TimeLineID:       %u\n"),
  		   ControlFile.checkPointCopy.ThisTimeLineID);
+ 	printf(_("Latest checkpoint's full_page_writes:       %s\n"),
+ 		   ControlFile.checkPointCopy.fullPageWrites ? _("yes") : _("no"));
  	printf(_("Latest checkpoint's NextXID:          %u/%u\n"),
  		   ControlFile.checkPointCopy.nextXidEpoch,
  		   ControlFile.checkPointCopy.nextXid);
***************
*** 637,642 **** RewriteControlFile(void)
--- 640,647 ----
  	ControlFile.minRecoveryPoint.xrecoff = 0;
  	ControlFile.backupStartPoint.xlogid = 0;
  	ControlFile.backupStartPoint.xrecoff = 0;
+ 	ControlFile.backupEndPoint.xlogid = 0;
+ 	ControlFile.backupEndPoint.xrecoff = 0;
  	ControlFile.backupEndRequired = false;
  
  	/*
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 192,197 **** extern int	XLogArchiveTimeout;
--- 192,198 ----
  extern bool XLogArchiveMode;
  extern char *XLogArchiveCommand;
  extern bool EnableHotStandby;
+ extern bool fullPageWrites;
  extern bool log_checkpoints;
  
  /* WAL levels */
***************
*** 307,312 **** extern void CreateCheckPoint(int flags);
--- 308,314 ----
  extern bool CreateRestartPoint(int flags);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr XLogRestorePoint(const char *rpName);
+ extern void UpdateFullPageWrites(void);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern XLogRecPtr GetInsertRecPtr(void);
  extern XLogRecPtr GetFlushRecPtr(void);
*** a/src/include/catalog/pg_control.h
--- b/src/include/catalog/pg_control.h
***************
*** 21,27 ****
  
  
  /* Version identifier for this pg_control format */
! #define PG_CONTROL_VERSION	921
  
  /*
   * Body of CheckPoint XLOG records.  This is declared here because we keep
--- 21,27 ----
  
  
  /* Version identifier for this pg_control format */
! #define PG_CONTROL_VERSION	922
  
  /*
   * Body of CheckPoint XLOG records.  This is declared here because we keep
***************
*** 33,38 **** typedef struct CheckPoint
--- 33,39 ----
  	XLogRecPtr	redo;			/* next RecPtr available when we began to
  								 * create CheckPoint (i.e. REDO start point) */
  	TimeLineID	ThisTimeLineID; /* current TLI */
+ 	bool			fullPageWrites;	/* current full_page_writes */
  	uint32		nextXidEpoch;	/* higher-order bits of nextXid */
  	TransactionId nextXid;		/* next free XID */
  	Oid			nextOid;		/* next free OID */
***************
*** 60,65 **** typedef struct CheckPoint
--- 61,67 ----
  #define XLOG_BACKUP_END					0x50
  #define XLOG_PARAMETER_CHANGE			0x60
  #define XLOG_RESTORE_POINT				0x70
+ #define XLOG_FPW_CHANGE				0x80
  
  
  /*
***************
*** 138,143 **** typedef struct ControlFileData
--- 140,151 ----
  	 * record, to make sure the end-of-backup record corresponds the base
  	 * backup we're recovering from.
  	 *
+ 	 * backupEndPoint is the backup end location, if we are recovering from
+ 	 * an online backup which was taken from the standby and haven't reached
+ 	 * the end of backup yet. It is initialized to the minimum recovery point
+ 	 * in pg_control which was backed up last. It is reset to zero when
+ 	 * the end of backup is reached, and we mustn't start up before that.
+ 	 *
  	 * If backupEndRequired is true, we know for sure that we're restoring
  	 * from a backup, and must see a backup-end record before we can safely
  	 * start up. If it's false, but backupStartPoint is set, a backup_label
***************
*** 146,151 **** typedef struct ControlFileData
--- 154,160 ----
  	 */
  	XLogRecPtr	minRecoveryPoint;
  	XLogRecPtr	backupStartPoint;
+ 	XLogRecPtr	backupEndPoint;
  	bool		backupEndRequired;
  
  	/*

#72

Fujii Masao

masao.fujii@gmail.com

almost 14 years ago

In reply to: Fujii Masao (#71)

1 attachment(s)

Re: Online base backup from the hot-standby

On Fri, Jan 13, 2012 at 5:02 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

The amount of code changes to allow pg_basebackup to make a backup from
the standby seems to be small. So I ended up merging that changes and the
infrastructure patch. WIP patch attached. But I'd happy to split the patch again
if you want.

Attached is the updated version of the patch. I wrote the limitations of
standby-only backup in the document and changed the error messages.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

standby_online_backup_v13.patchtext/x-diff; charset=US-ASCII; name=standby_online_backup_v13.patchDownload

*** a/doc/src/sgml/ref/pg_basebackup.sgml
--- b/doc/src/sgml/ref/pg_basebackup.sgml
***************
*** 64,69 **** PostgreSQL documentation
--- 64,111 ----
     better from a performance point of view to take only one backup, and copy
     the result.
    </para>
+ 
+   <para>
+    <application>pg_basebackup</application> can make a base backup from
+    not only the master but also the standby. To take a backup from the standby,
+    set up the standby so that it can accept replication connections (that is, set
+    <varname>max_wal_senders</> and <xref linkend="guc-hot-standby">,
+    and configure <link linkend="auth-pg-hba-conf">host-based authentication</link>).
+    You will also need to enable <xref linkend="guc-full-page-writes"> on the master.
+   </para>
+ 
+   <para>
+    Note that there are some limitations in an online backup from the standby:
+ 
+    <itemizedlist>
+     <listitem>
+      <para>
+       The backup history file is not created in the database cluster backed up.
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       There is no guarantee that all WAL files required for the backup are archived
+       at the end of backup. If you are planning to use the backup for an archive
+       recovery and want to ensure that all required files are available at that moment,
+       you need to include them into the backup by using <literal>-x</> option.
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       If the standby is promoted to the master during online backup, the backup fails.
+      </para>
+     </listitem>
+     <listitem>
+      <para>
+       All WAL records required for the backup must contain sufficient full-page writes,
+       which requires you to enable <varname>full_page_writes</> on the master and
+       not to use the tool like <application>pg_compresslog</> as
+       <varname>archive_command</> to remove full-page writes from WAL files.
+      </para>
+     </listitem>
+    </itemizedlist>
+   </para>
   </refsect1>
  
   <refsect1>
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 157,162 **** HotStandbyState standbyState = STANDBY_DISABLED;
--- 157,170 ----
  static XLogRecPtr LastRec;
  
  /*
+  * During recovery, lastFullPageWrites keeps track of full_page_writes that
+  * the replayed WAL records indicate. It's initialized with full_page_writes
+  * that the recovery starting checkpoint record indicates, and then updated
+  * each time XLOG_FPW_CHANGE record is replayed.
+  */
+ static bool lastFullPageWrites;
+ 
+ /*
   * Local copy of SharedRecoveryInProgress variable. True actually means "not
   * known, need to check the shared state".
   */
***************
*** 355,360 **** typedef struct XLogCtlInsert
--- 363,378 ----
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  
  	/*
+ 	 * fullPageWrites is shared-memory copy of walwriter's or startup
+ 	 * process' full_page_writes. All backends use this flag to determine
+ 	 * whether to write full-page to WAL, instead of using process-local
+ 	 * one. This is required because, when full_page_writes is changed
+ 	 * by SIGHUP, we must WAL-log it before it actually affects
+ 	 * WAL-logging by backends.
+ 	 */
+ 	bool		fullPageWrites;
+ 
+ 	/*
  	 * exclusiveBackup is true if a backup started with pg_start_backup() is
  	 * in progress, and nonExclusiveBackups is a counter indicating the number
  	 * of streaming base backups currently in progress. forcePageWrites is set
***************
*** 460,465 **** typedef struct XLogCtlData
--- 478,489 ----
  	/* Are we requested to pause recovery? */
  	bool		recoveryPause;
  
+ 	/*
+ 	 * lastFpwDisableRecPtr points to the start of the last replayed
+ 	 * XLOG_FPW_CHANGE record that instructs full_page_writes is disabled.
+ 	 */
+ 	XLogRecPtr	lastFpwDisableRecPtr;
+ 
  	slock_t		info_lck;		/* locks shared variables shown above */
  } XLogCtlData;
  
***************
*** 663,669 **** static void xlog_outrec(StringInfo buf, XLogRecord *record);
  #endif
  static void pg_start_backup_callback(int code, Datum arg);
  static bool read_backup_label(XLogRecPtr *checkPointLoc,
! 				  bool *backupEndRequired);
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
--- 687,693 ----
  #endif
  static void pg_start_backup_callback(int code, Datum arg);
  static bool read_backup_label(XLogRecPtr *checkPointLoc,
! 				  bool *backupEndRequired, bool *backupFromStandby);
  static void rm_redo_error_callback(void *arg);
  static int	get_sync_bit(int method);
  
***************
*** 709,714 **** XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
--- 733,739 ----
  	bool		updrqst;
  	bool		doPageWrites;
  	bool		isLogSwitch = (rmid == RM_XLOG_ID && info == XLOG_SWITCH);
+ 	bool		fpwChange = (rmid == RM_XLOG_ID && info == XLOG_FPW_CHANGE);
  	uint8		info_orig = info;
  
  	/* cross-check on whether we should be here or not */
***************
*** 756,765 **** begin:;
  	/*
  	 * Decide if we need to do full-page writes in this XLOG record: true if
  	 * full_page_writes is on or we have a PITR request for it.  Since we
! 	 * don't yet have the insert lock, forcePageWrites could change under us,
! 	 * but we'll recheck it once we have the lock.
  	 */
! 	doPageWrites = fullPageWrites || Insert->forcePageWrites;
  
  	len = 0;
  	for (rdt = rdata;;)
--- 781,790 ----
  	/*
  	 * Decide if we need to do full-page writes in this XLOG record: true if
  	 * full_page_writes is on or we have a PITR request for it.  Since we
! 	 * don't yet have the insert lock, fullPageWrites and forcePageWrites
! 	 * could change under us, but we'll recheck them once we have the lock.
  	 */
! 	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
  
  	len = 0;
  	for (rdt = rdata;;)
***************
*** 939,950 **** begin:;
  	}
  
  	/*
! 	 * Also check to see if forcePageWrites was just turned on; if we weren't
! 	 * already doing full-page writes then go back and recompute. (If it was
! 	 * just turned off, we could recompute the record without full pages, but
! 	 * we choose not to bother.)
  	 */
! 	if (Insert->forcePageWrites && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data. */
  		LWLockRelease(WALInsertLock);
--- 964,975 ----
  	}
  
  	/*
! 	 * Also check to see if fullPageWrites or forcePageWrites was just turned on;
! 	 * if we weren't already doing full-page writes then go back and recompute.
! 	 * (If it was just turned off, we could recompute the record without full pages,
! 	 * but we choose not to bother.)
  	 */
! 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
  	{
  		/* Oops, must redo it with full-page data. */
  		LWLockRelease(WALInsertLock);
***************
*** 1189,1194 **** begin:;
--- 1214,1228 ----
  		WriteRqst = XLogCtl->xlblocks[curridx];
  	}
  
+ 	/*
+ 	 * If the record is an XLOG_FPW_CHANGE, we update full_page_writes
+ 	 * in shared memory before releasing WALInsertLock. This ensures that
+ 	 * an XLOG_FPW_CHANGE record precedes any WAL record affected
+ 	 * by this change of full_page_writes.
+ 	 */
+ 	if (fpwChange)
+ 		Insert->fullPageWrites = fullPageWrites;
+ 
  	LWLockRelease(WALInsertLock);
  
  	if (updrqst)
***************
*** 5147,5152 **** BootStrapXLOG(void)
--- 5181,5187 ----
  	checkPoint.redo.xlogid = 0;
  	checkPoint.redo.xrecoff = XLogSegSize + SizeOfXLogLongPHD;
  	checkPoint.ThisTimeLineID = ThisTimeLineID;
+ 	checkPoint.fullPageWrites = fullPageWrites;
  	checkPoint.nextXidEpoch = 0;
  	checkPoint.nextXid = FirstNormalTransactionId;
  	checkPoint.nextOid = FirstBootstrapObjectId;
***************
*** 5961,5966 **** StartupXLOG(void)
--- 5996,6003 ----
  	uint32		freespace;
  	TransactionId oldestActiveXID;
  	bool		backupEndRequired = false;
+ 	bool		backupFromStandby = false;
+ 	DBState	save_state;
  
  	/*
  	 * Read control file and check XLOG status looks valid.
***************
*** 6094,6100 **** StartupXLOG(void)
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
! 	if (read_backup_label(&checkPointLoc, &backupEndRequired))
  	{
  		/*
  		 * When a backup_label file is present, we want to roll forward from
--- 6131,6138 ----
  	if (StandbyMode)
  		OwnLatch(&XLogCtl->recoveryWakeupLatch);
  
! 	if (read_backup_label(&checkPointLoc, &backupEndRequired,
! 						  &backupFromStandby))
  	{
  		/*
  		 * When a backup_label file is present, we want to roll forward from
***************
*** 6210,6215 **** StartupXLOG(void)
--- 6248,6255 ----
  	 */
  	ThisTimeLineID = checkPoint.ThisTimeLineID;
  
+ 	lastFullPageWrites = checkPoint.fullPageWrites;
+ 
  	RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
  
  	if (XLByteLT(RecPtr, checkPoint.redo))
***************
*** 6250,6255 **** StartupXLOG(void)
--- 6290,6296 ----
  		 * pg_control with any minimum recovery stop point obtained from a
  		 * backup history file.
  		 */
+ 		save_state = ControlFile->state;
  		if (InArchiveRecovery)
  			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
  		else
***************
*** 6270,6281 **** StartupXLOG(void)
  		}
  
  		/*
! 		 * set backupStartPoint if we're starting recovery from a base backup
  		 */
  		if (haveBackupLabel)
  		{
  			ControlFile->backupStartPoint = checkPoint.redo;
  			ControlFile->backupEndRequired = backupEndRequired;
  		}
  		ControlFile->time = (pg_time_t) time(NULL);
  		/* No need to hold ControlFileLock yet, we aren't up far enough */
--- 6311,6338 ----
  		}
  
  		/*
! 		 * Set backupStartPoint if we're starting recovery from a base backup.
! 		 *
! 		 * Set backupEndPoint and use minRecoveryPoint as the backup end location
! 		 * if we're starting recovery from a base backup which was taken from
! 		 * the standby. In this case, the database system status in pg_control must
! 		 * indicate DB_IN_ARCHIVE_RECOVERY. If not, which means that backup
! 		 * is corrupted, so we cancel recovery.
  		 */
  		if (haveBackupLabel)
  		{
  			ControlFile->backupStartPoint = checkPoint.redo;
  			ControlFile->backupEndRequired = backupEndRequired;
+ 
+ 			if (backupFromStandby)
+ 			{
+ 				if (save_state != DB_IN_ARCHIVE_RECOVERY)
+ 					ereport(FATAL,
+ 							(errmsg("backup_label contains inconsistent data with control file"),
+ 							 errhint("This means that the backup is corrupted and you will "
+ 									 "have to use another backup for recovery.")));
+ 				ControlFile->backupEndPoint = ControlFile->minRecoveryPoint;
+ 			}
  		}
  		ControlFile->time = (pg_time_t) time(NULL);
  		/* No need to hold ControlFileLock yet, we aren't up far enough */
***************
*** 6564,6569 **** StartupXLOG(void)
--- 6621,6647 ----
  				/* Pop the error context stack */
  				error_context_stack = errcontext.previous;
  
+ 				if (!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
+ 					XLByteLE(ControlFile->backupEndPoint, EndRecPtr))
+ 				{
+ 					/*
+ 					 * We have reached the end of base backup, the point where
+ 					 * the minimum recovery point in pg_control indicates.
+ 					 * The data on disk is now consistent. Reset backupStartPoint
+ 					 * and backupEndPoint.
+ 					 */
+ 					elog(DEBUG1, "end of backup reached");
+ 
+ 					LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ 
+ 					MemSet(&ControlFile->backupStartPoint, 0, sizeof(XLogRecPtr));
+ 					MemSet(&ControlFile->backupEndPoint, 0, sizeof(XLogRecPtr));
+ 					ControlFile->backupEndRequired = false;
+ 					UpdateControlFile();
+ 
+ 					LWLockRelease(ControlFileLock);
+ 				}
+ 
  				/*
  				 * Update shared recoveryLastRecPtr after this record has been
  				 * replayed.
***************
*** 6763,6768 **** StartupXLOG(void)
--- 6841,6856 ----
  	/* Pre-scan prepared transactions to find out the range of XIDs present */
  	oldestActiveXID = PrescanPreparedTransactions(NULL, NULL);
  
+ 	/*
+ 	 * Update full_page_writes in shared memory and write an
+ 	 * XLOG_FPW_CHANGE record before resource manager writes cleanup
+ 	 * WAL records or checkpoint record is written.
+ 	 */
+ 	Insert->fullPageWrites = lastFullPageWrites;
+ 	LocalSetXLogInsertAllowed();
+ 	UpdateFullPageWrites();
+ 	LocalXLogInsertAllowed = -1;
+ 
  	if (InRecovery)
  	{
  		int			rmid;
***************
*** 7644,7649 **** CreateCheckPoint(int flags)
--- 7732,7738 ----
  		LocalSetXLogInsertAllowed();
  
  	checkPoint.ThisTimeLineID = ThisTimeLineID;
+ 	checkPoint.fullPageWrites = Insert->fullPageWrites;
  
  	/*
  	 * Compute new REDO record ptr = location of next XLOG record.
***************
*** 8359,8364 **** XLogReportParameters(void)
--- 8448,8495 ----
  }
  
  /*
+  * Update full_page_writes in shared memory, and write an
+  * XLOG_FPW_CHANGE record if necessary.
+  */
+ void
+ UpdateFullPageWrites(void)
+ {
+ 	XLogCtlInsert *Insert = &XLogCtl->Insert;
+ 
+ 	/*
+ 	 * Do nothing if full_page_writes has not been changed.
+ 	 *
+ 	 * It's safe to check the shared full_page_writes without the lock,
+ 	 * because we can guarantee that there is no concurrently running
+ 	 * process which can update it.
+ 	 */
+ 	if (fullPageWrites == Insert->fullPageWrites)
+ 		return;
+ 
+ 	/*
+ 	 * Write an XLOG_FPW_CHANGE record. This allows us to keep
+ 	 * track of full_page_writes during archive recovery, if required.
+ 	 */
+ 	if (XLogStandbyInfoActive())
+ 	{
+ 		XLogRecData	rdata;
+ 
+ 		rdata.data = (char *) (&fullPageWrites);
+ 		rdata.len = sizeof(bool);
+ 		rdata.buffer = InvalidBuffer;
+ 		rdata.next = NULL;
+ 
+ 		XLogInsert(RM_XLOG_ID, XLOG_FPW_CHANGE, &rdata);
+ 	}
+ 	else
+ 	{
+ 		LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+ 		Insert->fullPageWrites = fullPageWrites;
+ 		LWLockRelease(WALInsertLock);
+ 	}
+ }
+ 
+ /*
   * XLOG resource manager's routines
   *
   * Definitions of info values are in include/catalog/pg_control.h, though
***************
*** 8402,8408 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
  		 * never arrive.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint))
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
--- 8533,8540 ----
  		 * never arrive.
  		 */
  		if (InArchiveRecovery &&
! 			!XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
! 			XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
  			ereport(ERROR,
  					(errmsg("online backup was canceled, recovery cannot continue")));
  
***************
*** 8571,8576 **** xlog_redo(XLogRecPtr lsn, XLogRecord *record)
--- 8703,8732 ----
  		/* Check to see if any changes to max_connections give problems */
  		CheckRequiredParameterValues();
  	}
+ 	else if (info == XLOG_FPW_CHANGE)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 		bool		fpw;
+ 
+ 		memcpy(&fpw, XLogRecGetData(record), sizeof(bool));
+ 
+ 		/*
+ 		 * Update the LSN of the last replayed XLOG_FPW_CHANGE record
+ 		 * so that do_pg_start_backup() and do_pg_stop_backup() can check
+ 		 * whether full_page_writes has been disabled during online backup.
+ 		 */
+ 		if (!fpw)
+ 		{
+ 			SpinLockAcquire(&xlogctl->info_lck);
+ 			if (XLByteLT(xlogctl->lastFpwDisableRecPtr, ReadRecPtr))
+ 				xlogctl->lastFpwDisableRecPtr = ReadRecPtr;
+ 			SpinLockRelease(&xlogctl->info_lck);
+ 		}
+ 
+ 		/* Keep track of full_page_writes */
+ 		lastFullPageWrites = fpw;
+ 	}
  }
  
  void
***************
*** 8584,8593 **** xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
  		CheckPoint *checkpoint = (CheckPoint *) rec;
  
  		appendStringInfo(buf, "checkpoint: redo %X/%X; "
! 						 "tli %u; xid %u/%u; oid %u; multi %u; offset %u; "
  						 "oldest xid %u in DB %u; oldest running xid %u; %s",
  						 checkpoint->redo.xlogid, checkpoint->redo.xrecoff,
  						 checkpoint->ThisTimeLineID,
  						 checkpoint->nextXidEpoch, checkpoint->nextXid,
  						 checkpoint->nextOid,
  						 checkpoint->nextMulti,
--- 8740,8750 ----
  		CheckPoint *checkpoint = (CheckPoint *) rec;
  
  		appendStringInfo(buf, "checkpoint: redo %X/%X; "
! 						 "tli %u; fpw %s; xid %u/%u; oid %u; multi %u; offset %u; "
  						 "oldest xid %u in DB %u; oldest running xid %u; %s",
  						 checkpoint->redo.xlogid, checkpoint->redo.xrecoff,
  						 checkpoint->ThisTimeLineID,
+ 						 checkpoint->fullPageWrites ? "true" : "false",
  						 checkpoint->nextXidEpoch, checkpoint->nextXid,
  						 checkpoint->nextOid,
  						 checkpoint->nextMulti,
***************
*** 8652,8657 **** xlog_desc(StringInfo buf, uint8 xl_info, char *rec)
--- 8809,8821 ----
  						 xlrec.max_locks_per_xact,
  						 wal_level_str);
  	}
+ 	else if (info == XLOG_FPW_CHANGE)
+ 	{
+ 		bool		fpw;
+ 
+ 		memcpy(&fpw, rec, sizeof(bool));
+ 		appendStringInfo(buf, "full_page_writes: %s", fpw ? "true" : "false");
+ 	}
  	else
  		appendStringInfo(buf, "UNKNOWN");
  }
***************
*** 8837,8842 **** XLogRecPtr
--- 9001,9007 ----
  do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		recovery_in_progress = false;
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
  	pg_time_t	stamp_time;
***************
*** 8848,8865 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  	FILE	   *fp;
  	StringInfoData labelfbuf;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		   errmsg("must be superuser or replication role to run a backup")));
  
! 	if (RecoveryInProgress())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
--- 9013,9039 ----
  	FILE	   *fp;
  	StringInfoData labelfbuf;
  
+ 	recovery_in_progress = RecoveryInProgress();
+ 
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		   errmsg("must be superuser or replication role to run a backup")));
  
! 	/*
! 	 * Currently only non-exclusive backup can be taken during recovery.
! 	 */
! 	if (recovery_in_progress && exclusive)
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	/*
! 	 * During recovery, we don't need to check WAL level. Because, if WAL level
! 	 * is not sufficient, it's impossible to get here during recovery.
! 	 */
! 	if (!recovery_in_progress && !XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
***************
*** 8885,8890 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
--- 9059,9067 ----
  	 * since we expect that any pages not modified during the backup interval
  	 * must have been correctly captured by the backup.)
  	 *
+ 	 * Note that forcePageWrites has no effect during an online backup from
+ 	 * the standby.
+ 	 *
  	 * We must hold WALInsertLock to change the value of forcePageWrites, to
  	 * ensure adequate interlocking against XLogInsert().
  	 */
***************
*** 8927,8943 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  		 * Therefore, if a WAL archiver (such as pglesslog) is trying to
  		 * compress out removable backup blocks, it won't remove any that
  		 * occur after this point.
  		 */
! 		RequestXLogSwitch();
  
  		do
  		{
  			/*
! 			 * Force a CHECKPOINT.	Aside from being necessary to prevent torn
  			 * page problems, this guarantees that two successive backup runs
  			 * will have different checkpoint positions and hence different
  			 * history file names, even if nothing happened in between.
  			 *
  			 * We use CHECKPOINT_IMMEDIATE only if requested by user (via
  			 * passing fast = true).  Otherwise this can take awhile.
  			 */
--- 9104,9135 ----
  		 * Therefore, if a WAL archiver (such as pglesslog) is trying to
  		 * compress out removable backup blocks, it won't remove any that
  		 * occur after this point.
+ 		 *
+ 		 * During recovery, we skip forcing XLOG file switch, which means that
+ 		 * the backup taken during recovery is not available for the special
+ 		 * recovery case described above.
  		 */
! 		if (!recovery_in_progress)
! 			RequestXLogSwitch();
  
  		do
  		{
+ 			bool		checkpointfpw;
+ 
  			/*
! 			 * Force a CHECKPOINT.  Aside from being necessary to prevent torn
  			 * page problems, this guarantees that two successive backup runs
  			 * will have different checkpoint positions and hence different
  			 * history file names, even if nothing happened in between.
  			 *
+ 			 * During recovery, establish a restartpoint if possible. We use the last
+ 			 * restartpoint as the backup starting checkpoint. This means that two
+ 			 * successive backup runs can have same checkpoint positions.
+ 			 *
+ 			 * Since the fact that we are executing do_pg_start_backup() during
+ 			 * recovery means that checkpointer is running, we can use
+ 			 * RequestCheckpoint() to establish a restartpoint.
+ 			 *
  			 * We use CHECKPOINT_IMMEDIATE only if requested by user (via
  			 * passing fast = true).  Otherwise this can take awhile.
  			 */
***************
*** 8953,8960 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
--- 9145,9187 ----
  			LWLockAcquire(ControlFileLock, LW_SHARED);
  			checkpointloc = ControlFile->checkPoint;
  			startpoint = ControlFile->checkPointCopy.redo;
+ 			checkpointfpw = ControlFile->checkPointCopy.fullPageWrites;
  			LWLockRelease(ControlFileLock);
  
+ 			if (recovery_in_progress)
+ 			{
+ 				/* use volatile pointer to prevent code rearrangement */
+ 				volatile XLogCtlData *xlogctl = XLogCtl;
+ 				XLogRecPtr		recptr;
+ 
+ 				/*
+ 				 * Check to see if all WAL replayed during online backup (i.e.,
+ 				 * since last restartpoint used as backup starting checkpoint)
+ 				 * contain full-page writes.
+ 				 */
+ 				SpinLockAcquire(&xlogctl->info_lck);
+ 				recptr = xlogctl->lastFpwDisableRecPtr;
+ 				SpinLockRelease(&xlogctl->info_lck);
+ 
+ 				if (!checkpointfpw || XLByteLE(startpoint, recptr))
+ 					ereport(ERROR,
+ 							(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 							 errmsg("WAL generated with full_page_writes=off was replayed "
+ 									"since last restartpoint"),
+ 							 errhint("This means that the backup being taken has gotten corrupted. "
+ 									 "Enable full_page_writes and run CHECKPOINT on the master, "
+ 									 "and then try an online backup again.")));
+ 
+ 				/*
+ 				 * During recovery, since we don't use the end-of-backup WAL
+ 				 * record and don't write the backup history file, the starting WAL
+ 				 * location doesn't need to be unique. This means that two base
+ 				 * backups started at the same time might use the same checkpoint
+ 				 * as starting locations.
+ 				 */
+ 				gotUniqueStartpoint = true;
+ 			}
+ 
  			/*
  			 * If two base backups are started at the same time (in WAL sender
  			 * processes), we need to make sure that they use different
***************
*** 8994,8999 **** do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
--- 9221,9228 ----
  						 checkpointloc.xlogid, checkpointloc.xrecoff);
  		appendStringInfo(&labelfbuf, "BACKUP METHOD: %s\n",
  						 exclusive ? "pg_start_backup" : "streamed");
+ 		appendStringInfo(&labelfbuf, "BACKUP FROM: %s\n",
+ 						 recovery_in_progress ? "standby" : "master");
  		appendStringInfo(&labelfbuf, "START TIME: %s\n", strfbuf);
  		appendStringInfo(&labelfbuf, "LABEL: %s\n", backupidstr);
  
***************
*** 9088,9093 **** XLogRecPtr
--- 9317,9323 ----
  do_pg_stop_backup(char *labelfile, bool waitforarchive)
  {
  	bool		exclusive = (labelfile == NULL);
+ 	bool		recovery_in_progress = false;
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
  	XLogRecData rdata;
***************
*** 9098,9103 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
--- 9328,9334 ----
  	char		stopxlogfilename[MAXFNAMELEN];
  	char		lastxlogfilename[MAXFNAMELEN];
  	char		histfilename[MAXFNAMELEN];
+ 	char		backupfrom[20];
  	uint32		_logId;
  	uint32		_logSeg;
  	FILE	   *lfp;
***************
*** 9107,9125 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
  	int			waits = 0;
  	bool		reported_waiting = false;
  	char	   *remaining;
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
! 	if (RecoveryInProgress())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	if (!XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
--- 9338,9366 ----
  	int			waits = 0;
  	bool		reported_waiting = false;
  	char	   *remaining;
+ 	char	   *ptr;
+ 
+ 	recovery_in_progress = RecoveryInProgress();
  
  	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
  		 (errmsg("must be superuser or replication role to run a backup"))));
  
! 	/*
! 	 * Currently only non-exclusive backup can be taken during recovery.
! 	 */
! 	if (recovery_in_progress && exclusive)
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("recovery is in progress"),
  				 errhint("WAL control functions cannot be executed during recovery.")));
  
! 	/*
! 	 * During recovery, we don't need to check WAL level. Because, if WAL level
! 	 * is not sufficient, it's impossible to get here during recovery.
! 	 */
! 	if (!recovery_in_progress && !XLogIsNeeded())
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  			  errmsg("WAL level not sufficient for making an online backup"),
***************
*** 9210,9215 **** do_pg_stop_backup(char *labelfile, bool waitforarchive)
--- 9451,9527 ----
  	remaining = strchr(labelfile, '\n') + 1;	/* %n is not portable enough */
  
  	/*
+ 	 * Parse the BACKUP FROM line. If we are taking an online backup from
+ 	 * the standby, we confirm that the standby has not been promoted
+ 	 * during the backup.
+ 	 */
+ 	ptr = strstr(remaining, "BACKUP FROM:");
+ 	if (sscanf(ptr, "BACKUP FROM: %19s\n", backupfrom) != 1)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
+ 	if (strcmp(backupfrom, "standby") == 0 && !recovery_in_progress)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 				 errmsg("the standby was promoted during online backup"),
+ 				 errhint("This means that the backup being taken has gotten corrupted. "
+ 						 "Try an online backup again.")));
+ 
+ 	/*
+ 	 * During recovery, we don't write an end-of-backup record. We can
+ 	 * assume that pg_control was backed up last and its minimum recovery
+ 	 * point can be available as the backup end location. Without an
+ 	 * end-of-backup record, we can check correctly whether we've
+ 	 * reached the end of backup when starting recovery from this backup.
+ 	 *
+ 	 * We don't force a switch to new WAL file and wait for all the required
+ 	 * files to be archived. This is okay if we use the backup to start
+ 	 * the standby. But, if it's for an archive recovery, to ensure all the
+ 	 * required files are available, a user should wait for them to be archived,
+ 	 * or include them into the backup.
+ 	 *
+ 	 * We return the current minimum recovery point as the backup end
+ 	 * location. Note that it's would be bigger than the exact backup end
+ 	 * location if the minimum recovery point is updated since the backup
+ 	 * of pg_control. This is harmless for current uses.
+ 	 *
+ 	 * XXX currently a backup history file is for informational and debug
+ 	 * purposes only. It's not essential for an online backup. Furthermore,
+ 	 * even if it's created, it will not be archived during recovery because
+ 	 * an archiver is not invoked. So it doesn't seem worthwhile to write
+ 	 * a backup history file during recovery.
+ 	 */
+ 	if (recovery_in_progress)
+ 	{
+ 		/* use volatile pointer to prevent code rearrangement */
+ 		volatile XLogCtlData *xlogctl = XLogCtl;
+ 		XLogRecPtr	recptr;
+ 
+ 		/*
+ 		 * Check to see if all WAL replayed during online backup contain
+ 		 * full-page writes.
+ 		 */
+ 		SpinLockAcquire(&xlogctl->info_lck);
+ 		recptr = xlogctl->lastFpwDisableRecPtr;
+ 		SpinLockRelease(&xlogctl->info_lck);
+ 
+ 		if (XLByteLE(startpoint, recptr))
+ 			ereport(ERROR,
+ 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ 					 errmsg("WAL generated with full_page_writes=off was replayed "
+ 							"during online backup"),
+ 					 errhint("This means that the backup being taken has gotten corrupted. "
+ 							 "Enable full_page_writes and run CHECKPOINT on the master, "
+ 							 "and then try an online backup again.")));
+ 
+ 		LWLockAcquire(ControlFileLock, LW_SHARED);
+ 		stoppoint = ControlFile->minRecoveryPoint;
+ 		LWLockRelease(ControlFileLock);
+ 
+ 		return stoppoint;
+ 	}
+ 
+ 	/*
  	 * Write the backup-end xlog record
  	 */
  	rdata.data = (char *) (&startpoint);
***************
*** 9454,9471 **** GetXLogWriteRecPtr(void)
   * Returns TRUE if a backup_label was found (and fills the checkpoint
   * location and its REDO location into *checkPointLoc and RedoStartLSN,
   * respectively); returns FALSE if not. If this backup_label came from a
!  * streamed backup, *backupEndRequired is set to TRUE.
   */
  static bool
! read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired)
  {
  	char		startxlogfilename[MAXFNAMELEN];
  	TimeLineID	tli;
  	FILE	   *lfp;
  	char		ch;
  	char		backuptype[20];
  
  	*backupEndRequired = false;
  
  	/*
  	 * See if label file is present
--- 9766,9787 ----
   * Returns TRUE if a backup_label was found (and fills the checkpoint
   * location and its REDO location into *checkPointLoc and RedoStartLSN,
   * respectively); returns FALSE if not. If this backup_label came from a
!  * streamed backup, *backupEndRequired is set to TRUE. If this backup_label
!  * was created during recovery, *backupFromStandby is set to TRUE.
   */
  static bool
! read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired,
! 				  bool *backupFromStandby)
  {
  	char		startxlogfilename[MAXFNAMELEN];
  	TimeLineID	tli;
  	FILE	   *lfp;
  	char		ch;
  	char		backuptype[20];
+ 	char		backupfrom[20];
  
  	*backupEndRequired = false;
+ 	*backupFromStandby = false;
  
  	/*
  	 * See if label file is present
***************
*** 9499,9514 **** read_backup_label(XLogRecPtr *checkPointLoc, bool *backupEndRequired)
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
  	/*
! 	 * BACKUP METHOD line is new in 9.1. We can't restore from an older backup
! 	 * anyway, but since the information on it is not strictly required, don't
! 	 * error out if it's missing for some reason.
  	 */
! 	if (fscanf(lfp, "BACKUP METHOD: %19s", backuptype) == 1)
  	{
  		if (strcmp(backuptype, "streamed") == 0)
  			*backupEndRequired = true;
  	}
  
  	if (ferror(lfp) || FreeFile(lfp))
  		ereport(FATAL,
  				(errcode_for_file_access(),
--- 9815,9836 ----
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
  	/*
! 	 * BACKUP METHOD and BACKUP FROM lines are new in 9.2. We can't
! 	 * restore from an older backup anyway, but since the information on it
! 	 * is not strictly required, don't error out if it's missing for some reason.
  	 */
! 	if (fscanf(lfp, "BACKUP METHOD: %19s\n", backuptype) == 1)
  	{
  		if (strcmp(backuptype, "streamed") == 0)
  			*backupEndRequired = true;
  	}
  
+ 	if (fscanf(lfp, "BACKUP FROM: %19s\n", backupfrom) == 1)
+ 	{
+ 		if (strcmp(backupfrom, "standby") == 0)
+ 			*backupFromStandby = true;
+ 	}
+ 
  	if (ferror(lfp) || FreeFile(lfp))
  		ereport(FATAL,
  				(errcode_for_file_access(),
*** a/src/backend/postmaster/postmaster.c
--- b/src/backend/postmaster/postmaster.c
***************
*** 3067,3074 **** PostmasterStateMachine(void)
  		else
  		{
  			/*
! 			 * Terminate backup mode to avoid recovery after a clean fast
! 			 * shutdown.  Since a backup can only be taken during normal
  			 * running (and not, for example, while running under Hot Standby)
  			 * it only makes sense to do this if we reached normal running. If
  			 * we're still in recovery, the backup file is one we're
--- 3067,3074 ----
  		else
  		{
  			/*
! 			 * Terminate exclusive backup mode to avoid recovery after a clean fast
! 			 * shutdown.  Since an exclusive backup can only be taken during normal
  			 * running (and not, for example, while running under Hot Standby)
  			 * it only makes sense to do this if we reached normal running. If
  			 * we're still in recovery, the backup file is one we're
*** a/src/backend/postmaster/walwriter.c
--- b/src/backend/postmaster/walwriter.c
***************
*** 218,223 **** WalWriterMain(void)
--- 218,230 ----
  	PG_SETMASK(&UnBlockSig);
  
  	/*
+ 	 * There is a race condition: full_page_writes might have been changed
+ 	 * by SIGHUP since the startup process had updated it in shared memory.
+ 	 * To handle this case, we always update shared full_page_writes here.
+ 	 */
+ 	UpdateFullPageWrites();
+ 
+ 	/*
  	 * Loop forever
  	 */
  	for (;;)
***************
*** 238,243 **** WalWriterMain(void)
--- 245,256 ----
  		{
  			got_SIGHUP = false;
  			ProcessConfigFile(PGC_SIGHUP);
+ 
+ 			/*
+ 			 * If full_page_writes has been changed by SIGHUP, we update it
+ 			 * in shared memory and write an XLOG_FPW_CHANGE record.
+ 			 */
+ 			UpdateFullPageWrites();
  		}
  		if (shutdown_requested)
  		{
*** a/src/backend/replication/basebackup.c
--- b/src/backend/replication/basebackup.c
***************
*** 180,185 **** perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
--- 180,201 ----
  					ti->path == NULL ? 1 : strlen(ti->path),
  					false);
  
+ 			/* In the main tar, include pg_control last. */
+ 			if (ti->path == NULL)
+ 			{
+ 				struct stat statbuf;
+ 
+ 				if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
+ 				{
+ 					ereport(ERROR,
+ 							(errcode_for_file_access(),
+ 							 errmsg("could not stat control file \"%s\": %m",
+ 									XLOG_CONTROL_FILE)));
+ 				}
+ 
+ 				sendFile(XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf);
+ 			}
+ 
  			/*
  			 * If we're including WAL, and this is the main data directory we
  			 * don't terminate the tar stream here. Instead, we will append
***************
*** 361,371 **** SendBaseBackup(BaseBackupCmd *cmd)
  	MemoryContext old_context;
  	basebackup_options opt;
  
- 	if (am_cascading_walsender)
- 		ereport(FATAL,
- 				(errcode(ERRCODE_CANNOT_CONNECT_NOW),
- 				 errmsg("recovery is still in progress, can't accept WAL streaming connections for backup")));
- 
  	parse_basebackup_options(cmd->options, &opt);
  
  	backup_context = AllocSetContextCreate(CurrentMemoryContext,
--- 377,382 ----
***************
*** 609,614 **** sendDir(char *path, int basepathlen, bool sizeonly)
--- 620,629 ----
  			strcmp(pathbuf, "./postmaster.opts") == 0)
  			continue;
  
+ 		/* Skip pg_control here to back up it last */
+ 		if (strcmp(pathbuf, "./global/pg_control") == 0)
+ 			continue;
+ 
  		if (lstat(pathbuf, &statbuf) != 0)
  		{
  			if (errno != ENOENT)
*** a/src/backend/utils/misc/guc.c
--- b/src/backend/utils/misc/guc.c
***************
*** 130,136 **** extern int	CommitSiblings;
  extern char *default_tablespace;
  extern char *temp_tablespaces;
  extern bool synchronize_seqscans;
- extern bool fullPageWrites;
  extern int	ssl_renegotiation_limit;
  extern char *SSLCipherSuites;
  
--- 130,135 ----
*** a/src/bin/pg_controldata/pg_controldata.c
--- b/src/bin/pg_controldata/pg_controldata.c
***************
*** 209,214 **** main(int argc, char *argv[])
--- 209,216 ----
  		   ControlFile.checkPointCopy.redo.xrecoff);
  	printf(_("Latest checkpoint's TimeLineID:       %u\n"),
  		   ControlFile.checkPointCopy.ThisTimeLineID);
+ 	printf(_("Latest checkpoint's full_page_writes: %s\n"),
+ 		   ControlFile.checkPointCopy.fullPageWrites ? _("yes") : _("no"));
  	printf(_("Latest checkpoint's NextXID:          %u/%u\n"),
  		   ControlFile.checkPointCopy.nextXidEpoch,
  		   ControlFile.checkPointCopy.nextXid);
***************
*** 232,237 **** main(int argc, char *argv[])
--- 234,242 ----
  	printf(_("Backup start location:                %X/%X\n"),
  		   ControlFile.backupStartPoint.xlogid,
  		   ControlFile.backupStartPoint.xrecoff);
+ 	printf(_("Backup end location:                  %X/%X\n"),
+ 		   ControlFile.backupEndPoint.xlogid,
+ 		   ControlFile.backupEndPoint.xrecoff);
  	printf(_("End-of-backup record required:        %s\n"),
  		   ControlFile.backupEndRequired ? _("yes") : _("no"));
  	printf(_("Current wal_level setting:            %s\n"),
*** a/src/bin/pg_resetxlog/pg_resetxlog.c
--- b/src/bin/pg_resetxlog/pg_resetxlog.c
***************
*** 489,494 **** GuessControlValues(void)
--- 489,495 ----
  	ControlFile.checkPointCopy.redo.xlogid = 0;
  	ControlFile.checkPointCopy.redo.xrecoff = SizeOfXLogLongPHD;
  	ControlFile.checkPointCopy.ThisTimeLineID = 1;
+ 	ControlFile.checkPointCopy.fullPageWrites = false;
  	ControlFile.checkPointCopy.nextXidEpoch = 0;
  	ControlFile.checkPointCopy.nextXid = FirstNormalTransactionId;
  	ControlFile.checkPointCopy.nextOid = FirstBootstrapObjectId;
***************
*** 503,509 **** GuessControlValues(void)
  	ControlFile.time = (pg_time_t) time(NULL);
  	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
  
! 	/* minRecoveryPoint and backupStartPoint can be left zero */
  
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.MaxConnections = 100;
--- 504,510 ----
  	ControlFile.time = (pg_time_t) time(NULL);
  	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
  
! 	/* minRecoveryPoint, backupStartPoint and backupEndPoint can be left zero */
  
  	ControlFile.wal_level = WAL_LEVEL_MINIMAL;
  	ControlFile.MaxConnections = 100;
***************
*** 569,574 **** PrintControlValues(bool guessed)
--- 570,577 ----
  		   sysident_str);
  	printf(_("Latest checkpoint's TimeLineID:       %u\n"),
  		   ControlFile.checkPointCopy.ThisTimeLineID);
+ 	printf(_("Latest checkpoint's full_page_writes:       %s\n"),
+ 		   ControlFile.checkPointCopy.fullPageWrites ? _("yes") : _("no"));
  	printf(_("Latest checkpoint's NextXID:          %u/%u\n"),
  		   ControlFile.checkPointCopy.nextXidEpoch,
  		   ControlFile.checkPointCopy.nextXid);
***************
*** 637,642 **** RewriteControlFile(void)
--- 640,647 ----
  	ControlFile.minRecoveryPoint.xrecoff = 0;
  	ControlFile.backupStartPoint.xlogid = 0;
  	ControlFile.backupStartPoint.xrecoff = 0;
+ 	ControlFile.backupEndPoint.xlogid = 0;
+ 	ControlFile.backupEndPoint.xrecoff = 0;
  	ControlFile.backupEndRequired = false;
  
  	/*
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 192,197 **** extern int	XLogArchiveTimeout;
--- 192,198 ----
  extern bool XLogArchiveMode;
  extern char *XLogArchiveCommand;
  extern bool EnableHotStandby;
+ extern bool fullPageWrites;
  extern bool log_checkpoints;
  
  /* WAL levels */
***************
*** 307,312 **** extern void CreateCheckPoint(int flags);
--- 308,314 ----
  extern bool CreateRestartPoint(int flags);
  extern void XLogPutNextOid(Oid nextOid);
  extern XLogRecPtr XLogRestorePoint(const char *rpName);
+ extern void UpdateFullPageWrites(void);
  extern XLogRecPtr GetRedoRecPtr(void);
  extern XLogRecPtr GetInsertRecPtr(void);
  extern XLogRecPtr GetFlushRecPtr(void);
*** a/src/include/catalog/pg_control.h
--- b/src/include/catalog/pg_control.h
***************
*** 21,27 ****
  
  
  /* Version identifier for this pg_control format */
! #define PG_CONTROL_VERSION	921
  
  /*
   * Body of CheckPoint XLOG records.  This is declared here because we keep
--- 21,27 ----
  
  
  /* Version identifier for this pg_control format */
! #define PG_CONTROL_VERSION	922
  
  /*
   * Body of CheckPoint XLOG records.  This is declared here because we keep
***************
*** 33,38 **** typedef struct CheckPoint
--- 33,39 ----
  	XLogRecPtr	redo;			/* next RecPtr available when we began to
  								 * create CheckPoint (i.e. REDO start point) */
  	TimeLineID	ThisTimeLineID; /* current TLI */
+ 	bool			fullPageWrites;	/* current full_page_writes */
  	uint32		nextXidEpoch;	/* higher-order bits of nextXid */
  	TransactionId nextXid;		/* next free XID */
  	Oid			nextOid;		/* next free OID */
***************
*** 60,65 **** typedef struct CheckPoint
--- 61,67 ----
  #define XLOG_BACKUP_END					0x50
  #define XLOG_PARAMETER_CHANGE			0x60
  #define XLOG_RESTORE_POINT				0x70
+ #define XLOG_FPW_CHANGE				0x80
  
  
  /*
***************
*** 138,143 **** typedef struct ControlFileData
--- 140,151 ----
  	 * record, to make sure the end-of-backup record corresponds the base
  	 * backup we're recovering from.
  	 *
+ 	 * backupEndPoint is the backup end location, if we are recovering from
+ 	 * an online backup which was taken from the standby and haven't reached
+ 	 * the end of backup yet. It is initialized to the minimum recovery point
+ 	 * in pg_control which was backed up last. It is reset to zero when
+ 	 * the end of backup is reached, and we mustn't start up before that.
+ 	 *
  	 * If backupEndRequired is true, we know for sure that we're restoring
  	 * from a backup, and must see a backup-end record before we can safely
  	 * start up. If it's false, but backupStartPoint is set, a backup_label
***************
*** 146,151 **** typedef struct ControlFileData
--- 154,160 ----
  	 */
  	XLogRecPtr	minRecoveryPoint;
  	XLogRecPtr	backupStartPoint;
+ 	XLogRecPtr	backupEndPoint;
  	bool		backupEndRequired;
  
  	/*

#73

Steve Singer

ssinger_pg@sympatico.ca

almost 14 years ago

In reply to: Fujii Masao (#72)

Re: Online base backup from the hot-standby

On 12-01-17 05:38 AM, Fujii Masao wrote:

On Fri, Jan 13, 2012 at 5:02 PM, Fujii Masao<masao.fujii@gmail.com> wrote:

The amount of code changes to allow pg_basebackup to make a backup from
the standby seems to be small. So I ended up merging that changes and the
infrastructure patch. WIP patch attached. But I'd happy to split the patch again
if you want.

Attached is the updated version of the patch. I wrote the limitations of
standby-only backup in the document and changed the error messages.

Here is my review of this verison of the patch. I think this patch has
been in every CF for 9.2 and I feel it is getting close to being
committed. The only issue of significants is a crash I encountered
while testing, see below.

I am fine with including the pg_basebackup changes in the patch it also
makes testing some of the other changes possible.

The documentation updates you have are good

I don't see any issues looking at the code.

Testing Review
--------------------------------

I encountered this on my first replica (the one based on the master). I
am not sure if it is related to this patch, it happened after the
pg_basebackup against the replica finished.

TRAP: FailedAssertion("!(((xid) != ((TransactionId) 0)))", File:
"twophase.c", Line: 1238)
LOG: startup process (PID 12222) was terminated by signal 6: Aborted
LOG: terminating any other active server processes

A little earlier this postmaster had printed.

LOG: restored log file "00000001000000000000001F" from archive
LOG: restored log file "000000010000000000000020" from archive
cp: cannot stat
`/usr/local/pgsql92git/archive/000000010000000000000021': No such file
or directory
LOG: unexpected pageaddr 0/19000000 in log file 0, segment 33, offset 0
cp: cannot stat
`/usr/local/pgsql92git/archive/000000010000000000000021': No such file
or directory

I have NOT been able to replicate this error and I am not sure exactly
what I had done in my testing prior to that point.

In another test run I had

- set full page writes=off and did a checkpoint
- Started the pg_basebackup
- set full_page_writes=on and did a HUP + some database activity that
might have forced a checkpoint.

I got this message from pg_basebackup.
./pg_basebackup -D ../data3 -l foo -h localhost -p 5438
pg_basebackup: could not get WAL end position from server

I point this out because the message is different than the normal "could
not initiate base backup: FATAL: WAL generated with
full_page_writes=off" thatI normally see. We might want to add a
PQerrorMessage(conn)) to pg_basebackup to print the error details.
Since this patch didn't actually change pg_basebackup I don't think your
required to improve the error messages in it. I am just mentioning this
because it came up in testing.

The rest of the tests I did involving changing full_page_writes
with/without checkpoints and sighups and promoting the replica seemed to
work as expected.

Show quoted text

Regards,

#74

Fujii Masao

masao.fujii@gmail.com

almost 14 years ago

In reply to: Steve Singer (#73)

1 attachment(s)

Re: Online base backup from the hot-standby

On Fri, Jan 20, 2012 at 1:01 PM, Steve Singer <ssinger_pg@sympatico.ca> wrote:

Here is my review of this verison of the patch. I think this patch has been
in every CF for 9.2 and I feel it is getting close to being committed.

Thanks for the review!

Testing Review
--------------------------------

I encountered this on my first replica (the one based on the master). I am
not sure if it is related to this patch, it happened after the pg_basebackup
against the replica finished.

TRAP: FailedAssertion("!(((xid) != ((TransactionId) 0)))", File:
"twophase.c", Line: 1238)
LOG: startup process (PID 12222) was terminated by signal 6: Aborted

I spent one hour to reproduce that issue, but finally I was not able
to do that :(
For now I have no idea what causes that issue. But basically the patch doesn't
touch any codes related to that issue, so I'm guessing that it's a problem of
the HEAD rather than the patch...

I will spend more time to diagnose the issue. If you notice something, please
let me know.

- set full page writes=off and did a checkpoint
- Started the pg_basebackup
- set full_page_writes=on and did a HUP + some database activity that might
have forced a checkpoint.

I got this message from pg_basebackup.
./pg_basebackup -D ../data3 -l foo -h localhost -p 5438
pg_basebackup: could not get WAL end position from server

I point this out because the message is different than the normal "could not
initiate base backup: FATAL: WAL generated with full_page_writes=off" thatI
normally see.

I guess that's because you started pg_basebackup before checkpoint record
with full_page_writes=off had been replicated and replayed to the standby.
In this case, when you starts pg_basebackup, it uses the previous checkpoint
record with maybe full_page_writes=on as the backup starting checkpoint, so
pg_basebackup passes the check of full_page_writes at the start of backup.
Then, it fails the check at the end of backup, so you got such an error message.

We might want to add a PQerrorMessage(conn)) to
pg_basebackup to print the error details. Since this patch didn't actually
change pg_basebackup I don't think your required to improve the error
messages in it. I am just mentioning this because it came up in testing.

Agreed.

When PQresultStatus() returns an unexpected status, basically the error
message from PQerrorMessage() should be reported. But only when
pg_basebackup could not get WAL end position, PQerrorMessage() was
not reported... This looks like a oversight of pg_basebackup... I think that
it's better to fix that as a separate patch (attached). Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

pg_basebackup_errormsg_v1.patchtext/x-diff; charset=US-ASCII; name=pg_basebackup_errormsg_v1.patchDownload

diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 4007680..873ef64 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -914,7 +914,7 @@ BaseBackup(void)
 	res = PQexec(conn, "IDENTIFY_SYSTEM");
 	if (PQresultStatus(res) != PGRES_TUPLES_OK)
 	{
-		fprintf(stderr, _("%s: could not identify system: %s\n"),
+		fprintf(stderr, _("%s: could not identify system: %s"),
 				progname, PQerrorMessage(conn));
 		disconnect_and_exit(1);
 	}
@@ -1049,8 +1049,8 @@ BaseBackup(void)
 	res = PQgetResult(conn);
 	if (PQresultStatus(res) != PGRES_TUPLES_OK)
 	{
-		fprintf(stderr, _("%s: could not get WAL end position from server\n"),
-				progname);
+		fprintf(stderr, _("%s: could not get WAL end position from server: %s"),
+				progname, PQerrorMessage(conn));
 		disconnect_and_exit(1);
 	}
 	if (PQntuples(res) != 1)

#75

Erik Rijkers

er@xs4all.nl

almost 14 years ago

In reply to: Steve Singer (#73)

Re: Online base backup from the hot-standby

On Fri, January 20, 2012 05:01, Steve Singer wrote:

On 12-01-17 05:38 AM, Fujii Masao wrote:

On Fri, Jan 13, 2012 at 5:02 PM, Fujii Masao<masao.fujii@gmail.com> wrote:

The amount of code changes to allow pg_basebackup to make a backup from
the standby seems to be small. So I ended up merging that changes and the
infrastructure patch. WIP patch attached. But I'd happy to split the patch again
if you want.

Attached is the updated version of the patch. I wrote the limitations of
standby-only backup in the document and changed the error messages.

Here is my review of this verison of the patch. I think this patch has
been in every CF for 9.2 and I feel it is getting close to being
committed. The only issue of significants is a crash I encountered
while testing, see below.

I am fine with including the pg_basebackup changes in the patch it also
makes testing some of the other changes possible.

The documentation updates you have are good

I don't see any issues looking at the code.

Testing Review
--------------------------------

I encountered this on my first replica (the one based on the master). I
am not sure if it is related to this patch, it happened after the
pg_basebackup against the replica finished.

TRAP: FailedAssertion("!(((xid) != ((TransactionId) 0)))", File:
"twophase.c", Line: 1238)
LOG: startup process (PID 12222) was terminated by signal 6: Aborted
LOG: terminating any other active server processes

A little earlier this postmaster had printed.

LOG: restored log file "00000001000000000000001F" from archive
LOG: restored log file "000000010000000000000020" from archive
cp: cannot stat
`/usr/local/pgsql92git/archive/000000010000000000000021': No such file
or directory
LOG: unexpected pageaddr 0/19000000 in log file 0, segment 33, offset 0
cp: cannot stat
`/usr/local/pgsql92git/archive/000000010000000000000021': No such file
or directory

I have NOT been able to replicate this error and I am not sure exactly
what I had done in my testing prior to that point.

I'm not sure, but it does look like this is the "mystery" bug that I encountered repeatedly
already in 9.0devel; but I was never able to reproduce it reliably. But I don't think it was ever
solved.

http://archives.postgresql.org/pgsql-hackers/2010-03/msg00223.php

Erik Rijkers

Show quoted text

In another test run I had

- set full page writes=off and did a checkpoint
- Started the pg_basebackup
- set full_page_writes=on and did a HUP + some database activity that
might have forced a checkpoint.

I got this message from pg_basebackup.
./pg_basebackup -D ../data3 -l foo -h localhost -p 5438
pg_basebackup: could not get WAL end position from server

I point this out because the message is different than the normal "could
not initiate base backup: FATAL: WAL generated with
full_page_writes=off" thatI normally see. We might want to add a
PQerrorMessage(conn)) to pg_basebackup to print the error details.
Since this patch didn't actually change pg_basebackup I don't think your
required to improve the error messages in it. I am just mentioning this
because it came up in testing.

The rest of the tests I did involving changing full_page_writes
with/without checkpoints and sighups and promoting the replica seemed to
work as expected.

#76

Fujii Masao

masao.fujii@gmail.com

almost 14 years ago

In reply to: Erik Rijkers (#75)

Re: Online base backup from the hot-standby

On Fri, Jan 20, 2012 at 7:37 PM, Erik Rijkers <er@xs4all.nl> wrote:

I'm not sure, but it does look like this is the "mystery" bug that I encountered repeatedly
already in 9.0devel; but I was never able to reproduce it reliably. But I don't think it was ever
solved.

http://archives.postgresql.org/pgsql-hackers/2010-03/msg00223.php

I also encountered the same issue one year before:
http://archives.postgresql.org/pgsql-hackers/2010-11/msg01579.php

At that moment, I identified its cause:
http://archives.postgresql.org/pgsql-hackers/2010-11/msg01700.php

At last it was fixed:
http://archives.postgresql.org/pgsql-hackers/2010-11/msg01910.php

But Steve encountered it again, which means that the above fix is not
sufficient. Unless the issue is derived from my patch, we should do
another cycle of diagnosis of it.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#77

Simon Riggs

simon@2ndquadrant.com

almost 14 years ago

In reply to: Fujii Masao (#72)

Re: Online base backup from the hot-standby

On Tue, Jan 17, 2012 at 10:38 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Jan 13, 2012 at 5:02 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

The amount of code changes to allow pg_basebackup to make a backup from
the standby seems to be small. So I ended up merging that changes and the
infrastructure patch. WIP patch attached. But I'd happy to split the patch again
if you want.

Attached is the updated version of the patch. I wrote the limitations of
standby-only backup in the document and changed the error messages.

I'm looking at this patch and wondering why we're doing so many
press-ups to ensure full_page_writes parameter is on. This will still
fail if you use a utility that removes the full page writes, but fail
silently.

I think it would be beneficial to explicitly check that all WAL
records have full page writes actually attached to them until we
achieve consistency.

Surprised to see XLOG_FPW_CHANGE is there again after I objected to it
and it was removed. Not sure why? We already track other parameters
when they change, so I don't want to introduce a whole new WAL record
for each new parameter whose change needs tracking.

Please make a note for committer that wal version needs bumping.

I think its probably time to start a README.recovery to explain why
this works the way it does. Other changes can then start to do that as
well, so we can keep this to sane levels of complexity.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#78

Simon Riggs

simon@2ndQuadrant.com

almost 14 years ago

In reply to: Fujii Masao (#76)

Re: Online base backup from the hot-standby

On Fri, Jan 20, 2012 at 11:04 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

But Steve encountered it again, which means that the above fix is not
sufficient. Unless the issue is derived from my patch, we should do
another cycle of diagnosis of it.

It's my bug, and I've posted a fix but not yet applied it, just added
to open items list. The only reason for that was time pressure, which
is now gone, so I'll look to apply it sooner.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#79

Fujii Masao

masao.fujii@gmail.com

almost 14 years ago

In reply to: Simon Riggs (#77)

Re: Online base backup from the hot-standby

Thanks for the review!

On Fri, Jan 20, 2012 at 8:15 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I'm looking at this patch and wondering why we're doing so many
press-ups to ensure full_page_writes parameter is on. This will still
fail if you use a utility that removes the full page writes, but fail
silently.

I think it would be beneficial to explicitly check that all WAL
records have full page writes actually attached to them until we
achieve consistency.

I agree that it's worth adding such a safeguard. That can be a self-contained
feature, so I'll submit a separate patch for that, to keep each patch small.

Surprised to see XLOG_FPW_CHANGE is there again after I objected to it
and it was removed. Not sure why? We already track other parameters
when they change, so I don't want to introduce a whole new WAL record
for each new parameter whose change needs tracking.

I revived that because whenever full_page_writes must be WAL-logged
or replayed, there is no need to WAL-log or replay the HS parameters.
The opposite is also true. Logging or replaying all of them every time
seems to be a bit useless, and to make the code unreadable. ISTM that
XLOG_FPW_CHANGE can make the code simpler and avoid adding useless
WAL activity by merging them into one WAL record.

Please make a note for committer that wal version needs bumping.

Okay, will add the note about bumping XLOG_PAGE_MAGIC.

I think its probably time to start a README.recovery to explain why
this works the way it does. Other changes can then start to do that as
well, so we can keep this to sane levels of complexity.

In this CF, there are other patches which change recovery codes. So
I think that it's better to do that after all of them will have been committed.
No need to hurry up to do that now.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#80

Simon Riggs

simon@2ndQuadrant.com

almost 14 years ago

In reply to: Fujii Masao (#79)

Re: Online base backup from the hot-standby

On Fri, Jan 20, 2012 at 12:54 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Thanks for the review!

On Fri, Jan 20, 2012 at 8:15 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I'm looking at this patch and wondering why we're doing so many
press-ups to ensure full_page_writes parameter is on. This will still
fail if you use a utility that removes the full page writes, but fail
silently.

I think it would be beneficial to explicitly check that all WAL
records have full page writes actually attached to them until we
achieve consistency.

I agree that it's worth adding such a safeguard. That can be a self-contained
feature, so I'll submit a separate patch for that, to keep each patch small.

Maybe, but you mean do this now as well? Not sure I like silent errors.

Surprised to see XLOG_FPW_CHANGE is there again after I objected to it
and it was removed. Not sure why? We already track other parameters
when they change, so I don't want to introduce a whole new WAL record
for each new parameter whose change needs tracking.

I revived that because whenever full_page_writes must be WAL-logged
or replayed, there is no need to WAL-log or replay the HS parameters.
The opposite is also true. Logging or replaying all of them every time
seems to be a bit useless, and to make the code unreadable. ISTM that
XLOG_FPW_CHANGE can make the code simpler and avoid adding useless
WAL activity by merging them into one WAL record.

I don't agree, but for the sake of getting on with things I say this
is minor so is no reason to block this.

Please make a note for committer that wal version needs bumping.

Okay, will add the note about bumping XLOG_PAGE_MAGIC.

I think its probably time to start a README.recovery to explain why
this works the way it does. Other changes can then start to do that as
well, so we can keep this to sane levels of complexity.

In this CF, there are other patches which change recovery codes. So
I think that it's better to do that after all of them will have been committed.
No need to hurry up to do that now.

Agreed.

Will proceed to final review and if all OK, commit.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#81

Fujii Masao

masao.fujii@gmail.com

almost 14 years ago

In reply to: Simon Riggs (#80)

Re: Online base backup from the hot-standby

On Fri, Jan 20, 2012 at 11:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Fri, Jan 20, 2012 at 12:54 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Thanks for the review!

On Fri, Jan 20, 2012 at 8:15 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I'm looking at this patch and wondering why we're doing so many
press-ups to ensure full_page_writes parameter is on. This will still
fail if you use a utility that removes the full page writes, but fail
silently.

I think it would be beneficial to explicitly check that all WAL
records have full page writes actually attached to them until we
achieve consistency.

I agree that it's worth adding such a safeguard. That can be a self-contained
feature, so I'll submit a separate patch for that, to keep each patch small.

Maybe, but you mean do this now as well? Not sure I like silent errors.

If many people think the patch is not acceptable without such a safeguard,
I will do that right now. Otherwise, I'd like to take more time to do
that, i.e.,
add it to 9.2dev Oepn Items.

I've not come up with good idea. Ugly idea is to keep track of all replays of
full_page_writes for every buffer pages (i.e., prepare 1-bit per buffer page
table and set the specified bit to 1 when full_page_writes is applied),
and then check whether full_page_writes has been already applied when
replaying normal WAL record... Do you have any better idea?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#82

Simon Riggs

simon@2ndQuadrant.com

almost 14 years ago

In reply to: Fujii Masao (#81)

Re: Online base backup from the hot-standby

On Mon, Jan 23, 2012 at 10:29 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Jan 20, 2012 at 11:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Fri, Jan 20, 2012 at 12:54 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Thanks for the review!

On Fri, Jan 20, 2012 at 8:15 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I'm looking at this patch and wondering why we're doing so many
press-ups to ensure full_page_writes parameter is on. This will still
fail if you use a utility that removes the full page writes, but fail
silently.

I think it would be beneficial to explicitly check that all WAL
records have full page writes actually attached to them until we
achieve consistency.

I agree that it's worth adding such a safeguard. That can be a self-contained
feature, so I'll submit a separate patch for that, to keep each patch small.

Maybe, but you mean do this now as well? Not sure I like silent errors.

If many people think the patch is not acceptable without such a safeguard,
I will do that right now. Otherwise, I'd like to take more time to do
that, i.e.,
add it to 9.2dev Oepn Items.

I've not come up with good idea. Ugly idea is to keep track of all replays of
full_page_writes for every buffer pages (i.e., prepare 1-bit per buffer page
table and set the specified bit to 1 when full_page_writes is applied),
and then check whether full_page_writes has been already applied when
replaying normal WAL record... Do you have any better idea?

Not sure.

I think the only possible bug here is one introduced by an outside utility.

In that case, I don't think it should be the job of the backend to go
too far to protect against such atypical error. So if we can't solve
it fairly easily and with no overhead then I'd say lets skip it. We
could easily introduce a bug here just by having faulty checking code.

So lets add it to 9.2 open items as a non-priority item. I'll proceed
to commit for this now.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#83

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Fujii Masao (#81)

Re: Online base backup from the hot-standby

On Mon, Jan 23, 2012 at 5:29 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

If many people think the patch is not acceptable without such a safeguard,
I will do that right now.

That's my view. I think we ought to resolve this issue before commit,
especially since it seems unclear that we know how to fix it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#84

Robert Haas

robertmhaas@gmail.com

almost 14 years ago

In reply to: Robert Haas (#83)

Re: Online base backup from the hot-standby

On Mon, Jan 23, 2012 at 8:11 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jan 23, 2012 at 5:29 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

If many people think the patch is not acceptable without such a safeguard,
I will do that right now.

That's my view. I think we ought to resolve this issue before commit,
especially since it seems unclear that we know how to fix it.

Actually, never mind. On reading this more carefully, I'm not too
concerned about the possibility of people breaking it with pg_lesslog
or similar. But it should be solid if you use only the functionality
built into core.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#85

Fujii Masao

masao.fujii@gmail.com

almost 14 years ago

In reply to: Simon Riggs (#82)

Re: Online base backup from the hot-standby

On Mon, Jan 23, 2012 at 10:11 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Mon, Jan 23, 2012 at 10:29 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Fri, Jan 20, 2012 at 11:34 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Fri, Jan 20, 2012 at 12:54 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

Thanks for the review!

On Fri, Jan 20, 2012 at 8:15 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

I'm looking at this patch and wondering why we're doing so many
press-ups to ensure full_page_writes parameter is on. This will still
fail if you use a utility that removes the full page writes, but fail
silently.

I think it would be beneficial to explicitly check that all WAL
records have full page writes actually attached to them until we
achieve consistency.

I agree that it's worth adding such a safeguard. That can be a self-contained
feature, so I'll submit a separate patch for that, to keep each patch small.

Maybe, but you mean do this now as well? Not sure I like silent errors.

If many people think the patch is not acceptable without such a safeguard,
I will do that right now. Otherwise, I'd like to take more time to do
that, i.e.,
add it to 9.2dev Oepn Items.

I've not come up with good idea. Ugly idea is to keep track of all replays of
full_page_writes for every buffer pages (i.e., prepare 1-bit per buffer page
table and set the specified bit to 1 when full_page_writes is applied),
and then check whether full_page_writes has been already applied when
replaying normal WAL record... Do you have any better idea?

Not sure.

I think the only possible bug here is one introduced by an outside utility.

In that case, I don't think it should be the job of the backend to go
too far to protect against such atypical error. So if we can't solve
it fairly easily and with no overhead then I'd say lets skip it. We
could easily introduce a bug here just by having faulty checking code.

So lets add it to 9.2 open items as a non-priority item.

Agreed.

I'll proceed to commit for this now.

Thanks a lot!

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#86

Simon Riggs

simon@2ndQuadrant.com

almost 14 years ago

In reply to: Fujii Masao (#85)

Re: Online base backup from the hot-standby

On Tue, Jan 24, 2012 at 9:51 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I'll proceed to commit for this now.

Thanks a lot!

Can I just check a few things?

You say
/*
+        * Update full_page_writes in shared memory and write an
+        * XLOG_FPW_CHANGE record before resource manager writes cleanup
+        * WAL records or checkpoint record is written.
+        */

why does it need to be before the cleanup and checkpoint?

You say
/*
+        * Currently only non-exclusive backup can be taken during recovery.
+        */

why?

You mention in the docs
"The backup history file is not created in the database cluster backed up."
but we need to explain the bad effect, if any.

You say
"If the standby is promoted to the master during online backup, the
backup fails."
but no explanation of why?

I could work those things out, but I don't want to have to, plus we
may disagree if I did.

There are some good explanations in comments of other things, just not
everywhere needed.

What happens if we shutdown the WALwriter and then issue SIGHUP?

Are we sure we want to make the change of file format mandatory? That
means earlier versions of clients such as pg_basebackup will fail
against this version. Should we allow that if BACKUP FROM is missing
we assume it was master?

There are no docs to explain the new feature is available in the main
docs, or to explain the restrictions.
I expect you will add that later after commit.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#87

Simon Riggs

simon@2ndQuadrant.com

almost 14 years ago

In reply to: Simon Riggs (#86)

Re: Online base backup from the hot-standby

On Tue, Jan 24, 2012 at 10:54 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, Jan 24, 2012 at 9:51 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I'll proceed to commit for this now.

Thanks a lot!

Can I just check a few things?

Just to clarify, not expecting another patch version, just reply here
and I can edit.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#88

Fujii Masao

masao.fujii@gmail.com

almost 14 years ago

In reply to: Simon Riggs (#86)

Re: Online base backup from the hot-standby

On Tue, Jan 24, 2012 at 7:54 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Tue, Jan 24, 2012 at 9:51 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

I'll proceed to commit for this now.

Thanks a lot!

Can I just check a few things?

Sure!

You say
/*
+        * Update full_page_writes in shared memory and write an
+        * XLOG_FPW_CHANGE record before resource manager writes cleanup
+        * WAL records or checkpoint record is written.
+        */

why does it need to be before the cleanup and checkpoint?

Because the cleanup and checkpoint need to see FPW in shared memory.
If FPW in shared memory is not updated there, the cleanup and (end-of-recovery)
checkpoint always use an initial value (= false) of FPW in shared memory.

You say
/*
+        * Currently only non-exclusive backup can be taken during recovery.
+        */

why?

At first I proposed to allow exclusive backup to be taken during recovery. But
Heikki disagreed with the proposal because he thought that the exclusive backup
procedure which I proposed was too fragile. No one could come up with any good
user-friendly easy-to-implement procedure. So we decided to allow only
non-exclusive backup to be taken during recovery. In non-exclusive backup,
the complicated procedure is performed by pg_basebackup, so a user doesn't
need to care about that.

You mention in the docs
"The backup history file is not created in the database cluster backed up."
but we need to explain the bad effect, if any.

Users cannot know various information (e.g., which WAL files are required for
backups, backup starting/ending time, etc) about backups which have been taken
so far. If they need such information, they need to record that manually.

Users cannot pass the backup history file to pg_archivecleanup. Which might make
the usage of pg_archivecleanup more difficult.

After a little thought, pg_basebackup would be able to create the backup history
file in the backup, though it cannot be archived. We shoud implement
that feature
to alleviate the bad effect?

You say
"If the standby is promoted to the master during online backup, the
backup fails."
but no explanation of why?

I could work those things out, but I don't want to have to, plus we
may disagree if I did.

If the backup succeeds in that case, when we start an archive recovery from that
backup, the recovery needs to cross between two timelines. Which means that
we need to set recovery_target_timeline before starting recovery. Whether
recovery_target_timeline needs to be set or not depends on whether the standby
was promoted during taking the backup. Leaving such a decision to a user seems
fragile.

pg_basebackup -x ensures that all required files are included in the backup and
we can start recovery without restoring any file from the archive. But
if the standby
is promoted during the backup, the timeline history file would become
an essential
file for recovery, but it's not included in the backup.

There are some good explanations in comments of other things, just not
everywhere needed.

What happens if we shutdown the WALwriter and then issue SIGHUP?

SIGHUP doesn't affect full_page_writes in that case. Oh, you are concerned about
the case where smart shutdown kills walwriter but some backends are
still running?
Currently SIGHUP affects full_page_writes and running backends use the changed
new value of full_page_writes. But in the patch, SIGHUP doesn't affect...

To address the problem, we should either postpone the shutdown of walwriter
until all backends have gone away, or leave the update of full_page_writes to
checkpointer process instead of walwriter. Thought?

Are we sure we want to make the change of file format mandatory? That
means earlier versions of clients such as pg_basebackup will fail
against this version.

Really? Unless I'm missing something, pg_basebackup doesn't care about the
file format of backup_label. So I don't think that earlier version of
pg_basebackup
fails.

There are no docs to explain the new feature is available in the main
docs, or to explain the restrictions.
I expect you will add that later after commit.

Okay. Will do.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#89

Simon Riggs

simon@2ndQuadrant.com

almost 14 years ago

In reply to: Fujii Masao (#88)

Re: Online base backup from the hot-standby

On Wed, Jan 25, 2012 at 8:16 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

What happens if we shutdown the WALwriter and then issue SIGHUP?

SIGHUP doesn't affect full_page_writes in that case. Oh, you are concerned about
the case where smart shutdown kills walwriter but some backends are
still running?
Currently SIGHUP affects full_page_writes and running backends use the changed
new value of full_page_writes. But in the patch, SIGHUP doesn't affect...

To address the problem, we should either postpone the shutdown of walwriter
until all backends have gone away, or leave the update of full_page_writes to
checkpointer process instead of walwriter. Thought?

checkpointer seems the correct place to me

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#90

Simon Riggs

simon@2ndQuadrant.com

almost 14 years ago

In reply to: Simon Riggs (#89)

Re: Online base backup from the hot-standby

On Wed, Jan 25, 2012 at 8:49 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Wed, Jan 25, 2012 at 8:16 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

What happens if we shutdown the WALwriter and then issue SIGHUP?

SIGHUP doesn't affect full_page_writes in that case. Oh, you are concerned about
the case where smart shutdown kills walwriter but some backends are
still running?
Currently SIGHUP affects full_page_writes and running backends use the changed
new value of full_page_writes. But in the patch, SIGHUP doesn't affect...

To address the problem, we should either postpone the shutdown of walwriter
until all backends have gone away, or leave the update of full_page_writes to
checkpointer process instead of walwriter. Thought?

checkpointer seems the correct place to me

Done.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

#91

Fujii Masao

masao.fujii@gmail.com

almost 14 years ago

In reply to: Simon Riggs (#90)

Re: Online base backup from the hot-standby

On Thu, Jan 26, 2012 at 3:07 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Wed, Jan 25, 2012 at 8:49 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On Wed, Jan 25, 2012 at 8:16 AM, Fujii Masao <masao.fujii@gmail.com> wrote:

What happens if we shutdown the WALwriter and then issue SIGHUP?

SIGHUP doesn't affect full_page_writes in that case. Oh, you are concerned about
the case where smart shutdown kills walwriter but some backends are
still running?
Currently SIGHUP affects full_page_writes and running backends use the changed
new value of full_page_writes. But in the patch, SIGHUP doesn't affect...

To address the problem, we should either postpone the shutdown of walwriter
until all backends have gone away, or leave the update of full_page_writes to
checkpointer process instead of walwriter. Thought?

checkpointer seems the correct place to me

Done.

Thanks a lot!!

I proposed another small patch which fixes the issue about an error message of
pg_basebackup, in this upthread. If it's reasonable, could you commit it?
http://archives.postgresql.org/message-id/CAHGQGwENjSDN=f_VDPwVQ53QRU0cu9+wZKBvwNaEXMawj-y-GQ@mail.gmail.com

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center