Restartable Recovery

Started by Simon Riggsover 19 years ago12 messages

simon@2ndquadrant.com

over 19 years ago

1 attachment(s)

On Marko Kreen's detailed suggestion, I've implemented a restartable
recovery mode for archive recovery (aka PITR). Restart points are known
as recovery checkpoints and are normally taken every 100 checkpoints in
the log to ensure good recovery performance.

An additional mode
standby_mode = 'true'
can also be specified, which ensures that a recovery checkpoint occurs
for each checkpoint in the logs.

Some other code refactorings, though all changes isolated to xlog.c and
to pg_control.h; code comments welcome.

Applies cleanly to cvstip, passes make check.

Further details testing is very desirable. I've tested restarting a
recovery twice and things work successfully.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

Attachments:

restartableRecovery.patchtext/x-patch; charset=UTF-8; name=restartableRecovery.patchDownload

Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.242
diff -c -r1.242 xlog.c
*** src/backend/access/transam/xlog.c	27 Jun 2006 18:59:17 -0000	1.242
--- src/backend/access/transam/xlog.c	11 Jul 2006 16:46:21 -0000
***************
*** 124,129 ****
--- 124,130 ----
  
  /* File path names (all relative to $PGDATA) */
  #define BACKUP_LABEL_FILE		"backup_label"
+ #define BACKUP_LABEL_IN_USE	    "backup_label.in_use"
  #define RECOVERY_COMMAND_FILE	"recovery.conf"
  #define RECOVERY_COMMAND_DONE	"recovery.done"
  
***************
*** 183,188 ****
--- 184,192 ----
  static bool recoveryTargetInclusive = true;
  static TransactionId recoveryTargetXid;
  static time_t recoveryTargetTime;
+ static bool InStandby = false;
+ /* How many XLOG_CHECKPOINT* entries since last recovery checkpoint */
+ static int nCheckpoints = 0;    
  
  /* if recoveryStopsHere returns true, it saves actual stop xid/time here */
  static TransactionId recoveryStopXid;
***************
*** 496,501 ****
--- 500,506 ----
  					 uint32 endLogId, uint32 endLogSeg);
  static void WriteControlFile(void);
  static void ReadControlFile(void);
+ static void ValidateControlFile(void);
  static char *str_time(time_t tnow);
  static void issue_xlog_fsync(void);
  
***************
*** 505,511 ****
  static bool read_backup_label(XLogRecPtr *checkPointLoc);
  static void remove_backup_label(void);
  static void rm_redo_error_callback(void *arg);
! 
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
--- 510,516 ----
  static bool read_backup_label(XLogRecPtr *checkPointLoc);
  static void remove_backup_label(void);
  static void rm_redo_error_callback(void *arg);
! static void CheckPointShmem(XLogRecPtr checkPointRedo);
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
***************
*** 3626,3631 ****
--- 3631,3663 ----
  		ereport(FATAL,
  				(errmsg("incorrect checksum in control file")));
  
+     ValidateControlFile();
+ 
+ 	if (pg_perm_setlocale(LC_COLLATE, ControlFile->lc_collate) == NULL)
+ 		ereport(FATAL,
+ 			(errmsg("database files are incompatible with operating system"),
+ 			 errdetail("The database cluster was initialized with LC_COLLATE \"%s\","
+ 					   " which is not recognized by setlocale().",
+ 					   ControlFile->lc_collate),
+ 			 errhint("It looks like you need to initdb or install locale support.")));
+ 	if (pg_perm_setlocale(LC_CTYPE, ControlFile->lc_ctype) == NULL)
+ 		ereport(FATAL,
+ 			(errmsg("database files are incompatible with operating system"),
+ 		errdetail("The database cluster was initialized with LC_CTYPE \"%s\","
+ 				  " which is not recognized by setlocale().",
+ 				  ControlFile->lc_ctype),
+ 			 errhint("It looks like you need to initdb or install locale support.")));
+ 
+ 	/* Make the fixed locale settings visible as GUC variables, too */
+ 	SetConfigOption("lc_collate", ControlFile->lc_collate,
+ 					PGC_INTERNAL, PGC_S_OVERRIDE);
+ 	SetConfigOption("lc_ctype", ControlFile->lc_ctype,
+ 					PGC_INTERNAL, PGC_S_OVERRIDE);
+ }
+ 
+ static void
+ ValidateControlFile(void)
+ {
  	/*
  	 * Do compatibility checking immediately.  We do this here for 2 reasons:
  	 *
***************
*** 3722,3747 ****
  				  " but the server was compiled with LOCALE_NAME_BUFLEN %d.",
  						   ControlFile->localeBuflen, LOCALE_NAME_BUFLEN),
  				 errhint("It looks like you need to recompile or initdb.")));
- 	if (pg_perm_setlocale(LC_COLLATE, ControlFile->lc_collate) == NULL)
- 		ereport(FATAL,
- 			(errmsg("database files are incompatible with operating system"),
- 			 errdetail("The database cluster was initialized with LC_COLLATE \"%s\","
- 					   " which is not recognized by setlocale().",
- 					   ControlFile->lc_collate),
- 			 errhint("It looks like you need to initdb or install locale support.")));
- 	if (pg_perm_setlocale(LC_CTYPE, ControlFile->lc_ctype) == NULL)
- 		ereport(FATAL,
- 			(errmsg("database files are incompatible with operating system"),
- 		errdetail("The database cluster was initialized with LC_CTYPE \"%s\","
- 				  " which is not recognized by setlocale().",
- 				  ControlFile->lc_ctype),
- 			 errhint("It looks like you need to initdb or install locale support.")));
- 
- 	/* Make the fixed locale settings visible as GUC variables, too */
- 	SetConfigOption("lc_collate", ControlFile->lc_collate,
- 					PGC_INTERNAL, PGC_S_OVERRIDE);
- 	SetConfigOption("lc_ctype", ControlFile->lc_ctype,
- 					PGC_INTERNAL, PGC_S_OVERRIDE);
  }
  
  void
--- 3754,3759 ----
***************
*** 3749,3754 ****
--- 3761,3768 ----
  {
  	int			fd;
  
+     ValidateControlFile();
+ 
  	INIT_CRC32(ControlFile->crc);
  	COMP_CRC32(ControlFile->crc,
  			   (char *) ControlFile,
***************
*** 4095,4100 ****
--- 4109,4123 ----
  					(errmsg("restore_command = \"%s\"",
  							recoveryRestoreCommand)));
  		}
+ 		else if (strcmp(tok1, "standby_mode") == 0)
+ 		{
+ 			if (strcmp(tok2, "true") == 0)
+             {
+                 InStandby = true;
+ 				ereport(LOG,
+ 						(errmsg("standby_mode = true")));
+             }
+         }
  		else if (strcmp(tok1, "recovery_target_timeline") == 0)
  		{
  			rtliGiven = true;
***************
*** 4230,4235 ****
--- 4253,4259 ----
  	 * We are no longer in archive recovery state.
  	 */
  	InArchiveRecovery = false;
+ 	InStandby = false;
  
  	/*
  	 * We should have the ending log segment currently open.  Verify, and then
***************
*** 4465,4476 ****
  		ereport(LOG,
  				(errmsg("database system shutdown was interrupted at %s",
  						str_time(ControlFile->time))));
! 	else if (ControlFile->state == DB_IN_RECOVERY)
  		ereport(LOG,
  		   (errmsg("database system was interrupted while in recovery at %s",
  				   str_time(ControlFile->time)),
  			errhint("This probably means that some data is corrupted and"
  					" you will have to use the last backup for recovery.")));
  	else if (ControlFile->state == DB_IN_PRODUCTION)
  		ereport(LOG,
  				(errmsg("database system was interrupted at %s",
--- 4489,4506 ----
  		ereport(LOG,
  				(errmsg("database system shutdown was interrupted at %s",
  						str_time(ControlFile->time))));
! 	else if (ControlFile->state == DB_IN_CRASH_RECOVERY)
  		ereport(LOG,
  		   (errmsg("database system was interrupted while in recovery at %s",
  				   str_time(ControlFile->time)),
  			errhint("This probably means that some data is corrupted and"
  					" you will have to use the last backup for recovery.")));
+ 	else if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
+ 		ereport(LOG,
+ 		   (errmsg("database system was interrupted while in recovery at log time %s",
+ 				   str_time(ControlFile->time)),
+ 			errhint("If this has occurred more than once some data may be corrupted"
+ 					" and you may need to choose an earlier recovery target.")));
  	else if (ControlFile->state == DB_IN_PRODUCTION)
  		ereport(LOG,
  				(errmsg("database system was interrupted at %s",
***************
*** 4626,4641 ****
  	{
  		int			rmid;
  
  		if (InArchiveRecovery)
! 			ereport(LOG,
  					(errmsg("automatic recovery in progress")));
  		else
  			ereport(LOG,
  					(errmsg("database system was not properly shut down; "
  							"automatic recovery in progress")));
! 		ControlFile->state = DB_IN_RECOVERY;
  		ControlFile->time = time(NULL);
  		UpdateControlFile();
  
  		/* Start up the recovery environment */
  		XLogInitRelationCache();
--- 4656,4685 ----
  	{
  		int			rmid;
  
+         /*
+          * If we are in Archive Recovery then we create recovery checkpoints
+          * to avoid needing to start right from the beginning again. 
+          */
+     	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
  		if (InArchiveRecovery)
!         {		
!         	ereport(LOG,
  					(errmsg("automatic recovery in progress")));
+     		ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+         }
  		else
+         {
  			ereport(LOG,
  					(errmsg("database system was not properly shut down; "
  							"automatic recovery in progress")));
!     		ControlFile->state = DB_IN_CRASH_RECOVERY;
!         }
  		ControlFile->time = time(NULL);
+     	ControlFile->prevCheckPoint = ControlFile->checkPoint;
+     	ControlFile->checkPoint = checkPointLoc;
+     	ControlFile->checkPointCopy = checkPoint;
  		UpdateControlFile();
+     	LWLockRelease(ControlFileLock);
  
  		/* Start up the recovery environment */
  		XLogInitRelationCache();
***************
*** 4668,4673 ****
--- 4712,4719 ----
  			ErrorContextCallback	errcontext;
  
  			InRedo = true;
+             nCheckpoints = 0;
+ 
  			ereport(LOG,
  					(errmsg("redo starts at %X/%X",
  							ReadRecPtr.xlogid, ReadRecPtr.xrecoff)));
***************
*** 5334,5345 ****
  		ereport(DEBUG2,
  				(errmsg("checkpoint starting")));
  
! 	CheckPointCLOG();
! 	CheckPointSUBTRANS();
! 	CheckPointMultiXact();
! 	FlushBufferPool();
! 	/* We deliberately delay 2PC checkpointing as long as possible */
! 	CheckPointTwoPhase(checkPoint.redo);
  
  	START_CRIT_SECTION();
  
--- 5380,5389 ----
  		ereport(DEBUG2,
  				(errmsg("checkpoint starting")));
  
!     /*
!      * Ensure all of shared memory gets checkpointed
!      */
!     CheckPointShmem(checkPoint.redo);
  
  	START_CRIT_SECTION();
  
***************
*** 5458,5463 ****
--- 5502,5508 ----
  xlog_redo(XLogRecPtr lsn, XLogRecord *record)
  {
  	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+ 	CheckPoint	checkPoint;
  
  	if (info == XLOG_NEXTOID)
  	{
***************
*** 5469,5479 ****
  			ShmemVariableCache->nextOid = nextOid;
  			ShmemVariableCache->oidCount = 0;
  		}
  	}
  	else if (info == XLOG_CHECKPOINT_SHUTDOWN)
  	{
- 		CheckPoint	checkPoint;
- 
  		memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
  		/* In a SHUTDOWN checkpoint, believe the counters exactly */
  		ShmemVariableCache->nextXid = checkPoint.nextXid;
--- 5514,5523 ----
  			ShmemVariableCache->nextOid = nextOid;
  			ShmemVariableCache->oidCount = 0;
  		}
+         return;
  	}
  	else if (info == XLOG_CHECKPOINT_SHUTDOWN)
  	{
  		memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
  		/* In a SHUTDOWN checkpoint, believe the counters exactly */
  		ShmemVariableCache->nextXid = checkPoint.nextXid;
***************
*** 5499,5506 ****
  	}
  	else if (info == XLOG_CHECKPOINT_ONLINE)
  	{
- 		CheckPoint	checkPoint;
- 
  		memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
  		/* In an ONLINE checkpoint, treat the counters like NEXTOID */
  		if (TransactionIdPrecedes(ShmemVariableCache->nextXid,
--- 5543,5548 ----
***************
*** 5519,5524 ****
--- 5561,5609 ----
  					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
  							checkPoint.ThisTimeLineID, ThisTimeLineID)));
  	}
+ 
+ #define RECOVERY_CHECKPOINT_INTERVAL 100
+ 
+     /*
+      * If we are in Standby mode, then do a recovery checkpoint 
+      * for each checkpoint found in WAL replay. Otherwise,
+      * don't do this very frequently since this slows down recovery.
+      * A recovery checkpoint is simply a recreation of the database
+      * state after the original checkpoint: all database changes
+      * are written to disk, allowing us to restart recovery from that
+      * point. 
+      *
+      * Note: Should recovery ever be parallelised in the future,
+      * all work *must* stop until the recovery checkpoint has
+      * completed.
+      */
+     if (InArchiveRecovery && (InStandby || nCheckpoints >= RECOVERY_CHECKPOINT_INTERVAL))
+     {
+         CheckPointShmem(lsn);
+ 
+     	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+    		ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+     	ControlFile->prevCheckPoint = ControlFile->checkPoint;
+         /* 
+          * The checkpoint record starts at ReadRecPtr; lsn is pointer to
+          * the next xlog record so must not be used here
+          */
+     	ControlFile->checkPoint = ReadRecPtr;
+     	ControlFile->checkPointCopy = checkPoint;
+         /* 
+          * Make it look like we started from this point, so this is *not*
+          * current time but original checkpoint time 
+          */
+     	ControlFile->time = checkPoint.time;
+     	UpdateControlFile();
+     	LWLockRelease(ControlFileLock);
+ 		ereport(LOG,
+ 				(errmsg("recovery checkpoint at %X/%X",
+ 						lsn.xlogid, lsn.xrecoff)));
+         nCheckpoints = 0;
+     }
+     else
+         nCheckpoints++;
  }
  
  void
***************
*** 6106,6111 ****
--- 6191,6207 ----
  							histfilepath)));
  	}
  
+ 	/*
+ 	 * Rename the backup label file out of the way, so that we don't accidentally
+ 	 * re-start recovery from the beginning.
+ 	 */
+ 	unlink(BACKUP_LABEL_IN_USE);
+ 	if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_IN_USE) != 0)
+ 		ereport(FATAL,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
+ 						BACKUP_LABEL_FILE, BACKUP_LABEL_IN_USE)));
+ 
  	return true;
  }
  
***************
*** 6119,6130 ****
  static void
  remove_backup_label(void)
  {
! 	if (unlink(BACKUP_LABEL_FILE) != 0)
! 		if (errno != ENOENT)
! 			ereport(FATAL,
! 					(errcode_for_file_access(),
! 					 errmsg("could not remove file \"%s\": %m",
! 							BACKUP_LABEL_FILE)));
  }
  
  /*
--- 6215,6226 ----
  static void
  remove_backup_label(void)
  {
!     if (unlink(BACKUP_LABEL_IN_USE) != 0)
!         if (errno != ENOENT)
!             ereport(FATAL,
!                     (errcode_for_file_access(),
!                     errmsg("could not remove file \"%s\": %m",
!                                     BACKUP_LABEL_IN_USE)));
  }
  
  /*
***************
*** 6147,6149 ****
--- 6243,6258 ----
  
  	pfree(buf.data);
  }
+ 
+ /* 
+  * Flush all shared memory data zones and ensure fsync
+  */
+ static void CheckPointShmem(XLogRecPtr checkPointRedo)
+ {
+ 	CheckPointCLOG();
+ 	CheckPointSUBTRANS();
+ 	CheckPointMultiXact();
+ 	FlushBufferPool();     /* performs all required fsyncs */
+ 	/* We deliberately delay 2PC checkpointing as long as possible */
+ 	CheckPointTwoPhase(checkPointRedo);
+ }
Index: src/include/catalog/pg_control.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/catalog/pg_control.h,v
retrieving revision 1.29
diff -c -r1.29 pg_control.h
*** src/include/catalog/pg_control.h	4 Apr 2006 22:39:59 -0000	1.29
--- src/include/catalog/pg_control.h	11 Jul 2006 16:46:24 -0000
***************
*** 55,61 ****
  	DB_STARTUP = 0,
  	DB_SHUTDOWNED,
  	DB_SHUTDOWNING,
! 	DB_IN_RECOVERY,
  	DB_IN_PRODUCTION
  } DBState;
  
--- 55,62 ----
  	DB_STARTUP = 0,
  	DB_SHUTDOWNED,
  	DB_SHUTDOWNING,
! 	DB_IN_CRASH_RECOVERY,
! 	DB_IN_ARCHIVE_RECOVERY,
  	DB_IN_PRODUCTION
  } DBState;

Andreas Seltenreich

andreas+pg@gate450.dyndns.org

over 19 years ago

In reply to: Simon Riggs (#1)

Re: Restartable Recovery

Simon Riggs <simon@2ndquadrant.com> writes:

[2. text/x-patch; restartableRecovery.patch]

Hmm, wouldn't you have to reboot the resource managers at each
checkpoint? I'm afraid otherwise things like postponed page splits
could get lost on restart from a later checkpoint.

regards,
andreas

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Andreas Seltenreich (#2)

Re: Restartable Recovery

Andreas Seltenreich <andreas+pg@gate450.dyndns.org> writes:

Simon Riggs <simon@2ndquadrant.com> writes:

[2. text/x-patch; restartableRecovery.patch]

Hmm, wouldn't you have to reboot the resource managers at each
checkpoint? I'm afraid otherwise things like postponed page splits
could get lost on restart from a later checkpoint.

Ouch. That's a bit nasty. You can't just apply a postponed split at
checkpoint time, because the WAL record could easily be somewhere after
the checkpoint, leading to duplicate insertions. Right offhand I don't
see how to make this work :-(

regards, tom lane

Simon Riggs

simon@2ndquadrant.com

over 19 years ago

In reply to: Tom Lane (#3)

Re: Restartable Recovery

On Sun, 2006-07-16 at 10:51 -0400, Tom Lane wrote:

Andreas Seltenreich <andreas+pg@gate450.dyndns.org> writes:

Simon Riggs <simon@2ndquadrant.com> writes:

[2. text/x-patch; restartableRecovery.patch]

Hmm, wouldn't you have to reboot the resource managers at each
checkpoint? I'm afraid otherwise things like postponed page splits
could get lost on restart from a later checkpoint.

Ouch. That's a bit nasty. You can't just apply a postponed split at
checkpoint time, because the WAL record could easily be somewhere after
the checkpoint, leading to duplicate insertions. Right offhand I don't
see how to make this work :-(

Yes, ouch. So much for gung-ho code sprints; thanks Andreas.

To do this we would need to have another rmgr specific routine that gets
called at a recovery checkpoint. This would then write to disk the
current state of the incomplete multi-WAL actions, in some manner.
During the startup routines we would check for any pre-existing state
files and use those to initialise the incomplete action cache. Cleanup
would then discard all state files.

That allows us to not-forget actions, but it doesn't help us if there
are problems repeating actions twice. We would at least know that we are
in a potential double-action zone and could give different kinds of
errors or handling.

Or we can simply mark any indexes incomplete-needs-rebuild if they had a
page split during the overlap time between the last known good recovery
checkpoint and the following one. But that does lead to randomly bounded
recovery time, which might be better to have started from scratch
anyway.

Given time available for 8.2, neither one is a quick fix.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Simon Riggs (#4)

Re: [PATCHES] Restartable Recovery

Simon Riggs <simon@2ndquadrant.com> writes:

On Sun, 2006-07-16 at 10:51 -0400, Tom Lane wrote:

Ouch. That's a bit nasty. You can't just apply a postponed split at
checkpoint time, because the WAL record could easily be somewhere after
the checkpoint, leading to duplicate insertions.

To do this we would need to have another rmgr specific routine that gets
called at a recovery checkpoint. This would then write to disk the
current state of the incomplete multi-WAL actions, in some manner.
During the startup routines we would check for any pre-existing state
files and use those to initialise the incomplete action cache. Cleanup
would then discard all state files.

I thought about that too, but it seems very messy, eg you'd have to
actually fsync the state files to be sure they were safely down to disk.
Another problem is that WAL records between the checkpoint's REDO point
and the physical checkpoint location could get replayed twice, leading
to duplicate entries in the rmgr's state. Consider a split start WAL
entry located in that range, with the split completion entry after the
checkpoint --- on restart, we'd load a pending-split entry from the
state file and then create another one on seeing the split-start record
again.

A compromise that might be good enough is to add an rmgr routine defined
as "bool is_idle(void)" that tests whether the rmgr has any open state
to worry about. Then, recovery checkpoints are done only if all rmgrs
say they are idle. That is, we only checkpoint if there is not a need
for any state files. At least for btree's usage, this should be all
right since the "split pending" state is short-lived and so most of the
time we'd not need to skip checkpoints. I'm not totally sure about GIST
or GIN though (Teodor?).

regards, tom lane

Simon Riggs

simon@2ndquadrant.com

over 19 years ago

In reply to: Tom Lane (#5)

Re: [PATCHES] Restartable Recovery

On Sun, 2006-07-16 at 12:40 -0400, Tom Lane wrote:

A compromise that might be good enough is to add an rmgr routine defined
as "bool is_idle(void)" that tests whether the rmgr has any open state
to worry about. Then, recovery checkpoints are done only if all rmgrs
say they are idle.

Like it.

That is, we only checkpoint if there is not a need
for any state files. At least for btree's usage, this should be all
right since the "split pending" state is short-lived and so most of the
time we'd not need to skip checkpoints. I'm not totally sure about GIST
or GIN though (Teodor?).

Considering how infrequently we wanted to do recovery checkpoints, this
is unlikely to cause any issue. But in any case, this is the best we can
give people, rather than a compromise.

Perhaps that should be extended to say whether there are any
non-idempotent changes made in the last checkpoint period. That might
cover a wider set of potential actions.

If index splits in GIST or GIN are *not* short lived, then I would
imagine we'd have some serious contention problems to clear up since an
inconsistent index is unusable and would require portions of it to be
locked throughout such operations to ensure their atomicity.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Simon Riggs (#6)

Re: [PATCHES] Restartable Recovery

Simon Riggs <simon@2ndquadrant.com> writes:

On Sun, 2006-07-16 at 12:40 -0400, Tom Lane wrote:

A compromise that might be good enough is to add an rmgr routine defined
as "bool is_idle(void)" that tests whether the rmgr has any open state
to worry about. Then, recovery checkpoints are done only if all rmgrs
say they are idle.

Perhaps that should be extended to say whether there are any
non-idempotent changes made in the last checkpoint period. That might
cover a wider set of potential actions.

Perhaps best to call it safe_to_checkpoint(), and not pre-judge what
reasons the rmgr might have for not wanting to restart here.

If we are only going to do a recovery checkpoint at every Nth checkpoint
record, then occasionally having to skip one seems no big problem ---
just do it at the first subsequent record that is safe.

regards, tom lane

Simon Riggs

simon@2ndquadrant.com

over 19 years ago

In reply to: Tom Lane (#7)

Re: [PATCHES] Restartable Recovery

On Sun, 2006-07-16 at 15:33 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

On Sun, 2006-07-16 at 12:40 -0400, Tom Lane wrote:

A compromise that might be good enough is to add an rmgr routine defined
as "bool is_idle(void)" that tests whether the rmgr has any open state
to worry about. Then, recovery checkpoints are done only if all rmgrs
say they are idle.

Perhaps that should be extended to say whether there are any
non-idempotent changes made in the last checkpoint period. That might
cover a wider set of potential actions.

Perhaps best to call it safe_to_checkpoint(), and not pre-judge what
reasons the rmgr might have for not wanting to restart here.

You read my mind.

If we are only going to do a recovery checkpoint at every Nth checkpoint
record, then occasionally having to skip one seems no big problem ---
just do it at the first subsequent record that is safe.

Got it.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

Simon Riggs

simon@2ndquadrant.com

over 19 years ago

In reply to: Simon Riggs (#8)

1 attachment(s)

Re: [HACKERS] Restartable Recovery

On Sun, 2006-07-16 at 20:56 +0100, Simon Riggs wrote:

On Sun, 2006-07-16 at 15:33 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

On Sun, 2006-07-16 at 12:40 -0400, Tom Lane wrote:

A compromise that might be good enough is to add an rmgr routine defined
as "bool is_idle(void)" that tests whether the rmgr has any open state
to worry about. Then, recovery checkpoints are done only if all rmgrs
say they are idle.

Perhaps that should be extended to say whether there are any
non-idempotent changes made in the last checkpoint period. That might
cover a wider set of potential actions.

Perhaps best to call it safe_to_checkpoint(), and not pre-judge what
reasons the rmgr might have for not wanting to restart here.

You read my mind.

If we are only going to do a recovery checkpoint at every Nth checkpoint
record, then occasionally having to skip one seems no big problem ---
just do it at the first subsequent record that is safe.

Got it.

I've implemented this for BTree, GIN, GIST using an additional rmgr
function bool rm_safe_restartpoint(void)

The functions are actually trivial, assuming I've understood this and
how GIST and GIN work for their xlogging.

"Recovery checkpoints" are now renamed "restartpoints" to avoid
confusion with checkpoints. So checkpoints occur during normal
processing (only) and restartpoints occur during recovery (only).

Updated patch enclosed, which I believe has no conflicts with the other
patches on xlog.c just submitted.

Much additional testing required, but the underlying concepts are very
simple really. Andreas: any further gotchas? :-)

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

Attachments:

restartableRecovery2.patchtext/x-patch; charset=UTF-8; name=restartableRecovery2.patchDownload

Index: src/backend/access/gin/ginxlog.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/access/gin/ginxlog.c,v
retrieving revision 1.3
diff -c -r1.3 ginxlog.c
*** src/backend/access/gin/ginxlog.c	14 Jul 2006 14:52:16 -0000	1.3
--- src/backend/access/gin/ginxlog.c	31 Jul 2006 23:51:56 -0000
***************
*** 538,540 ****
--- 538,548 ----
  	MemoryContextDelete(opCtx);
  }
  
+ bool
+ gin_safe_restartpoint(void)
+ {
+     if (list_length(incomplete_splits) > 0)
+         return false;
+ 
+     return true;
+ }
Index: src/backend/access/gist/gistxlog.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/access/gist/gistxlog.c,v
retrieving revision 1.22
diff -c -r1.22 gistxlog.c
*** src/backend/access/gist/gistxlog.c	14 Jul 2006 14:52:16 -0000	1.22
--- src/backend/access/gist/gistxlog.c	31 Jul 2006 23:51:57 -0000
***************
*** 818,823 ****
--- 818,831 ----
  	MemoryContextDelete(insertCtx);
  }
  
+ bool
+ gist_safe_restartpoint(void)
+ {
+     if (list_length(incomplete_inserts) > 0)
+         return false;
+ 
+     return true;
+ }
  
  XLogRecData *
  formSplitRdata(RelFileNode node, BlockNumber blkno, bool page_is_leaf,
Index: src/backend/access/nbtree/nbtxlog.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/access/nbtree/nbtxlog.c,v
retrieving revision 1.36
diff -c -r1.36 nbtxlog.c
*** src/backend/access/nbtree/nbtxlog.c	25 Jul 2006 19:13:00 -0000	1.36
--- src/backend/access/nbtree/nbtxlog.c	31 Jul 2006 23:51:58 -0000
***************
*** 794,796 ****
--- 794,805 ----
  	}
  	incomplete_splits = NIL;
  }
+ 
+ bool
+ btree_safe_restartpoint(void)
+ {
+     if (list_length(incomplete_splits) > 0)
+         return false;
+ 
+     return true;
+ }
Index: src/backend/access/transam/rmgr.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/rmgr.c,v
retrieving revision 1.23
diff -c -r1.23 rmgr.c
*** src/backend/access/transam/rmgr.c	11 Jul 2006 17:26:58 -0000	1.23
--- src/backend/access/transam/rmgr.c	31 Jul 2006 23:51:58 -0000
***************
*** 23,42 ****
  
  
  const RmgrData RmgrTable[RM_MAX_ID + 1] = {
! 	{"XLOG", xlog_redo, xlog_desc, NULL, NULL},
! 	{"Transaction", xact_redo, xact_desc, NULL, NULL},
! 	{"Storage", smgr_redo, smgr_desc, NULL, NULL},
! 	{"CLOG", clog_redo, clog_desc, NULL, NULL},
! 	{"Database", dbase_redo, dbase_desc, NULL, NULL},
! 	{"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL},
! 	{"MultiXact", multixact_redo, multixact_desc, NULL, NULL},
! 	{"Reserved 7", NULL, NULL, NULL, NULL},
! 	{"Reserved 8", NULL, NULL, NULL, NULL},
! 	{"Reserved 9", NULL, NULL, NULL, NULL},
! 	{"Heap", heap_redo, heap_desc, NULL, NULL},
! 	{"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup},
! 	{"Hash", hash_redo, hash_desc, NULL, NULL},
! 	{"Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup},
! 	{"Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup},
! 	{"Sequence", seq_redo, seq_desc, NULL, NULL}
  };
--- 23,42 ----
  
  
  const RmgrData RmgrTable[RM_MAX_ID + 1] = {
! 	{"XLOG", xlog_redo, xlog_desc, NULL, NULL, NULL},
! 	{"Transaction", xact_redo, xact_desc, NULL, NULL, NULL},
! 	{"Storage", smgr_redo, smgr_desc, NULL, NULL, NULL},
! 	{"CLOG", clog_redo, clog_desc, NULL, NULL, NULL},
! 	{"Database", dbase_redo, dbase_desc, NULL, NULL, NULL},
! 	{"Tablespace", tblspc_redo, tblspc_desc, NULL, NULL, NULL},
! 	{"MultiXact", multixact_redo, multixact_desc, NULL, NULL, NULL},
! 	{"Reserved 7", NULL, NULL, NULL, NULL, NULL},
! 	{"Reserved 8", NULL, NULL, NULL, NULL, NULL},
! 	{"Reserved 9", NULL, NULL, NULL, NULL, NULL},
! 	{"Heap", heap_redo, heap_desc, NULL, NULL, NULL},
! 	{"Btree", btree_redo, btree_desc, btree_xlog_startup, btree_xlog_cleanup, btree_safe_restartpoint},
! 	{"Hash", hash_redo, hash_desc, NULL, NULL, NULL},
! 	{"Gin", gin_redo, gin_desc, gin_xlog_startup, gin_xlog_cleanup, gin_safe_restartpoint},
! 	{"Gist", gist_redo, gist_desc, gist_xlog_startup, gist_xlog_cleanup, gist_safe_restartpoint},
! 	{"Sequence", seq_redo, seq_desc, NULL, NULL, NULL}
  };
Index: src/backend/access/transam/xlog.c
===================================================================
RCS file: /projects/cvsroot/pgsql/src/backend/access/transam/xlog.c,v
retrieving revision 1.245
diff -c -r1.245 xlog.c
*** src/backend/access/transam/xlog.c	30 Jul 2006 02:07:18 -0000	1.245
--- src/backend/access/transam/xlog.c	31 Jul 2006 23:52:05 -0000
***************
*** 120,125 ****
--- 120,126 ----
  
  /* File path names (all relative to $PGDATA) */
  #define BACKUP_LABEL_FILE		"backup_label"
+ #define BACKUP_LABEL_IN_USE	    "backup_label.in_use"
  #define RECOVERY_COMMAND_FILE	"recovery.conf"
  #define RECOVERY_COMMAND_DONE	"recovery.done"
  
***************
*** 179,184 ****
--- 180,188 ----
  static bool recoveryTargetInclusive = true;
  static TransactionId recoveryTargetXid;
  static time_t recoveryTargetTime;
+ static bool InStandby = false;
+ /* How many XLOG_CHECKPOINT* entries since last restartpoint */
+ static int nCheckpoints = 0;    
  
  /* if recoveryStopsHere returns true, it saves actual stop xid/time here */
  static TransactionId recoveryStopXid;
***************
*** 492,497 ****
--- 496,502 ----
  					 uint32 endLogId, uint32 endLogSeg);
  static void WriteControlFile(void);
  static void ReadControlFile(void);
+ static void ValidateControlFile(void);
  static char *str_time(time_t tnow);
  static void issue_xlog_fsync(void);
  
***************
*** 501,507 ****
  static bool read_backup_label(XLogRecPtr *checkPointLoc);
  static void remove_backup_label(void);
  static void rm_redo_error_callback(void *arg);
! 
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
--- 506,512 ----
  static bool read_backup_label(XLogRecPtr *checkPointLoc);
  static void remove_backup_label(void);
  static void rm_redo_error_callback(void *arg);
! static void CheckPointShmem(XLogRecPtr checkPointRedo);
  
  /*
   * Insert an XLOG record having the specified RMID and info bytes,
***************
*** 3622,3627 ****
--- 3627,3659 ----
  		ereport(FATAL,
  				(errmsg("incorrect checksum in control file")));
  
+     ValidateControlFile();
+ 
+ 	if (pg_perm_setlocale(LC_COLLATE, ControlFile->lc_collate) == NULL)
+ 		ereport(FATAL,
+ 			(errmsg("database files are incompatible with operating system"),
+ 			 errdetail("The database cluster was initialized with LC_COLLATE \"%s\","
+ 					   " which is not recognized by setlocale().",
+ 					   ControlFile->lc_collate),
+ 			 errhint("It looks like you need to initdb or install locale support.")));
+ 	if (pg_perm_setlocale(LC_CTYPE, ControlFile->lc_ctype) == NULL)
+ 		ereport(FATAL,
+ 			(errmsg("database files are incompatible with operating system"),
+ 		errdetail("The database cluster was initialized with LC_CTYPE \"%s\","
+ 				  " which is not recognized by setlocale().",
+ 				  ControlFile->lc_ctype),
+ 			 errhint("It looks like you need to initdb or install locale support.")));
+ 
+ 	/* Make the fixed locale settings visible as GUC variables, too */
+ 	SetConfigOption("lc_collate", ControlFile->lc_collate,
+ 					PGC_INTERNAL, PGC_S_OVERRIDE);
+ 	SetConfigOption("lc_ctype", ControlFile->lc_ctype,
+ 					PGC_INTERNAL, PGC_S_OVERRIDE);
+ }
+ 
+ static void
+ ValidateControlFile(void)
+ {
  	/*
  	 * Do compatibility checking immediately.  We do this here for 2 reasons:
  	 *
***************
*** 3718,3743 ****
  				  " but the server was compiled with LOCALE_NAME_BUFLEN %d.",
  						   ControlFile->localeBuflen, LOCALE_NAME_BUFLEN),
  				 errhint("It looks like you need to recompile or initdb.")));
- 	if (pg_perm_setlocale(LC_COLLATE, ControlFile->lc_collate) == NULL)
- 		ereport(FATAL,
- 			(errmsg("database files are incompatible with operating system"),
- 			 errdetail("The database cluster was initialized with LC_COLLATE \"%s\","
- 					   " which is not recognized by setlocale().",
- 					   ControlFile->lc_collate),
- 			 errhint("It looks like you need to initdb or install locale support.")));
- 	if (pg_perm_setlocale(LC_CTYPE, ControlFile->lc_ctype) == NULL)
- 		ereport(FATAL,
- 			(errmsg("database files are incompatible with operating system"),
- 		errdetail("The database cluster was initialized with LC_CTYPE \"%s\","
- 				  " which is not recognized by setlocale().",
- 				  ControlFile->lc_ctype),
- 			 errhint("It looks like you need to initdb or install locale support.")));
- 
- 	/* Make the fixed locale settings visible as GUC variables, too */
- 	SetConfigOption("lc_collate", ControlFile->lc_collate,
- 					PGC_INTERNAL, PGC_S_OVERRIDE);
- 	SetConfigOption("lc_ctype", ControlFile->lc_ctype,
- 					PGC_INTERNAL, PGC_S_OVERRIDE);
  }
  
  void
--- 3750,3755 ----
***************
*** 3745,3750 ****
--- 3757,3764 ----
  {
  	int			fd;
  
+     ValidateControlFile();
+ 
  	INIT_CRC32(ControlFile->crc);
  	COMP_CRC32(ControlFile->crc,
  			   (char *) ControlFile,
***************
*** 4091,4096 ****
--- 4105,4119 ----
  					(errmsg("restore_command = \"%s\"",
  							recoveryRestoreCommand)));
  		}
+ 		else if (strcmp(tok1, "standby_mode") == 0)
+ 		{
+ 			if (strcmp(tok2, "true") == 0)
+             {
+                 InStandby = true;
+ 				ereport(LOG,
+ 						(errmsg("standby_mode = true")));
+             }
+         }
  		else if (strcmp(tok1, "recovery_target_timeline") == 0)
  		{
  			rtliGiven = true;
***************
*** 4226,4231 ****
--- 4249,4255 ----
  	 * We are no longer in archive recovery state.
  	 */
  	InArchiveRecovery = false;
+ 	InStandby = false;
  
  	/*
  	 * We should have the ending log segment currently open.  Verify, and then
***************
*** 4461,4472 ****
  		ereport(LOG,
  				(errmsg("database system shutdown was interrupted at %s",
  						str_time(ControlFile->time))));
! 	else if (ControlFile->state == DB_IN_RECOVERY)
  		ereport(LOG,
  		   (errmsg("database system was interrupted while in recovery at %s",
  				   str_time(ControlFile->time)),
  			errhint("This probably means that some data is corrupted and"
  					" you will have to use the last backup for recovery.")));
  	else if (ControlFile->state == DB_IN_PRODUCTION)
  		ereport(LOG,
  				(errmsg("database system was interrupted at %s",
--- 4485,4502 ----
  		ereport(LOG,
  				(errmsg("database system shutdown was interrupted at %s",
  						str_time(ControlFile->time))));
! 	else if (ControlFile->state == DB_IN_CRASH_RECOVERY)
  		ereport(LOG,
  		   (errmsg("database system was interrupted while in recovery at %s",
  				   str_time(ControlFile->time)),
  			errhint("This probably means that some data is corrupted and"
  					" you will have to use the last backup for recovery.")));
+ 	else if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY)
+ 		ereport(LOG,
+ 		   (errmsg("database system was interrupted while in recovery at log time %s",
+ 				   str_time(ControlFile->time)),
+ 			errhint("If this has occurred more than once some data may be corrupted"
+ 					" and you may need to choose an earlier recovery target.")));
  	else if (ControlFile->state == DB_IN_PRODUCTION)
  		ereport(LOG,
  				(errmsg("database system was interrupted at %s",
***************
*** 4622,4637 ****
  	{
  		int			rmid;
  
  		if (InArchiveRecovery)
! 			ereport(LOG,
  					(errmsg("automatic recovery in progress")));
  		else
  			ereport(LOG,
  					(errmsg("database system was not properly shut down; "
  							"automatic recovery in progress")));
! 		ControlFile->state = DB_IN_RECOVERY;
  		ControlFile->time = time(NULL);
  		UpdateControlFile();
  
  		/* Start up the recovery environment */
  		XLogInitRelationCache();
--- 4652,4681 ----
  	{
  		int			rmid;
  
+         /*
+          * If we are in Archive Recovery then we create restartpoints
+          * to avoid needing to start right from the beginning again. 
+          */
+     	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
  		if (InArchiveRecovery)
!         {		
!         	ereport(LOG,
  					(errmsg("automatic recovery in progress")));
+     		ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+         }
  		else
+         {
  			ereport(LOG,
  					(errmsg("database system was not properly shut down; "
  							"automatic recovery in progress")));
!     		ControlFile->state = DB_IN_CRASH_RECOVERY;
!         }
  		ControlFile->time = time(NULL);
+     	ControlFile->prevCheckPoint = ControlFile->checkPoint;
+     	ControlFile->checkPoint = checkPointLoc;
+     	ControlFile->checkPointCopy = checkPoint;
  		UpdateControlFile();
+     	LWLockRelease(ControlFileLock);
  
  		/* Start up the recovery environment */
  		XLogInitRelationCache();
***************
*** 4664,4669 ****
--- 4708,4715 ----
  			ErrorContextCallback	errcontext;
  
  			InRedo = true;
+             nCheckpoints = 0;
+ 
  			ereport(LOG,
  					(errmsg("redo starts at %X/%X",
  							ReadRecPtr.xlogid, ReadRecPtr.xrecoff)));
***************
*** 5330,5341 ****
  		ereport(DEBUG2,
  				(errmsg("checkpoint starting")));
  
! 	CheckPointCLOG();
! 	CheckPointSUBTRANS();
! 	CheckPointMultiXact();
! 	FlushBufferPool();
! 	/* We deliberately delay 2PC checkpointing as long as possible */
! 	CheckPointTwoPhase(checkPoint.redo);
  
  	START_CRIT_SECTION();
  
--- 5376,5385 ----
  		ereport(DEBUG2,
  				(errmsg("checkpoint starting")));
  
!     /*
!      * Ensure all of shared memory gets checkpointed
!      */
!     CheckPointShmem(checkPoint.redo);
  
  	START_CRIT_SECTION();
  
***************
*** 5454,5459 ****
--- 5498,5504 ----
  xlog_redo(XLogRecPtr lsn, XLogRecord *record)
  {
  	uint8		info = record->xl_info & ~XLR_INFO_MASK;
+ 	CheckPoint	checkPoint;
  
  	if (info == XLOG_NEXTOID)
  	{
***************
*** 5465,5475 ****
  			ShmemVariableCache->nextOid = nextOid;
  			ShmemVariableCache->oidCount = 0;
  		}
  	}
  	else if (info == XLOG_CHECKPOINT_SHUTDOWN)
  	{
- 		CheckPoint	checkPoint;
- 
  		memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
  		/* In a SHUTDOWN checkpoint, believe the counters exactly */
  		ShmemVariableCache->nextXid = checkPoint.nextXid;
--- 5510,5519 ----
  			ShmemVariableCache->nextOid = nextOid;
  			ShmemVariableCache->oidCount = 0;
  		}
+         return;
  	}
  	else if (info == XLOG_CHECKPOINT_SHUTDOWN)
  	{
  		memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
  		/* In a SHUTDOWN checkpoint, believe the counters exactly */
  		ShmemVariableCache->nextXid = checkPoint.nextXid;
***************
*** 5495,5502 ****
  	}
  	else if (info == XLOG_CHECKPOINT_ONLINE)
  	{
- 		CheckPoint	checkPoint;
- 
  		memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
  		/* In an ONLINE checkpoint, treat the counters like NEXTOID */
  		if (TransactionIdPrecedes(ShmemVariableCache->nextXid,
--- 5539,5544 ----
***************
*** 5515,5520 ****
--- 5557,5625 ----
  					(errmsg("unexpected timeline ID %u (should be %u) in checkpoint record",
  							checkPoint.ThisTimeLineID, ThisTimeLineID)));
  	}
+ 
+ #define RESTARTPOINT_INTERVAL 100
+ 
+     /*
+      * If we are in Standby mode, then do a mark a restartpoint for each
+      * checkpoint found in WAL replay. Otherwise, don't do this very
+      * frequently since this slows down recovery. A restartpoint is 
+      * simply a recreation of the database state after the original
+      * checkpoint: all database changes are written to disk, allowing us
+      * to restart recovery from that point. 
+      *
+      * Note: Should recovery ever be parallelised in the future,
+      * all work *must* stop until the restartpoint has completed.
+      */
+     if (InArchiveRecovery && (InStandby || nCheckpoints >= RESTARTPOINT_INTERVAL))
+     {
+         int     rmid;
+         bool    safe_restartpoint = true;
+ 
+         /*
+          * Is it safe to checkpoint? We must be careful to ask each of
+          * the resource managers whether they have any partial state
+          * information that might prevent a valid restartpoint from being
+          * written. If so, we skip this opportunity, but return
+          * on the next checkpoint record for another try.
+          */
+ 		for (rmid = 0; rmid <= RM_MAX_ID; rmid++)
+ 		{
+ 			if (!RmgrTable[rmid].rm_safe_restartpoint)
+             {
+                 safe_restartpoint = false;
+ 				break;
+             }
+ 		}
+         
+         if (!safe_restartpoint)
+             return;
+ 
+         CheckPointShmem(lsn);
+ 
+     	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+    		ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+     	ControlFile->prevCheckPoint = ControlFile->checkPoint;
+         /* 
+          * The checkpoint record starts at ReadRecPtr; lsn is pointer to
+          * the next xlog record so must not be used here
+          */
+     	ControlFile->checkPoint = ReadRecPtr;
+     	ControlFile->checkPointCopy = checkPoint;
+         /* 
+          * Make it look like we started from this point, so this is *not*
+          * current time but original checkpoint time 
+          */
+     	ControlFile->time = checkPoint.time;
+     	UpdateControlFile();
+     	LWLockRelease(ControlFileLock);
+ 		ereport(LOG,
+ 				(errmsg("restartpoint at %X/%X",
+ 						lsn.xlogid, lsn.xrecoff)));
+         nCheckpoints = 0;
+     }
+     else
+         nCheckpoints++;
  }
  
  void
***************
*** 6102,6107 ****
--- 6207,6223 ----
  							histfilepath)));
  	}
  
+ 	/*
+ 	 * Rename the backup label file out of the way, so that we don't accidentally
+ 	 * re-start recovery from the beginning.
+ 	 */
+ 	unlink(BACKUP_LABEL_IN_USE);
+ 	if (rename(BACKUP_LABEL_FILE, BACKUP_LABEL_IN_USE) != 0)
+ 		ereport(FATAL,
+ 				(errcode_for_file_access(),
+ 				 errmsg("could not rename file \"%s\" to \"%s\": %m",
+ 						BACKUP_LABEL_FILE, BACKUP_LABEL_IN_USE)));
+ 
  	return true;
  }
  
***************
*** 6115,6126 ****
  static void
  remove_backup_label(void)
  {
! 	if (unlink(BACKUP_LABEL_FILE) != 0)
! 		if (errno != ENOENT)
! 			ereport(FATAL,
! 					(errcode_for_file_access(),
! 					 errmsg("could not remove file \"%s\": %m",
! 							BACKUP_LABEL_FILE)));
  }
  
  /*
--- 6231,6242 ----
  static void
  remove_backup_label(void)
  {
!     if (unlink(BACKUP_LABEL_IN_USE) != 0)
!         if (errno != ENOENT)
!             ereport(FATAL,
!                     (errcode_for_file_access(),
!                     errmsg("could not remove file \"%s\": %m",
!                                     BACKUP_LABEL_IN_USE)));
  }
  
  /*
***************
*** 6143,6145 ****
--- 6259,6274 ----
  
  	pfree(buf.data);
  }
+ 
+ /* 
+  * Flush all shared memory data zones and ensure fsync
+  */
+ static void CheckPointShmem(XLogRecPtr checkPointRedo)
+ {
+ 	CheckPointCLOG();
+ 	CheckPointSUBTRANS();
+ 	CheckPointMultiXact();
+ 	FlushBufferPool();     /* performs all required fsyncs */
+ 	/* We deliberately delay 2PC checkpointing as long as possible */
+ 	CheckPointTwoPhase(checkPointRedo);
+ }
Index: src/include/access/gin.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/access/gin.h,v
retrieving revision 1.5
diff -c -r1.5 gin.h
*** src/include/access/gin.h	11 Jul 2006 16:55:34 -0000	1.5
--- src/include/access/gin.h	31 Jul 2006 23:52:12 -0000
***************
*** 234,239 ****
--- 234,240 ----
  extern void gin_desc(StringInfo buf, uint8 xl_info, char *rec);
  extern void gin_xlog_startup(void);
  extern void gin_xlog_cleanup(void);
+ extern bool gin_safe_restartpoint(void);
  
  /* ginbtree.c */
  
Index: src/include/access/gist_private.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/access/gist_private.h,v
retrieving revision 1.22
diff -c -r1.22 gist_private.h
*** src/include/access/gist_private.h	11 Jul 2006 21:05:57 -0000	1.22
--- src/include/access/gist_private.h	31 Jul 2006 23:52:12 -0000
***************
*** 251,256 ****
--- 251,257 ----
  extern void gist_desc(StringInfo buf, uint8 xl_info, char *rec);
  extern void gist_xlog_startup(void);
  extern void gist_xlog_cleanup(void);
+ extern bool gist_safe_restartpoint(void);
  extern IndexTuple gist_form_invalid_tuple(BlockNumber blkno);
  
  extern XLogRecData *formUpdateRdata(RelFileNode node, Buffer buffer,
Index: src/include/access/nbtree.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/access/nbtree.h,v
retrieving revision 1.102
diff -c -r1.102 nbtree.h
*** src/include/access/nbtree.h	25 Jul 2006 19:13:00 -0000	1.102
--- src/include/access/nbtree.h	31 Jul 2006 23:52:13 -0000
***************
*** 545,549 ****
--- 545,550 ----
  extern void btree_desc(StringInfo buf, uint8 xl_info, char *rec);
  extern void btree_xlog_startup(void);
  extern void btree_xlog_cleanup(void);
+ extern bool btree_safe_restartpoint(void);
  
  #endif   /* NBTREE_H */
Index: src/include/access/xlog_internal.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/access/xlog_internal.h,v
retrieving revision 1.13
diff -c -r1.13 xlog_internal.h
*** src/include/access/xlog_internal.h	5 Apr 2006 03:34:05 -0000	1.13
--- src/include/access/xlog_internal.h	31 Jul 2006 23:52:13 -0000
***************
*** 232,237 ****
--- 232,238 ----
  	void		(*rm_desc) (StringInfo buf, uint8 xl_info, char *rec);
  	void		(*rm_startup) (void);
  	void		(*rm_cleanup) (void);
+     bool        (*rm_safe_restartpoint) (void);
  } RmgrData;
  
  extern const RmgrData RmgrTable[];
Index: src/include/catalog/pg_control.h
===================================================================
RCS file: /projects/cvsroot/pgsql/src/include/catalog/pg_control.h,v
retrieving revision 1.29
diff -c -r1.29 pg_control.h
*** src/include/catalog/pg_control.h	4 Apr 2006 22:39:59 -0000	1.29
--- src/include/catalog/pg_control.h	31 Jul 2006 23:52:14 -0000
***************
*** 55,61 ****
  	DB_STARTUP = 0,
  	DB_SHUTDOWNED,
  	DB_SHUTDOWNING,
! 	DB_IN_RECOVERY,
  	DB_IN_PRODUCTION
  } DBState;
  
--- 55,62 ----
  	DB_STARTUP = 0,
  	DB_SHUTDOWNED,
  	DB_SHUTDOWNING,
! 	DB_IN_CRASH_RECOVERY,
! 	DB_IN_ARCHIVE_RECOVERY,
  	DB_IN_PRODUCTION
  } DBState;

#10

Bruce Momjian

bruce@momjian.us

over 19 years ago

In reply to: Simon Riggs (#9)

Re: [HACKERS] Restartable Recovery

Nice. I was going to ask if this could make it into 8.2.

---------------------------------------------------------------------------

Simon Riggs wrote:

On Sun, 2006-07-16 at 20:56 +0100, Simon Riggs wrote:

On Sun, 2006-07-16 at 15:33 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

On Sun, 2006-07-16 at 12:40 -0400, Tom Lane wrote:

A compromise that might be good enough is to add an rmgr routine defined
as "bool is_idle(void)" that tests whether the rmgr has any open state
to worry about. Then, recovery checkpoints are done only if all rmgrs
say they are idle.

Perhaps that should be extended to say whether there are any
non-idempotent changes made in the last checkpoint period. That might
cover a wider set of potential actions.

Perhaps best to call it safe_to_checkpoint(), and not pre-judge what
reasons the rmgr might have for not wanting to restart here.

You read my mind.

If we are only going to do a recovery checkpoint at every Nth checkpoint
record, then occasionally having to skip one seems no big problem ---
just do it at the first subsequent record that is safe.

Got it.

I've implemented this for BTree, GIN, GIST using an additional rmgr
function bool rm_safe_restartpoint(void)

The functions are actually trivial, assuming I've understood this and
how GIST and GIN work for their xlogging.

"Recovery checkpoints" are now renamed "restartpoints" to avoid
confusion with checkpoints. So checkpoints occur during normal
processing (only) and restartpoints occur during recovery (only).

Updated patch enclosed, which I believe has no conflicts with the other
patches on xlog.c just submitted.

Much additional testing required, but the underlying concepts are very
simple really. Andreas: any further gotchas? :-)

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com

[ Attachment, skipping... ]

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org

--
Bruce Momjian bruce@momjian.us
EnterpriseDB http://www.enterprisedb.com

+ If your life is a hard drive, Christ can be your backup. +

#11

Tom Lane

tgl@sss.pgh.pa.us

over 19 years ago

In reply to: Simon Riggs (#9)

Re: [HACKERS] Restartable Recovery

Simon Riggs <simon@2ndquadrant.com> writes:

I've implemented this for BTree, GIN, GIST using an additional rmgr
function bool rm_safe_restartpoint(void)
...
"Recovery checkpoints" are now renamed "restartpoints" to avoid
confusion with checkpoints. So checkpoints occur during normal
processing (only) and restartpoints occur during recovery (only).

Applied with revisions. As submitted the patch pushed backup_label out
of the way immediately upon reading it, which is no good: you need to be
sure that the starting checkpoint location is written to pg_control
first, else an immediate crash would allow the thing to try to start
from whatever checkpoint is listed in the backed-up pg_control. Also,
the minimum recovery stopping point that's obtained using the label file
still has to be enforced if there's a crash during the replay sequence.
I felt the best way to do that was to copy the minimum stopping point
into pg_control, so that's what the code does now.

Also, as I mentioned earlier, I think that doing restartpoints on the
basis of elapsed time is simpler and more useful than having an explicit
distinction between "normal" and "standby" modes. We can always invent
a standby_mode flag later if we need one, but we don't need it for this.

regards, tom lane

#12

Simon Riggs

simon@2ndquadrant.com

over 19 years ago

In reply to: Tom Lane (#11)

Re: [HACKERS] Restartable Recovery

On Mon, 2006-08-07 at 13:05 -0400, Tom Lane wrote:

Simon Riggs <simon@2ndquadrant.com> writes:

I've implemented this for BTree, GIN, GIST using an additional rmgr
function bool rm_safe_restartpoint(void)
...
"Recovery checkpoints" are now renamed "restartpoints" to avoid
confusion with checkpoints. So checkpoints occur during normal
processing (only) and restartpoints occur during recovery (only).

Applied with revisions.

err....CheckPointGuts() :-) I guess patch reviews need some spicing up.

As submitted the patch pushed backup_label out
of the way immediately upon reading it, which is no good: you need to be
sure that the starting checkpoint location is written to pg_control
first, else an immediate crash would allow the thing to try to start
from whatever checkpoint is listed in the backed-up pg_control. Also,
the minimum recovery stopping point that's obtained using the label file
still has to be enforced if there's a crash during the replay sequence.
I felt the best way to do that was to copy the minimum stopping point
into pg_control, so that's what the code does now.

Thanks for checking that.

Also, as I mentioned earlier, I think that doing restartpoints on the
basis of elapsed time is simpler and more useful than having an explicit
distinction between "normal" and "standby" modes. We can always invent
a standby_mode flag later if we need one, but we don't need it for this.

OK, agreed.

The original thinking was that writing a restartpoint was more crucial
when in standby mode; but this way we've better performance and have a
low ceiling on the restart time if that should ever occur at the worst
moment.

Thanks again to Marko for the concept.

I'll work on the docs for backup.sgml also.

--
Simon Riggs
EnterpriseDB http://www.enterprisedb.com