Allowing multiple concurrent base backups

Started by Heikki Linnakangasabout 15 years ago46 messages
#1Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
2 attachment(s)

Now that we have a basic over-the-wire base backup capability in
walsender, it would be nice to allow taking multiple base backups at the
same time. It might not seem very useful at first, but it makes it
easier to set up standbys for small databases. At the moment, if you
want to set up two standbys, you have to either take a single base
backup and distribute it to both standbys, or somehow coordinate that
they don't try to take the base backup at the same time. Also, you don't
want initializing a standby to conflict with a nightly backup cron script.

So, this patch modifies the internal do_pg_start/stop_backup functions
so that in addition to the traditional mode of operation, where a
backup_label file is created in the data directory where it's backed up
along with all other files, the backup label file is be returned to the
caller, and the caller is responsible for including it in the backup.
The code in replication/basebackup.c includes it in the tar file that's
streamed the client, as "backup_label".

To make that safe, I've changed forcePageWrites into an integer.
Whenever a backup is started, it's incremented, and when one ends, it's
decremented. When forcePageWrites == 0, there's no backup in progress.

The user-visible pg_start_backup() function is not changed. You can only
have one backup started that way in progress at a time. But you can do
streaming base backups at the same time with traditional pg_start_backup().

I implemented this in two ways, and can't decide which I like better:

1. The contents of the backup label file are returned to the caller of
do_pg_start_backup() as a palloc'd string.

2. do_pg_start_backup() creates a temporary file that the backup label
is written to (instead of "backup_label").

Implementation 1 changes more code, as pg_start/stop_backup() need to be
changed to write/read from memory instead of file, but the result isn't
any more complicated. Nevertheless, I somehow feel more comfortable with 2.

Patches for both approaches attached. They're also available in my
github repository at git@github.com:hlinnaka/postgres.git.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

multiple_inprogress_backups1.patchtext/x-diff; name=multiple_inprogress_backups1.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5b6a230..cc4d46e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -338,7 +338,8 @@ typedef struct XLogCtlInsert
 	XLogPageHeader currpage;	/* points to header of block in cache */
 	char	   *currpos;		/* current insertion point in cache */
 	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
-	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
+	int			forcePageWrites;	/* forcing full-page writes for PITR? */
+	bool		exclusiveBackup;	/* a backup was started with pg_start_backup() */
 } XLogCtlInsert;
 
 /*
@@ -8313,7 +8314,7 @@ pg_start_backup(PG_FUNCTION_ARGS)
 
 	backupidstr = text_to_cstring(backupid);
 
-	startpoint = do_pg_start_backup(backupidstr, fast);
+	startpoint = do_pg_start_backup(backupidstr, fast, true, NULL);
 
 	snprintf(startxlogstr, sizeof(startxlogstr), "%X/%X",
 			 startpoint.xlogid, startpoint.xrecoff);
@@ -8321,7 +8322,7 @@ pg_start_backup(PG_FUNCTION_ARGS)
 }
 
 XLogRecPtr
-do_pg_start_backup(const char *backupidstr, bool fast)
+do_pg_start_backup(const char *backupidstr, bool fast, bool exclusive, char **labelfile)
 {
 	XLogRecPtr	checkpointloc;
 	XLogRecPtr	startpoint;
@@ -8332,6 +8333,7 @@ do_pg_start_backup(const char *backupidstr, bool fast)
 	uint32		_logSeg;
 	struct stat stat_buf;
 	FILE	   *fp;
+	StringInfoData labelfbuf;
 
 	if (!superuser() && !is_authenticated_user_replication_role())
 		ereport(ERROR,
@@ -8368,15 +8370,19 @@ do_pg_start_backup(const char *backupidstr, bool fast)
 	 * ensure adequate interlocking against XLogInsert().
 	 */
 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
-	if (XLogCtl->Insert.forcePageWrites)
+	if (exclusive)
 	{
-		LWLockRelease(WALInsertLock);
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("a backup is already in progress"),
-				 errhint("Run pg_stop_backup() and try again.")));
+		if (XLogCtl->Insert.exclusiveBackup)
+		{
+			LWLockRelease(WALInsertLock);
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("a backup is already in progress"),
+					 errhint("Run pg_stop_backup() and try again.")));
+		}
+		XLogCtl->Insert.exclusiveBackup = true;
 	}
-	XLogCtl->Insert.forcePageWrites = true;
+	XLogCtl->Insert.forcePageWrites++;
 	LWLockRelease(WALInsertLock);
 
 	/*
@@ -8393,7 +8399,7 @@ do_pg_start_backup(const char *backupidstr, bool fast)
 	RequestXLogSwitch();
 
 	/* Ensure we release forcePageWrites if fail below */
-	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) 0);
+	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
 	{
 		/*
 		 * Force a CHECKPOINT.	Aside from being necessary to prevent torn
@@ -8420,54 +8426,67 @@ do_pg_start_backup(const char *backupidstr, bool fast)
 		XLByteToSeg(startpoint, _logId, _logSeg);
 		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
 
+		/*
+		 * Construct backup label file 
+		 */
+		initStringInfo(&labelfbuf);
+
 		/* Use the log timezone here, not the session timezone */
 		stamp_time = (pg_time_t) time(NULL);
 		pg_strftime(strfbuf, sizeof(strfbuf),
 					"%Y-%m-%d %H:%M:%S %Z",
 					pg_localtime(&stamp_time, log_timezone));
+		appendStringInfo(&labelfbuf, "START WAL LOCATION: %X/%X (file %s)\n",
+						 startpoint.xlogid, startpoint.xrecoff, xlogfilename);
+		appendStringInfo(&labelfbuf, "CHECKPOINT LOCATION: %X/%X\n",
+						 checkpointloc.xlogid, checkpointloc.xrecoff);
+		appendStringInfo(&labelfbuf, "START TIME: %s\n", strfbuf);
+		appendStringInfo(&labelfbuf, "LABEL: %s\n", backupidstr);
 
 		/*
-		 * Check for existing backup label --- implies a backup is already
-		 * running.  (XXX given that we checked forcePageWrites above, maybe
-		 * it would be OK to just unlink any such label file?)
+		 * Okay, write the file, or return it to caller.
 		 */
-		if (stat(BACKUP_LABEL_FILE, &stat_buf) != 0)
+		if (exclusive)
 		{
-			if (errno != ENOENT)
+			/*
+			 * Check for existing backup label --- implies a backup is already
+			 * running.  (XXX given that we checked exclusiveBackup above, maybe
+			 * it would be OK to just unlink any such label file?)
+			 */
+			if (stat(BACKUP_LABEL_FILE, &stat_buf) != 0)
+			{
+				if (errno != ENOENT)
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not stat file \"%s\": %m",
+									BACKUP_LABEL_FILE)));
+			}
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("a backup is already in progress"),
+						 errhint("If you're sure there is no backup in progress, remove file \"%s\" and try again.",
+								 BACKUP_LABEL_FILE)));
+
+			fp = AllocateFile(BACKUP_LABEL_FILE, "w");
+
+			if (!fp)
 				ereport(ERROR,
 						(errcode_for_file_access(),
-						 errmsg("could not stat file \"%s\": %m",
+						 errmsg("could not create file \"%s\": %m",
 								BACKUP_LABEL_FILE)));
+			fwrite(labelfbuf.data, labelfbuf.len, 1, fp);
+			if (fflush(fp) || ferror(fp) || FreeFile(fp))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write file \"%s\": %m",
+								BACKUP_LABEL_FILE)));
+			pfree(labelfbuf.data);
 		}
 		else
-			ereport(ERROR,
-					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					 errmsg("a backup is already in progress"),
-					 errhint("If you're sure there is no backup in progress, remove file \"%s\" and try again.",
-							 BACKUP_LABEL_FILE)));
-
-		/*
-		 * Okay, write the file
-		 */
-		fp = AllocateFile(BACKUP_LABEL_FILE, "w");
-		if (!fp)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not create file \"%s\": %m",
-							BACKUP_LABEL_FILE)));
-		fprintf(fp, "START WAL LOCATION: %X/%X (file %s)\n",
-				startpoint.xlogid, startpoint.xrecoff, xlogfilename);
-		fprintf(fp, "CHECKPOINT LOCATION: %X/%X\n",
-				checkpointloc.xlogid, checkpointloc.xrecoff);
-		fprintf(fp, "START TIME: %s\n", strfbuf);
-		fprintf(fp, "LABEL: %s\n", backupidstr);
-		if (fflush(fp) || ferror(fp) || FreeFile(fp))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not write file \"%s\": %m",
-							BACKUP_LABEL_FILE)));
+			*labelfile = labelfbuf.data;
 	}
-	PG_END_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) 0);
+	PG_END_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
 
 	/*
 	 * We're done.  As a convenience, return the starting WAL location.
@@ -8479,8 +8498,15 @@ do_pg_start_backup(const char *backupidstr, bool fast)
 static void
 pg_start_backup_callback(int code, Datum arg)
 {
-	/* Turn off forcePageWrites on failure */
+	bool exclusive = DatumGetBool(arg);
+
+	/* Decrement forcePageWrites on failure */
 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+	if (exclusive)
+	{
+		Assert(XLogCtl->Insert.exclusiveBackup);
+		XLogCtl->Insert.exclusiveBackup = false;
+	}
 	XLogCtl->Insert.forcePageWrites = false;
 	LWLockRelease(WALInsertLock);
 }
@@ -8504,16 +8530,20 @@ pg_stop_backup(PG_FUNCTION_ARGS)
 	XLogRecPtr	stoppoint;
 	char		stopxlogstr[MAXFNAMELEN];
 
-	stoppoint = do_pg_stop_backup();
+	stoppoint = do_pg_stop_backup(NULL);
 
 	snprintf(stopxlogstr, sizeof(stopxlogstr), "%X/%X",
 			 stoppoint.xlogid, stoppoint.xrecoff);
 	PG_RETURN_TEXT_P(cstring_to_text(stopxlogstr));
 }
 
+/*
+ * If labelfile is NULL, this stops an exclusive backup.
+ */
 XLogRecPtr
-do_pg_stop_backup(void)
+do_pg_stop_backup(char *labelfile)
 {
+	bool		exclusive = (labelfile == NULL);
 	XLogRecPtr	startpoint;
 	XLogRecPtr	stoppoint;
 	XLogRecData rdata;
@@ -8529,10 +8559,10 @@ do_pg_stop_backup(void)
 	FILE	   *lfp;
 	FILE	   *fp;
 	char		ch;
-	int			ich;
 	int			seconds_before_warning;
 	int			waits = 0;
 	bool		reported_waiting = false;
+	char	   *remaining;
 
 	if (!superuser() && !is_authenticated_user_replication_role())
 		ereport(ERROR,
@@ -8555,35 +8585,71 @@ do_pg_stop_backup(void)
 	 * OK to clear forcePageWrites
 	 */
 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
-	XLogCtl->Insert.forcePageWrites = false;
+	if (exclusive)
+	{
+		if (XLogCtl->Insert.exclusiveBackup)
+		{
+			XLogCtl->Insert.forcePageWrites--;
+			XLogCtl->Insert.exclusiveBackup = false;
+		}
+	}
+	XLogCtl->Insert.forcePageWrites--;
 	LWLockRelease(WALInsertLock);
 
-	/*
-	 * Open the existing label file
-	 */
-	lfp = AllocateFile(BACKUP_LABEL_FILE, "r");
-	if (!lfp)
+	if (!labelfile)
 	{
-		if (errno != ENOENT)
+		int len;
+
+		/*
+		 * Read the existing label file
+		 */
+		lfp = AllocateFile(BACKUP_LABEL_FILE, "r");
+		if (!lfp)
+		{
+			if (errno != ENOENT)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not read file \"%s\": %m",
+								labelfile)));
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("a backup is not in progress")));
+		}
+
+		/* The label file should be small */
+		labelfile = palloc(1024);
+		len = fread(labelfile, 1, 1023, lfp);
+		if (len >= 1023)
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("backup_label file \"%s\" is too long", BACKUP_LABEL_FILE)));
+
+		/*
+		 * Close and remove the backup label file
+		 */
+		if (ferror(lfp) || FreeFile(lfp))
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read file \"%s\": %m",
 							BACKUP_LABEL_FILE)));
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("a backup is not in progress")));
+		if (unlink(BACKUP_LABEL_FILE) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m",
+							BACKUP_LABEL_FILE)));
 	}
 
 	/*
 	 * Read and parse the START WAL LOCATION line (this code is pretty crude,
 	 * but we are not expecting any variability in the file format).
 	 */
-	if (fscanf(lfp, "START WAL LOCATION: %X/%X (file %24s)%c",
+	if (sscanf(labelfile, "START WAL LOCATION: %X/%X (file %24s)%c",
 			   &startpoint.xlogid, &startpoint.xrecoff, startxlogfilename,
 			   &ch) != 4 || ch != '\n')
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
+	remaining = strchr(labelfile, '\n') + 1;
 
 	/*
 	 * Write the backup-end xlog record
@@ -8626,8 +8692,7 @@ do_pg_stop_backup(void)
 	fprintf(fp, "STOP WAL LOCATION: %X/%X (file %s)\n",
 			stoppoint.xlogid, stoppoint.xrecoff, stopxlogfilename);
 	/* transfer remaining lines from label to history file */
-	while ((ich = fgetc(lfp)) != EOF)
-		fputc(ich, fp);
+	fwrite(remaining, strlen(remaining), 1, fp);
 	fprintf(fp, "STOP TIME: %s\n", strfbuf);
 	if (fflush(fp) || ferror(fp) || FreeFile(fp))
 		ereport(ERROR,
@@ -8636,20 +8701,6 @@ do_pg_stop_backup(void)
 						histfilepath)));
 
 	/*
-	 * Close and remove the backup label file
-	 */
-	if (ferror(lfp) || FreeFile(lfp))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
-						BACKUP_LABEL_FILE)));
-	if (unlink(BACKUP_LABEL_FILE) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m",
-						BACKUP_LABEL_FILE)));
-
-	/*
 	 * Clean out any no-longer-needed history files.  As a side effect, this
 	 * will post a .ready file for the newly created history file, notifying
 	 * the archiver that history file may be archived immediately.
@@ -8730,9 +8781,13 @@ do_pg_stop_backup(void)
 /*
  * do_pg_abort_backup: abort a running backup
  *
- * This does just the most basic steps of pg_stop_backup(), by taking the
+ * This does just the most basic steps of do_pg_stop_backup(), by taking the
  * system out of backup mode, thus making it a lot more safe to call from
  * an error handler.
+ *
+ * NB: This is only for aborting a non-exclusive backup that doesn't write
+ * backup_label. A backup started with pg_stop_backup() needs to be finished
+ * with pg_stop_backup().
  */
 void
 do_pg_abort_backup(void)
@@ -8741,17 +8796,14 @@ do_pg_abort_backup(void)
 	 * OK to clear forcePageWrites
 	 */
 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
-	XLogCtl->Insert.forcePageWrites = false;
-	LWLockRelease(WALInsertLock);
-
 	/*
-	 * Remove backup label file
+	 * The caller should take care to not call us only if a backup, but
+	 * be extra paranoid to avoid wrapping forcePageWrites around if the
+	 * caller screws up.
 	 */
-	if (unlink(BACKUP_LABEL_FILE) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m",
-						BACKUP_LABEL_FILE)));
+	if (XLogCtl->Insert.forcePageWrites > 0)
+		XLogCtl->Insert.forcePageWrites--;
+	LWLockRelease(WALInsertLock);
 }
 
 /*
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index de5fa22..c31b091 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -31,13 +31,17 @@
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 
+/* XXX: copied from xlog.c */
+#define BACKUP_LABEL_FILE		"backup_label"
+
 static int64 sendDir(char *path, int basepathlen, bool sizeonly);
-static void sendFile(char *path, int basepathlen, struct stat * statbuf);
+static void sendFile(char *readfilename, char *tarfilename,
+		 struct stat * statbuf);
 static void _tarWriteHeader(char *filename, char *linktarget,
 				struct stat * statbuf);
 static void send_int8_string(StringInfoData *buf, int64 intval);
 static void SendBackupHeader(List *tablespaces);
-static void SendBackupDirectory(char *location, char *spcoid);
+static void SendBackupDirectories(List *tablespaces, char *labelfilepath);
 static void base_backup_cleanup(int code, Datum arg);
 
 typedef struct
@@ -75,6 +79,7 @@ SendBaseBackup(const char *options)
 	tablespaceinfo *ti;
 	MemoryContext backup_context;
 	MemoryContext old_context;
+	char	   *labelfile;
 
 	backup_context = AllocSetContextCreate(CurrentMemoryContext,
 										   "Streaming base backup context",
@@ -145,26 +150,19 @@ SendBaseBackup(const char *options)
 	}
 	FreeDir(dir);
 
-	do_pg_start_backup(backup_label, true);
+	do_pg_start_backup(backup_label, true, false, &labelfile);
 
 	PG_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
 	{
-		ListCell   *lc;
-
 		/* Send tablespace header */
 		SendBackupHeader(tablespaces);
 
-		/* Send off our tablespaces one by one */
-		foreach(lc, tablespaces)
-		{
-			ti = (tablespaceinfo *) lfirst(lc);
-
-			SendBackupDirectory(ti->path, ti->oid);
-		}
+		/* Send the tars */
+		SendBackupDirectories(tablespaces, labelfile);
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
 
-	do_pg_stop_backup();
+	do_pg_stop_backup(labelfile);
 
 	MemoryContextSwitchTo(old_context);
 	MemoryContextDelete(backup_context);
@@ -250,20 +248,65 @@ SendBackupHeader(List *tablespaces)
 }
 
 static void
-SendBackupDirectory(char *location, char *spcoid)
+SendBackupDirectories(List *tablespaces, char *labelfile)
 {
 	StringInfoData buf;
+	ListCell *lc;
 
-	/* Send CopyOutResponse message */
-	pq_beginmessage(&buf, 'H');
-	pq_sendbyte(&buf, 0);		/* overall format */
-	pq_sendint(&buf, 0, 2);		/* natts */
-	pq_endmessage(&buf);
+	/* Send off our tablespaces one by one */
+	foreach(lc, tablespaces)
+	{
+		tablespaceinfo *ti = (tablespaceinfo *) lfirst(lc);
+		char *location = ti->path;
+
+		/* Send CopyOutResponse message */
+		pq_beginmessage(&buf, 'H');
+		pq_sendbyte(&buf, 0);		/* overall format */
+		pq_sendint(&buf, 0, 2);		/* natts */
+		pq_endmessage(&buf);
+
+		/*
+		 * tar up the data directory if NULL, otherwise the tablespace
+		 * In the main tar, also include the backup_label.
+		 */
+		if (location == NULL)
+		{
+			struct stat statbuf;
+			int pad, len;
 
-	/* tar up the data directory if NULL, otherwise the tablespace */
-	sendDir(location == NULL ? "." : location,
-			location == NULL ? 1 : strlen(location),
-			false);
+			len = strlen(labelfile);
+
+			/*
+			 * We use PG_VERSION as a template for the uid and gid and mode
+			 * to use for the backup_label file. XXX: would "." be better?
+			 * Or something else?
+			 */
+			if (lstat("PG_VERSION", &statbuf) != 0)
+				ereport(ERROR,
+						(errcode(errcode_for_file_access()),
+						 errmsg("could not stat file \"%s\": %m",
+								"PG_VERSION")));
+			statbuf.st_mtime = time(NULL);
+			statbuf.st_size = len;
+
+			_tarWriteHeader(BACKUP_LABEL_FILE, NULL, &statbuf);
+			/* Send the contents as a CopyData message */
+			pq_putmessage('d', labelfile, len);
+
+			/* Pad to 512 byte boundary, per tar format requirements */
+			pad = ((len + 511) & ~511) - len;
+			if (pad > 0)
+			{
+				char buf[512];
+				MemSet(buf, 0, pad);
+				pq_putmessage('d', buf, pad);
+			}
+
+			location = ".";
+		}
+
+		sendDir(location, strlen(location),	false);
+	}
 
 	/* Send CopyDone message */
 	pq_putemptymessage('c');
@@ -360,7 +403,7 @@ sendDir(char *path, int basepathlen, bool sizeonly)
 			/* Add size, rounded up to 512byte block */
 			size += ((statbuf.st_size + 511) & ~511);
 			if (!sizeonly)
-				sendFile(pathbuf, basepathlen, &statbuf);
+				sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf);
 			size += 512;		/* Size of the header of the file */
 		}
 		else
@@ -418,7 +461,7 @@ _tarChecksum(char *header)
 
 /* Given the member, write the TAR header & send the file */
 static void
-sendFile(char *filename, int basepathlen, struct stat * statbuf)
+sendFile(char *readfilename, char *tarfilename, struct stat * statbuf)
 {
 	FILE	   *fp;
 	char		buf[32768];
@@ -426,11 +469,11 @@ sendFile(char *filename, int basepathlen, struct stat * statbuf)
 	pgoff_t		len = 0;
 	size_t		pad;
 
-	fp = AllocateFile(filename, "rb");
+	fp = AllocateFile(readfilename, "rb");
 	if (fp == NULL)
 		ereport(ERROR,
 				(errcode(errcode_for_file_access()),
-				 errmsg("could not open file \"%s\": %m", filename)));
+				 errmsg("could not open file \"%s\": %m", readfilename)));
 
 	/*
 	 * Some compilers will throw a warning knowing this test can never be true
@@ -439,9 +482,9 @@ sendFile(char *filename, int basepathlen, struct stat * statbuf)
 	if (statbuf->st_size > MAX_TAR_MEMBER_FILELEN)
 		ereport(ERROR,
 				(errmsg("archive member \"%s\" too large for tar format",
-						filename)));
+						tarfilename)));
 
-	_tarWriteHeader(filename + basepathlen + 1, NULL, statbuf);
+	_tarWriteHeader(tarfilename, NULL, statbuf);
 
 	while ((cnt = fread(buf, 1, Min(sizeof(buf), statbuf->st_size - len), fp)) > 0)
 	{
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 74d3427..babaaa6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -312,8 +312,8 @@ extern void HandleStartupProcInterrupts(void);
 extern void StartupProcessMain(void);
 extern void WakeupRecovery(void);
 
-extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast);
-extern XLogRecPtr do_pg_stop_backup(void);
+extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast, bool exclusive, char **labelfile);
+extern XLogRecPtr do_pg_stop_backup(char *labelfile);
 extern void do_pg_abort_backup(void);
 
 #endif   /* XLOG_H */
multiple_inprogress_backups2.patchtext/x-diff; name=multiple_inprogress_backups2.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5b6a230..3d0f813 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -338,7 +338,8 @@ typedef struct XLogCtlInsert
 	XLogPageHeader currpage;	/* points to header of block in cache */
 	char	   *currpos;		/* current insertion point in cache */
 	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
-	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
+	int			forcePageWrites;	/* forcing full-page writes for PITR? */
+	bool		exclusiveBackup;	/* a backup was started with pg_start_backup() */
 } XLogCtlInsert;
 
 /*
@@ -8313,7 +8314,7 @@ pg_start_backup(PG_FUNCTION_ARGS)
 
 	backupidstr = text_to_cstring(backupid);
 
-	startpoint = do_pg_start_backup(backupidstr, fast);
+	startpoint = do_pg_start_backup(backupidstr, fast, true, NULL);
 
 	snprintf(startxlogstr, sizeof(startxlogstr), "%X/%X",
 			 startpoint.xlogid, startpoint.xrecoff);
@@ -8321,7 +8322,7 @@ pg_start_backup(PG_FUNCTION_ARGS)
 }
 
 XLogRecPtr
-do_pg_start_backup(const char *backupidstr, bool fast)
+do_pg_start_backup(const char *backupidstr, bool fast, bool exclusive, char **labelfilename)
 {
 	XLogRecPtr	checkpointloc;
 	XLogRecPtr	startpoint;
@@ -8368,15 +8369,19 @@ do_pg_start_backup(const char *backupidstr, bool fast)
 	 * ensure adequate interlocking against XLogInsert().
 	 */
 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
-	if (XLogCtl->Insert.forcePageWrites)
+	if (exclusive)
 	{
-		LWLockRelease(WALInsertLock);
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("a backup is already in progress"),
-				 errhint("Run pg_stop_backup() and try again.")));
+		if (XLogCtl->Insert.exclusiveBackup)
+		{
+			LWLockRelease(WALInsertLock);
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("a backup is already in progress"),
+					 errhint("Run pg_stop_backup() and try again.")));
+		}
+		XLogCtl->Insert.exclusiveBackup = true;
 	}
-	XLogCtl->Insert.forcePageWrites = true;
+	XLogCtl->Insert.forcePageWrites++;
 	LWLockRelease(WALInsertLock);
 
 	/*
@@ -8393,7 +8398,7 @@ do_pg_start_backup(const char *backupidstr, bool fast)
 	RequestXLogSwitch();
 
 	/* Ensure we release forcePageWrites if fail below */
-	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) 0);
+	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
 	{
 		/*
 		 * Force a CHECKPOINT.	Aside from being necessary to prevent torn
@@ -8427,29 +8432,55 @@ do_pg_start_backup(const char *backupidstr, bool fast)
 					pg_localtime(&stamp_time, log_timezone));
 
 		/*
-		 * Check for existing backup label --- implies a backup is already
-		 * running.  (XXX given that we checked forcePageWrites above, maybe
-		 * it would be OK to just unlink any such label file?)
+		 * Okay, write the file
 		 */
-		if (stat(BACKUP_LABEL_FILE, &stat_buf) != 0)
+		if (exclusive)
 		{
-			if (errno != ENOENT)
+			/*
+			 * Check for existing backup label --- implies a backup is already
+			 * running.  (XXX given that we checked exclusiveBackup above, maybe
+			 * it would be OK to just unlink any such label file?)
+			 */
+			if (stat(BACKUP_LABEL_FILE, &stat_buf) != 0)
+			{
+				if (errno != ENOENT)
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not stat file \"%s\": %m",
+									BACKUP_LABEL_FILE)));
+			}
+			else
 				ereport(ERROR,
-						(errcode_for_file_access(),
-						 errmsg("could not stat file \"%s\": %m",
-								BACKUP_LABEL_FILE)));
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("a backup is already in progress"),
+						 errhint("If you're sure there is no backup in progress, remove file \"%s\" and try again.",
+								 BACKUP_LABEL_FILE)));
+
+			fp = AllocateFile(BACKUP_LABEL_FILE, "w");
 		}
 		else
-			ereport(ERROR,
-					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					 errmsg("a backup is already in progress"),
-					 errhint("If you're sure there is no backup in progress, remove file \"%s\" and try again.",
-							 BACKUP_LABEL_FILE)));
+		{
+			char historyfilename[MAXFNAMELEN];
+			/* Calculate name for temp file */
+			XLByteToSeg(startpoint, _logId, _logSeg);
+			BackupHistoryFileName(historyfilename, ThisTimeLineID,
+								  _logId, _logSeg, startpoint.xrecoff % XLogSegSize);
+
+			*labelfilename = palloc(MAXPGPATH);
+			snprintf(*labelfilename, MAXPGPATH, "%s/%s.%s",
+					 PG_TEMP_FILES_DIR, PG_TEMP_FILE_PREFIX,
+					 historyfilename);
+
+			/* Open file */
+			fp = AllocateFile(*labelfilename, PG_BINARY_W);
+			if (!fp)
+			{
+				/* As in OpenTemporaryFile, try to make the temp-file directory */
+				mkdir(PG_TEMP_FILES_DIR, S_IRWXU);
 
-		/*
-		 * Okay, write the file
-		 */
-		fp = AllocateFile(BACKUP_LABEL_FILE, "w");
+				fp = AllocateFile(*labelfilename, PG_BINARY_W);
+			}
+		}
 		if (!fp)
 			ereport(ERROR,
 					(errcode_for_file_access(),
@@ -8467,7 +8498,7 @@ do_pg_start_backup(const char *backupidstr, bool fast)
 					 errmsg("could not write file \"%s\": %m",
 							BACKUP_LABEL_FILE)));
 	}
-	PG_END_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) 0);
+	PG_END_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
 
 	/*
 	 * We're done.  As a convenience, return the starting WAL location.
@@ -8479,8 +8510,15 @@ do_pg_start_backup(const char *backupidstr, bool fast)
 static void
 pg_start_backup_callback(int code, Datum arg)
 {
-	/* Turn off forcePageWrites on failure */
+	bool exclusive = DatumGetBool(arg);
+
+	/* Decrement forcePageWrites on failure */
 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
+	if (exclusive)
+	{
+		Assert(XLogCtl->Insert.exclusiveBackup);
+		XLogCtl->Insert.exclusiveBackup = false;
+	}
 	XLogCtl->Insert.forcePageWrites = false;
 	LWLockRelease(WALInsertLock);
 }
@@ -8504,16 +8542,20 @@ pg_stop_backup(PG_FUNCTION_ARGS)
 	XLogRecPtr	stoppoint;
 	char		stopxlogstr[MAXFNAMELEN];
 
-	stoppoint = do_pg_stop_backup();
+	stoppoint = do_pg_stop_backup(NULL);
 
 	snprintf(stopxlogstr, sizeof(stopxlogstr), "%X/%X",
 			 stoppoint.xlogid, stoppoint.xrecoff);
 	PG_RETURN_TEXT_P(cstring_to_text(stopxlogstr));
 }
 
+/*
+ * If labelfile is NULL, this stops an exclusive backup.
+ */
 XLogRecPtr
-do_pg_stop_backup(void)
+do_pg_stop_backup(const char *labelfile)
 {
+	bool		exclusive;
 	XLogRecPtr	startpoint;
 	XLogRecPtr	stoppoint;
 	XLogRecData rdata;
@@ -8534,6 +8576,14 @@ do_pg_stop_backup(void)
 	int			waits = 0;
 	bool		reported_waiting = false;
 
+	if (labelfile == NULL)
+	{
+		exclusive = true;
+		labelfile = BACKUP_LABEL_FILE;
+	}
+	else
+		exclusive = false;
+
 	if (!superuser() && !is_authenticated_user_replication_role())
 		ereport(ERROR,
 				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
@@ -8555,20 +8605,28 @@ do_pg_stop_backup(void)
 	 * OK to clear forcePageWrites
 	 */
 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
-	XLogCtl->Insert.forcePageWrites = false;
+	if (exclusive)
+	{
+		if (XLogCtl->Insert.exclusiveBackup)
+		{
+			XLogCtl->Insert.forcePageWrites--;
+			XLogCtl->Insert.exclusiveBackup = false;
+		}
+	}
+	XLogCtl->Insert.forcePageWrites--;
 	LWLockRelease(WALInsertLock);
 
 	/*
 	 * Open the existing label file
 	 */
-	lfp = AllocateFile(BACKUP_LABEL_FILE, "r");
+	lfp = AllocateFile(labelfile, "r");
 	if (!lfp)
 	{
 		if (errno != ENOENT)
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read file \"%s\": %m",
-							BACKUP_LABEL_FILE)));
+							labelfile)));
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("a backup is not in progress")));
@@ -8643,7 +8701,7 @@ do_pg_stop_backup(void)
 				(errcode_for_file_access(),
 				 errmsg("could not read file \"%s\": %m",
 						BACKUP_LABEL_FILE)));
-	if (unlink(BACKUP_LABEL_FILE) != 0)
+	if (unlink(labelfile) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not remove file \"%s\": %m",
@@ -8730,9 +8788,13 @@ do_pg_stop_backup(void)
 /*
  * do_pg_abort_backup: abort a running backup
  *
- * This does just the most basic steps of pg_stop_backup(), by taking the
+ * This does just the most basic steps of do_pg_stop_backup(), by taking the
  * system out of backup mode, thus making it a lot more safe to call from
  * an error handler.
+ *
+ * NB: This is only for aborting a non-exclusive backup that doesn't write
+ * backup_label. A backup started with pg_stop_backup() needs to be finished
+ * with pg_stop_backup().
  */
 void
 do_pg_abort_backup(void)
@@ -8741,17 +8803,14 @@ do_pg_abort_backup(void)
 	 * OK to clear forcePageWrites
 	 */
 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
-	XLogCtl->Insert.forcePageWrites = false;
-	LWLockRelease(WALInsertLock);
-
 	/*
-	 * Remove backup label file
+	 * The caller should take care to not call us only if a backup, but
+	 * be extra paranoid to avoid wrapping forcePageWrites around if the
+	 * caller screws up.
 	 */
-	if (unlink(BACKUP_LABEL_FILE) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m",
-						BACKUP_LABEL_FILE)));
+	if (XLogCtl->Insert.forcePageWrites > 0)
+		XLogCtl->Insert.forcePageWrites--;
+	LWLockRelease(WALInsertLock);
 }
 
 /*
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index de5fa22..a3d4e5c 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -31,13 +31,17 @@
 #include "utils/memutils.h"
 #include "utils/ps_status.h"
 
+/* XXX: copied from xlog.c */
+#define BACKUP_LABEL_FILE		"backup_label"
+
 static int64 sendDir(char *path, int basepathlen, bool sizeonly);
-static void sendFile(char *path, int basepathlen, struct stat * statbuf);
+static void sendFile(char *readfilename, char *tarfilename,
+		 struct stat * statbuf);
 static void _tarWriteHeader(char *filename, char *linktarget,
 				struct stat * statbuf);
 static void send_int8_string(StringInfoData *buf, int64 intval);
 static void SendBackupHeader(List *tablespaces);
-static void SendBackupDirectory(char *location, char *spcoid);
+static void SendBackupDirectories(List *tablespaces, char *labelfilepath);
 static void base_backup_cleanup(int code, Datum arg);
 
 typedef struct
@@ -75,6 +79,7 @@ SendBaseBackup(const char *options)
 	tablespaceinfo *ti;
 	MemoryContext backup_context;
 	MemoryContext old_context;
+	char	   *labelfilepath;
 
 	backup_context = AllocSetContextCreate(CurrentMemoryContext,
 										   "Streaming base backup context",
@@ -145,26 +150,19 @@ SendBaseBackup(const char *options)
 	}
 	FreeDir(dir);
 
-	do_pg_start_backup(backup_label, true);
+	do_pg_start_backup(backup_label, true, false, &labelfilepath);
 
 	PG_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
 	{
-		ListCell   *lc;
-
 		/* Send tablespace header */
 		SendBackupHeader(tablespaces);
 
-		/* Send off our tablespaces one by one */
-		foreach(lc, tablespaces)
-		{
-			ti = (tablespaceinfo *) lfirst(lc);
-
-			SendBackupDirectory(ti->path, ti->oid);
-		}
+		/* Send the tars */
+		SendBackupDirectories(tablespaces, labelfilepath);
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
 
-	do_pg_stop_backup();
+	do_pg_stop_backup(labelfilepath);
 
 	MemoryContextSwitchTo(old_context);
 	MemoryContextDelete(backup_context);
@@ -250,20 +248,44 @@ SendBackupHeader(List *tablespaces)
 }
 
 static void
-SendBackupDirectory(char *location, char *spcoid)
+SendBackupDirectories(List *tablespaces, char *labelfilepath)
 {
 	StringInfoData buf;
+	ListCell *lc;
 
-	/* Send CopyOutResponse message */
-	pq_beginmessage(&buf, 'H');
-	pq_sendbyte(&buf, 0);		/* overall format */
-	pq_sendint(&buf, 0, 2);		/* natts */
-	pq_endmessage(&buf);
+	/* Send off our tablespaces one by one */
+	foreach(lc, tablespaces)
+	{
+		tablespaceinfo *ti = (tablespaceinfo *) lfirst(lc);
+		char *location = ti->path;
+
+		/* Send CopyOutResponse message */
+		pq_beginmessage(&buf, 'H');
+		pq_sendbyte(&buf, 0);		/* overall format */
+		pq_sendint(&buf, 0, 2);		/* natts */
+		pq_endmessage(&buf);
+
+		/*
+		 * tar up the data directory if NULL, otherwise the tablespace
+		 * In the main tar, also include the backup_label.
+		 */
+		if (location == NULL)
+		{
+			struct stat statbuf;
+
+			if (lstat(labelfilepath, &statbuf) != 0)
+				ereport(ERROR,
+						(errcode(errcode_for_file_access()),
+						 errmsg("could not read temporary backup label file \"%s\": %m",
+								labelfilepath)));
 
-	/* tar up the data directory if NULL, otherwise the tablespace */
-	sendDir(location == NULL ? "." : location,
-			location == NULL ? 1 : strlen(location),
-			false);
+			sendFile(labelfilepath, BACKUP_LABEL_FILE, &statbuf);
+
+			location = ".";
+		}
+
+		sendDir(location, strlen(location),	false);
+	}
 
 	/* Send CopyDone message */
 	pq_putemptymessage('c');
@@ -360,7 +382,7 @@ sendDir(char *path, int basepathlen, bool sizeonly)
 			/* Add size, rounded up to 512byte block */
 			size += ((statbuf.st_size + 511) & ~511);
 			if (!sizeonly)
-				sendFile(pathbuf, basepathlen, &statbuf);
+				sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf);
 			size += 512;		/* Size of the header of the file */
 		}
 		else
@@ -418,7 +440,7 @@ _tarChecksum(char *header)
 
 /* Given the member, write the TAR header & send the file */
 static void
-sendFile(char *filename, int basepathlen, struct stat * statbuf)
+sendFile(char *readfilename, char *tarfilename, struct stat * statbuf)
 {
 	FILE	   *fp;
 	char		buf[32768];
@@ -426,11 +448,13 @@ sendFile(char *filename, int basepathlen, struct stat * statbuf)
 	pgoff_t		len = 0;
 	size_t		pad;
 
-	fp = AllocateFile(filename, "rb");
+	pg_usleep(100000);
+
+	fp = AllocateFile(readfilename, "rb");
 	if (fp == NULL)
 		ereport(ERROR,
 				(errcode(errcode_for_file_access()),
-				 errmsg("could not open file \"%s\": %m", filename)));
+				 errmsg("could not open file \"%s\": %m", readfilename)));
 
 	/*
 	 * Some compilers will throw a warning knowing this test can never be true
@@ -439,9 +463,9 @@ sendFile(char *filename, int basepathlen, struct stat * statbuf)
 	if (statbuf->st_size > MAX_TAR_MEMBER_FILELEN)
 		ereport(ERROR,
 				(errmsg("archive member \"%s\" too large for tar format",
-						filename)));
+						tarfilename)));
 
-	_tarWriteHeader(filename + basepathlen + 1, NULL, statbuf);
+	_tarWriteHeader(tarfilename, NULL, statbuf);
 
 	while ((cnt = fread(buf, 1, Min(sizeof(buf), statbuf->st_size - len), fp)) > 0)
 	{
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 74d3427..fb8d715 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -312,8 +312,8 @@ extern void HandleStartupProcInterrupts(void);
 extern void StartupProcessMain(void);
 extern void WakeupRecovery(void);
 
-extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast);
-extern XLogRecPtr do_pg_stop_backup(void);
+extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast, bool exclusive, char **labelfilepath);
+extern XLogRecPtr do_pg_stop_backup(const char *labelfile);
 extern void do_pg_abort_backup(void);
 
 #endif   /* XLOG_H */
#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#1)
Re: Allowing multiple concurrent base backups

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

I implemented this in two ways, and can't decide which I like better:

1. The contents of the backup label file are returned to the caller of
do_pg_start_backup() as a palloc'd string.

2. do_pg_start_backup() creates a temporary file that the backup label
is written to (instead of "backup_label").

Implementation 1 changes more code, as pg_start/stop_backup() need to be
changed to write/read from memory instead of file, but the result isn't
any more complicated. Nevertheless, I somehow feel more comfortable with 2.

Seems like either one of these is fairly problematic in that you have to
have some monstrous kluge to get the backup_label file to appear with
the right name in the tarfile. How badly do we actually need this?
I don't think the use-case for concurrent base backups is all that large
in practice given the I/O hit it's going to involve.

regards, tom lane

#3Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#2)
Re: Allowing multiple concurrent base backups

On Tue, Jan 11, 2011 at 19:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

I implemented this in two ways, and can't decide which I like better:

1. The contents of the backup label file are returned to the caller of
do_pg_start_backup() as a palloc'd string.

2. do_pg_start_backup() creates a temporary file that the backup label
is written to (instead of "backup_label").

Implementation 1 changes more code, as pg_start/stop_backup() need to be
changed to write/read from memory instead of file, but the result isn't
any more complicated. Nevertheless, I somehow feel more comfortable with 2.

Seems like either one of these is fairly problematic in that you have to
have some monstrous kluge to get the backup_label file to appear with
the right name in the tarfile.  How badly do we actually need this?
I don't think the use-case for concurrent base backups is all that large
in practice given the I/O hit it's going to involve.

I think it can be done cleaner in the tar file injection - I've been
chatting with Heikki offlist about that. Not sure, but I have a
feeling it does.

As for the use-case, it doesn't necessarily involve a huge I/O hit if
either the entire DB is in RAM (not all that uncommon - though then
the backup finishes quickly as well), or if the *client* is slow, thus
making the server backlogged on sending the base backup.

Having it possible to do it concurrently also makes for *much* easier
use of the feature. It might be just about overlapping with a few
seconds, for example - which would probably have no major effect, but
takes a lot more code on the other end to work around.

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

#4Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Tom Lane (#2)
Re: Allowing multiple concurrent base backups

On 11.01.2011 20:51, Tom Lane wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

I implemented this in two ways, and can't decide which I like better:

1. The contents of the backup label file are returned to the caller of
do_pg_start_backup() as a palloc'd string.

2. do_pg_start_backup() creates a temporary file that the backup label
is written to (instead of "backup_label").

Implementation 1 changes more code, as pg_start/stop_backup() need to be
changed to write/read from memory instead of file, but the result isn't
any more complicated. Nevertheless, I somehow feel more comfortable with 2.

Seems like either one of these is fairly problematic in that you have to
have some monstrous kluge to get the backup_label file to appear with
the right name in the tarfile.

Oh. I'm surprised you feel that way - that part didn't feel ugly or
kludgey at all to me.

How badly do we actually need this?
I don't think the use-case for concurrent base backups is all that large
in practice given the I/O hit it's going to involve.

It makes it very convenient to set up standbys, without having to worry
that you'll conflict e.g with a nightly backup. I don't imagine people
will use streaming base backups for very large databases anyway.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#5Josh Berkus
josh@agliodbs.com
In reply to: Heikki Linnakangas (#4)
Re: Allowing multiple concurrent base backups

It makes it very convenient to set up standbys, without having to worry
that you'll conflict e.g with a nightly backup. I don't imagine people
will use streaming base backups for very large databases anyway.

Also, imagine that you're provisioning a 10-node replication cluster on
EC2. This would make that worlds easier.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#3)
Re: Allowing multiple concurrent base backups

Magnus Hagander <magnus@hagander.net> writes:

On Tue, Jan 11, 2011 at 19:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Seems like either one of these is fairly problematic in that you have to
have some monstrous kluge to get the backup_label file to appear with
the right name in the tarfile. �How badly do we actually need this?
I don't think the use-case for concurrent base backups is all that large
in practice given the I/O hit it's going to involve.

I think it can be done cleaner in the tar file injection - I've been
chatting with Heikki offlist about that. Not sure, but I have a
feeling it does.

One point that I'm particularly interested to see how you'll kluge it
is ensuring that the tarball contains only the desired temp data and not
also the "real" $PGDATA/backup_label, should there be a normal base
backup being done concurrently with the streamed one.

The whole thing just seems too fragile and dangerous to be worth dealing
with given that actual usage will be a corner case. *I* sure wouldn't
trust it to work when the chips were down.

regards, tom lane

#7Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#1)
Re: Allowing multiple concurrent base backups

On 11.01.2011 20:17, Heikki Linnakangas wrote:

Patches for both approaches attached. They're also available in my
github repository at git@github.com:hlinnaka/postgres.git.

Just so people won't report the same issues again, a couple of bugs have
already cropped up (thanks Magnus):

* a backup_label file in the data directory should now not be tarred up
- we're injecting a different backup_label file there.

* pg_start_backup_callback needs to be updated to just decrement
forcePageWrites, not reset it to 'false'

* pg_stop_backup() decrements forcePageWrites twice. oops.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#8Dimitri Fontaine
dimitri@2ndQuadrant.fr
In reply to: Heikki Linnakangas (#1)
Re: Allowing multiple concurrent base backups

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

Now that we have a basic over-the-wire base backup capability in walsender,
it would be nice to allow taking multiple base backups at the same time.

I would prefer to be able to take a base backup from a standby, or is
that already possible? What about cascading the wal shipping? Maybe
those ideas are much bigger projects, but if I don't ask, I don't get to
know :)

Regards,
--
Dimitri Fontaine
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support

#9Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Dimitri Fontaine (#8)
Re: Allowing multiple concurrent base backups

On 11.01.2011 21:50, Dimitri Fontaine wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com> writes:

Now that we have a basic over-the-wire base backup capability in walsender,
it would be nice to allow taking multiple base backups at the same time.

I would prefer to be able to take a base backup from a standby, or is
that already possible? What about cascading the wal shipping? Maybe
those ideas are much bigger projects, but if I don't ask, I don't get to
know :)

Yeah, those are bigger projects..

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#10Jeff Davis
pgsql@j-davis.com
In reply to: Heikki Linnakangas (#1)
Re: Allowing multiple concurrent base backups

On Tue, 2011-01-11 at 20:17 +0200, Heikki Linnakangas wrote:

So, this patch modifies the internal do_pg_start/stop_backup functions
so that in addition to the traditional mode of operation, where a
backup_label file is created in the data directory where it's backed up
along with all other files, the backup label file is be returned to the
caller, and the caller is responsible for including it in the backup.
The code in replication/basebackup.c includes it in the tar file that's
streamed the client, as "backup_label".

Perhaps we can use this more intelligent form of base backup to
differentiate between:
a. a primary that has crashed while a backup was in progress; and
b. an online backup that is being restored.

Allowing the user to do an unrestricted file copy as a base backup
doesn't allow us to make that differentiation. That lead to the two bugs
that we fixed in StartupXLOG(). And right now there are still two
problems remaining (albeit less severe):

1. If it's a primary recovering from a crash, and there is a
backup_label file, and the WAL referenced in the backup_label exists,
then it does a bunch of extra work during recovery; and
2. In the same situation, if the WAL referenced in the backup_label
does not exist, then it PANICs with a HINT to tell you to remove the
backup_label.

Is this an opportunity to solve these problems and simplify the code?

Regards,
Jeff Davis

#11Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jeff Davis (#10)
Re: Allowing multiple concurrent base backups

On 11.01.2011 22:16, Jeff Davis wrote:

On Tue, 2011-01-11 at 20:17 +0200, Heikki Linnakangas wrote:

So, this patch modifies the internal do_pg_start/stop_backup functions
so that in addition to the traditional mode of operation, where a
backup_label file is created in the data directory where it's backed up
along with all other files, the backup label file is be returned to the
caller, and the caller is responsible for including it in the backup.
The code in replication/basebackup.c includes it in the tar file that's
streamed the client, as "backup_label".

Perhaps we can use this more intelligent form of base backup to
differentiate between:
a. a primary that has crashed while a backup was in progress; and
b. an online backup that is being restored.

Allowing the user to do an unrestricted file copy as a base backup
doesn't allow us to make that differentiation. That lead to the two bugs
that we fixed in StartupXLOG(). And right now there are still two
problems remaining (albeit less severe):

1. If it's a primary recovering from a crash, and there is a
backup_label file, and the WAL referenced in the backup_label exists,
then it does a bunch of extra work during recovery; and
2. In the same situation, if the WAL referenced in the backup_label
does not exist, then it PANICs with a HINT to tell you to remove the
backup_label.

Is this an opportunity to solve these problems and simplify the code?

It won't change the situation for pg_start_backup(), but with the patch
the base backups done via streaming won't have those issues, because
backup_label is not created (with that name) in the master.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#12Jeff Davis
pgsql@j-davis.com
In reply to: Heikki Linnakangas (#11)
Re: Allowing multiple concurrent base backups

On Tue, 2011-01-11 at 22:56 +0200, Heikki Linnakangas wrote:

1. If it's a primary recovering from a crash, and there is a
backup_label file, and the WAL referenced in the backup_label exists,
then it does a bunch of extra work during recovery; and
2. In the same situation, if the WAL referenced in the backup_label
does not exist, then it PANICs with a HINT to tell you to remove the
backup_label.

Is this an opportunity to solve these problems and simplify the code?

It won't change the situation for pg_start_backup(), but with the patch
the base backups done via streaming won't have those issues, because
backup_label is not created (with that name) in the master.

Do you think we should change the backup protocol for normal base
backups to try to fix this? Or do you think that the simplicity of
unrestricted file copy is worth these problems?

We could probably make some fairly minor changes, like making a file on
the primary and telling users to exclude it from any base backup. The
danger, of course, is that they do copy it, and their backup is
compromised.

Regards,
Jeff Davis

#13Robert Haas
robertmhaas@gmail.com
In reply to: Tom Lane (#6)
Re: Allowing multiple concurrent base backups

On Jan 11, 2011, at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The whole thing just seems too fragile and dangerous to be worth dealing
with given that actual usage will be a corner case. *I* sure wouldn't
trust it to work when the chips were down.

I hope this assessment proves to be incorrect, because like Magnus and Heikki, I think this will be a major usability improvement. It takes us from "there's a way to do that" to "it just works".

...Robert

#14Magnus Hagander
magnus@hagander.net
In reply to: Jeff Davis (#12)
Re: Allowing multiple concurrent base backups

On Tue, Jan 11, 2011 at 22:51, Jeff Davis <pgsql@j-davis.com> wrote:

On Tue, 2011-01-11 at 22:56 +0200, Heikki Linnakangas wrote:

  1. If it's a primary recovering from a crash, and there is a
backup_label file, and the WAL referenced in the backup_label exists,
then it does a bunch of extra work during recovery; and
  2. In the same situation, if the WAL referenced in the backup_label
does not exist, then it PANICs with a HINT to tell you to remove the
backup_label.

Is this an opportunity to solve these problems and simplify the code?

It won't change the situation for pg_start_backup(), but with the patch
the base backups done via streaming won't have those issues, because
backup_label is not created (with that name) in the master.

Do you think we should change the backup protocol for normal base
backups to try to fix this? Or do you think that the simplicity of
unrestricted file copy is worth these problems?

We could probably make some fairly minor changes, like making a file on
the primary and telling users to exclude it from any base backup. The
danger, of course, is that they do copy it, and their backup is
compromised.

I think keeping the flexibility is important. If it does add an extra
step I think that's ok once we have pg_basebackup, but it must be
reasonably *safe*. Corrupt backups from forgetting to exclude a file
seems not so.

But if the problem is you forgot to exclude it, can't you just remove
it at a later time?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

#15David Fetter
david@fetter.org
In reply to: Robert Haas (#13)
Re: Allowing multiple concurrent base backups

On Tue, Jan 11, 2011 at 05:06:34PM -0500, Robert Haas wrote:

On Jan 11, 2011, at 2:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

The whole thing just seems too fragile and dangerous to be worth dealing
with given that actual usage will be a corner case. *I* sure wouldn't
trust it to work when the chips were down.

I hope this assessment proves to be incorrect, because like Magnus and Heikki, I think this will be a major usability improvement. It takes us from "there's a way to do that" to "it just works".

(It just works)++ :)

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#16Jeff Davis
pgsql@j-davis.com
In reply to: Magnus Hagander (#14)
Re: Allowing multiple concurrent base backups

On Tue, 2011-01-11 at 23:07 +0100, Magnus Hagander wrote:

I think keeping the flexibility is important. If it does add an extra
step I think that's ok once we have pg_basebackup, but it must be
reasonably *safe*. Corrupt backups from forgetting to exclude a file
seems not so.

Agreed.

But if the problem is you forgot to exclude it, can't you just remove
it at a later time?

If you think you are recovering the primary, and it's really the backup,
then you get corruption. It's too late to remove a file after that
(unless you have a backup of your backup ;) ).

If you think you are restoring a backup, and it's really a primary that
crashed, then you run into one of the two problems that I mentioned
(which are less severe than corruption, but very annoying).

Regards,
Jeff Davis

#17Cédric Villemain
cedric.villemain.debian@gmail.com
In reply to: Heikki Linnakangas (#9)
Re: Allowing multiple concurrent base backups

2011/1/11 Heikki Linnakangas <heikki.linnakangas@enterprisedb.com>:

On 11.01.2011 21:50, Dimitri Fontaine wrote:

Heikki Linnakangas<heikki.linnakangas@enterprisedb.com>  writes:

Now that we have a basic over-the-wire base backup capability in
walsender,
it would be nice to allow taking multiple base backups at the same time.

I would prefer to be able to take a base backup from a standby, or is
that already possible?  What about cascading the wal shipping?  Maybe
those ideas are much bigger projects, but if I don't ask, I don't get to
know :)

Yeah, those are bigger projects..

Way more interesting, IMHO.
Feature to allow multiple backup at the same time looks useless to me.
If DB is small, then it is not that hard to do that before or after a
possible nightly backup.
If DB is large necessary to comment ?

The only case I find interesting is if you can use the bandwith of
more than one ethernet device, else I would use rsync between nodes
after the inital basebackup, pretty sure.

--
 Heikki Linnakangas
 EnterpriseDB   http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

--
Cédric Villemain               2ndQuadrant
http://2ndQuadrant.fr/     PostgreSQL : Expertise, Formation et Support

#18Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Jeff Davis (#12)
Re: Allowing multiple concurrent base backups

On 11.01.2011 23:51, Jeff Davis wrote:

On Tue, 2011-01-11 at 22:56 +0200, Heikki Linnakangas wrote:

1. If it's a primary recovering from a crash, and there is a
backup_label file, and the WAL referenced in the backup_label exists,
then it does a bunch of extra work during recovery; and
2. In the same situation, if the WAL referenced in the backup_label
does not exist, then it PANICs with a HINT to tell you to remove the
backup_label.

Is this an opportunity to solve these problems and simplify the code?

It won't change the situation for pg_start_backup(), but with the patch
the base backups done via streaming won't have those issues, because
backup_label is not created (with that name) in the master.

Do you think we should change the backup protocol for normal base
backups to try to fix this? Or do you think that the simplicity of
unrestricted file copy is worth these problems?

We could probably make some fairly minor changes, like making a file on
the primary and telling users to exclude it from any base backup. The
danger, of course, is that they do copy it, and their backup is
compromised.

Yeah, I don't think it's a good idea to provide such a foot-gun.

Last time we discussed this there was the idea of creating a file in
$PGDATA, and removing read-permissions from it, so that it couldn't be
accidentally included in the backup. Even with that safeguard, it
doesn't feel very safe - a backup running as root or some other special
privileges might still be able to read it.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#19marcin mank
marcin.mank@gmail.com
In reply to: Tom Lane (#6)
Re: Allowing multiple concurrent base backups

On Tue, Jan 11, 2011 at 8:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Magnus Hagander <magnus@hagander.net> writes:

On Tue, Jan 11, 2011 at 19:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Seems like either one of these is fairly problematic in that you have to
have some monstrous kluge to get the backup_label file to appear with
the right name in the tarfile.  How badly do we actually need this?
I don't think the use-case for concurrent base backups is all that large
in practice given the I/O hit it's going to involve.

I think it can be done cleaner in the tar file injection - I've been
chatting with Heikki offlist about that. Not sure, but I have a
feeling it does.

One point that I'm particularly interested to see how you'll kluge it
is ensuring that the tarball contains only the desired temp data and not
also the "real" $PGDATA/backup_label, should there be a normal base
backup being done concurrently with the streamed one.

The whole thing just seems too fragile and dangerous to be worth dealing
with given that actual usage will be a corner case.  *I* sure wouldn't
trust it to work when the chips were down.

Maybe if pg_start_backup() notices that there is another backup
running should block waiting for another session to run
pg_stop_backup() ? Or have a new function like pg_start_backup_wait()
?

Considering that parallell base backups would be io-bound (or
network-bound), there is little need to actually run them in parallell
.

Greetings
Marcin

#20David Fetter
david@fetter.org
In reply to: marcin mank (#19)
Re: Allowing multiple concurrent base backups

On Wed, Jan 12, 2011 at 10:26:05AM +0100, marcin mank wrote:

On Tue, Jan 11, 2011 at 8:07 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Magnus Hagander <magnus@hagander.net> writes:

On Tue, Jan 11, 2011 at 19:51, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Seems like either one of these is fairly problematic in that you have to
have some monstrous kluge to get the backup_label file to appear with
the right name in the tarfile. �How badly do we actually need this?
I don't think the use-case for concurrent base backups is all that large
in practice given the I/O hit it's going to involve.

I think it can be done cleaner in the tar file injection - I've been
chatting with Heikki offlist about that. Not sure, but I have a
feeling it does.

One point that I'm particularly interested to see how you'll kluge it
is ensuring that the tarball contains only the desired temp data and not
also the "real" $PGDATA/backup_label, should there be a normal base
backup being done concurrently with the streamed one.

The whole thing just seems too fragile and dangerous to be worth dealing
with given that actual usage will be a corner case. �*I* sure wouldn't
trust it to work when the chips were down.

Maybe if pg_start_backup() notices that there is another backup
running should block waiting for another session to run
pg_stop_backup() ? Or have a new function like pg_start_backup_wait()
?

Considering that parallell base backups would be io-bound (or
network-bound), there is little need to actually run them in parallell

That's not actually true. Backups at the moment are CPU-bound, and
running them in parallel is one way to make them closer to I/O-bound,
which is what they *should* be.

There are other proposals out there, and some work being done, to make
backups less dependent on CPU, among them:

- Making the on-disk representation smaller
- Making COPY more efficient

As far as I know, none of this work is public yet.

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#21Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: David Fetter (#20)
Re: Allowing multiple concurrent base backups

On 12.01.2011 17:15, David Fetter wrote:

On Wed, Jan 12, 2011 at 10:26:05AM +0100, marcin mank wrote:

Considering that parallell base backups would be io-bound (or
network-bound), there is little need to actually run them in parallell

That's not actually true. Backups at the moment are CPU-bound, and
running them in parallel is one way to make them closer to I/O-bound,
which is what they *should* be.

That's a different kind of "parallel". We're talking about taking
multiple backups in parallel, each using one process, and you're talking
about taking one backup using multiple parallel processes or threads.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#22David Fetter
david@fetter.org
In reply to: Heikki Linnakangas (#21)
Re: Allowing multiple concurrent base backups

On Wed, Jan 12, 2011 at 05:17:38PM +0200, Heikki Linnakangas wrote:

On 12.01.2011 17:15, David Fetter wrote:

On Wed, Jan 12, 2011 at 10:26:05AM +0100, marcin mank wrote:

Considering that parallell base backups would be io-bound (or
network-bound), there is little need to actually run them in parallell

That's not actually true. Backups at the moment are CPU-bound, and
running them in parallel is one way to make them closer to I/O-bound,
which is what they *should* be.

That's a different kind of "parallel". We're talking about taking
multiple backups in parallel, each using one process, and you're
talking about taking one backup using multiple parallel processes or
threads.

Good point. The idea that IO/network bandwidth is always saturated by
one backup just isn't true, though. Take the case of multiple
independent NICs, e.g.

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#23Aidan Van Dyk
aidan@highrise.ca
In reply to: David Fetter (#20)
Re: Allowing multiple concurrent base backups

On Wed, Jan 12, 2011 at 10:15 AM, David Fetter <david@fetter.org> wrote:

Considering that parallell base backups would be io-bound (or
network-bound), there is little need to actually run them in parallell

That's not actually true.  Backups at the moment are CPU-bound, and
running them in parallel is one way to make them closer to I/O-bound,
which is what they *should* be.

Remember, we're talking about filesystem base backups here. If you're
CPU can't handle a stream from disk -> network, byte for byte (maybe
encrypting it), then you've spend *WAAAAY* to much on your storage
sub-system, and way to little on CPU.

I can see trying to "parallize" the base backup such that each
table-space could be run concurrently, but that's about it.

There are other proposals out there, and some work being done, to make
backups less dependent on CPU, among them:

- Making the on-disk representation smaller
- Making COPY more efficient

As far as I know, none of this work is public yet.

pg_dump is another story. But it's not related to base backups for
PIT Recovery/Replication.

a.

--
Aidan Van Dyk                                             Create like a god,
aidan@highrise.ca                                       command like a king,
http://www.highrise.ca/                                   work like a slave.

#24Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#21)
Re: Allowing multiple concurrent base backups

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 12.01.2011 17:15, David Fetter wrote:

On Wed, Jan 12, 2011 at 10:26:05AM +0100, marcin mank wrote:

Considering that parallell base backups would be io-bound (or
network-bound), there is little need to actually run them in parallell

That's not actually true. Backups at the moment are CPU-bound, and
running them in parallel is one way to make them closer to I/O-bound,
which is what they *should* be.

That's a different kind of "parallel". We're talking about taking
multiple backups in parallel, each using one process, and you're talking
about taking one backup using multiple parallel processes or threads.

Even more to the point, you're confusing pg_dump with a base backup.
The reason pg_dump eats a lot of CPU is primarily COPY's data conversion
and formatting requirements, none of which will happen in a base backup
(streaming or otherwise).

regards, tom lane

#25David Fetter
david@fetter.org
In reply to: Tom Lane (#24)
Re: Allowing multiple concurrent base backups

On Wed, Jan 12, 2011 at 10:24:31AM -0500, Tom Lane wrote:

Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:

On 12.01.2011 17:15, David Fetter wrote:

On Wed, Jan 12, 2011 at 10:26:05AM +0100, marcin mank wrote:

Considering that parallell base backups would be io-bound (or
network-bound), there is little need to actually run them in
parallell

That's not actually true. Backups at the moment are CPU-bound,
and running them in parallel is one way to make them closer to
I/O-bound, which is what they *should* be.

That's a different kind of "parallel". We're talking about taking
multiple backups in parallel, each using one process, and you're
talking about taking one backup using multiple parallel processes
or threads.

Even more to the point, you're confusing pg_dump with a base backup.
The reason pg_dump eats a lot of CPU is primarily COPY's data
conversion and formatting requirements, none of which will happen in
a base backup (streaming or otherwise).

Oops. That'll teach me to post before coffee. :P

Cheers,
David.
--
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter
Skype: davidfetter XMPP: david.fetter@gmail.com
iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

#26Ross J. Reedstrom
reedstrm@rice.edu
In reply to: Josh Berkus (#5)
Re: Allowing multiple concurrent base backups

On Tue, Jan 11, 2011 at 11:06:18AM -0800, Josh Berkus wrote:

It makes it very convenient to set up standbys, without having to worry
that you'll conflict e.g with a nightly backup. I don't imagine people
will use streaming base backups for very large databases anyway.

Also, imagine that you're provisioning a 10-node replication cluster on
EC2. This would make that worlds easier.

Hmm, perhaps. My concern is that a naive attempt to do that is going to
have 10 base-backups happening at the same time, completely slamming the
master, and none of them completing is a reasonable time. Is this
possible, or is it that simultaneity will buy you hot caches and backup
#2 -> #10 all run faster?

Ross
--
Ross Reedstrom, Ph.D. reedstrm@rice.edu
Systems Engineer & Admin, Research Scientist phone: 713-348-6166
Connexions http://cnx.org fax: 713-348-3665
Rice University MS-375, Houston, TX 77005
GPG Key fingerprint = F023 82C8 9B0E 2CC6 0D8E F888 D3AE 810E 88F0 BEDE

#27Robert Haas
robertmhaas@gmail.com
In reply to: Ross J. Reedstrom (#26)
Re: Allowing multiple concurrent base backups

On Thu, Jan 13, 2011 at 2:19 PM, Ross J. Reedstrom <reedstrm@rice.edu> wrote:

On Tue, Jan 11, 2011 at 11:06:18AM -0800, Josh Berkus wrote:

It makes it very convenient to set up standbys, without having to worry
that you'll conflict e.g with a nightly backup. I don't imagine people
will use streaming base backups for very large databases anyway.

Also, imagine that you're provisioning a 10-node replication cluster on
EC2.  This would make that worlds easier.

Hmm, perhaps. My concern is that a naive attempt to do that is going to
have 10 base-backups happening at the same time, completely slamming the
master, and none of them completing is a reasonable time. Is this
possible, or is it that simultaneity will buy you hot caches and backup
#2 -> #10 all run faster?

That's going to depend on the situation. If the database fits in
memory, then it's just going to work. If it fits on disk, it's less
obvious whether it'll be good or bad, but an arbitrary limitation here
doesn't serve us well.

P.S. Your reply-to header is busted.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#28Josh Berkus
josh@agliodbs.com
In reply to: Robert Haas (#27)
Re: Allowing multiple concurrent base backups

On 1/13/11 12:11 PM, Robert Haas wrote:

That's going to depend on the situation. If the database fits in
memory, then it's just going to work. If it fits on disk, it's less
obvious whether it'll be good or bad, but an arbitrary limitation here
doesn't serve us well.

FWIW, if we had this feature right now in 9.0 we (PGX) would be using
it. We run into the case of DB in memory, multiple slaves fairly often
these days.

--
-- Josh Berkus
PostgreSQL Experts Inc.
http://www.pgexperts.com

#29Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Josh Berkus (#28)
1 attachment(s)
Re: Allowing multiple concurrent base backups

On 13.01.2011 22:57, Josh Berkus wrote:

On 1/13/11 12:11 PM, Robert Haas wrote:

That's going to depend on the situation. If the database fits in
memory, then it's just going to work. If it fits on disk, it's less
obvious whether it'll be good or bad, but an arbitrary limitation here
doesn't serve us well.

FWIW, if we had this feature right now in 9.0 we (PGX) would be using
it. We run into the case of DB in memory, multiple slaves fairly often
these days.

Anyway, here's an updated patch with all the known issues fixed.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

multiple_inprogress_backups1-2.patchtext/x-diff; name=multiple_inprogress_backups1-2.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5b6a230..400e12e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -60,8 +60,6 @@
 
 
 /* File path names (all relative to $PGDATA) */
-#define BACKUP_LABEL_FILE		"backup_label"
-#define BACKUP_LABEL_OLD		"backup_label.old"
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
 #define RECOVERY_COMMAND_DONE	"recovery.done"
 
@@ -338,7 +336,8 @@ typedef struct XLogCtlInsert
 	XLogPageHeader currpage;	/* points to header of block in cache */
 	char	   *currpos;		/* current insertion point in cache */
 	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
-	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
+	int			forcePageWrites;	/* forcing full-page writes for PITR? */
+	bool		exclusiveBackup;	/* a backup was started with pg_start_backup() */
 } XLogCtlInsert;
 
 /*
@@ -8313,16 +8312,38 @@ pg_start_backup(PG_FUNCTION_ARGS)
 
 	backupidstr = text_to_cstring(backupid);
 
-	startpoint = do_pg_start_backup(backupidstr, fast);
+	startpoint = do_pg_start_backup(backupidstr, fast, NULL);
 
 	snprintf(startxlogstr, sizeof(startxlogstr), "%X/%X",
 			 startpoint.xlogid, startpoint.xrecoff);
 	PG_RETURN_TEXT_P(cstring_to_text(startxlogstr));
 }
 
+/*
+ * do_pg_start_backup is the workhorse of the user-visible pg_start_backup()
+ * function. It creates the necessary starting checkpoint and constructs the
+ * backup label file.
+ * 
+ * There are two kind of backups: exclusive and non-exclusive. An exclusive
+ * backup is started with pg_start_backup(), and there can be only one active
+ * at a time. The backup label file of an exclusive backup is written to
+ * $PGDATA/backup_label, and it is removed by pg_stop_backup().
+ *
+ * A non-exclusive backup is used for the streaming base backups (see
+ * src/backend/replication/basebackup.c). The difference to exclusive backups
+ * is that the backup label file is not written to disk. Instead, its would-be
+ * contents are returned in *labelfile, and the caller is responsible for
+ * including it in the backup archive as 'backup_label'. There can be many
+ * non-exclusive backups active at the same time, and they don't conflict
+ * with exclusive backups either.
+ *
+ * Every successfully started non-exclusive backup must be stopped by calling
+ * do_pg_stop_backup() or do_pg_abort_backup().
+ */
 XLogRecPtr
-do_pg_start_backup(const char *backupidstr, bool fast)
+do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
 {
+	bool		exclusive = (labelfile == NULL);
 	XLogRecPtr	checkpointloc;
 	XLogRecPtr	startpoint;
 	pg_time_t	stamp_time;
@@ -8332,6 +8353,7 @@ do_pg_start_backup(const char *backupidstr, bool fast)
 	uint32		_logSeg;
 	struct stat stat_buf;
 	FILE	   *fp;
+	StringInfoData labelfbuf;
 
 	if (!superuser() && !is_authenticated_user_replication_role())
 		ereport(ERROR,
@@ -8350,6 +8372,12 @@ do_pg_start_backup(const char *backupidstr, bool fast)
 			  errmsg("WAL level not sufficient for making an online backup"),
 				 errhint("wal_level must be set to \"archive\" or \"hot_standby\" at server start.")));
 
+	if (strlen(backupidstr) > MAXPGPATH)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("backup label too long (max %d bytes)",
+						MAXPGPATH)));
+
 	/*
 	 * Mark backup active in shared memory.  We must do full-page WAL writes
 	 * during an on-line backup even if not doing so at other times, because
@@ -8368,15 +8396,19 @@ do_pg_start_backup(const char *backupidstr, bool fast)
 	 * ensure adequate interlocking against XLogInsert().
 	 */
 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
-	if (XLogCtl->Insert.forcePageWrites)
+	if (exclusive)
 	{
-		LWLockRelease(WALInsertLock);
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("a backup is already in progress"),
-				 errhint("Run pg_stop_backup() and try again.")));
+		if (XLogCtl->Insert.exclusiveBackup)
+		{
+			LWLockRelease(WALInsertLock);
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("a backup is already in progress"),
+					 errhint("Run pg_stop_backup() and try again.")));
+		}
+		XLogCtl->Insert.exclusiveBackup = true;
 	}
-	XLogCtl->Insert.forcePageWrites = true;
+	XLogCtl->Insert.forcePageWrites++;
 	LWLockRelease(WALInsertLock);
 
 	/*
@@ -8393,7 +8425,7 @@ do_pg_start_backup(const char *backupidstr, bool fast)
 	RequestXLogSwitch();
 
 	/* Ensure we release forcePageWrites if fail below */
-	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) 0);
+	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
 	{
 		/*
 		 * Force a CHECKPOINT.	Aside from being necessary to prevent torn
@@ -8420,54 +8452,67 @@ do_pg_start_backup(const char *backupidstr, bool fast)
 		XLByteToSeg(startpoint, _logId, _logSeg);
 		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
 
+		/*
+		 * Construct backup label file 
+		 */
+		initStringInfo(&labelfbuf);
+
 		/* Use the log timezone here, not the session timezone */
 		stamp_time = (pg_time_t) time(NULL);
 		pg_strftime(strfbuf, sizeof(strfbuf),
 					"%Y-%m-%d %H:%M:%S %Z",
 					pg_localtime(&stamp_time, log_timezone));
+		appendStringInfo(&labelfbuf, "START WAL LOCATION: %X/%X (file %s)\n",
+						 startpoint.xlogid, startpoint.xrecoff, xlogfilename);
+		appendStringInfo(&labelfbuf, "CHECKPOINT LOCATION: %X/%X\n",
+						 checkpointloc.xlogid, checkpointloc.xrecoff);
+		appendStringInfo(&labelfbuf, "START TIME: %s\n", strfbuf);
+		appendStringInfo(&labelfbuf, "LABEL: %s\n", backupidstr);
 
 		/*
-		 * Check for existing backup label --- implies a backup is already
-		 * running.  (XXX given that we checked forcePageWrites above, maybe
-		 * it would be OK to just unlink any such label file?)
+		 * Okay, write the file, or return its contents to caller.
 		 */
-		if (stat(BACKUP_LABEL_FILE, &stat_buf) != 0)
+		if (exclusive)
 		{
-			if (errno != ENOENT)
+			/*
+			 * Check for existing backup label --- implies a backup is already
+			 * running.  (XXX given that we checked exclusiveBackup above, maybe
+			 * it would be OK to just unlink any such label file?)
+			 */
+			if (stat(BACKUP_LABEL_FILE, &stat_buf) != 0)
+			{
+				if (errno != ENOENT)
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not stat file \"%s\": %m",
+									BACKUP_LABEL_FILE)));
+			}
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+						 errmsg("a backup is already in progress"),
+						 errhint("If you're sure there is no backup in progress, remove file \"%s\" and try again.",
+								 BACKUP_LABEL_FILE)));
+
+			fp = AllocateFile(BACKUP_LABEL_FILE, "w");
+
+			if (!fp)
 				ereport(ERROR,
 						(errcode_for_file_access(),
-						 errmsg("could not stat file \"%s\": %m",
+						 errmsg("could not create file \"%s\": %m",
+								BACKUP_LABEL_FILE)));
+			fwrite(labelfbuf.data, labelfbuf.len, 1, fp);
+			if (fflush(fp) || ferror(fp) || FreeFile(fp))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write file \"%s\": %m",
 								BACKUP_LABEL_FILE)));
+			pfree(labelfbuf.data);
 		}
 		else
-			ereport(ERROR,
-					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					 errmsg("a backup is already in progress"),
-					 errhint("If you're sure there is no backup in progress, remove file \"%s\" and try again.",
-							 BACKUP_LABEL_FILE)));
-
-		/*
-		 * Okay, write the file
-		 */
-		fp = AllocateFile(BACKUP_LABEL_FILE, "w");
-		if (!fp)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not create file \"%s\": %m",
-							BACKUP_LABEL_FILE)));
-		fprintf(fp, "START WAL LOCATION: %X/%X (file %s)\n",
-				startpoint.xlogid, startpoint.xrecoff, xlogfilename);
-		fprintf(fp, "CHECKPOINT LOCATION: %X/%X\n",
-				checkpointloc.xlogid, checkpointloc.xrecoff);
-		fprintf(fp, "START TIME: %s\n", strfbuf);
-		fprintf(fp, "LABEL: %s\n", backupidstr);
-		if (fflush(fp) || ferror(fp) || FreeFile(fp))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not write file \"%s\": %m",
-							BACKUP_LABEL_FILE)));
+			*labelfile = labelfbuf.data;
 	}
-	PG_END_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) 0);
+	PG_END_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
 
 	/*
 	 * We're done.  As a convenience, return the starting WAL location.
@@ -8479,9 +8524,16 @@ do_pg_start_backup(const char *backupidstr, bool fast)
 static void
 pg_start_backup_callback(int code, Datum arg)
 {
-	/* Turn off forcePageWrites on failure */
+	bool exclusive = DatumGetBool(arg);
+
+	/* Decrement forcePageWrites on failure */
 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
-	XLogCtl->Insert.forcePageWrites = false;
+	if (exclusive)
+	{
+		Assert(XLogCtl->Insert.exclusiveBackup);
+		XLogCtl->Insert.exclusiveBackup = false;
+	}
+	XLogCtl->Insert.forcePageWrites--;
 	LWLockRelease(WALInsertLock);
 }
 
@@ -8504,16 +8556,24 @@ pg_stop_backup(PG_FUNCTION_ARGS)
 	XLogRecPtr	stoppoint;
 	char		stopxlogstr[MAXFNAMELEN];
 
-	stoppoint = do_pg_stop_backup();
+	stoppoint = do_pg_stop_backup(NULL);
 
 	snprintf(stopxlogstr, sizeof(stopxlogstr), "%X/%X",
 			 stoppoint.xlogid, stoppoint.xrecoff);
 	PG_RETURN_TEXT_P(cstring_to_text(stopxlogstr));
 }
 
+/*
+ * do_pg_start_backup is the workhorse of the user-visible pg_stop_backup()
+ * function.
+
+ * If labelfile is NULL, this stops an exclusive backup. Otherwise this stops
+ * the non-exclusive backup specified by 'labelfile'.
+ */
 XLogRecPtr
-do_pg_stop_backup(void)
+do_pg_stop_backup(char *labelfile)
 {
+	bool		exclusive = (labelfile == NULL);
 	XLogRecPtr	startpoint;
 	XLogRecPtr	stoppoint;
 	XLogRecData rdata;
@@ -8529,10 +8589,10 @@ do_pg_stop_backup(void)
 	FILE	   *lfp;
 	FILE	   *fp;
 	char		ch;
-	int			ich;
 	int			seconds_before_warning;
 	int			waits = 0;
 	bool		reported_waiting = false;
+	char	   *remaining;
 
 	if (!superuser() && !is_authenticated_user_replication_role())
 		ereport(ERROR,
@@ -8552,38 +8612,88 @@ do_pg_stop_backup(void)
 				 errhint("wal_level must be set to \"archive\" or \"hot_standby\" at server start.")));
 
 	/*
-	 * OK to clear forcePageWrites
+	 * OK to decrement forcePageWrites
 	 */
 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
-	XLogCtl->Insert.forcePageWrites = false;
+	if (exclusive)
+	{
+		if (XLogCtl->Insert.exclusiveBackup)
+		{
+			XLogCtl->Insert.forcePageWrites--;
+			XLogCtl->Insert.exclusiveBackup = false;
+		}
+	}
+	else
+	{
+		/*
+		 * The user-visible pg_start/stop_backup() functions that operate on
+		 * exclusive backups can be called at any time, but for non-exclusive
+		 * backups, it is expected that each do_pg_start_backup() call is
+		 * matched by exactly one do_pg_stop_backup() call.
+		 */
+		Assert(XLogCtl->Insert.forcePageWrites > (XLogCtl->Insert.exclusiveBackup ? 1 : 0));
+		XLogCtl->Insert.forcePageWrites--;
+	}
 	LWLockRelease(WALInsertLock);
 
-	/*
-	 * Open the existing label file
-	 */
-	lfp = AllocateFile(BACKUP_LABEL_FILE, "r");
-	if (!lfp)
+	if (exclusive)
 	{
-		if (errno != ENOENT)
+		/*
+		 * Read the existing label file into memory.
+		 */
+		struct	stat statbuf;
+		int		r;
+
+		if (stat(BACKUP_LABEL_FILE, &statbuf))
+		{
+			if (errno != ENOENT)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not stat file \"%s\": %m",
+								BACKUP_LABEL_FILE)));
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("a backup is not in progress")));
+		}
+
+		lfp = AllocateFile(BACKUP_LABEL_FILE, "r");
+		if (!lfp)
+		{
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not read file \"%s\": %m",
 							BACKUP_LABEL_FILE)));
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("a backup is not in progress")));
+		}
+		labelfile = palloc(statbuf.st_size + 1);
+		r = fread(labelfile, statbuf.st_size, 1, lfp);
+		labelfile[statbuf.st_size] = '\0';
+
+		/*
+		 * Close and remove the backup label file
+		 */
+		if (r != 1 || ferror(lfp) || FreeFile(lfp))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read file \"%s\": %m",
+							BACKUP_LABEL_FILE)));
+		if (unlink(BACKUP_LABEL_FILE) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not remove file \"%s\": %m",
+							BACKUP_LABEL_FILE)));
 	}
 
 	/*
 	 * Read and parse the START WAL LOCATION line (this code is pretty crude,
 	 * but we are not expecting any variability in the file format).
 	 */
-	if (fscanf(lfp, "START WAL LOCATION: %X/%X (file %24s)%c",
+	if (sscanf(labelfile, "START WAL LOCATION: %X/%X (file %24s)%c",
 			   &startpoint.xlogid, &startpoint.xrecoff, startxlogfilename,
 			   &ch) != 4 || ch != '\n')
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
+	remaining = strchr(labelfile, '\n') + 1; /* %n is not portable enough */
 
 	/*
 	 * Write the backup-end xlog record
@@ -8626,8 +8736,7 @@ do_pg_stop_backup(void)
 	fprintf(fp, "STOP WAL LOCATION: %X/%X (file %s)\n",
 			stoppoint.xlogid, stoppoint.xrecoff, stopxlogfilename);
 	/* transfer remaining lines from label to history file */
-	while ((ich = fgetc(lfp)) != EOF)
-		fputc(ich, fp);
+	fprintf(fp, "%s", remaining);
 	fprintf(fp, "STOP TIME: %s\n", strfbuf);
 	if (fflush(fp) || ferror(fp) || FreeFile(fp))
 		ereport(ERROR,
@@ -8636,20 +8745,6 @@ do_pg_stop_backup(void)
 						histfilepath)));
 
 	/*
-	 * Close and remove the backup label file
-	 */
-	if (ferror(lfp) || FreeFile(lfp))
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
-						BACKUP_LABEL_FILE)));
-	if (unlink(BACKUP_LABEL_FILE) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m",
-						BACKUP_LABEL_FILE)));
-
-	/*
 	 * Clean out any no-longer-needed history files.  As a side effect, this
 	 * will post a .ready file for the newly created history file, notifying
 	 * the archiver that history file may be archived immediately.
@@ -8730,28 +8825,21 @@ do_pg_stop_backup(void)
 /*
  * do_pg_abort_backup: abort a running backup
  *
- * This does just the most basic steps of pg_stop_backup(), by taking the
+ * This does just the most basic steps of do_pg_stop_backup(), by taking the
  * system out of backup mode, thus making it a lot more safe to call from
  * an error handler.
+ *
+ * NB: This is only for aborting a non-exclusive backup that doesn't write
+ * backup_label. A backup started with pg_stop_backup() needs to be finished
+ * with pg_stop_backup().
  */
 void
 do_pg_abort_backup(void)
 {
-	/*
-	 * OK to clear forcePageWrites
-	 */
 	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
-	XLogCtl->Insert.forcePageWrites = false;
+	Assert(XLogCtl->Insert.forcePageWrites > (XLogCtl->Insert.exclusiveBackup ? 1 : 0));
+	XLogCtl->Insert.forcePageWrites--;
 	LWLockRelease(WALInsertLock);
-
-	/*
-	 * Remove backup label file
-	 */
-	if (unlink(BACKUP_LABEL_FILE) != 0)
-		ereport(ERROR,
-				(errcode_for_file_access(),
-				 errmsg("could not remove file \"%s\": %m",
-						BACKUP_LABEL_FILE)));
 }
 
 /*
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 2a74c5f..1a5090a 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -32,12 +32,13 @@
 #include "utils/ps_status.h"
 
 static int64 sendDir(char *path, int basepathlen, bool sizeonly);
-static void sendFile(char *path, int basepathlen, struct stat * statbuf);
+static void sendFile(char *readfilename, char *tarfilename,
+		 struct stat * statbuf);
 static void _tarWriteHeader(char *filename, char *linktarget,
 				struct stat * statbuf);
 static void send_int8_string(StringInfoData *buf, int64 intval);
 static void SendBackupHeader(List *tablespaces);
-static void SendBackupDirectory(char *location, char *spcoid);
+static void SendBackupDirectory(char *location, char *labelfile);
 static void base_backup_cleanup(int code, Datum arg);
 
 typedef struct
@@ -67,7 +68,9 @@ base_backup_cleanup(int code, Datum arg)
 static void
 perform_base_backup(const char *backup_label, List *tablespaces)
 {
-	do_pg_start_backup(backup_label, true);
+	char *labelfile;
+
+	do_pg_start_backup(backup_label, true, &labelfile);
 
 	PG_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
 	{
@@ -81,12 +84,12 @@ perform_base_backup(const char *backup_label, List *tablespaces)
 		{
 			tablespaceinfo *ti = (tablespaceinfo *) lfirst(lc);
 
-			SendBackupDirectory(ti->path, ti->oid);
+			SendBackupDirectory(ti->path, labelfile);
 		}
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
 
-	do_pg_stop_backup();
+	do_pg_stop_backup(labelfile);
 }
 
 /*
@@ -261,8 +264,13 @@ SendBackupHeader(List *tablespaces)
 	pq_puttextmessage('C', "SELECT");
 }
 
+/*
+ * Send the contents of 'location' to the client as a tar file, via CopyOut.
+ * If 'labelfile' is not NULL, an extra 'backup_label' file is included in
+ * the tar, and labelfile string is written as its contents.
+ */
 static void
-SendBackupDirectory(char *location, char *spcoid)
+SendBackupDirectory(char *location, char *labelfile)
 {
 	StringInfoData buf;
 
@@ -272,6 +280,48 @@ SendBackupDirectory(char *location, char *spcoid)
 	pq_sendint(&buf, 0, 2);		/* natts */
 	pq_endmessage(&buf);
 
+	/*
+	 * In the main tar, include the backup_label first.
+	 */
+	if (labelfile != NULL)
+	{
+		struct stat statbuf;
+		int pad, len;
+
+		len = strlen(labelfile);
+
+		/*
+		 * Construct a stat struct for the backup_label file we're injecting
+		 * in the tar.
+		 */
+		/* Windows doesn't have the concept of uid and gid */
+#ifdef WIN32
+		statbuf.st_uid = 0;
+		statbuf.st_gid = 0;
+#else
+		statbuf.st_uid = geteuid();
+		statbuf.st_gid = getegid();
+#endif
+		statbuf.st_mtime = time(NULL);
+		statbuf.st_mode = S_IRUSR | S_IWUSR;
+		statbuf.st_size = len;
+
+		_tarWriteHeader(BACKUP_LABEL_FILE, NULL, &statbuf);
+		/* Send the contents as a CopyData message */
+		pq_putmessage('d', labelfile, len);
+
+		/* Pad to 512 byte boundary, per tar format requirements */
+		pad = ((len + 511) & ~511) - len;
+		if (pad > 0)
+		{
+			char buf[512];
+			MemSet(buf, 0, pad);
+			pq_putmessage('d', buf, pad);
+		}
+
+		location = ".";
+	}
+
 	/* tar up the data directory if NULL, otherwise the tablespace */
 	sendDir(location == NULL ? "." : location,
 			location == NULL ? 1 : strlen(location),
@@ -304,6 +354,14 @@ sendDir(char *path, int basepathlen, bool sizeonly)
 					strlen(PG_TEMP_FILE_PREFIX)) == 0)
 			continue;
 
+		/*
+		 * If there's a backup_label file, it belongs to a backup started by
+		 * the user with pg_start_backup(). It is *not* correct for this
+		 * backup, our backup_label is injected into the tar separately.
+		 */
+		if (strcmp(de->d_name, BACKUP_LABEL_FILE) == 0)
+			continue;
+
 		snprintf(pathbuf, MAXPGPATH, "%s/%s", path, de->d_name);
 
 		/* Skip postmaster.pid in the data directory */
@@ -372,7 +430,7 @@ sendDir(char *path, int basepathlen, bool sizeonly)
 			/* Add size, rounded up to 512byte block */
 			size += ((statbuf.st_size + 511) & ~511);
 			if (!sizeonly)
-				sendFile(pathbuf, basepathlen, &statbuf);
+				sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf);
 			size += 512;		/* Size of the header of the file */
 		}
 		else
@@ -430,7 +488,7 @@ _tarChecksum(char *header)
 
 /* Given the member, write the TAR header & send the file */
 static void
-sendFile(char *filename, int basepathlen, struct stat * statbuf)
+sendFile(char *readfilename, char *tarfilename, struct stat * statbuf)
 {
 	FILE	   *fp;
 	char		buf[32768];
@@ -438,11 +496,11 @@ sendFile(char *filename, int basepathlen, struct stat * statbuf)
 	pgoff_t		len = 0;
 	size_t		pad;
 
-	fp = AllocateFile(filename, "rb");
+	fp = AllocateFile(readfilename, "rb");
 	if (fp == NULL)
 		ereport(ERROR,
 				(errcode(errcode_for_file_access()),
-				 errmsg("could not open file \"%s\": %m", filename)));
+				 errmsg("could not open file \"%s\": %m", readfilename)));
 
 	/*
 	 * Some compilers will throw a warning knowing this test can never be true
@@ -451,9 +509,9 @@ sendFile(char *filename, int basepathlen, struct stat * statbuf)
 	if (statbuf->st_size > MAX_TAR_MEMBER_FILELEN)
 		ereport(ERROR,
 				(errmsg("archive member \"%s\" too large for tar format",
-						filename)));
+						tarfilename)));
 
-	_tarWriteHeader(filename + basepathlen + 1, NULL, statbuf);
+	_tarWriteHeader(tarfilename, NULL, statbuf);
 
 	while ((cnt = fread(buf, 1, Min(sizeof(buf), statbuf->st_size - len), fp)) > 0)
 	{
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 74d3427..122e96b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -312,8 +312,15 @@ extern void HandleStartupProcInterrupts(void);
 extern void StartupProcessMain(void);
 extern void WakeupRecovery(void);
 
-extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast);
-extern XLogRecPtr do_pg_stop_backup(void);
+/*
+ * Starting/stopping a base backup
+ */
+extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile);
+extern XLogRecPtr do_pg_stop_backup(char *labelfile);
 extern void do_pg_abort_backup(void);
 
+/* File path names (all relative to $PGDATA) */
+#define BACKUP_LABEL_FILE		"backup_label"
+#define BACKUP_LABEL_OLD		"backup_label.old"
+
 #endif   /* XLOG_H */
#30Magnus Hagander
magnus@hagander.net
In reply to: Heikki Linnakangas (#29)
Re: Allowing multiple concurrent base backups

On Thu, Jan 13, 2011 at 22:32, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 13.01.2011 22:57, Josh Berkus wrote:

On 1/13/11 12:11 PM, Robert Haas wrote:

That's going to depend on the situation.  If the database fits in
memory, then it's just going to work.  If it fits on disk, it's less
obvious whether it'll be good or bad, but an arbitrary limitation here
doesn't serve us well.

FWIW, if we had this feature right now in 9.0 we (PGX) would be using
it.  We run into the case of DB in memory, multiple slaves fairly often
these days.

Anyway, here's an updated patch with all the known issues fixed.

It's kind of crude that do_pg_stop_backup() has to parse the
backuplabel with sscanf() when it's coming from a non-exclusive backup
- we could pass the location as a parameter instead, to make it
cleaner. There might be a point to it being the same in both
codepaths, but I'm not sure it's that important...

Doesn't this code put the backup label in *every* tarfile, and not
just the main one? From what I can tell the code in
SendBackupDirectory() relies on labelfile being NULL in tar files for
"other tablespaces", but the caller never sets this.

Finally, I'd move the addition of the labelfile to it's own function,
but that's just me ;)

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

#31Simon Riggs
simon@2ndQuadrant.com
In reply to: Heikki Linnakangas (#29)
Re: Allowing multiple concurrent base backups

On Thu, 2011-01-13 at 23:32 +0200, Heikki Linnakangas wrote:

On 13.01.2011 22:57, Josh Berkus wrote:

On 1/13/11 12:11 PM, Robert Haas wrote:

That's going to depend on the situation. If the database fits in
memory, then it's just going to work. If it fits on disk, it's less
obvious whether it'll be good or bad, but an arbitrary limitation here
doesn't serve us well.

FWIW, if we had this feature right now in 9.0 we (PGX) would be using
it. We run into the case of DB in memory, multiple slaves fairly often
these days.

Anyway, here's an updated patch with all the known issues fixed.

It's good we have this as an option and I like the way you've done this.

--
Simon Riggs http://www.2ndQuadrant.com/books/
PostgreSQL Development, 24x7 Support, Training and Services

#32Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#29)
1 attachment(s)
Re: Allowing multiple concurrent base backups

On 13.01.2011 23:32, Heikki Linnakangas wrote:

Anyway, here's an updated patch with all the known issues fixed.

Another updated patch. Fixed bitrot, and addressed the privilege issue
Fujii-san raised here:
http://archives.postgresql.org/message-id/4D380560.3040400@enterprisedb.com.
I changed the privilege checks so that pg_start/stop_backup() functions
require superuser privileges again, but not for a base backup via the
replication protocol (replication privilege is needed to establish a
replication connection to begin with).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

multiple_inprogress_backups1-3.patchtext/x-diff; name=multiple_inprogress_backups1-3.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 60,67 ****
  
  
  /* File path names (all relative to $PGDATA) */
- #define BACKUP_LABEL_FILE		"backup_label"
- #define BACKUP_LABEL_OLD		"backup_label.old"
  #define RECOVERY_COMMAND_FILE	"recovery.conf"
  #define RECOVERY_COMMAND_DONE	"recovery.done"
  
--- 60,65 ----
***************
*** 338,344 **** typedef struct XLogCtlInsert
  	XLogPageHeader currpage;	/* points to header of block in cache */
  	char	   *currpos;		/* current insertion point in cache */
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
! 	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
  } XLogCtlInsert;
  
  /*
--- 336,343 ----
  	XLogPageHeader currpage;	/* points to header of block in cache */
  	char	   *currpos;		/* current insertion point in cache */
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
! 	int			forcePageWrites;	/* forcing full-page writes for PITR? */
! 	bool		exclusiveBackup;	/* a backup was started with pg_start_backup() */
  } XLogCtlInsert;
  
  /*
***************
*** 8352,8367 **** pg_start_backup(PG_FUNCTION_ARGS)
  
  	backupidstr = text_to_cstring(backupid);
  
! 	startpoint = do_pg_start_backup(backupidstr, fast);
  
  	snprintf(startxlogstr, sizeof(startxlogstr), "%X/%X",
  			 startpoint.xlogid, startpoint.xrecoff);
  	PG_RETURN_TEXT_P(cstring_to_text(startxlogstr));
  }
  
  XLogRecPtr
! do_pg_start_backup(const char *backupidstr, bool fast)
  {
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
  	pg_time_t	stamp_time;
--- 8351,8388 ----
  
  	backupidstr = text_to_cstring(backupid);
  
! 	startpoint = do_pg_start_backup(backupidstr, fast, NULL);
  
  	snprintf(startxlogstr, sizeof(startxlogstr), "%X/%X",
  			 startpoint.xlogid, startpoint.xrecoff);
  	PG_RETURN_TEXT_P(cstring_to_text(startxlogstr));
  }
  
+ /*
+  * do_pg_start_backup is the workhorse of the user-visible pg_start_backup()
+  * function. It creates the necessary starting checkpoint and constructs the
+  * backup label file.
+  * 
+  * There are two kind of backups: exclusive and non-exclusive. An exclusive
+  * backup is started with pg_start_backup(), and there can be only one active
+  * at a time. The backup label file of an exclusive backup is written to
+  * $PGDATA/backup_label, and it is removed by pg_stop_backup().
+  *
+  * A non-exclusive backup is used for the streaming base backups (see
+  * src/backend/replication/basebackup.c). The difference to exclusive backups
+  * is that the backup label file is not written to disk. Instead, its would-be
+  * contents are returned in *labelfile, and the caller is responsible for
+  * including it in the backup archive as 'backup_label'. There can be many
+  * non-exclusive backups active at the same time, and they don't conflict
+  * with an exclusive backup either.
+  *
+  * Every successfully started non-exclusive backup must be stopped by calling
+  * do_pg_stop_backup() or do_pg_abort_backup().
+  */
  XLogRecPtr
! do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
+ 	bool		exclusive = (labelfile == NULL);
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
  	pg_time_t	stamp_time;
***************
*** 8371,8381 **** do_pg_start_backup(const char *backupidstr, bool fast)
  	uint32		_logSeg;
  	struct stat stat_buf;
  	FILE	   *fp;
  
! 	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 				 errmsg("must be superuser or replication role to run a backup")));
  
  	if (RecoveryInProgress())
  		ereport(ERROR,
--- 8392,8403 ----
  	uint32		_logSeg;
  	struct stat stat_buf;
  	FILE	   *fp;
+ 	StringInfoData labelfbuf;
  
! 	if (exclusive && !superuser())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 				 errmsg("must be superuser to run a backup")));
  
  	if (RecoveryInProgress())
  		ereport(ERROR,
***************
*** 8389,8394 **** do_pg_start_backup(const char *backupidstr, bool fast)
--- 8411,8422 ----
  			  errmsg("WAL level not sufficient for making an online backup"),
  				 errhint("wal_level must be set to \"archive\" or \"hot_standby\" at server start.")));
  
+ 	if (strlen(backupidstr) > MAXPGPATH)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 				 errmsg("backup label too long (max %d bytes)",
+ 						MAXPGPATH)));
+ 
  	/*
  	 * Mark backup active in shared memory.  We must do full-page WAL writes
  	 * during an on-line backup even if not doing so at other times, because
***************
*** 8407,8421 **** do_pg_start_backup(const char *backupidstr, bool fast)
  	 * ensure adequate interlocking against XLogInsert().
  	 */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	if (XLogCtl->Insert.forcePageWrites)
  	{
! 		LWLockRelease(WALInsertLock);
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 				 errmsg("a backup is already in progress"),
! 				 errhint("Run pg_stop_backup() and try again.")));
  	}
! 	XLogCtl->Insert.forcePageWrites = true;
  	LWLockRelease(WALInsertLock);
  
  	/*
--- 8435,8453 ----
  	 * ensure adequate interlocking against XLogInsert().
  	 */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	if (exclusive)
  	{
! 		if (XLogCtl->Insert.exclusiveBackup)
! 		{
! 			LWLockRelease(WALInsertLock);
! 			ereport(ERROR,
! 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 					 errmsg("a backup is already in progress"),
! 					 errhint("Run pg_stop_backup() and try again.")));
! 		}
! 		XLogCtl->Insert.exclusiveBackup = true;
  	}
! 	XLogCtl->Insert.forcePageWrites++;
  	LWLockRelease(WALInsertLock);
  
  	/*
***************
*** 8432,8438 **** do_pg_start_backup(const char *backupidstr, bool fast)
  	RequestXLogSwitch();
  
  	/* Ensure we release forcePageWrites if fail below */
! 	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) 0);
  	{
  		/*
  		 * Force a CHECKPOINT.	Aside from being necessary to prevent torn
--- 8464,8470 ----
  	RequestXLogSwitch();
  
  	/* Ensure we release forcePageWrites if fail below */
! 	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
  	{
  		/*
  		 * Force a CHECKPOINT.	Aside from being necessary to prevent torn
***************
*** 8459,8512 **** do_pg_start_backup(const char *backupidstr, bool fast)
  		XLByteToSeg(startpoint, _logId, _logSeg);
  		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
  
  		/* Use the log timezone here, not the session timezone */
  		stamp_time = (pg_time_t) time(NULL);
  		pg_strftime(strfbuf, sizeof(strfbuf),
  					"%Y-%m-%d %H:%M:%S %Z",
  					pg_localtime(&stamp_time, log_timezone));
  
  		/*
! 		 * Check for existing backup label --- implies a backup is already
! 		 * running.  (XXX given that we checked forcePageWrites above, maybe
! 		 * it would be OK to just unlink any such label file?)
  		 */
! 		if (stat(BACKUP_LABEL_FILE, &stat_buf) != 0)
  		{
! 			if (errno != ENOENT)
  				ereport(ERROR,
  						(errcode_for_file_access(),
! 						 errmsg("could not stat file \"%s\": %m",
  								BACKUP_LABEL_FILE)));
  		}
  		else
! 			ereport(ERROR,
! 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 					 errmsg("a backup is already in progress"),
! 					 errhint("If you're sure there is no backup in progress, remove file \"%s\" and try again.",
! 							 BACKUP_LABEL_FILE)));
! 
! 		/*
! 		 * Okay, write the file
! 		 */
! 		fp = AllocateFile(BACKUP_LABEL_FILE, "w");
! 		if (!fp)
! 			ereport(ERROR,
! 					(errcode_for_file_access(),
! 					 errmsg("could not create file \"%s\": %m",
! 							BACKUP_LABEL_FILE)));
! 		fprintf(fp, "START WAL LOCATION: %X/%X (file %s)\n",
! 				startpoint.xlogid, startpoint.xrecoff, xlogfilename);
! 		fprintf(fp, "CHECKPOINT LOCATION: %X/%X\n",
! 				checkpointloc.xlogid, checkpointloc.xrecoff);
! 		fprintf(fp, "START TIME: %s\n", strfbuf);
! 		fprintf(fp, "LABEL: %s\n", backupidstr);
! 		if (fflush(fp) || ferror(fp) || FreeFile(fp))
! 			ereport(ERROR,
! 					(errcode_for_file_access(),
! 					 errmsg("could not write file \"%s\": %m",
! 							BACKUP_LABEL_FILE)));
  	}
! 	PG_END_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) 0);
  
  	/*
  	 * We're done.  As a convenience, return the starting WAL location.
--- 8491,8557 ----
  		XLByteToSeg(startpoint, _logId, _logSeg);
  		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
  
+ 		/*
+ 		 * Construct backup label file 
+ 		 */
+ 		initStringInfo(&labelfbuf);
+ 
  		/* Use the log timezone here, not the session timezone */
  		stamp_time = (pg_time_t) time(NULL);
  		pg_strftime(strfbuf, sizeof(strfbuf),
  					"%Y-%m-%d %H:%M:%S %Z",
  					pg_localtime(&stamp_time, log_timezone));
+ 		appendStringInfo(&labelfbuf, "START WAL LOCATION: %X/%X (file %s)\n",
+ 						 startpoint.xlogid, startpoint.xrecoff, xlogfilename);
+ 		appendStringInfo(&labelfbuf, "CHECKPOINT LOCATION: %X/%X\n",
+ 						 checkpointloc.xlogid, checkpointloc.xrecoff);
+ 		appendStringInfo(&labelfbuf, "START TIME: %s\n", strfbuf);
+ 		appendStringInfo(&labelfbuf, "LABEL: %s\n", backupidstr);
  
  		/*
! 		 * Okay, write the file, or return its contents to caller.
  		 */
! 		if (exclusive)
  		{
! 			/*
! 			 * Check for existing backup label --- implies a backup is already
! 			 * running.  (XXX given that we checked exclusiveBackup above, maybe
! 			 * it would be OK to just unlink any such label file?)
! 			 */
! 			if (stat(BACKUP_LABEL_FILE, &stat_buf) != 0)
! 			{
! 				if (errno != ENOENT)
! 					ereport(ERROR,
! 							(errcode_for_file_access(),
! 							 errmsg("could not stat file \"%s\": %m",
! 									BACKUP_LABEL_FILE)));
! 			}
! 			else
! 				ereport(ERROR,
! 						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 						 errmsg("a backup is already in progress"),
! 						 errhint("If you're sure there is no backup in progress, remove file \"%s\" and try again.",
! 								 BACKUP_LABEL_FILE)));
! 
! 			fp = AllocateFile(BACKUP_LABEL_FILE, "w");
! 
! 			if (!fp)
  				ereport(ERROR,
  						(errcode_for_file_access(),
! 						 errmsg("could not create file \"%s\": %m",
! 								BACKUP_LABEL_FILE)));
! 			fwrite(labelfbuf.data, labelfbuf.len, 1, fp);
! 			if (fflush(fp) || ferror(fp) || FreeFile(fp))
! 				ereport(ERROR,
! 						(errcode_for_file_access(),
! 						 errmsg("could not write file \"%s\": %m",
  								BACKUP_LABEL_FILE)));
+ 			pfree(labelfbuf.data);
  		}
  		else
! 			*labelfile = labelfbuf.data;
  	}
! 	PG_END_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
  
  	/*
  	 * We're done.  As a convenience, return the starting WAL location.
***************
*** 8518,8526 **** do_pg_start_backup(const char *backupidstr, bool fast)
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
! 	/* Turn off forcePageWrites on failure */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	XLogCtl->Insert.forcePageWrites = false;
  	LWLockRelease(WALInsertLock);
  }
  
--- 8563,8578 ----
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
! 	bool exclusive = DatumGetBool(arg);
! 
! 	/* Decrement forcePageWrites on failure */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	if (exclusive)
! 	{
! 		Assert(XLogCtl->Insert.exclusiveBackup);
! 		XLogCtl->Insert.exclusiveBackup = false;
! 	}
! 	XLogCtl->Insert.forcePageWrites--;
  	LWLockRelease(WALInsertLock);
  }
  
***************
*** 8543,8558 **** pg_stop_backup(PG_FUNCTION_ARGS)
  	XLogRecPtr	stoppoint;
  	char		stopxlogstr[MAXFNAMELEN];
  
! 	stoppoint = do_pg_stop_backup();
  
  	snprintf(stopxlogstr, sizeof(stopxlogstr), "%X/%X",
  			 stoppoint.xlogid, stoppoint.xrecoff);
  	PG_RETURN_TEXT_P(cstring_to_text(stopxlogstr));
  }
  
  XLogRecPtr
! do_pg_stop_backup(void)
  {
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
  	XLogRecData rdata;
--- 8595,8618 ----
  	XLogRecPtr	stoppoint;
  	char		stopxlogstr[MAXFNAMELEN];
  
! 	stoppoint = do_pg_stop_backup(NULL);
  
  	snprintf(stopxlogstr, sizeof(stopxlogstr), "%X/%X",
  			 stoppoint.xlogid, stoppoint.xrecoff);
  	PG_RETURN_TEXT_P(cstring_to_text(stopxlogstr));
  }
  
+ /*
+  * do_pg_start_backup is the workhorse of the user-visible pg_stop_backup()
+  * function.
+ 
+  * If labelfile is NULL, this stops an exclusive backup. Otherwise this stops
+  * the non-exclusive backup specified by 'labelfile'.
+  */
  XLogRecPtr
! do_pg_stop_backup(char *labelfile)
  {
+ 	bool		exclusive = (labelfile == NULL);
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
  	XLogRecData rdata;
***************
*** 8568,8582 **** do_pg_stop_backup(void)
  	FILE	   *lfp;
  	FILE	   *fp;
  	char		ch;
- 	int			ich;
  	int			seconds_before_warning;
  	int			waits = 0;
  	bool		reported_waiting = false;
  
! 	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 				 (errmsg("must be superuser or replication role to run a backup"))));
  
  	if (RecoveryInProgress())
  		ereport(ERROR,
--- 8628,8642 ----
  	FILE	   *lfp;
  	FILE	   *fp;
  	char		ch;
  	int			seconds_before_warning;
  	int			waits = 0;
  	bool		reported_waiting = false;
+ 	char	   *remaining;
  
! 	if (exclusive && !superuser())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 				 errmsg("must be superuser to run a backup")));
  
  	if (RecoveryInProgress())
  		ereport(ERROR,
***************
*** 8591,8628 **** do_pg_stop_backup(void)
  				 errhint("wal_level must be set to \"archive\" or \"hot_standby\" at server start.")));
  
  	/*
! 	 * OK to clear forcePageWrites
  	 */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	XLogCtl->Insert.forcePageWrites = false;
  	LWLockRelease(WALInsertLock);
  
! 	/*
! 	 * Open the existing label file
! 	 */
! 	lfp = AllocateFile(BACKUP_LABEL_FILE, "r");
! 	if (!lfp)
  	{
! 		if (errno != ENOENT)
  			ereport(ERROR,
  					(errcode_for_file_access(),
  					 errmsg("could not read file \"%s\": %m",
  							BACKUP_LABEL_FILE)));
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 				 errmsg("a backup is not in progress")));
  	}
  
  	/*
  	 * Read and parse the START WAL LOCATION line (this code is pretty crude,
  	 * but we are not expecting any variability in the file format).
  	 */
! 	if (fscanf(lfp, "START WAL LOCATION: %X/%X (file %24s)%c",
  			   &startpoint.xlogid, &startpoint.xrecoff, startxlogfilename,
  			   &ch) != 4 || ch != '\n')
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
  
  	/*
  	 * Write the backup-end xlog record
--- 8651,8738 ----
  				 errhint("wal_level must be set to \"archive\" or \"hot_standby\" at server start.")));
  
  	/*
! 	 * OK to decrement forcePageWrites
  	 */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	if (exclusive)
! 	{
! 		if (XLogCtl->Insert.exclusiveBackup)
! 		{
! 			XLogCtl->Insert.forcePageWrites--;
! 			XLogCtl->Insert.exclusiveBackup = false;
! 		}
! 	}
! 	else
! 	{
! 		/*
! 		 * The user-visible pg_start/stop_backup() functions that operate on
! 		 * exclusive backups can be called at any time, but for non-exclusive
! 		 * backups, it is expected that each do_pg_start_backup() call is
! 		 * matched by exactly one do_pg_stop_backup() call.
! 		 */
! 		Assert(XLogCtl->Insert.forcePageWrites > (XLogCtl->Insert.exclusiveBackup ? 1 : 0));
! 		XLogCtl->Insert.forcePageWrites--;
! 	}
  	LWLockRelease(WALInsertLock);
  
! 	if (exclusive)
  	{
! 		/*
! 		 * Read the existing label file into memory.
! 		 */
! 		struct	stat statbuf;
! 		int		r;
! 
! 		if (stat(BACKUP_LABEL_FILE, &statbuf))
! 		{
! 			if (errno != ENOENT)
! 				ereport(ERROR,
! 						(errcode_for_file_access(),
! 						 errmsg("could not stat file \"%s\": %m",
! 								BACKUP_LABEL_FILE)));
! 			ereport(ERROR,
! 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 					 errmsg("a backup is not in progress")));
! 		}
! 
! 		lfp = AllocateFile(BACKUP_LABEL_FILE, "r");
! 		if (!lfp)
! 		{
  			ereport(ERROR,
  					(errcode_for_file_access(),
  					 errmsg("could not read file \"%s\": %m",
  							BACKUP_LABEL_FILE)));
! 		}
! 		labelfile = palloc(statbuf.st_size + 1);
! 		r = fread(labelfile, statbuf.st_size, 1, lfp);
! 		labelfile[statbuf.st_size] = '\0';
! 
! 		/*
! 		 * Close and remove the backup label file
! 		 */
! 		if (r != 1 || ferror(lfp) || FreeFile(lfp))
! 			ereport(ERROR,
! 					(errcode_for_file_access(),
! 					 errmsg("could not read file \"%s\": %m",
! 							BACKUP_LABEL_FILE)));
! 		if (unlink(BACKUP_LABEL_FILE) != 0)
! 			ereport(ERROR,
! 					(errcode_for_file_access(),
! 					 errmsg("could not remove file \"%s\": %m",
! 							BACKUP_LABEL_FILE)));
  	}
  
  	/*
  	 * Read and parse the START WAL LOCATION line (this code is pretty crude,
  	 * but we are not expecting any variability in the file format).
  	 */
! 	if (sscanf(labelfile, "START WAL LOCATION: %X/%X (file %24s)%c",
  			   &startpoint.xlogid, &startpoint.xrecoff, startxlogfilename,
  			   &ch) != 4 || ch != '\n')
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
+ 	remaining = strchr(labelfile, '\n') + 1; /* %n is not portable enough */
  
  	/*
  	 * Write the backup-end xlog record
***************
*** 8665,8672 **** do_pg_stop_backup(void)
  	fprintf(fp, "STOP WAL LOCATION: %X/%X (file %s)\n",
  			stoppoint.xlogid, stoppoint.xrecoff, stopxlogfilename);
  	/* transfer remaining lines from label to history file */
! 	while ((ich = fgetc(lfp)) != EOF)
! 		fputc(ich, fp);
  	fprintf(fp, "STOP TIME: %s\n", strfbuf);
  	if (fflush(fp) || ferror(fp) || FreeFile(fp))
  		ereport(ERROR,
--- 8775,8781 ----
  	fprintf(fp, "STOP WAL LOCATION: %X/%X (file %s)\n",
  			stoppoint.xlogid, stoppoint.xrecoff, stopxlogfilename);
  	/* transfer remaining lines from label to history file */
! 	fprintf(fp, "%s", remaining);
  	fprintf(fp, "STOP TIME: %s\n", strfbuf);
  	if (fflush(fp) || ferror(fp) || FreeFile(fp))
  		ereport(ERROR,
***************
*** 8675,8694 **** do_pg_stop_backup(void)
  						histfilepath)));
  
  	/*
- 	 * Close and remove the backup label file
- 	 */
- 	if (ferror(lfp) || FreeFile(lfp))
- 		ereport(ERROR,
- 				(errcode_for_file_access(),
- 				 errmsg("could not read file \"%s\": %m",
- 						BACKUP_LABEL_FILE)));
- 	if (unlink(BACKUP_LABEL_FILE) != 0)
- 		ereport(ERROR,
- 				(errcode_for_file_access(),
- 				 errmsg("could not remove file \"%s\": %m",
- 						BACKUP_LABEL_FILE)));
- 
- 	/*
  	 * Clean out any no-longer-needed history files.  As a side effect, this
  	 * will post a .ready file for the newly created history file, notifying
  	 * the archiver that history file may be archived immediately.
--- 8784,8789 ----
***************
*** 8769,8796 **** do_pg_stop_backup(void)
  /*
   * do_pg_abort_backup: abort a running backup
   *
!  * This does just the most basic steps of pg_stop_backup(), by taking the
   * system out of backup mode, thus making it a lot more safe to call from
   * an error handler.
   */
  void
  do_pg_abort_backup(void)
  {
- 	/*
- 	 * OK to clear forcePageWrites
- 	 */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	XLogCtl->Insert.forcePageWrites = false;
  	LWLockRelease(WALInsertLock);
- 
- 	/*
- 	 * Remove backup label file
- 	 */
- 	if (unlink(BACKUP_LABEL_FILE) != 0)
- 		ereport(ERROR,
- 				(errcode_for_file_access(),
- 				 errmsg("could not remove file \"%s\": %m",
- 						BACKUP_LABEL_FILE)));
  }
  
  /*
--- 8864,8884 ----
  /*
   * do_pg_abort_backup: abort a running backup
   *
!  * This does just the most basic steps of do_pg_stop_backup(), by taking the
   * system out of backup mode, thus making it a lot more safe to call from
   * an error handler.
+  *
+  * NB: This is only for aborting a non-exclusive backup that doesn't write
+  * backup_label. A backup started with pg_stop_backup() needs to be finished
+  * with pg_stop_backup().
   */
  void
  do_pg_abort_backup(void)
  {
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	Assert(XLogCtl->Insert.forcePageWrites > (XLogCtl->Insert.exclusiveBackup ? 1 : 0));
! 	XLogCtl->Insert.forcePageWrites--;
  	LWLockRelease(WALInsertLock);
  }
  
  /*
*** a/src/backend/replication/basebackup.c
--- b/src/backend/replication/basebackup.c
***************
*** 40,52 **** typedef struct
  }	basebackup_options;
  
  
  static int64 sendDir(char *path, int basepathlen, bool sizeonly);
! static void sendFile(char *path, int basepathlen, struct stat * statbuf);
  static void _tarWriteHeader(char *filename, char *linktarget,
  				struct stat * statbuf);
  static void send_int8_string(StringInfoData *buf, int64 intval);
  static void SendBackupHeader(List *tablespaces);
- static void SendBackupDirectory(char *location, char *spcoid);
  static void base_backup_cleanup(int code, Datum arg);
  static void perform_base_backup(basebackup_options *opt, DIR *tblspcdir);
  static void parse_basebackup_options(List *options, basebackup_options *opt);
--- 40,53 ----
  }	basebackup_options;
  
  
+ static void sendLabelFile(const char *content);
  static int64 sendDir(char *path, int basepathlen, bool sizeonly);
! static void sendFile(char *readfilename, char *tarfilename,
! 		 struct stat * statbuf);
  static void _tarWriteHeader(char *filename, char *linktarget,
  				struct stat * statbuf);
  static void send_int8_string(StringInfoData *buf, int64 intval);
  static void SendBackupHeader(List *tablespaces);
  static void base_backup_cleanup(int code, Datum arg);
  static void perform_base_backup(basebackup_options *opt, DIR *tblspcdir);
  static void parse_basebackup_options(List *options, basebackup_options *opt);
***************
*** 78,84 **** base_backup_cleanup(int code, Datum arg)
  static void
  perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
  {
! 	do_pg_start_backup(opt->label, opt->fastcheckpoint);
  
  	PG_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
  	{
--- 79,87 ----
  static void
  perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
  {
! 	char *labelfile;
! 
! 	do_pg_start_backup(opt->label, opt->fastcheckpoint, &labelfile);
  
  	PG_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
  	{
***************
*** 128,140 **** perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
  		foreach(lc, tablespaces)
  		{
  			tablespaceinfo *ti = (tablespaceinfo *) lfirst(lc);
  
! 			SendBackupDirectory(ti->path, ti->oid);
  		}
  	}
  	PG_END_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
  
! 	do_pg_stop_backup();
  }
  
  /*
--- 131,162 ----
  		foreach(lc, tablespaces)
  		{
  			tablespaceinfo *ti = (tablespaceinfo *) lfirst(lc);
+ 			StringInfoData buf;
+ 
+ 			/* Send CopyOutResponse message */
+ 			pq_beginmessage(&buf, 'H');
+ 			pq_sendbyte(&buf, 0);		/* overall format */
+ 			pq_sendint(&buf, 0, 2);		/* natts */
+ 			pq_endmessage(&buf);
+ 
+ 			/*
+ 			 * In the main tar, include the backup_label first.
+ 			 */
+ 			if (ti->path == NULL)
+ 			{
+ 				sendLabelFile(labelfile);
+ 				sendDir(".", 1, false);
+ 			}
+ 			else
+ 				sendDir(ti->path, strlen(ti->path), false);
  
! 			/* Send CopyDone message */
! 			pq_putemptymessage('c');
  		}
  	}
  	PG_END_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
  
! 	do_pg_stop_backup(labelfile);
  }
  
  /*
***************
*** 192,199 **** parse_basebackup_options(List *options, basebackup_options *opt)
  /*
   * SendBaseBackup() - send a complete base backup.
   *
!  * The function will take care of running pg_start_backup() and
!  * pg_stop_backup() for the user.
   */
  void
  SendBaseBackup(BaseBackupCmd *cmd)
--- 214,222 ----
  /*
   * SendBaseBackup() - send a complete base backup.
   *
!  * The function will put the system into backup mode like pg_start_backup()
!  * does, so that the backup is consistent even though we read directly from
!  * the filesystem, bypassing the buffer cache.
   */
  void
  SendBaseBackup(BaseBackupCmd *cmd)
***************
*** 316,342 **** SendBackupHeader(List *tablespaces)
  	pq_puttextmessage('C', "SELECT");
  }
  
  static void
! SendBackupDirectory(char *location, char *spcoid)
  {
! 	StringInfoData buf;
  
! 	/* Send CopyOutResponse message */
! 	pq_beginmessage(&buf, 'H');
! 	pq_sendbyte(&buf, 0);		/* overall format */
! 	pq_sendint(&buf, 0, 2);		/* natts */
! 	pq_endmessage(&buf);
  
! 	/* tar up the data directory if NULL, otherwise the tablespace */
! 	sendDir(location == NULL ? "." : location,
! 			location == NULL ? 1 : strlen(location),
! 			false);
  
! 	/* Send CopyDone message */
! 	pq_putemptymessage('c');
! }
  
  
  static int64
  sendDir(char *path, int basepathlen, bool sizeonly)
  {
--- 339,390 ----
  	pq_puttextmessage('C', "SELECT");
  }
  
+ /*
+  * Include a label file with the given content in the output tar stream.
+  */
  static void
! sendLabelFile(const char *content)
  {
! 	struct stat statbuf;
! 	int pad, len;
  
! 	len = strlen(content);
  
! 	/*
! 	 * Construct a stat struct for the backup_label file we're injecting
! 	 * in the tar.
! 	 */
! 	/* Windows doesn't have the concept of uid and gid */
! #ifdef WIN32
! 	statbuf.st_uid = 0;
! 	statbuf.st_gid = 0;
! #else
! 	statbuf.st_uid = geteuid();
! 	statbuf.st_gid = getegid();
! #endif
! 	statbuf.st_mtime = time(NULL);
! 	statbuf.st_mode = S_IRUSR | S_IWUSR;
! 	statbuf.st_size = len;
  
! 	_tarWriteHeader(BACKUP_LABEL_FILE, NULL, &statbuf);
! 	/* Send the contents as a CopyData message */
! 	pq_putmessage('d', content, len);
  
+ 	/* Pad to 512 byte boundary, per tar format requirements */
+ 	pad = ((len + 511) & ~511) - len;
+ 	if (pad > 0)
+ 	{
+ 		char buf[512];
+ 		MemSet(buf, 0, pad);
+ 		pq_putmessage('d', buf, pad);
+ 	}
+ }
  
+ /*
+  * Include all files from the given directory in the output tar stream. If
+  * 'sizeonly' is true, we just calculate a total length and return ig, without
+  * actually sending anything.
+  */
  static int64
  sendDir(char *path, int basepathlen, bool sizeonly)
  {
***************
*** 360,365 **** sendDir(char *path, int basepathlen, bool sizeonly)
--- 408,421 ----
  			continue;
  
  		/*
+ 		 * If there's a backup_label file, it belongs to a backup started by
+ 		 * the user with pg_start_backup(). It is *not* correct for this
+ 		 * backup, our backup_label is injected into the tar separately.
+ 		 */
+ 		if (strcmp(de->d_name, BACKUP_LABEL_FILE) == 0)
+ 			continue;
+ 
+ 		/*
  		 * Check if the postmaster has signaled us to exit, and abort
  		 * with an error in that case. The error handler further up
  		 * will call do_pg_abort_backup() for us.
***************
*** 445,451 **** sendDir(char *path, int basepathlen, bool sizeonly)
  			/* Add size, rounded up to 512byte block */
  			size += ((statbuf.st_size + 511) & ~511);
  			if (!sizeonly)
! 				sendFile(pathbuf, basepathlen, &statbuf);
  			size += 512;		/* Size of the header of the file */
  		}
  		else
--- 501,507 ----
  			/* Add size, rounded up to 512byte block */
  			size += ((statbuf.st_size + 511) & ~511);
  			if (!sizeonly)
! 				sendFile(pathbuf, pathbuf + basepathlen + 1, &statbuf);
  			size += 512;		/* Size of the header of the file */
  		}
  		else
***************
*** 503,509 **** _tarChecksum(char *header)
  
  /* Given the member, write the TAR header & send the file */
  static void
! sendFile(char *filename, int basepathlen, struct stat * statbuf)
  {
  	FILE	   *fp;
  	char		buf[32768];
--- 559,565 ----
  
  /* Given the member, write the TAR header & send the file */
  static void
! sendFile(char *readfilename, char *tarfilename, struct stat *statbuf)
  {
  	FILE	   *fp;
  	char		buf[32768];
***************
*** 511,521 **** sendFile(char *filename, int basepathlen, struct stat * statbuf)
  	pgoff_t		len = 0;
  	size_t		pad;
  
! 	fp = AllocateFile(filename, "rb");
  	if (fp == NULL)
  		ereport(ERROR,
  				(errcode(errcode_for_file_access()),
! 				 errmsg("could not open file \"%s\": %m", filename)));
  
  	/*
  	 * Some compilers will throw a warning knowing this test can never be true
--- 567,577 ----
  	pgoff_t		len = 0;
  	size_t		pad;
  
! 	fp = AllocateFile(readfilename, "rb");
  	if (fp == NULL)
  		ereport(ERROR,
  				(errcode(errcode_for_file_access()),
! 				 errmsg("could not open file \"%s\": %m", readfilename)));
  
  	/*
  	 * Some compilers will throw a warning knowing this test can never be true
***************
*** 524,532 **** sendFile(char *filename, int basepathlen, struct stat * statbuf)
  	if (statbuf->st_size > MAX_TAR_MEMBER_FILELEN)
  		ereport(ERROR,
  				(errmsg("archive member \"%s\" too large for tar format",
! 						filename)));
  
! 	_tarWriteHeader(filename + basepathlen + 1, NULL, statbuf);
  
  	while ((cnt = fread(buf, 1, Min(sizeof(buf), statbuf->st_size - len), fp)) > 0)
  	{
--- 580,588 ----
  	if (statbuf->st_size > MAX_TAR_MEMBER_FILELEN)
  		ereport(ERROR,
  				(errmsg("archive member \"%s\" too large for tar format",
! 						tarfilename)));
  
! 	_tarWriteHeader(tarfilename, NULL, statbuf);
  
  	while ((cnt = fread(buf, 1, Min(sizeof(buf), statbuf->st_size - len), fp)) > 0)
  	{
*** a/src/include/access/xlog.h
--- b/src/include/access/xlog.h
***************
*** 312,319 **** extern void HandleStartupProcInterrupts(void);
  extern void StartupProcessMain(void);
  extern void WakeupRecovery(void);
  
! extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast);
! extern XLogRecPtr do_pg_stop_backup(void);
  extern void do_pg_abort_backup(void);
  
  #endif   /* XLOG_H */
--- 312,326 ----
  extern void StartupProcessMain(void);
  extern void WakeupRecovery(void);
  
! /*
!  * Starting/stopping a base backup
!  */
! extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile);
! extern XLogRecPtr do_pg_stop_backup(char *labelfile);
  extern void do_pg_abort_backup(void);
  
+ /* File path names (all relative to $PGDATA) */
+ #define BACKUP_LABEL_FILE		"backup_label"
+ #define BACKUP_LABEL_OLD		"backup_label.old"
+ 
  #endif   /* XLOG_H */
#33Magnus Hagander
magnus@hagander.net
In reply to: Heikki Linnakangas (#32)
Re: Allowing multiple concurrent base backups

On Mon, Jan 24, 2011 at 17:52, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 13.01.2011 23:32, Heikki Linnakangas wrote:

Anyway, here's an updated patch with all the known issues fixed.

Another updated patch. Fixed bitrot, and addressed the privilege issue
Fujii-san raised here:
http://archives.postgresql.org/message-id/4D380560.3040400@enterprisedb.com.
I changed the privilege checks so that pg_start/stop_backup() functions
require superuser privileges again, but not for a base backup via the
replication protocol (replication privilege is needed to establish a
replication connection to begin with).

I'm not entirely sure the replication privilege removal is correct.
Doing that, it's no longer possible to deploy a slave *without* using
pg_basebackup, unless you are superuser. Do we really want to put that
restriction back in?

(And if we do, the docs proably need an update...)

I can't see an explicit check for the user ttempting to do
pg_stop_backup() when there is a nonexclusive backup running? Maybe
I'm reading it wrong? The case being when a user has started a backup
with pg_basebackup but then connects and manually does a
pg_stop_backup. ISTM it drops us ina codepath that just doesn't do the
decrement, but also doesn't throw an error?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

#34Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Magnus Hagander (#33)
Re: Allowing multiple concurrent base backups

On 24.01.2011 20:22, Magnus Hagander wrote:

On Mon, Jan 24, 2011 at 17:52, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Another updated patch. Fixed bitrot, and addressed the privilege issue
Fujii-san raised here:
http://archives.postgresql.org/message-id/4D380560.3040400@enterprisedb.com.
I changed the privilege checks so that pg_start/stop_backup() functions
require superuser privileges again, but not for a base backup via the
replication protocol (replication privilege is needed to establish a
replication connection to begin with).

I'm not entirely sure the replication privilege removal is correct.
Doing that, it's no longer possible to deploy a slave *without* using
pg_basebackup, unless you are superuser. Do we really want to put that
restriction back in?

Hmm, I thought we do, I thought that was changed just to make
pg_basebackup work without superuser privileges. But I guess it makes
sense apart from that. So the question is, should 'replication'
privilege be enough to use pg_start/stop_backup(), or should we require
superuser access for that? In any case, replication privilege will be
enough for pg_basebackup.

Ok, I won't touch that. But then we'll need to decide what to do about
Fujii's observation
(http://archives.postgresql.org/pgsql-hackers/2011-01/msg01934.php):

Both the user with REPLICATION privilege and the superuser can
call pg_stop_backup. But only superuser can connect to the server
to cancel online backup during shutdown. The non-superuser with
REPLICATION privilege cannot. Is this behavior intentional? Or just
oversight?

I can't see an explicit check for the user ttempting to do
pg_stop_backup() when there is a nonexclusive backup running? Maybe
I'm reading it wrong? The case being when a user has started a backup
with pg_basebackup but then connects and manually does a
pg_stop_backup. ISTM it drops us ina codepath that just doesn't do the
decrement, but also doesn't throw an error?

It throws an error later when it won't find backup_label. For better or
worse, it's always been like that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#35Magnus Hagander
magnus@hagander.net
In reply to: Heikki Linnakangas (#34)
Re: Allowing multiple concurrent base backups

On Mon, Jan 24, 2011 at 21:14, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 24.01.2011 20:22, Magnus Hagander wrote:

I can't see an explicit check for the user ttempting to do
pg_stop_backup() when there is a nonexclusive backup running? Maybe
I'm reading it wrong? The case being when a user has started a backup
with pg_basebackup but then connects and manually does a
pg_stop_backup. ISTM it drops us ina codepath that just doesn't do the
decrement, but also doesn't throw an error?

It throws an error later when it won't find backup_label. For better or
worse, it's always been like that.

Isn't that going to leave us in a broken state though? As in a
mistaken pg_stop_backup() will decrement the counter both breaking the
streaming base backup, and also possibly throwing an assert when that
one eventually wants to do it's do_pg_stop_backup()?

--
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

#36Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Magnus Hagander (#35)
1 attachment(s)
Re: Allowing multiple concurrent base backups

On 24.01.2011 22:31, Magnus Hagander wrote:

On Mon, Jan 24, 2011 at 21:14, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

On 24.01.2011 20:22, Magnus Hagander wrote:

I can't see an explicit check for the user ttempting to do
pg_stop_backup() when there is a nonexclusive backup running? Maybe
I'm reading it wrong? The case being when a user has started a backup
with pg_basebackup but then connects and manually does a
pg_stop_backup. ISTM it drops us ina codepath that just doesn't do the
decrement, but also doesn't throw an error?

It throws an error later when it won't find backup_label. For better or
worse, it's always been like that.

Isn't that going to leave us in a broken state though? As in a
mistaken pg_stop_backup() will decrement the counter both breaking the
streaming base backup, and also possibly throwing an assert when that
one eventually wants to do it's do_pg_stop_backup()?

Umm, no. pg_stop_backup() won't decrement the counter unless there's an
exclusive backup in operation according to the exclusiveBackup flag.

Hmm, perhaps the code would be more readable if instead of the
forcePageWrites counter that counts exclusive and non-exclusive backups,
and an exclusiveBackup boolean indicating if one of the in-progress
backups is an exclusive one, we had a counter that only counts
non-exclusive backups, plus a boolean indicating if an exclusive backup
is in progress in addition to them.

Attached is a patch for that (against master branch, including only xlog.c).

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

more-readable-forcePageWrites.patchtext/x-diff; name=more-readable-forcePageWrites.patchDownload
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 60,67 ****
  
  
  /* File path names (all relative to $PGDATA) */
- #define BACKUP_LABEL_FILE		"backup_label"
- #define BACKUP_LABEL_OLD		"backup_label.old"
  #define RECOVERY_COMMAND_FILE	"recovery.conf"
  #define RECOVERY_COMMAND_DONE	"recovery.done"
  
--- 60,65 ----
***************
*** 339,344 **** typedef struct XLogCtlInsert
--- 337,351 ----
  	char	   *currpos;		/* current insertion point in cache */
  	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
  	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
+ 
+ 	/*
+ 	 * exclusiveBackup is true if a backup started with pg_start_backup() is
+ 	 * in progress, and nonExclusiveBackups is a counter indicating the number
+ 	 * of streaming base backups currently in progress. forcePageWrites is
+ 	 * set to true, when either of these is non-zero.
+ 	 */
+ 	bool		exclusiveBackup;
+ 	int			nonExclusiveBackups;
  } XLogCtlInsert;
  
  /*
***************
*** 8352,8367 **** pg_start_backup(PG_FUNCTION_ARGS)
  
  	backupidstr = text_to_cstring(backupid);
  
! 	startpoint = do_pg_start_backup(backupidstr, fast);
  
  	snprintf(startxlogstr, sizeof(startxlogstr), "%X/%X",
  			 startpoint.xlogid, startpoint.xrecoff);
  	PG_RETURN_TEXT_P(cstring_to_text(startxlogstr));
  }
  
  XLogRecPtr
! do_pg_start_backup(const char *backupidstr, bool fast)
  {
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
  	pg_time_t	stamp_time;
--- 8359,8396 ----
  
  	backupidstr = text_to_cstring(backupid);
  
! 	startpoint = do_pg_start_backup(backupidstr, fast, NULL);
  
  	snprintf(startxlogstr, sizeof(startxlogstr), "%X/%X",
  			 startpoint.xlogid, startpoint.xrecoff);
  	PG_RETURN_TEXT_P(cstring_to_text(startxlogstr));
  }
  
+ /*
+  * do_pg_start_backup is the workhorse of the user-visible pg_start_backup()
+  * function. It creates the necessary starting checkpoint and constructs the
+  * backup label file.
+  * 
+  * There are two kind of backups: exclusive and non-exclusive. An exclusive
+  * backup is started with pg_start_backup(), and there can be only one active
+  * at a time. The backup label file of an exclusive backup is written to
+  * $PGDATA/backup_label, and it is removed by pg_stop_backup().
+  *
+  * A non-exclusive backup is used for the streaming base backups (see
+  * src/backend/replication/basebackup.c). The difference to exclusive backups
+  * is that the backup label file is not written to disk. Instead, its would-be
+  * contents are returned in *labelfile, and the caller is responsible for
+  * including it in the backup archive as 'backup_label'. There can be many
+  * non-exclusive backups active at the same time, and they don't conflict
+  * with an exclusive backup either.
+  *
+  * Every successfully started non-exclusive backup must be stopped by calling
+  * do_pg_stop_backup() or do_pg_abort_backup().
+  */
  XLogRecPtr
! do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
  {
+ 	bool		exclusive = (labelfile == NULL);
  	XLogRecPtr	checkpointloc;
  	XLogRecPtr	startpoint;
  	pg_time_t	stamp_time;
***************
*** 8371,8381 **** do_pg_start_backup(const char *backupidstr, bool fast)
  	uint32		_logSeg;
  	struct stat stat_buf;
  	FILE	   *fp;
  
! 	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 				 errmsg("must be superuser or replication role to run a backup")));
  
  	if (RecoveryInProgress())
  		ereport(ERROR,
--- 8400,8411 ----
  	uint32		_logSeg;
  	struct stat stat_buf;
  	FILE	   *fp;
+ 	StringInfoData labelfbuf;
  
! 	if (exclusive && !superuser())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 				 errmsg("must be superuser to run a backup")));
  
  	if (RecoveryInProgress())
  		ereport(ERROR,
***************
*** 8389,8394 **** do_pg_start_backup(const char *backupidstr, bool fast)
--- 8419,8430 ----
  			  errmsg("WAL level not sufficient for making an online backup"),
  				 errhint("wal_level must be set to \"archive\" or \"hot_standby\" at server start.")));
  
+ 	if (strlen(backupidstr) > MAXPGPATH)
+ 		ereport(ERROR,
+ 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ 				 errmsg("backup label too long (max %d bytes)",
+ 						MAXPGPATH)));
+ 
  	/*
  	 * Mark backup active in shared memory.  We must do full-page WAL writes
  	 * during an on-line backup even if not doing so at other times, because
***************
*** 8407,8420 **** do_pg_start_backup(const char *backupidstr, bool fast)
  	 * ensure adequate interlocking against XLogInsert().
  	 */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	if (XLogCtl->Insert.forcePageWrites)
  	{
! 		LWLockRelease(WALInsertLock);
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 				 errmsg("a backup is already in progress"),
! 				 errhint("Run pg_stop_backup() and try again.")));
  	}
  	XLogCtl->Insert.forcePageWrites = true;
  	LWLockRelease(WALInsertLock);
  
--- 8443,8462 ----
  	 * ensure adequate interlocking against XLogInsert().
  	 */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	if (exclusive)
  	{
! 		if (XLogCtl->Insert.exclusiveBackup)
! 		{
! 			LWLockRelease(WALInsertLock);
! 			ereport(ERROR,
! 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 					 errmsg("a backup is already in progress"),
! 					 errhint("Run pg_stop_backup() and try again.")));
! 		}
! 		XLogCtl->Insert.exclusiveBackup = true;
  	}
+ 	else
+ 		XLogCtl->Insert.nonExclusiveBackups++;
  	XLogCtl->Insert.forcePageWrites = true;
  	LWLockRelease(WALInsertLock);
  
***************
*** 8432,8438 **** do_pg_start_backup(const char *backupidstr, bool fast)
  	RequestXLogSwitch();
  
  	/* Ensure we release forcePageWrites if fail below */
! 	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) 0);
  	{
  		/*
  		 * Force a CHECKPOINT.	Aside from being necessary to prevent torn
--- 8474,8480 ----
  	RequestXLogSwitch();
  
  	/* Ensure we release forcePageWrites if fail below */
! 	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
  	{
  		/*
  		 * Force a CHECKPOINT.	Aside from being necessary to prevent torn
***************
*** 8459,8512 **** do_pg_start_backup(const char *backupidstr, bool fast)
  		XLByteToSeg(startpoint, _logId, _logSeg);
  		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
  
  		/* Use the log timezone here, not the session timezone */
  		stamp_time = (pg_time_t) time(NULL);
  		pg_strftime(strfbuf, sizeof(strfbuf),
  					"%Y-%m-%d %H:%M:%S %Z",
  					pg_localtime(&stamp_time, log_timezone));
  
  		/*
! 		 * Check for existing backup label --- implies a backup is already
! 		 * running.  (XXX given that we checked forcePageWrites above, maybe
! 		 * it would be OK to just unlink any such label file?)
  		 */
! 		if (stat(BACKUP_LABEL_FILE, &stat_buf) != 0)
  		{
! 			if (errno != ENOENT)
  				ereport(ERROR,
  						(errcode_for_file_access(),
! 						 errmsg("could not stat file \"%s\": %m",
  								BACKUP_LABEL_FILE)));
  		}
  		else
! 			ereport(ERROR,
! 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 					 errmsg("a backup is already in progress"),
! 					 errhint("If you're sure there is no backup in progress, remove file \"%s\" and try again.",
! 							 BACKUP_LABEL_FILE)));
! 
! 		/*
! 		 * Okay, write the file
! 		 */
! 		fp = AllocateFile(BACKUP_LABEL_FILE, "w");
! 		if (!fp)
! 			ereport(ERROR,
! 					(errcode_for_file_access(),
! 					 errmsg("could not create file \"%s\": %m",
! 							BACKUP_LABEL_FILE)));
! 		fprintf(fp, "START WAL LOCATION: %X/%X (file %s)\n",
! 				startpoint.xlogid, startpoint.xrecoff, xlogfilename);
! 		fprintf(fp, "CHECKPOINT LOCATION: %X/%X\n",
! 				checkpointloc.xlogid, checkpointloc.xrecoff);
! 		fprintf(fp, "START TIME: %s\n", strfbuf);
! 		fprintf(fp, "LABEL: %s\n", backupidstr);
! 		if (fflush(fp) || ferror(fp) || FreeFile(fp))
! 			ereport(ERROR,
! 					(errcode_for_file_access(),
! 					 errmsg("could not write file \"%s\": %m",
! 							BACKUP_LABEL_FILE)));
  	}
! 	PG_END_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) 0);
  
  	/*
  	 * We're done.  As a convenience, return the starting WAL location.
--- 8501,8567 ----
  		XLByteToSeg(startpoint, _logId, _logSeg);
  		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
  
+ 		/*
+ 		 * Construct backup label file 
+ 		 */
+ 		initStringInfo(&labelfbuf);
+ 
  		/* Use the log timezone here, not the session timezone */
  		stamp_time = (pg_time_t) time(NULL);
  		pg_strftime(strfbuf, sizeof(strfbuf),
  					"%Y-%m-%d %H:%M:%S %Z",
  					pg_localtime(&stamp_time, log_timezone));
+ 		appendStringInfo(&labelfbuf, "START WAL LOCATION: %X/%X (file %s)\n",
+ 						 startpoint.xlogid, startpoint.xrecoff, xlogfilename);
+ 		appendStringInfo(&labelfbuf, "CHECKPOINT LOCATION: %X/%X\n",
+ 						 checkpointloc.xlogid, checkpointloc.xrecoff);
+ 		appendStringInfo(&labelfbuf, "START TIME: %s\n", strfbuf);
+ 		appendStringInfo(&labelfbuf, "LABEL: %s\n", backupidstr);
  
  		/*
! 		 * Okay, write the file, or return its contents to caller.
  		 */
! 		if (exclusive)
  		{
! 			/*
! 			 * Check for existing backup label --- implies a backup is already
! 			 * running.  (XXX given that we checked exclusiveBackup above, maybe
! 			 * it would be OK to just unlink any such label file?)
! 			 */
! 			if (stat(BACKUP_LABEL_FILE, &stat_buf) != 0)
! 			{
! 				if (errno != ENOENT)
! 					ereport(ERROR,
! 							(errcode_for_file_access(),
! 							 errmsg("could not stat file \"%s\": %m",
! 									BACKUP_LABEL_FILE)));
! 			}
! 			else
! 				ereport(ERROR,
! 						(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 						 errmsg("a backup is already in progress"),
! 						 errhint("If you're sure there is no backup in progress, remove file \"%s\" and try again.",
! 								 BACKUP_LABEL_FILE)));
! 
! 			fp = AllocateFile(BACKUP_LABEL_FILE, "w");
! 
! 			if (!fp)
  				ereport(ERROR,
  						(errcode_for_file_access(),
! 						 errmsg("could not create file \"%s\": %m",
  								BACKUP_LABEL_FILE)));
+ 			fwrite(labelfbuf.data, labelfbuf.len, 1, fp);
+ 			if (fflush(fp) || ferror(fp) || FreeFile(fp))
+ 				ereport(ERROR,
+ 						(errcode_for_file_access(),
+ 						 errmsg("could not write file \"%s\": %m",
+ 								BACKUP_LABEL_FILE)));
+ 			pfree(labelfbuf.data);
  		}
  		else
! 			*labelfile = labelfbuf.data;
  	}
! 	PG_END_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
  
  	/*
  	 * We're done.  As a convenience, return the starting WAL location.
***************
*** 8518,8526 **** do_pg_start_backup(const char *backupidstr, bool fast)
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
! 	/* Turn off forcePageWrites on failure */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	XLogCtl->Insert.forcePageWrites = false;
  	LWLockRelease(WALInsertLock);
  }
  
--- 8573,8592 ----
  static void
  pg_start_backup_callback(int code, Datum arg)
  {
! 	bool exclusive = DatumGetBool(arg);
! 
! 	/* Update backup counters and forcePageWrites on failure */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	if (exclusive)
! 	{
! 		Assert(XLogCtl->Insert.exclusiveBackup);
! 		XLogCtl->Insert.exclusiveBackup = false;
! 	}
! 	else
! 		XLogCtl->Insert.nonExclusiveBackups--;
! 
! 	if (!XLogCtl->Insert.exclusiveBackup && XLogCtl->Insert.nonExclusiveBackups == 0)
! 		XLogCtl->Insert.forcePageWrites = false;
  	LWLockRelease(WALInsertLock);
  }
  
***************
*** 8543,8558 **** pg_stop_backup(PG_FUNCTION_ARGS)
  	XLogRecPtr	stoppoint;
  	char		stopxlogstr[MAXFNAMELEN];
  
! 	stoppoint = do_pg_stop_backup();
  
  	snprintf(stopxlogstr, sizeof(stopxlogstr), "%X/%X",
  			 stoppoint.xlogid, stoppoint.xrecoff);
  	PG_RETURN_TEXT_P(cstring_to_text(stopxlogstr));
  }
  
  XLogRecPtr
! do_pg_stop_backup(void)
  {
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
  	XLogRecData rdata;
--- 8609,8632 ----
  	XLogRecPtr	stoppoint;
  	char		stopxlogstr[MAXFNAMELEN];
  
! 	stoppoint = do_pg_stop_backup(NULL);
  
  	snprintf(stopxlogstr, sizeof(stopxlogstr), "%X/%X",
  			 stoppoint.xlogid, stoppoint.xrecoff);
  	PG_RETURN_TEXT_P(cstring_to_text(stopxlogstr));
  }
  
+ /*
+  * do_pg_start_backup is the workhorse of the user-visible pg_stop_backup()
+  * function.
+ 
+  * If labelfile is NULL, this stops an exclusive backup. Otherwise this stops
+  * the non-exclusive backup specified by 'labelfile'.
+  */
  XLogRecPtr
! do_pg_stop_backup(char *labelfile)
  {
+ 	bool		exclusive = (labelfile == NULL);
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
  	XLogRecData rdata;
***************
*** 8568,8582 **** do_pg_stop_backup(void)
  	FILE	   *lfp;
  	FILE	   *fp;
  	char		ch;
- 	int			ich;
  	int			seconds_before_warning;
  	int			waits = 0;
  	bool		reported_waiting = false;
  
! 	if (!superuser() && !is_authenticated_user_replication_role())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 				 (errmsg("must be superuser or replication role to run a backup"))));
  
  	if (RecoveryInProgress())
  		ereport(ERROR,
--- 8642,8656 ----
  	FILE	   *lfp;
  	FILE	   *fp;
  	char		ch;
  	int			seconds_before_warning;
  	int			waits = 0;
  	bool		reported_waiting = false;
+ 	char	   *remaining;
  
! 	if (exclusive && !superuser())
  		ereport(ERROR,
  				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
! 				 errmsg("must be superuser to run a backup")));
  
  	if (RecoveryInProgress())
  		ereport(ERROR,
***************
*** 8591,8628 **** do_pg_stop_backup(void)
  				 errhint("wal_level must be set to \"archive\" or \"hot_standby\" at server start.")));
  
  	/*
! 	 * OK to clear forcePageWrites
  	 */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	XLogCtl->Insert.forcePageWrites = false;
  	LWLockRelease(WALInsertLock);
  
! 	/*
! 	 * Open the existing label file
! 	 */
! 	lfp = AllocateFile(BACKUP_LABEL_FILE, "r");
! 	if (!lfp)
  	{
! 		if (errno != ENOENT)
  			ereport(ERROR,
  					(errcode_for_file_access(),
  					 errmsg("could not read file \"%s\": %m",
  							BACKUP_LABEL_FILE)));
! 		ereport(ERROR,
! 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 				 errmsg("a backup is not in progress")));
  	}
  
  	/*
  	 * Read and parse the START WAL LOCATION line (this code is pretty crude,
  	 * but we are not expecting any variability in the file format).
  	 */
! 	if (fscanf(lfp, "START WAL LOCATION: %X/%X (file %24s)%c",
  			   &startpoint.xlogid, &startpoint.xrecoff, startxlogfilename,
  			   &ch) != 4 || ch != '\n')
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
  
  	/*
  	 * Write the backup-end xlog record
--- 8665,8749 ----
  				 errhint("wal_level must be set to \"archive\" or \"hot_standby\" at server start.")));
  
  	/*
! 	 * OK to update backup counters and forcePageWrites
  	 */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	if (exclusive)
! 		XLogCtl->Insert.exclusiveBackup = false;
! 	else
! 	{
! 		/*
! 		 * The user-visible pg_start/stop_backup() functions that operate on
! 		 * exclusive backups can be called at any time, but for non-exclusive
! 		 * backups, it is expected that each do_pg_start_backup() call is
! 		 * matched by exactly one do_pg_stop_backup() call.
! 		 */
! 		Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 		XLogCtl->Insert.nonExclusiveBackups--;
! 	}
! 
! 	if (!XLogCtl->Insert.exclusiveBackup && XLogCtl->Insert.nonExclusiveBackups == 0)
! 		XLogCtl->Insert.forcePageWrites = false;
  	LWLockRelease(WALInsertLock);
  
! 	if (exclusive)
  	{
! 		/*
! 		 * Read the existing label file into memory.
! 		 */
! 		struct	stat statbuf;
! 		int		r;
! 
! 		if (stat(BACKUP_LABEL_FILE, &statbuf))
! 		{
! 			if (errno != ENOENT)
! 				ereport(ERROR,
! 						(errcode_for_file_access(),
! 						 errmsg("could not stat file \"%s\": %m",
! 								BACKUP_LABEL_FILE)));
! 			ereport(ERROR,
! 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
! 					 errmsg("a backup is not in progress")));
! 		}
! 
! 		lfp = AllocateFile(BACKUP_LABEL_FILE, "r");
! 		if (!lfp)
! 		{
  			ereport(ERROR,
  					(errcode_for_file_access(),
  					 errmsg("could not read file \"%s\": %m",
  							BACKUP_LABEL_FILE)));
! 		}
! 		labelfile = palloc(statbuf.st_size + 1);
! 		r = fread(labelfile, statbuf.st_size, 1, lfp);
! 		labelfile[statbuf.st_size] = '\0';
! 
! 		/*
! 		 * Close and remove the backup label file
! 		 */
! 		if (r != 1 || ferror(lfp) || FreeFile(lfp))
! 			ereport(ERROR,
! 					(errcode_for_file_access(),
! 					 errmsg("could not read file \"%s\": %m",
! 							BACKUP_LABEL_FILE)));
! 		if (unlink(BACKUP_LABEL_FILE) != 0)
! 			ereport(ERROR,
! 					(errcode_for_file_access(),
! 					 errmsg("could not remove file \"%s\": %m",
! 							BACKUP_LABEL_FILE)));
  	}
  
  	/*
  	 * Read and parse the START WAL LOCATION line (this code is pretty crude,
  	 * but we are not expecting any variability in the file format).
  	 */
! 	if (sscanf(labelfile, "START WAL LOCATION: %X/%X (file %24s)%c",
  			   &startpoint.xlogid, &startpoint.xrecoff, startxlogfilename,
  			   &ch) != 4 || ch != '\n')
  		ereport(ERROR,
  				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
+ 	remaining = strchr(labelfile, '\n') + 1; /* %n is not portable enough */
  
  	/*
  	 * Write the backup-end xlog record
***************
*** 8665,8672 **** do_pg_stop_backup(void)
  	fprintf(fp, "STOP WAL LOCATION: %X/%X (file %s)\n",
  			stoppoint.xlogid, stoppoint.xrecoff, stopxlogfilename);
  	/* transfer remaining lines from label to history file */
! 	while ((ich = fgetc(lfp)) != EOF)
! 		fputc(ich, fp);
  	fprintf(fp, "STOP TIME: %s\n", strfbuf);
  	if (fflush(fp) || ferror(fp) || FreeFile(fp))
  		ereport(ERROR,
--- 8786,8792 ----
  	fprintf(fp, "STOP WAL LOCATION: %X/%X (file %s)\n",
  			stoppoint.xlogid, stoppoint.xrecoff, stopxlogfilename);
  	/* transfer remaining lines from label to history file */
! 	fprintf(fp, "%s", remaining);
  	fprintf(fp, "STOP TIME: %s\n", strfbuf);
  	if (fflush(fp) || ferror(fp) || FreeFile(fp))
  		ereport(ERROR,
***************
*** 8675,8694 **** do_pg_stop_backup(void)
  						histfilepath)));
  
  	/*
- 	 * Close and remove the backup label file
- 	 */
- 	if (ferror(lfp) || FreeFile(lfp))
- 		ereport(ERROR,
- 				(errcode_for_file_access(),
- 				 errmsg("could not read file \"%s\": %m",
- 						BACKUP_LABEL_FILE)));
- 	if (unlink(BACKUP_LABEL_FILE) != 0)
- 		ereport(ERROR,
- 				(errcode_for_file_access(),
- 				 errmsg("could not remove file \"%s\": %m",
- 						BACKUP_LABEL_FILE)));
- 
- 	/*
  	 * Clean out any no-longer-needed history files.  As a side effect, this
  	 * will post a .ready file for the newly created history file, notifying
  	 * the archiver that history file may be archived immediately.
--- 8795,8800 ----
***************
*** 8769,8796 **** do_pg_stop_backup(void)
  /*
   * do_pg_abort_backup: abort a running backup
   *
!  * This does just the most basic steps of pg_stop_backup(), by taking the
   * system out of backup mode, thus making it a lot more safe to call from
   * an error handler.
   */
  void
  do_pg_abort_backup(void)
  {
- 	/*
- 	 * OK to clear forcePageWrites
- 	 */
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	XLogCtl->Insert.forcePageWrites = false;
! 	LWLockRelease(WALInsertLock);
  
! 	/*
! 	 * Remove backup label file
! 	 */
! 	if (unlink(BACKUP_LABEL_FILE) != 0)
! 		ereport(ERROR,
! 				(errcode_for_file_access(),
! 				 errmsg("could not remove file \"%s\": %m",
! 						BACKUP_LABEL_FILE)));
  }
  
  /*
--- 8875,8899 ----
  /*
   * do_pg_abort_backup: abort a running backup
   *
!  * This does just the most basic steps of do_pg_stop_backup(), by taking the
   * system out of backup mode, thus making it a lot more safe to call from
   * an error handler.
+  *
+  * NB: This is only for aborting a non-exclusive backup that doesn't write
+  * backup_label. A backup started with pg_stop_backup() needs to be finished
+  * with pg_stop_backup().
   */
  void
  do_pg_abort_backup(void)
  {
  	LWLockAcquire(WALInsertLock, LW_EXCLUSIVE);
! 	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
! 	XLogCtl->Insert.nonExclusiveBackups--;
  
! 	if (!XLogCtl->Insert.exclusiveBackup && XLogCtl->Insert.nonExclusiveBackups == 0)
! 		XLogCtl->Insert.forcePageWrites = false;
! 
! 	LWLockRelease(WALInsertLock);
  }
  
  /*
#37Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#34)
Re: Allowing multiple concurrent base backups

On Tue, Jan 25, 2011 at 5:14 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

I'm not entirely sure the replication privilege removal is correct.
Doing that, it's no longer possible to deploy a slave *without* using
pg_basebackup, unless you are superuser. Do we really want to put that
restriction back in?

Hmm, I thought we do, I thought that was changed just to make pg_basebackup
work without superuser privileges.

If we encourage users not to use the "replication" privilege to log in
the database, putting that restriction seems to be reasonable.

Ok, I won't touch that. But then we'll need to decide what to do about
Fujii's observation
(http://archives.postgresql.org/pgsql-hackers/2011-01/msg01934.php):

Yes. If we allow the "replication" users to call pg_start/stop_backup,
we also allow them to connect to the database even during shutdown
in order to cancel the backup.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#38Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#36)
Re: Allowing multiple concurrent base backups

On Tue, Jan 25, 2011 at 6:02 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Hmm, perhaps the code would be more readable if instead of the
forcePageWrites counter that counts exclusive and non-exclusive backups, and
an exclusiveBackup boolean indicating if one of the in-progress backups is
an exclusive one, we had a counter that only counts non-exclusive backups,
plus a boolean indicating if an exclusive backup is in progress in addition
to them.

Attached is a patch for that (against master branch, including only xlog.c).

I read this patch and previous-posted one. Those look good.

Comments:

+ * do_pg_start_backup is the workhorse of the user-visible pg_stop_backup()
+ * function.

Typo: s/do_pg_start_backup/do_pg_stop_backup

It's helpful to explain about this behavior in pg_basebackup.sgml or elsewhere.

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#39Fujii Masao
masao.fujii@gmail.com
In reply to: Fujii Masao (#38)
Re: Allowing multiple concurrent base backups

On Tue, Jan 25, 2011 at 1:02 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Jan 25, 2011 at 6:02 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Hmm, perhaps the code would be more readable if instead of the
forcePageWrites counter that counts exclusive and non-exclusive backups, and
an exclusiveBackup boolean indicating if one of the in-progress backups is
an exclusive one, we had a counter that only counts non-exclusive backups,
plus a boolean indicating if an exclusive backup is in progress in addition
to them.

Attached is a patch for that (against master branch, including only xlog.c).

I read this patch and previous-posted one. Those look good.

Comments:

+ * do_pg_start_backup is the workhorse of the user-visible pg_stop_backup()
+ * function.

Typo: s/do_pg_start_backup/do_pg_stop_backup

It's helpful to explain about this behavior in pg_basebackup.sgml or elsewhere.

When I read the patch, I found that pg_stop_backup removes the backup history
file as soon as it creates the file, if archive_mode is not enabled.
This looks like
oversight. We should prevent pg_stop_backup from removing the fresh history
file? Or we should prevent pg_stop_backup from creating the history
file from the
beginning since it's not used at all if archiving is disabled?
(If archiving is enabled, the history file can be used to clean the
archived files up).

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

#40Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#38)
Re: Allowing multiple concurrent base backups

On 25.01.2011 06:02, Fujii Masao wrote:

On Tue, Jan 25, 2011 at 6:02 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Hmm, perhaps the code would be more readable if instead of the
forcePageWrites counter that counts exclusive and non-exclusive backups, and
an exclusiveBackup boolean indicating if one of the in-progress backups is
an exclusive one, we had a counter that only counts non-exclusive backups,
plus a boolean indicating if an exclusive backup is in progress in addition
to them.

Attached is a patch for that (against master branch, including only xlog.c).

I read this patch and previous-posted one. Those look good.

Comments:

+ * do_pg_start_backup is the workhorse of the user-visible pg_stop_backup()
+ * function.

Typo: s/do_pg_start_backup/do_pg_stop_backup

It's helpful to explain about this behavior in pg_basebackup.sgml or elsewhere.

Thanks. I've committed this now, fixing that, and hopefully all the
other issues mentioned in this thread.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#41Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Fujii Masao (#39)
Re: Allowing multiple concurrent base backups

On 27.01.2011 15:15, Fujii Masao wrote:

When I read the patch, I found that pg_stop_backup removes the backup history
file as soon as it creates the file, if archive_mode is not enabled.
This looks like
oversight. We should prevent pg_stop_backup from removing the fresh history
file? Or we should prevent pg_stop_backup from creating the history
file from the
beginning since it's not used at all if archiving is disabled?
(If archiving is enabled, the history file can be used to clean the
archived files up).

Hmm, good point. It's harmless, but creating the history file in the
first place sure seems like a waste of time.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#42Fujii Masao
masao.fujii@gmail.com
In reply to: Heikki Linnakangas (#41)
1 attachment(s)
Re: Allowing multiple concurrent base backups

On Tue, Feb 1, 2011 at 1:31 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Hmm, good point. It's harmless, but creating the history file in the first
place sure seems like a waste of time.

The attached patch changes pg_stop_backup so that it doesn't create
the backup history file if archiving is not enabled.

When I tested the multiple backups, I found that they can have the same
checkpoint location and the same history file name.

--------------------
$ for ((i=0; i<4; i++)); do
pg_basebackup -D test$i -c fast -x -l test$i &
done

$ cat test0/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test0

$ cat test1/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test1

$ cat test2/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test2

$ cat test3/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test3

$ ls archive/*.backup
archive/000000010000000000000002.000000B0.backup
--------------------

This would cause a serious problem. Because the backup-end record
which indicates the same "START WAL LOCATION" can be written by the
first backup before the other finishes. So we might think wrongly that
we've already reached a consistency state by reading the backup-end
record (written by the first backup) before reading the last required WAL
file.

/*
* Force a CHECKPOINT. Aside from being necessary to prevent torn
* page problems, this guarantees that two successive backup runs will
* have different checkpoint positions and hence different history
* file names, even if nothing happened in between.
*
* We use CHECKPOINT_IMMEDIATE only if requested by user (via passing
* fast = true). Otherwise this can take awhile.
*/
RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT |
(fast ? CHECKPOINT_IMMEDIATE : 0));

This problem happens because the above code (in do_pg_start_backup)
actually doesn't ensure that the concurrent backups have the different
checkpoint locations. ISTM that we should change the above or elsewhere
to ensure that. Or we should include backup label name in the backup-end
record, to prevent a recovery from reading not-its-own backup-end record.

Thought?

Regards,

--
Fujii Masao
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

Attachments:

not_create_histfile_if_not_arch_v1.patchapplication/octet-stream; name=not_create_histfile_if_not_arch_v1.patchDownload
*** a/doc/src/sgml/backup.sgml
--- b/doc/src/sgml/backup.sgml
***************
*** 874,880 **** SELECT pg_stop_backup();
     <para>
      To make use of the backup, you will need to keep all the WAL
      segment files generated during and after the file system backup.
!     To aid you in doing this, the <function>pg_stop_backup</> function
      creates a <firstterm>backup history file</> that is immediately
      stored into the WAL archive area. This file is named after the first
      WAL segment file that you need for the file system backup.
--- 874,881 ----
     <para>
      To make use of the backup, you will need to keep all the WAL
      segment files generated during and after the file system backup.
!     To aid you in doing this, if archiving is enabled,
!     the <function>pg_stop_backup</> function
      creates a <firstterm>backup history file</> that is immediately
      stored into the WAL archive area. This file is named after the first
      WAL segment file that you need for the file system backup.
*** a/doc/src/sgml/func.sgml
--- b/doc/src/sgml/func.sgml
***************
*** 14029,14035 **** postgres=# select pg_start_backup('label_goes_here');
     <para>
      <function>pg_stop_backup</> removes the label file created by
      <function>pg_start_backup</>, and creates a backup history file in
!     the transaction log archive area.  The history file includes the label given to
      <function>pg_start_backup</>, the starting and ending transaction log locations for
      the backup, and the starting and ending times of the backup.  The return
      value is the backup's ending transaction log location (which again
--- 14029,14036 ----
     <para>
      <function>pg_stop_backup</> removes the label file created by
      <function>pg_start_backup</>, and creates a backup history file in
!     the transaction log archive area if archiving is enabled.
!     The history file includes the label given to
      <function>pg_start_backup</>, the starting and ending transaction log locations for
      the backup, and the starting and ending times of the backup.  The return
      value is the backup's ending transaction log location (which again
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 8636,8652 **** do_pg_stop_backup(char *labelfile)
  	XLogRecPtr	startpoint;
  	XLogRecPtr	stoppoint;
  	XLogRecData rdata;
- 	pg_time_t	stamp_time;
- 	char		strfbuf[128];
- 	char		histfilepath[MAXPGPATH];
  	char		startxlogfilename[MAXFNAMELEN];
- 	char		stopxlogfilename[MAXFNAMELEN];
  	char		lastxlogfilename[MAXFNAMELEN];
  	char		histfilename[MAXFNAMELEN];
  	uint32		_logId;
  	uint32		_logSeg;
  	FILE	   *lfp;
- 	FILE	   *fp;
  	char		ch;
  	int			seconds_before_warning;
  	int			waits = 0;
--- 8636,8647 ----
***************
*** 8769,8812 **** do_pg_stop_backup(char *labelfile)
  	 */
  	RequestXLogSwitch();
  
- 	XLByteToPrevSeg(stoppoint, _logId, _logSeg);
- 	XLogFileName(stopxlogfilename, ThisTimeLineID, _logId, _logSeg);
- 
- 	/* Use the log timezone here, not the session timezone */
- 	stamp_time = (pg_time_t) time(NULL);
- 	pg_strftime(strfbuf, sizeof(strfbuf),
- 				"%Y-%m-%d %H:%M:%S %Z",
- 				pg_localtime(&stamp_time, log_timezone));
- 
  	/*
! 	 * Write the backup history file
  	 */
! 	XLByteToSeg(startpoint, _logId, _logSeg);
! 	BackupHistoryFilePath(histfilepath, ThisTimeLineID, _logId, _logSeg,
! 						  startpoint.xrecoff % XLogSegSize);
! 	fp = AllocateFile(histfilepath, "w");
! 	if (!fp)
! 		ereport(ERROR,
! 				(errcode_for_file_access(),
! 				 errmsg("could not create file \"%s\": %m",
! 						histfilepath)));
! 	fprintf(fp, "START WAL LOCATION: %X/%X (file %s)\n",
! 			startpoint.xlogid, startpoint.xrecoff, startxlogfilename);
! 	fprintf(fp, "STOP WAL LOCATION: %X/%X (file %s)\n",
! 			stoppoint.xlogid, stoppoint.xrecoff, stopxlogfilename);
! 	/* transfer remaining lines from label to history file */
! 	fprintf(fp, "%s", remaining);
! 	fprintf(fp, "STOP TIME: %s\n", strfbuf);
! 	if (fflush(fp) || ferror(fp) || FreeFile(fp))
! 		ereport(ERROR,
! 				(errcode_for_file_access(),
! 				 errmsg("could not write file \"%s\": %m",
! 						histfilepath)));
  
  	/*
  	 * Clean out any no-longer-needed history files.  As a side effect, this
  	 * will post a .ready file for the newly created history file, notifying
  	 * the archiver that history file may be archived immediately.
  	 */
  	CleanupBackupHistory();
  
--- 8764,8820 ----
  	 */
  	RequestXLogSwitch();
  
  	/*
! 	 * If archiving is enabled, write the backup history file.
  	 */
! 	if (XLogArchivingActive())
! 	{
! 		pg_time_t	stamp_time;
! 		char		strfbuf[128];
! 		char		histfilepath[MAXPGPATH];
! 		char		stopxlogfilename[MAXFNAMELEN];
! 		FILE	   *fp;
! 
! 		XLByteToPrevSeg(stoppoint, _logId, _logSeg);
! 		XLogFileName(stopxlogfilename, ThisTimeLineID, _logId, _logSeg);
! 
! 		/* Use the log timezone here, not the session timezone */
! 		stamp_time = (pg_time_t) time(NULL);
! 		pg_strftime(strfbuf, sizeof(strfbuf),
! 					"%Y-%m-%d %H:%M:%S %Z",
! 					pg_localtime(&stamp_time, log_timezone));
! 
! 		XLByteToSeg(startpoint, _logId, _logSeg);
! 		BackupHistoryFilePath(histfilepath, ThisTimeLineID, _logId, _logSeg,
! 							  startpoint.xrecoff % XLogSegSize);
! 		fp = AllocateFile(histfilepath, "w");
! 		if (!fp)
! 			ereport(ERROR,
! 					(errcode_for_file_access(),
! 					 errmsg("could not create file \"%s\": %m",
! 							histfilepath)));
! 		fprintf(fp, "START WAL LOCATION: %X/%X (file %s)\n",
! 				startpoint.xlogid, startpoint.xrecoff, startxlogfilename);
! 		fprintf(fp, "STOP WAL LOCATION: %X/%X (file %s)\n",
! 				stoppoint.xlogid, stoppoint.xrecoff, stopxlogfilename);
! 		/* transfer remaining lines from label to history file */
! 		fprintf(fp, "%s", remaining);
! 		fprintf(fp, "STOP TIME: %s\n", strfbuf);
! 		if (fflush(fp) || ferror(fp) || FreeFile(fp))
! 			ereport(ERROR,
! 					(errcode_for_file_access(),
! 					 errmsg("could not write file \"%s\": %m",
! 							histfilepath)));
! 	}
  
  	/*
  	 * Clean out any no-longer-needed history files.  As a side effect, this
  	 * will post a .ready file for the newly created history file, notifying
  	 * the archiver that history file may be archived immediately.
+ 	 *
+ 	 * We do this even if archiving is not enabled since there might be the
+ 	 * unnecessary history file which was created when the server was running
+ 	 * before with archiving enabled.
  	 */
  	CleanupBackupHistory();
  
#43Robert Haas
robertmhaas@gmail.com
In reply to: Fujii Masao (#42)
Re: Allowing multiple concurrent base backups

On Mon, Jan 31, 2011 at 10:45 PM, Fujii Masao <masao.fujii@gmail.com> wrote:

On Tue, Feb 1, 2011 at 1:31 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Hmm, good point. It's harmless, but creating the history file in the first
place sure seems like a waste of time.

The attached patch changes pg_stop_backup so that it doesn't create
the backup history file if archiving is not enabled.

When I tested the multiple backups, I found that they can have the same
checkpoint location and the same history file name.

--------------------
$ for ((i=0; i<4; i++)); do
pg_basebackup -D test$i -c fast -x -l test$i &
done

$ cat test0/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test0

$ cat test1/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test1

$ cat test2/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test2

$ cat test3/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test3

$ ls archive/*.backup
archive/000000010000000000000002.000000B0.backup
--------------------

This would cause a serious problem. Because the backup-end record
which indicates the same "START WAL LOCATION" can be written by the
first backup before the other finishes. So we might think wrongly that
we've already reached a consistency state by reading the backup-end
record (written by the first backup) before reading the last required WAL
file.

               /*
                * Force a CHECKPOINT.  Aside from being necessary to prevent torn
                * page problems, this guarantees that two successive backup runs will
                * have different checkpoint positions and hence different history
                * file names, even if nothing happened in between.
                *
                * We use CHECKPOINT_IMMEDIATE only if requested by user (via passing
                * fast = true).  Otherwise this can take awhile.
                */
               RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT |
                                                 (fast ? CHECKPOINT_IMMEDIATE : 0));

This problem happens because the above code (in do_pg_start_backup)
actually doesn't ensure that the concurrent backups have the different
checkpoint locations. ISTM that we should change the above or elsewhere
to ensure that. Or we should include backup label name in the backup-end
record, to prevent a recovery from reading not-its-own backup-end record.

Thought?

This patch is on the 9.1 open items list, but I don't understand it
well enough to know whether it's correct. Can someone else pick it
up?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#44Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Robert Haas (#43)
Re: Allowing multiple concurrent base backups

On 17.03.2011 21:39, Robert Haas wrote:

On Mon, Jan 31, 2011 at 10:45 PM, Fujii Masao<masao.fujii@gmail.com> wrote:

On Tue, Feb 1, 2011 at 1:31 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Hmm, good point. It's harmless, but creating the history file in the first
place sure seems like a waste of time.

The attached patch changes pg_stop_backup so that it doesn't create
the backup history file if archiving is not enabled.

When I tested the multiple backups, I found that they can have the same
checkpoint location and the same history file name.

--------------------
$ for ((i=0; i<4; i++)); do
pg_basebackup -D test$i -c fast -x -l test$i&
done

$ cat test0/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test0

$ cat test1/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test1

$ cat test2/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test2

$ cat test3/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test3

$ ls archive/*.backup
archive/000000010000000000000002.000000B0.backup
--------------------

This would cause a serious problem. Because the backup-end record
which indicates the same "START WAL LOCATION" can be written by the
first backup before the other finishes. So we might think wrongly that
we've already reached a consistency state by reading the backup-end
record (written by the first backup) before reading the last required WAL
file.

/*
* Force a CHECKPOINT. Aside from being necessary to prevent torn
* page problems, this guarantees that two successive backup runs will
* have different checkpoint positions and hence different history
* file names, even if nothing happened in between.
*
* We use CHECKPOINT_IMMEDIATE only if requested by user (via passing
* fast = true). Otherwise this can take awhile.
*/
RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT |
(fast ? CHECKPOINT_IMMEDIATE : 0));

This problem happens because the above code (in do_pg_start_backup)
actually doesn't ensure that the concurrent backups have the different
checkpoint locations. ISTM that we should change the above or elsewhere
to ensure that.

Yes, good point.

Or we should include backup label name in the backup-end
record, to prevent a recovery from reading not-its-own backup-end record.

Backup labels are not guaranteed to be unique either, so including
backup label in the backup-end-record doesn't solve the problem. But
something else like a backup-start counter in shared memory or process
id would work.

It won't make the history file names unique, though. Now that we use on
the end-of-backup record for detecting end-of-backup, the history files
are just for documenting purposes. Do we want to give up on history
files for backups performed with pg_basebackup? Or we can include the
backup counter or similar in the filename too.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

#45Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#44)
1 attachment(s)
Re: Allowing multiple concurrent base backups

On 18.03.2011 10:48, Heikki Linnakangas wrote:

On 17.03.2011 21:39, Robert Haas wrote:

On Mon, Jan 31, 2011 at 10:45 PM, Fujii Masao<masao.fujii@gmail.com>
wrote:

On Tue, Feb 1, 2011 at 1:31 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Hmm, good point. It's harmless, but creating the history file in the
first
place sure seems like a waste of time.

The attached patch changes pg_stop_backup so that it doesn't create
the backup history file if archiving is not enabled.

When I tested the multiple backups, I found that they can have the same
checkpoint location and the same history file name.

--------------------
$ for ((i=0; i<4; i++)); do
pg_basebackup -D test$i -c fast -x -l test$i&
done

$ cat test0/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test0

$ cat test1/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test1

$ cat test2/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test2

$ cat test3/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test3

$ ls archive/*.backup
archive/000000010000000000000002.000000B0.backup
--------------------

This would cause a serious problem. Because the backup-end record
which indicates the same "START WAL LOCATION" can be written by the
first backup before the other finishes. So we might think wrongly that
we've already reached a consistency state by reading the backup-end
record (written by the first backup) before reading the last required
WAL
file.

/*
* Force a CHECKPOINT. Aside from being necessary to prevent torn
* page problems, this guarantees that two successive backup runs will
* have different checkpoint positions and hence different history
* file names, even if nothing happened in between.
*
* We use CHECKPOINT_IMMEDIATE only if requested by user (via passing
* fast = true). Otherwise this can take awhile.
*/
RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT |
(fast ? CHECKPOINT_IMMEDIATE : 0));

This problem happens because the above code (in do_pg_start_backup)
actually doesn't ensure that the concurrent backups have the different
checkpoint locations. ISTM that we should change the above or elsewhere
to ensure that.

Yes, good point.

Here's a patch based on that approach, ensuring that each base backup
uses a different checkpoint as the start location. I think I'll commit
this, rather than invent a new unique ID mechanism for backups. The
latter would need changes in recovery and control file too, and I don't
feel like tinkering with that at this stage.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

Attachments:

ensure-unique-backup-start-locations-1.patchtext/x-diff; name=ensure-unique-backup-start-locations-1.patchDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 15af669..570f02b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -355,10 +355,13 @@ typedef struct XLogCtlInsert
 	 * exclusiveBackup is true if a backup started with pg_start_backup() is
 	 * in progress, and nonExclusiveBackups is a counter indicating the number
 	 * of streaming base backups currently in progress. forcePageWrites is
-	 * set to true when either of these is non-zero.
+	 * set to true when either of these is non-zero. lastBackupStart is the
+	 * latest checkpoint redo location used as a starting point for an online
+	 * backup.
 	 */
 	bool		exclusiveBackup;
 	int			nonExclusiveBackups;
+	XLogRecPtr	lastBackupStart;
 } XLogCtlInsert;
 
 /*
@@ -8809,6 +8812,19 @@ do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
 						MAXPGPATH)));
 
 	/*
+	 * Force an XLOG file switch before the checkpoint, to ensure that the WAL
+	 * segment the checkpoint is written to doesn't contain pages with old
+	 * timeline IDs. That would otherwise happen if you called
+	 * pg_start_backup() right after restoring from a PITR archive: the first
+	 * WAL segment containing the startup checkpoint has pages in the
+	 * beginning with the old timeline ID. That can cause trouble at recovery:
+	 * we won't have a history file covering the old timeline if pg_xlog
+	 * directory was not included in the base backup and the WAL archive was
+	 * cleared too before starting the backup.
+	 */
+	RequestXLogSwitch();
+
+	/*
 	 * Mark backup active in shared memory.  We must do full-page WAL writes
 	 * during an on-line backup even if not doing so at other times, because
 	 * it's quite possible for the backup dump to obtain a "torn" (partially
@@ -8843,43 +8859,54 @@ do_pg_start_backup(const char *backupidstr, bool fast, char **labelfile)
 	XLogCtl->Insert.forcePageWrites = true;
 	LWLockRelease(WALInsertLock);
 
-	/*
-	 * Force an XLOG file switch before the checkpoint, to ensure that the WAL
-	 * segment the checkpoint is written to doesn't contain pages with old
-	 * timeline IDs. That would otherwise happen if you called
-	 * pg_start_backup() right after restoring from a PITR archive: the first
-	 * WAL segment containing the startup checkpoint has pages in the
-	 * beginning with the old timeline ID. That can cause trouble at recovery:
-	 * we won't have a history file covering the old timeline if pg_xlog
-	 * directory was not included in the base backup and the WAL archive was
-	 * cleared too before starting the backup.
-	 */
-	RequestXLogSwitch();
-
 	/* Ensure we release forcePageWrites if fail below */
 	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
 	{
-		/*
-		 * Force a CHECKPOINT.	Aside from being necessary to prevent torn
-		 * page problems, this guarantees that two successive backup runs will
-		 * have different checkpoint positions and hence different history
-		 * file names, even if nothing happened in between.
-		 *
-		 * We use CHECKPOINT_IMMEDIATE only if requested by user (via passing
-		 * fast = true).  Otherwise this can take awhile.
-		 */
-		RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT |
-						  (fast ? CHECKPOINT_IMMEDIATE : 0));
+		bool gotUniqueStartpoint = false;
+		do
+		{
+			/*
+			 * Force a CHECKPOINT.	Aside from being necessary to prevent torn
+			 * page problems, this guarantees that two successive backup runs will
+			 * have different checkpoint positions and hence different history
+			 * file names, even if nothing happened in between.
+			 *
+			 * We use CHECKPOINT_IMMEDIATE only if requested by user (via passing
+			 * fast = true).  Otherwise this can take awhile.
+			 */
+			RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT |
+							  (fast ? CHECKPOINT_IMMEDIATE : 0));
 
-		/*
-		 * Now we need to fetch the checkpoint record location, and also its
-		 * REDO pointer.  The oldest point in WAL that would be needed to
-		 * restore starting from the checkpoint is precisely the REDO pointer.
-		 */
-		LWLockAcquire(ControlFileLock, LW_SHARED);
-		checkpointloc = ControlFile->checkPoint;
-		startpoint = ControlFile->checkPointCopy.redo;
-		LWLockRelease(ControlFileLock);
+			/*
+			 * Now we need to fetch the checkpoint record location, and also its
+			 * REDO pointer.  The oldest point in WAL that would be needed to
+			 * restore starting from the checkpoint is precisely the REDO pointer.
+			 */
+			LWLockAcquire(ControlFileLock, LW_SHARED);
+			checkpointloc = ControlFile->checkPoint;
+			startpoint = ControlFile->checkPointCopy.redo;
+			LWLockRelease(ControlFileLock);
+
+			/*
+			 * If two base backups are started at the same time (in WAL
+			 * sender processes), we need to make sure that they use
+			 * different checkpoints as starting locations, because we use
+			 * the starting WAL location as a unique identifier for the base
+			 * backup in the end-of-backup WAL record and when we write the
+			 * backup history file. Perhaps it would be better generate a
+			 * separate unique ID for each backup instead of forcing another
+			 * checkpoint, but taking a checkpoint right after another is
+			 * not that expensive either because only few buffers have been
+			 * dirtied yet.
+			 */
+			LWLockAcquire(WALInsertLock, LW_SHARED);
+			if (XLByteLT(XLogCtl->Insert.lastBackupStart, startpoint))
+			{
+				XLogCtl->Insert.lastBackupStart = startpoint;
+				gotUniqueStartpoint = true;
+			}
+			LWLockRelease(WALInsertLock);
+		} while(!gotUniqueStartpoint);
 
 		XLByteToSeg(startpoint, _logId, _logSeg);
 		XLogFileName(xlogfilename, ThisTimeLineID, _logId, _logSeg);
#46Heikki Linnakangas
heikki.linnakangas@enterprisedb.com
In reply to: Heikki Linnakangas (#45)
Re: Allowing multiple concurrent base backups

On 18.03.2011 13:56, Heikki Linnakangas wrote:

On 18.03.2011 10:48, Heikki Linnakangas wrote:

On 17.03.2011 21:39, Robert Haas wrote:

On Mon, Jan 31, 2011 at 10:45 PM, Fujii Masao<masao.fujii@gmail.com>
wrote:

On Tue, Feb 1, 2011 at 1:31 AM, Heikki Linnakangas
<heikki.linnakangas@enterprisedb.com> wrote:

Hmm, good point. It's harmless, but creating the history file in the
first
place sure seems like a waste of time.

The attached patch changes pg_stop_backup so that it doesn't create
the backup history file if archiving is not enabled.

When I tested the multiple backups, I found that they can have the same
checkpoint location and the same history file name.

--------------------
$ for ((i=0; i<4; i++)); do
pg_basebackup -D test$i -c fast -x -l test$i&
done

$ cat test0/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test0

$ cat test1/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test1

$ cat test2/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test2

$ cat test3/backup_label
START WAL LOCATION: 0/20000B0 (file 000000010000000000000002)
CHECKPOINT LOCATION: 0/20000E8
START TIME: 2011-02-01 12:12:31 JST
LABEL: test3

$ ls archive/*.backup
archive/000000010000000000000002.000000B0.backup
--------------------

This would cause a serious problem. Because the backup-end record
which indicates the same "START WAL LOCATION" can be written by the
first backup before the other finishes. So we might think wrongly that
we've already reached a consistency state by reading the backup-end
record (written by the first backup) before reading the last required
WAL
file.

/*
* Force a CHECKPOINT. Aside from being necessary to prevent torn
* page problems, this guarantees that two successive backup runs will
* have different checkpoint positions and hence different history
* file names, even if nothing happened in between.
*
* We use CHECKPOINT_IMMEDIATE only if requested by user (via passing
* fast = true). Otherwise this can take awhile.
*/
RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT |
(fast ? CHECKPOINT_IMMEDIATE : 0));

This problem happens because the above code (in do_pg_start_backup)
actually doesn't ensure that the concurrent backups have the different
checkpoint locations. ISTM that we should change the above or elsewhere
to ensure that.

Yes, good point.

Here's a patch based on that approach, ensuring that each base backup
uses a different checkpoint as the start location. I think I'll commit
this, rather than invent a new unique ID mechanism for backups. The
latter would need changes in recovery and control file too, and I don't
feel like tinkering with that at this stage.

Ok, committed this.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com