Add recovery to pg_control and remove backup_label

david@pgmasters.net

about 2 years ago

In reply to: David G. Johnston (#2)

Re: Add recovery to pg_control and remove backup_label

On 10/26/23 17:27, David G. Johnston wrote:

On Thu, Oct 26, 2023 at 2:02 PM David Steele <david@pgmasters.net
<mailto:david@pgmasters.net>> wrote:

Are we planning on dealing with torn writes in the back branches in some
way or are we just throwing in the towel and saying the old method is
too error-prone to exist/retain

We are still planning to address this issue in the back branches.

and therefore the goal of the v17
changes is to not only provide a better way but also to ensure the old
way no longer works? It seems sufficient to change the output signature
of pg_backup_stop to accomplish that goal though I am pondering whether
an explicit check and error for seeing the backup_label file would be
warranted.

Well, if the backup tool is just copying the second column of output to
the backup_label, then it won't break. Of course in that case, restores
won't work correctly but you would not get an error. Testing would show
that it is not working properly and backup tools should certainly be tested.

Even so, I'm OK with an explicit check for backup_label. Let's see what
others think.

If we are going to solve the torn writes problem completely then while I
agree the new way is superior, implementing it doesn't have to mean
existing tools built to produce backup_label and rely upon the
pg_control in the data directory need to be forcibly broken.

It is a pretty easy update to any backup software that supports
non-exclusive backup. I was able to make the changes to pgBackRest in
less than an hour. We've made major changes to backup and restore in
almost every major version of PostgreSQL for a while: non-exlusive
backup in 9.6, dir renames in 10, variable WAL size in 11, new recovery
location in 12, hard recovery target errors in 13, and changes to
non-exclusive backup and removal of exclusive backup in 15. In 17 we are
already looking at new page and segment sizes.

I know that outputting pg_control as bytea is going to be a bit
controversial. Software that is using psql get run pg_backup_stop()
could use encode() to get pg_control as text and then decode it later.
Alternately, we could update ReadControlFile() to recognize a
base64-encoded pg_control file. I'm not sure dealing with binary
data is
that much of a problem, though, and if the backup software gets it
wrong
then recovery with fail on an invalid pg_control file.

Can we not figure out some way to place the relevant files onto the
server somewhere so that a simple "cp" command would work? Have
pg_backup_stop return paths instead of contents, those paths being
"$TEMP_DIR"/<random unique new directory>/pg_control.conf (and
tablespace_map)

Nobody has been able to figure this out, and some of us have been
thinking about it for years. It just doesn't seem possible to reliably
tell the difference between a cluster that was copied and one that
simply crashed.

If cp is really the backup tool being employed, I would recommend using
pg_basebackup. cp has flaws that could lead to corruption, and of course
does not at all take into account the archive required to make a backup
consistent, directories to be excluded, the order of copying pg_control
on backup from standy, etc., etc.

Backup/restore is not a simple endeavor and we don't do anyone favors
pretending that it is.

Regards,
-David

David G. Johnston

david.g.johnston@gmail.com

about 2 years ago

In reply to: David Steele (#3)

Re: Add recovery to pg_control and remove backup_label

On Fri, Oct 27, 2023 at 7:10 AM David Steele <david@pgmasters.net> wrote:

On 10/26/23 17:27, David G. Johnston wrote:

Can we not figure out some way to place the relevant files onto the
server somewhere so that a simple "cp" command would work? Have
pg_backup_stop return paths instead of contents, those paths being
"$TEMP_DIR"/<random unique new directory>/pg_control.conf (and
tablespace_map)

Nobody has been able to figure this out, and some of us have been
thinking about it for years. It just doesn't seem possible to reliably
tell the difference between a cluster that was copied and one that
simply crashed.

If cp is really the backup tool being employed, I would recommend using
pg_basebackup. cp has flaws that could lead to corruption, and of course
does not at all take into account the archive required to make a backup
consistent, directories to be excluded, the order of copying pg_control
on backup from standy, etc., etc.

Let me modify that to make it a bit more clear, I actually wouldn't care if
pg_backup_end outputs an entire binary pg_control file as part of the SQL
resultset.

My proposal would be to, in addition, place in the temporary directory on
the server, Postgres-written versions of pg_control and tablespace_map as
part of the pg_backup_end processing. The client software would then have
a choice. Write the contents of the SQL resultset to newly created binary
mode files in the destination, or, copy the server-written files from the
temporary directory to the destination.

That said, I'm starting to dislike that idea myself. It only really makes
sense if the files could be placed in the data directory but that isn't
doable given concurrent backups and not wanting to place the source server
into an inconsistent state.

David J.

david@pgmasters.net

about 2 years ago

In reply to: David G. Johnston (#4)

Re: Add recovery to pg_control and remove backup_label

On 10/27/23 13:45, David G. Johnston wrote:

Let me modify that to make it a bit more clear, I actually wouldn't care
if pg_backup_end outputs an entire binary pg_control file as part of the
SQL resultset.

My proposal would be to, in addition, place in the temporary directory
on the server, Postgres-written versions of pg_control and
tablespace_map as part of the pg_backup_end processing. The client
software would then have a choice. Write the contents of the SQL
resultset to newly created binary mode files in the destination, or,
copy the server-written files from the temporary directory to the
destination.

That said, I'm starting to dislike that idea myself. It only really
makes sense if the files could be placed in the data directory but that
isn't doable given concurrent backups and not wanting to place the
source server into an inconsistent state.

Pretty much the conclusion I have come to myself over the years.

Regards,
-David

david@pgmasters.net

about 2 years ago

In reply to: David Steele (#1)

1 attachment(s)

Re: Add recovery to pg_control and remove backup_label

Rebased on 151ffcf6.

Attachments:

v02-recovery-in-pgcontrol-remove-backuplabel.patchtext/plain; charset=UTF-8; name=v02-recovery-in-pgcontrol-remove-backuplabel.patchDownload

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae54..6be8fb902c5 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -935,19 +935,20 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
      ready to archive.
     </para>
     <para>
-     <function>pg_backup_stop</function> will return one row with three
-     values. The second of these fields should be written to a file named
-     <filename>backup_label</filename> in the root directory of the backup. The
-     third field should be written to a file named
-     <filename>tablespace_map</filename> unless the field is empty. These files are
+     <function>pg_backup_stop</function> returns the
+     <filename>pg_control</filename> file, which must be stored in the
+     <filename>global</filename> directory of the backup. It also returns the
+     <filename>tablespace_map</filename> file, which should be written in the
+     root directory of the backup unless the field is empty. These files are
      vital to the backup working and must be written byte for byte without
-     modification, which may require opening the file in binary mode.
+     modification, which will require opening the file in binary mode.
     </para>
    </listitem>
    <listitem>
     <para>
      Once the WAL segment files active during the backup are archived, you are
-     done.  The file identified by <function>pg_backup_stop</function>'s first return
+     done.  The file identified by <function>pg_backup_stop</function>'s
+     <parameter>lsn</parameter> return
      value is the last segment that is required to form a complete set of
      backup files.  On a primary, if <varname>archive_mode</varname> is enabled and the
      <literal>wait_for_archive</literal> parameter is <literal>true</literal>,
@@ -1013,7 +1014,15 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
    </para>
 
    <para>
-    You should, however, omit from the backup the files within the
+    You must exclude <filename>global/pg_control</filename> from your backup
+    and put the contents of the <parameter>pg_control_file</parameter> column
+    returned from <function>pg_backup_stop</function> in your backup at
+    <filename>global/pg_control</filename>. This file contains the information
+    required to safely recover.
+   </para>
+
+   <para>
+    You should also omit from the backup the files within the
     cluster's <filename>pg_wal/</filename> subdirectory.  This
     slight adjustment is worthwhile because it reduces the risk
     of mistakes when restoring.  This is easy to arrange if
@@ -1062,12 +1071,7 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
    </para>
 
    <para>
-    The backup label
-    file includes the label string you gave to <function>pg_backup_start</function>,
-    as well as the time at which <function>pg_backup_start</function> was run, and
-    the name of the starting WAL file.  In case of confusion it is therefore
-    possible to look inside a backup file and determine exactly which
-    backup session the dump file came from.  The tablespace map file includes
+    The tablespace map file includes
     the symbolic link names as they exist in the directory
     <filename>pg_tblspc/</filename> and the full path of each symbolic link.
     These files are not merely for your information; their presence and
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index a6fcac0824a..6fd2eb8cc50 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26815,7 +26815,10 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
           <parameter>label</parameter> <type>text</type>
           <optional>, <parameter>fast</parameter> <type>boolean</type>
           </optional> )
-        <returnvalue>pg_lsn</returnvalue>
+        <returnvalue>record</returnvalue>
+        ( <parameter>lsn</parameter> <type>pg_lsn</type>,
+        <parameter>timeline_id</parameter> <type>int8</type>,
+        <parameter>start</parameter> <type>timestamptz</type> )
        </para>
        <para>
         Prepares the server to begin an on-line backup.  The only required
@@ -26827,6 +26830,13 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
         as possible.  This forces an immediate checkpoint which will cause a
         spike in I/O operations, slowing any concurrently executing queries.
        </para>
+       <para>
+        The result columns contain information about the start of the backup
+        and can be ignored: the <parameter>lsn</parameter> column holds the
+        starting write-ahead log location, the
+        <parameter>timeline_id</parameter> column holds the starting timeline,
+        and the <parameter>stop</parameter> column holds the starting timestamp.
+       </para>
        <para>
         This function is restricted to superusers by default, but other users
         can be granted EXECUTE to run the function.
@@ -26842,13 +26852,15 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
           <optional><parameter>wait_for_archive</parameter> <type>boolean</type>
           </optional> )
         <returnvalue>record</returnvalue>
-        ( <parameter>lsn</parameter> <type>pg_lsn</type>,
-        <parameter>labelfile</parameter> <type>text</type>,
-        <parameter>spcmapfile</parameter> <type>text</type> )
+        ( <parameter>pg_control_file</parameter> <type>text</type>,
+        <parameter>tablespace_map_file</parameter> <type>text</type>,
+        <parameter>lsn</parameter> <type>pg_lsn</type>,
+        <parameter>timeline_id</parameter> <type>int8</type>,
+        <parameter>stop</parameter> <type>timestamptz</type> )
        </para>
        <para>
         Finishes performing an on-line backup.  The desired contents of the
-        backup label file and the tablespace map file are returned as part of
+        pg_control file and the tablespace map file are returned as part of
         the result of the function and must be written to files in the
         backup area.  These files must not be written to the live data directory
         (doing so will cause PostgreSQL to fail to restart in the event of a
@@ -26880,13 +26892,16 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
         backup.
        </para>
        <para>
-        The result of the function is a single record.
-        The <parameter>lsn</parameter> column holds the backup's ending
-        write-ahead log location (which again can be ignored).  The second
-        column returns the contents of the backup label file, and the third
-        column returns the contents of the tablespace map file.  These must be
-        stored as part of the backup and are required as part of the restore
-        process.
+        The result of the function is a single record. The first column returns
+        the contents of the <filename>pg_control</filename> file and the
+        second column returns the contents of the
+        <filename>tablespace_map</filename> file.  These must be stored as part
+        of the backup and are required as part of the restore process. The
+        remainder of the columns contain information about the end of the backup
+        and can be ignored: the <parameter>lsn</parameter> column holds the
+        ending write-ahead log location, the <parameter>timeline_id</parameter>
+        column holds the ending timeline, and the <parameter>stop</parameter>
+        column holds the ending timestamp.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b541be8eec2..fa7816cf8ea 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -74,6 +74,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "port/pg_crc32c.h"
 #include "port/pg_iovec.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
@@ -5116,7 +5117,6 @@ StartupXLOG(void)
 	bool		wasShutdown;
 	bool		didCrash;
 	bool		haveTblspcMap;
-	bool		haveBackupLabel;
 	XLogRecPtr	EndOfLog;
 	TimeLineID	EndOfLogTLI;
 	TimeLineID	newTLI;
@@ -5240,13 +5240,14 @@ StartupXLOG(void)
 	/*
 	 * Prepare for WAL recovery if needed.
 	 *
-	 * InitWalRecovery analyzes the control file and the backup label file, if
-	 * any.  It updates the in-memory ControlFile buffer according to the
-	 * starting checkpoint, and sets InRecovery and ArchiveRecoveryRequested.
+	 * InitWalRecovery analyzes the control file and checks if backup recovery
+	 * has been requested.  It updates the in-memory ControlFile buffer
+	 * according to the starting checkpoint, and sets InRecovery and
+	 * ArchiveRecoveryRequested.
+	 *
 	 * It also applies the tablespace map file, if any.
 	 */
-	InitWalRecovery(ControlFile, &wasShutdown,
-					&haveBackupLabel, &haveTblspcMap);
+	InitWalRecovery(ControlFile, &wasShutdown, &haveTblspcMap);
 	checkPoint = ControlFile->checkPointCopy;
 
 	/* initialize shared memory variables from the checkpoint record */
@@ -5389,20 +5390,6 @@ StartupXLOG(void)
 		 */
 		UpdateControlFile();
 
-		/*
-		 * If there was a backup label file, it's done its job and the info
-		 * has now been propagated into pg_control.  We must get rid of the
-		 * label file so that if we crash during recovery, we'll pick up at
-		 * the latest recovery restartpoint instead of going all the way back
-		 * to the backup start point.  It seems prudent though to just rename
-		 * the file out of the way rather than delete it completely.
-		 */
-		if (haveBackupLabel)
-		{
-			unlink(BACKUP_LABEL_OLD);
-			durable_rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD, FATAL);
-		}
-
 		/*
 		 * If there was a tablespace_map file, it's done its job and the
 		 * symlinks have been created.  We must get rid of the map file so
@@ -5552,10 +5539,8 @@ StartupXLOG(void)
 	 * (at which point we reset backupStartPoint to be Invalid), for
 	 * backup-from-replica (which can't inject records into the WAL stream),
 	 * that point is when we reach the minRecoveryPoint in pg_control (which
-	 * we purposefully copy last when backing up from a replica).  For
-	 * pg_rewind (which creates a backup_label with a method of "pg_rewind")
-	 * or snapshot-style backups (which don't), backupEndRequired will be set
-	 * to false.
+	 * we purposefully copy last when backing up).  For pg_rewind or
+	 * snapshot-style backups, backupEndRequired will be set to false.
 	 *
 	 * Note: it is indeed okay to look at the local variable
 	 * LocalMinRecoveryPoint here, even though ControlFile->minRecoveryPoint
@@ -8725,11 +8710,33 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 	int			seconds_before_warning;
 	int			waits = 0;
 	bool		reported_waiting = false;
+	ControlFileData *controlFileCopy = (ControlFileData *)state->controlFile;
 
 	Assert(state != NULL);
 
 	backup_stopped_in_recovery = RecoveryInProgress();
 
+	/*
+	 * Create a copy of control data and update it with fields required for
+	 * recovery. Also recalculate the CRC.
+	 */
+	memset(controlFileCopy, 0, PG_CONTROL_MAX_SAFE_SIZE);
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	memcpy(controlFileCopy, ControlFile, sizeof(ControlFileData));
+	LWLockRelease(ControlFileLock);
+
+	controlFileCopy->backupRecoveryRequired = true;
+	controlFileCopy->backupFromStandby = backup_stopped_in_recovery;
+	controlFileCopy->backupEndRequired = true;
+	controlFileCopy->backupCheckPoint = state->checkpointloc;
+	controlFileCopy->backupStartPoint = state->startpoint;
+	controlFileCopy->backupStartPointTLI = state->starttli;
+
+	INIT_CRC32C(controlFileCopy->crc);
+	COMP_CRC32C(controlFileCopy->crc, controlFileCopy, offsetof(ControlFileData, crc));
+	FIN_CRC32C(controlFileCopy->crc);
+
 	/*
 	 * During recovery, we don't need to check WAL level. Because, if WAL
 	 * level is not sufficient, it's impossible to get here during recovery.
@@ -8831,11 +8838,8 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 							 "Enable full_page_writes and run CHECKPOINT on the primary, "
 							 "and then try an online backup again.")));
 
-
-		LWLockAcquire(ControlFileLock, LW_SHARED);
-		state->stoppoint = ControlFile->minRecoveryPoint;
-		state->stoptli = ControlFile->minRecoveryPointTLI;
-		LWLockRelease(ControlFileLock);
+		state->stoppoint = controlFileCopy->minRecoveryPoint;
+		state->stoptli = controlFileCopy->minRecoveryPointTLI;
 	}
 	else
 	{
@@ -8877,7 +8881,7 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 							histfilepath)));
 
 		/* Build and save the contents of the backup history file */
-		history_file = build_backup_content(state, true);
+		history_file = build_backup_content(state);
 		fprintf(fp, "%s", history_file);
 		pfree(history_file);
 
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae1..b61ed02bbbe 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -18,19 +18,19 @@
 #include "access/xlogbackup.h"
 
 /*
- * Build contents for backup_label or backup history file.
- *
- * When ishistoryfile is true, it creates the contents for a backup history
- * file, otherwise it creates contents for a backup_label file.
+ * Build contents for backup history file.
  *
  * Returns the result generated as a palloc'd string.
  */
 char *
-build_backup_content(BackupState *state, bool ishistoryfile)
+build_backup_content(BackupState *state)
 {
 	char		startstrbuf[128];
+	char		stopstrfbuf[128];
 	char		startxlogfile[MAXFNAMELEN]; /* backup start WAL file */
+	char		stopxlogfile[MAXFNAMELEN];	/* backup stop WAL file */
 	XLogSegNo	startsegno;
+	XLogSegNo	stopsegno;
 	StringInfo	result = makeStringInfo();
 	char	   *data;
 
@@ -45,16 +45,10 @@ build_backup_content(BackupState *state, bool ishistoryfile)
 	appendStringInfo(result, "START WAL LOCATION: %X/%X (file %s)\n",
 					 LSN_FORMAT_ARGS(state->startpoint), startxlogfile);
 
-	if (ishistoryfile)
-	{
-		char		stopxlogfile[MAXFNAMELEN];	/* backup stop WAL file */
-		XLogSegNo	stopsegno;
-
-		XLByteToSeg(state->stoppoint, stopsegno, wal_segment_size);
-		XLogFileName(stopxlogfile, state->stoptli, stopsegno, wal_segment_size);
-		appendStringInfo(result, "STOP WAL LOCATION: %X/%X (file %s)\n",
-						 LSN_FORMAT_ARGS(state->stoppoint), stopxlogfile);
-	}
+	XLByteToSeg(state->stoppoint, stopsegno, wal_segment_size);
+	XLogFileName(stopxlogfile, state->stoptli, stopsegno, wal_segment_size);
+	appendStringInfo(result, "STOP WAL LOCATION: %X/%X (file %s)\n",
+						LSN_FORMAT_ARGS(state->stoppoint), stopxlogfile);
 
 	appendStringInfo(result, "CHECKPOINT LOCATION: %X/%X\n",
 					 LSN_FORMAT_ARGS(state->checkpointloc));
@@ -65,17 +59,12 @@ build_backup_content(BackupState *state, bool ishistoryfile)
 	appendStringInfo(result, "LABEL: %s\n", state->name);
 	appendStringInfo(result, "START TIMELINE: %u\n", state->starttli);
 
-	if (ishistoryfile)
-	{
-		char		stopstrfbuf[128];
-
-		/* Use the log timezone here, not the session timezone */
-		pg_strftime(stopstrfbuf, sizeof(stopstrfbuf), "%Y-%m-%d %H:%M:%S %Z",
-					pg_localtime(&state->stoptime, log_timezone));
+	/* Use the log timezone here, not the session timezone */
+	pg_strftime(stopstrfbuf, sizeof(stopstrfbuf), "%Y-%m-%d %H:%M:%S %Z",
+				pg_localtime(&state->stoptime, log_timezone));
 
-		appendStringInfo(result, "STOP TIME: %s\n", stopstrfbuf);
-		appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
-	}
+	appendStringInfo(result, "STOP TIME: %s\n", stopstrfbuf);
+	appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
 
 	data = result->data;
 	pfree(result);
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 45a70668b1c..2388a60a5e5 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -53,7 +53,7 @@ static MemoryContext backupcontext = NULL;
  * pg_backup_start: set up for taking an on-line backup dump
  *
  * Essentially what this does is to create the contents required for the
- * backup_label file and the tablespace map.
+ * the tablespace map.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -61,6 +61,10 @@ static MemoryContext backupcontext = NULL;
 Datum
 pg_backup_start(PG_FUNCTION_ARGS)
 {
+#define PG_BACKUP_START_V2_COLS 3
+	TupleDesc	tupdesc;
+	Datum		values[PG_BACKUP_START_V2_COLS] = {0};
+	bool		nulls[PG_BACKUP_START_V2_COLS] = {0};
 	text	   *backupid = PG_GETARG_TEXT_PP(0);
 	bool		fast = PG_GETARG_BOOL(1);
 	char	   *backupidstr;
@@ -69,6 +73,10 @@ pg_backup_start(PG_FUNCTION_ARGS)
 
 	backupidstr = text_to_cstring(backupid);
 
+	/* Initialize attributes information in the tuple descriptor */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
 	if (status == SESSION_BACKUP_RUNNING)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@@ -102,7 +110,12 @@ pg_backup_start(PG_FUNCTION_ARGS)
 	register_persistent_abort_backup_handler();
 	do_pg_backup_start(backupidstr, fast, NULL, backup_state, tablespace_map);
 
-	PG_RETURN_LSN(backup_state->startpoint);
+	values[0] = LSNGetDatum(backup_state->startpoint);
+	values[1] = Int64GetDatum(backup_state->starttli);
+	values[2] = TimestampTzGetDatum(time_t_to_timestamptz(backup_state->starttime));
+
+	/* Returns the record as Datum */
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
 
 
@@ -113,14 +126,12 @@ pg_backup_start(PG_FUNCTION_ARGS)
  * allows the user to choose if they want to wait for the WAL to be archived
  * or if we should just return as soon as the WAL record is written.
  *
- * This function stops an in-progress backup, creates backup_label contents and
- * it returns the backup stop LSN, backup_label and tablespace_map contents.
+ * This function stops an in-progress backup and returns the backup stop LSN,
+ * pg_control and tablespace_map contents.
  *
- * The backup_label contains the user-supplied label string (typically this
- * would be used to tell where the backup dump will be stored), the starting
- * time, starting WAL location for the dump and so on.  It is the caller's
- * responsibility to write the backup_label and tablespace_map files in the
- * data folder that will be restored from this backup.
+ * The pg_control file contains the recovery information for the backup.  It is
+ * the caller's responsibility to write the pg_control and tablespace_map files
+ * in the data folder that will be restored from this backup.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -128,12 +139,12 @@ pg_backup_start(PG_FUNCTION_ARGS)
 Datum
 pg_backup_stop(PG_FUNCTION_ARGS)
 {
-#define PG_BACKUP_STOP_V2_COLS 3
+#define PG_BACKUP_STOP_V2_COLS 5
 	TupleDesc	tupdesc;
 	Datum		values[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		nulls[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		waitforarchive = PG_GETARG_BOOL(0);
-	char	   *backup_label;
+	bytea	   *pg_control_bytea;
 	SessionBackupState status = get_backup_status();
 
 	/* Initialize attributes information in the tuple descriptor */
@@ -152,15 +163,16 @@ pg_backup_stop(PG_FUNCTION_ARGS)
 	/* Stop the backup */
 	do_pg_backup_stop(backup_state, waitforarchive);
 
-	/* Build the contents of backup_label */
-	backup_label = build_backup_content(backup_state, false);
-
-	values[0] = LSNGetDatum(backup_state->stoppoint);
-	values[1] = CStringGetTextDatum(backup_label);
-	values[2] = CStringGetTextDatum(tablespace_map->data);
+	/* Build the contents of pg_control */
+	pg_control_bytea = (bytea *) palloc(PG_CONTROL_MAX_SAFE_SIZE + VARHDRSZ);
+	SET_VARSIZE(pg_control_bytea, PG_CONTROL_MAX_SAFE_SIZE + VARHDRSZ);
+	memcpy(VARDATA(pg_control_bytea), backup_state->controlFile, PG_CONTROL_MAX_SAFE_SIZE);
 
-	/* Deallocate backup-related variables */
-	pfree(backup_label);
+	values[0] = PointerGetDatum(pg_control_bytea);
+	values[1] = CStringGetTextDatum(tablespace_map->data);
+	values[2] = LSNGetDatum(backup_state->stoppoint);
+	values[3] = Int64GetDatum(backup_state->stoptli);
+	values[4] = TimestampTzGetDatum(time_t_to_timestamptz(backup_state->stoptime));
 
 	/* Clean up the session-level state and its memory context */
 	backup_state = NULL;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666aa..b12d351ed23 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -6,7 +6,7 @@
  * This source file contains functions controlling WAL recovery.
  * InitWalRecovery() initializes the system for crash or archive recovery,
  * or standby mode, depending on configuration options and the state of
- * the control file and possible backup label file.  PerformWalRecovery()
+ * the control file and possible backup recovery.  PerformWalRecovery()
  * performs the actual WAL replay, calling the rmgr-specific redo routines.
  * FinishWalRecovery() performs end-of-recovery checks and cleanup actions,
  * and prepares information needed to initialize the WAL for writes.  In
@@ -152,11 +152,12 @@ static bool recovery_signal_file_found = false;
 
 /*
  * CheckPointLoc is the position of the checkpoint record that determines
- * where to start the replay.  It comes from the backup label file or the
- * control file.
+ * where to start the replay.  It comes from the control file, either from the
+ * default location or from a backup recovery field.
  *
- * RedoStartLSN is the checkpoint's REDO location, also from the backup label
- * file or the control file.  In standby mode, XLOG streaming usually starts
+ * RedoStartLSN is the checkpoint's REDO location, also from the default
+ * control file location or from a backup recovery field.  In standby mode,
+ * XLOG streaming usually starts
  * from the position where an invalid record was found.  But if we fail to
  * read even the initial checkpoint record, we use the REDO location instead
  * of the checkpoint location as the start position of XLOG streaming.
@@ -388,9 +389,6 @@ static void ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, Time
 static void EnableStandbyMode(void);
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
-static bool read_backup_label(XLogRecPtr *checkPointLoc,
-							  TimeLineID *backupLabelTLI,
-							  bool *backupEndRequired, bool *backupFromStandby);
 static bool read_tablespace_map(List **tablespaces);
 
 static void xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI);
@@ -492,8 +490,8 @@ EnableStandbyMode(void)
  * Prepare the system for WAL recovery, if needed.
  *
  * This is called by StartupXLOG() which coordinates the server startup
- * sequence.  This function analyzes the control file and the backup label
- * file, if any, and figures out whether we need to perform crash recovery or
+ * sequence.  This function analyzes the control file and backup recovery
+ * info, if any, and figures out whether we need to perform crash recovery or
  * archive recovery, and how far we need to replay the WAL to reach a
  * consistent state.
  *
@@ -510,7 +508,7 @@ EnableStandbyMode(void)
  */
 void
 InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
-				bool *haveBackupLabel_ptr, bool *haveTblspcMap_ptr)
+				bool *haveTblspcMap_ptr)
 {
 	XLogPageReadPrivate *private;
 	struct stat st;
@@ -518,7 +516,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	XLogRecord *record;
 	DBState		dbstate_at_startup;
 	bool		haveTblspcMap = false;
-	bool		haveBackupLabel = false;
+	bool		backupRecoveryRequired = false;
 	CheckPoint	checkPoint;
 	bool		backupFromStandby = false;
 
@@ -585,18 +583,34 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	primary_image_masked = (char *) palloc(BLCKSZ);
 
 	/*
-	 * Read the backup_label file.  We want to run this part of the recovery
-	 * process after checking for signal files and after performing validation
-	 * of the recovery parameters.
+	 * Load recovery settings from pg_control.  We want to run this part of the
+	 * recovery process after checking for signal files and after performing
+	 * validation of the recovery parameters.
 	 */
-	if (read_backup_label(&CheckPointLoc, &CheckPointTLI, &backupEndRequired,
-						  &backupFromStandby))
+	if (ControlFile->backupRecoveryRequired)
 	{
 		List	   *tablespaces = NIL;
 
+		/* Initialize recovery from fields stored in pg_control */
+		CheckPointLoc = ControlFile->backupCheckPoint;
+		CheckPointTLI = ControlFile->backupStartPointTLI;
+		RedoStartLSN = ControlFile->backupStartPoint;
+		RedoStartTLI = ControlFile->backupStartPointTLI;
+		backupEndRequired = ControlFile->backupEndRequired;
+		backupFromStandby = ControlFile->backupFromStandby;
+
+		/* Clear fields used to initialize recovery */
+		ControlFile->backupCheckPoint = InvalidXLogRecPtr;
+		ControlFile->backupStartPointTLI = 0;
+		ControlFile->backupRecoveryRequired = false;
+		ControlFile->backupFromStandby = false;
+
+		/* Indicate that recovery was requested */
+		backupRecoveryRequired = true;
+
 		/*
-		 * Archive recovery was requested, and thanks to the backup label
-		 * file, we know how far we need to replay to reach consistency. Enter
+		 * Archive recovery was requested, and thanks to the recovery
+		 * info, we know how far we need to replay to reach consistency. Enter
 		 * archive recovery directly.
 		 */
 		InArchiveRecovery = true;
@@ -604,8 +618,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			EnableStandbyMode();
 
 		/*
-		 * When a backup_label file is present, we want to roll forward from
-		 * the checkpoint it identifies, rather than using pg_control.
+		 * When backup recovery is requested, we want to roll forward from
+		 * the checkpoint it identifies, rather than using the default
+		 * checkpoint.
 		 */
 		record = ReadCheckpointRecord(xlogprefetcher, CheckPointLoc,
 									  CheckPointTLI);
@@ -620,9 +635,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 
 			/*
 			 * Make sure that REDO location exists. This may not be the case
-			 * if there was a crash during an online backup, which left a
-			 * backup_label around that references a WAL segment that's
-			 * already been archived.
+			 * if recovery.signal is missing and the WAL has already been
+			 * archived.
 			 */
 			if (checkPoint.redo < CheckPointLoc)
 			{
@@ -631,20 +645,16 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 								checkPoint.ThisTimeLineID))
 					ereport(FATAL,
 							(errmsg("could not find redo location referenced by checkpoint record"),
-							 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n"
-									 "If you are not restoring from a backup, try removing the file \"%s/backup_label\".\n"
-									 "Be careful: removing \"%s/backup_label\" will result in a corrupt cluster if restoring from a backup.",
-									 DataDir, DataDir, DataDir, DataDir)));
+							 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n",
+									 DataDir, DataDir)));
 			}
 		}
 		else
 		{
 			ereport(FATAL,
 					(errmsg("could not locate required checkpoint record"),
-					 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n"
-							 "If you are not restoring from a backup, try removing the file \"%s/backup_label\".\n"
-							 "Be careful: removing \"%s/backup_label\" will result in a corrupt cluster if restoring from a backup.",
-							 DataDir, DataDir, DataDir, DataDir)));
+					 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n",
+							 DataDir, DataDir)));
 			wasShutdown = false;	/* keep compiler quiet */
 		}
 
@@ -679,37 +689,32 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			/* tell the caller to delete it later */
 			haveTblspcMap = true;
 		}
-
-		/* tell the caller to delete it later */
-		haveBackupLabel = true;
 	}
 	else
 	{
-		/* No backup_label file has been found if we are here. */
-
 		/*
-		 * If tablespace_map file is present without backup_label file, there
-		 * is no use of such file.  There is no harm in retaining it, but it
-		 * is better to get rid of the map file so that we don't have any
+		 * If tablespace_map file is present without backup recovery requested,
+		 * there is no use of such file.  There is no harm in retaining it, but
+		 * it is better to get rid of the map file so that we don't have any
 		 * redundant file in data directory and it will avoid any sort of
 		 * confusion.  It seems prudent though to just rename the file out of
 		 * the way rather than delete it completely, also we ignore any error
 		 * that occurs in rename operation as even if map file is present
-		 * without backup_label file, it is harmless.
+		 * without backup recovery requested, it is harmless.
 		 */
 		if (stat(TABLESPACE_MAP, &st) == 0)
 		{
 			unlink(TABLESPACE_MAP_OLD);
 			if (durable_rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD, DEBUG1) == 0)
 				ereport(LOG,
-						(errmsg("ignoring file \"%s\" because no file \"%s\" exists",
-								TABLESPACE_MAP, BACKUP_LABEL_FILE),
+						(errmsg("ignoring file \"%s\" because backup recovery was not requested",
+								TABLESPACE_MAP),
 						 errdetail("File \"%s\" was renamed to \"%s\".",
 								   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 			else
 				ereport(LOG,
-						(errmsg("ignoring file \"%s\" because no file \"%s\" exists",
-								TABLESPACE_MAP, BACKUP_LABEL_FILE),
+						(errmsg("ignoring file \"%s\" because backup recovery was not requested",
+								TABLESPACE_MAP),
 						 errdetail("Could not rename file \"%s\" to \"%s\": %m.",
 								   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 		}
@@ -943,7 +948,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		 * Any other state indicates that the backup somehow became corrupted
 		 * and we can't sensibly continue with recovery.
 		 */
-		if (haveBackupLabel)
+		if (backupRecoveryRequired)
 		{
 			ControlFile->backupStartPoint = checkPoint.redo;
 			ControlFile->backupEndRequired = backupEndRequired;
@@ -953,7 +958,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 				if (dbstate_at_startup != DB_IN_ARCHIVE_RECOVERY &&
 					dbstate_at_startup != DB_SHUTDOWNED_IN_RECOVERY)
 					ereport(FATAL,
-							(errmsg("backup_label contains data inconsistent with control file"),
+							(errmsg("pg_control contains inconsistent data for standby backup"),
 							 errhint("This means that the backup is corrupted and you will "
 									 "have to use another backup for recovery.")));
 				ControlFile->backupEndPoint = ControlFile->minRecoveryPoint;
@@ -983,7 +988,6 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	missingContrecPtr = InvalidXLogRecPtr;
 
 	*wasShutdown_ptr = wasShutdown;
-	*haveBackupLabel_ptr = haveBackupLabel;
 	*haveTblspcMap_ptr = haveTblspcMap;
 }
 
@@ -1156,154 +1160,6 @@ validateRecoveryParameters(void)
 	}
 }
 
-/*
- * read_backup_label: check to see if a backup_label file is present
- *
- * If we see a backup_label during recovery, we assume that we are recovering
- * from a backup dump file, and we therefore roll forward from the checkpoint
- * identified by the label file, NOT what pg_control says.  This avoids the
- * problem that pg_control might have been archived one or more checkpoints
- * later than the start of the dump, and so if we rely on it as the start
- * point, we will fail to restore a consistent database state.
- *
- * Returns true if a backup_label was found (and fills the checkpoint
- * location and TLI into *checkPointLoc and *backupLabelTLI, respectively);
- * returns false if not. If this backup_label came from a streamed backup,
- * *backupEndRequired is set to true. If this backup_label was created during
- * recovery, *backupFromStandby is set to true.
- *
- * Also sets the global variables RedoStartLSN and RedoStartTLI with the LSN
- * and TLI read from the backup file.
- */
-static bool
-read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
-				  bool *backupEndRequired, bool *backupFromStandby)
-{
-	char		startxlogfilename[MAXFNAMELEN];
-	TimeLineID	tli_from_walseg,
-				tli_from_file;
-	FILE	   *lfp;
-	char		ch;
-	char		backuptype[20];
-	char		backupfrom[20];
-	char		backuplabel[MAXPGPATH];
-	char		backuptime[128];
-	uint32		hi,
-				lo;
-
-	/* suppress possible uninitialized-variable warnings */
-	*checkPointLoc = InvalidXLogRecPtr;
-	*backupLabelTLI = 0;
-	*backupEndRequired = false;
-	*backupFromStandby = false;
-
-	/*
-	 * See if label file is present
-	 */
-	lfp = AllocateFile(BACKUP_LABEL_FILE, "r");
-	if (!lfp)
-	{
-		if (errno != ENOENT)
-			ereport(FATAL,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m",
-							BACKUP_LABEL_FILE)));
-		return false;			/* it's not there, all is fine */
-	}
-
-	/*
-	 * Read and parse the START WAL LOCATION and CHECKPOINT lines (this code
-	 * is pretty crude, but we are not expecting any variability in the file
-	 * format).
-	 */
-	if (fscanf(lfp, "START WAL LOCATION: %X/%X (file %08X%16s)%c",
-			   &hi, &lo, &tli_from_walseg, startxlogfilename, &ch) != 5 || ch != '\n')
-		ereport(FATAL,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
-	RedoStartLSN = ((uint64) hi) << 32 | lo;
-	RedoStartTLI = tli_from_walseg;
-	if (fscanf(lfp, "CHECKPOINT LOCATION: %X/%X%c",
-			   &hi, &lo, &ch) != 3 || ch != '\n')
-		ereport(FATAL,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
-	*checkPointLoc = ((uint64) hi) << 32 | lo;
-	*backupLabelTLI = tli_from_walseg;
-
-	/*
-	 * BACKUP METHOD lets us know if this was a typical backup ("streamed",
-	 * which could mean either pg_basebackup or the pg_backup_start/stop
-	 * method was used) or if this label came from somewhere else (the only
-	 * other option today being from pg_rewind).  If this was a streamed
-	 * backup then we know that we need to play through until we get to the
-	 * end of the WAL which was generated during the backup (at which point we
-	 * will have reached consistency and backupEndRequired will be reset to be
-	 * false).
-	 */
-	if (fscanf(lfp, "BACKUP METHOD: %19s\n", backuptype) == 1)
-	{
-		if (strcmp(backuptype, "streamed") == 0)
-			*backupEndRequired = true;
-	}
-
-	/*
-	 * BACKUP FROM lets us know if this was from a primary or a standby.  If
-	 * it was from a standby, we'll double-check that the control file state
-	 * matches that of a standby.
-	 */
-	if (fscanf(lfp, "BACKUP FROM: %19s\n", backupfrom) == 1)
-	{
-		if (strcmp(backupfrom, "standby") == 0)
-			*backupFromStandby = true;
-	}
-
-	/*
-	 * Parse START TIME and LABEL. Those are not mandatory fields for recovery
-	 * but checking for their presence is useful for debugging and the next
-	 * sanity checks. Cope also with the fact that the result buffers have a
-	 * pre-allocated size, hence if the backup_label file has been generated
-	 * with strings longer than the maximum assumed here an incorrect parsing
-	 * happens. That's fine as only minor consistency checks are done
-	 * afterwards.
-	 */
-	if (fscanf(lfp, "START TIME: %127[^\n]\n", backuptime) == 1)
-		ereport(DEBUG1,
-				(errmsg_internal("backup time %s in file \"%s\"",
-								 backuptime, BACKUP_LABEL_FILE)));
-
-	if (fscanf(lfp, "LABEL: %1023[^\n]\n", backuplabel) == 1)
-		ereport(DEBUG1,
-				(errmsg_internal("backup label %s in file \"%s\"",
-								 backuplabel, BACKUP_LABEL_FILE)));
-
-	/*
-	 * START TIMELINE is new as of 11. Its parsing is not mandatory, still use
-	 * it as a sanity check if present.
-	 */
-	if (fscanf(lfp, "START TIMELINE: %u\n", &tli_from_file) == 1)
-	{
-		if (tli_from_walseg != tli_from_file)
-			ereport(FATAL,
-					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE),
-					 errdetail("Timeline ID parsed is %u, but expected %u.",
-							   tli_from_file, tli_from_walseg)));
-
-		ereport(DEBUG1,
-				(errmsg_internal("backup timeline %u in file \"%s\"",
-								 tli_from_file, BACKUP_LABEL_FILE)));
-	}
-
-	if (ferror(lfp) || FreeFile(lfp))
-		ereport(FATAL,
-				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
-						BACKUP_LABEL_FILE)));
-
-	return true;
-}
-
 /*
  * read_tablespace_map: check to see if a tablespace_map file is present
  *
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index b537f462197..01d09dbdd21 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -22,6 +22,7 @@
 #include "backup/basebackup.h"
 #include "backup/basebackup_sink.h"
 #include "backup/basebackup_target.h"
+#include "catalog/pg_control.h"
 #include "commands/defrem.h"
 #include "common/compression.h"
 #include "common/file_perm.h"
@@ -94,7 +95,7 @@ static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
 								 BlockNumber blkno,
 								 uint16 *expected_checksum);
 static void sendFileWithContent(bbsink *sink, const char *filename,
-								const char *content,
+								const char *content, int len,
 								backup_manifest_info *manifest);
 static int64 _tarWriteHeader(bbsink *sink, const char *filename,
 							 const char *linktarget, struct stat *statbuf,
@@ -192,10 +193,9 @@ static const struct exclude_list_item excludeFiles[] =
 	{RELCACHE_INIT_FILENAME, true},
 
 	/*
-	 * backup_label and tablespace_map should not exist in a running cluster
-	 * capable of doing an online backup, but exclude them just in case.
+	 * tablespace_map should not exist in a running cluster capable of doing
+	 * an online backup, but exclude it just in case.
 	 */
-	{BACKUP_LABEL_FILE, false},
 	{TABLESPACE_MAP, false},
 
 	/*
@@ -325,23 +325,15 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 			if (ti->path == NULL)
 			{
-				struct stat statbuf;
 				bool		sendtblspclinks = true;
-				char	   *backup_label;
 
 				bbsink_begin_archive(sink, "base.tar");
 
-				/* In the main tar, include the backup_label first... */
-				backup_label = build_backup_content(backup_state, false);
-				sendFileWithContent(sink, BACKUP_LABEL_FILE,
-									backup_label, &manifest);
-				pfree(backup_label);
-
-				/* Then the tablespace_map file, if required... */
+				/* Send the tablespace_map file, if required... */
 				if (opt->sendtblspcmapfile)
 				{
 					sendFileWithContent(sink, TABLESPACE_MAP,
-										tablespace_map->data, &manifest);
+										tablespace_map->data, -1, &manifest);
 					sendtblspclinks = false;
 				}
 
@@ -349,14 +341,14 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 				sendDir(sink, ".", 1, false, state.tablespaces,
 						sendtblspclinks, &manifest, InvalidOid);
 
-				/* ... and pg_control after everything else. */
-				if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not stat file \"%s\": %m",
-									XLOG_CONTROL_FILE)));
-				sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
-						 false, InvalidOid, InvalidOid, &manifest);
+				/* End the backup before sending pg_control */
+				basebackup_progress_wait_wal_archive(&state);
+				do_pg_backup_stop(backup_state, !opt->nowait);
+
+				/* Send copy of pg_control containing recovery info */
+				sendFileWithContent(sink, XLOG_CONTROL_FILE,
+								    (char *)backup_state->controlFile,
+									PG_CONTROL_MAX_SAFE_SIZE, &manifest);
 			}
 			else
 			{
@@ -390,9 +382,6 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 			}
 		}
 
-		basebackup_progress_wait_wal_archive(&state);
-		do_pg_backup_stop(backup_state, !opt->nowait);
-
 		endptr = backup_state->stoppoint;
 		endtli = backup_state->stoptli;
 
@@ -601,7 +590,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 			 * complete segment.
 			 */
 			StatusFilePath(pathbuf, walFileName, ".done");
-			sendFileWithContent(sink, pathbuf, "", &manifest);
+			sendFileWithContent(sink, pathbuf, "", -1, &manifest);
 		}
 
 		/*
@@ -629,7 +618,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 			/* unconditionally mark file as archived */
 			StatusFilePath(pathbuf, fname, ".done");
-			sendFileWithContent(sink, pathbuf, "", &manifest);
+			sendFileWithContent(sink, pathbuf, "", -1, &manifest);
 		}
 
 		/* Properly terminate the tar file. */
@@ -1040,22 +1029,21 @@ SendBaseBackup(BaseBackupCmd *cmd)
  */
 static void
 sendFileWithContent(bbsink *sink, const char *filename, const char *content,
-					backup_manifest_info *manifest)
+					int len, backup_manifest_info *manifest)
 {
 	struct stat statbuf;
-	int			bytes_done = 0,
-				len;
+	int			bytes_done = 0;
 	pg_checksum_context checksum_ctx;
 
 	if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
 		elog(ERROR, "could not initialize checksum of file \"%s\"",
 			 filename);
 
-	len = strlen(content);
+	if (len < 0)
+		len = strlen(content);
 
 	/*
-	 * Construct a stat struct for the backup_label file we're injecting in
-	 * the tar.
+	 * Construct a stat struct for the file we're injecting in the tar.
 	 */
 	/* Windows doesn't have the concept of uid and gid */
 #ifdef WIN32
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 35d738d5763..24bf34b45eb 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -384,13 +384,15 @@ BEGIN ATOMIC
 END;
 
 CREATE OR REPLACE FUNCTION
-  pg_backup_start(label text, fast boolean DEFAULT false)
-  RETURNS pg_lsn STRICT VOLATILE LANGUAGE internal AS 'pg_backup_start'
+  pg_backup_start(label text, fast boolean DEFAULT false, OUT lsn pg_lsn,
+        OUT timeline_id int8, OUT start timestamptz)
+  RETURNS record STRICT VOLATILE LANGUAGE internal AS 'pg_backup_start'
   PARALLEL RESTRICTED;
 
 CREATE OR REPLACE FUNCTION pg_backup_stop (
-        wait_for_archive boolean DEFAULT true, OUT lsn pg_lsn,
-        OUT labelfile text, OUT spcmapfile text)
+        wait_for_archive boolean DEFAULT true, OUT pg_control_file bytea,
+        OUT tablespace_map_file text, OUT lsn pg_lsn, OUT timeline_id int8,
+        OUT stop timestamptz)
   RETURNS record STRICT VOLATILE LANGUAGE internal as 'pg_backup_stop'
   PARALLEL RESTRICTED;
 
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b4..c655cb03352 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -171,8 +171,8 @@ SKIP:
 
 # Write some files to test that they are not copied.
 foreach my $filename (
-	qw(backup_label tablespace_map postgresql.auto.conf.tmp
-	current_logfiles.tmp global/pg_internal.init.123))
+	qw(tablespace_map postgresql.auto.conf.tmp current_logfiles.tmp
+	   global/pg_internal.init.123))
 {
 	open my $file, '>>', "$pgdata/$filename";
 	print $file "DONOTCOPY";
@@ -261,14 +261,13 @@ foreach my $filename (@tempRelationFiles)
 		"base/$postgresOid/$filename not copied");
 }
 
-# Make sure existing backup_label was ignored.
-isnt(slurp_file("$tempdir/backup/backup_label"),
-	'DONOTCOPY', 'existing backup_label not copied');
+# Make sure existing tablespace_map was ignored.
+ok(!-f "$tempdir/backup/tablespace_map", 'tablespace_map not in backup');
 rmtree("$tempdir/backup");
 
-# Now delete the bogus backup_label file since it will interfere with startup
-unlink("$pgdata/backup_label")
-  or BAIL_OUT("unable to unlink $pgdata/backup_label");
+# Now delete the bogus tablespace_map file since it will interfere with startup
+unlink("$pgdata/tablespace_map")
+  or BAIL_OUT("unable to unlink $pgdata/tablespace_map");
 
 $node->command_ok(
 	[
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index ecadd69dc53..213f4e71b88 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -139,11 +139,10 @@ static const struct exclude_list_item excludeFiles[] =
 	{"pg_internal.init", true}, /* defined as RELCACHE_INIT_FILENAME */
 
 	/*
-	 * If there is a backup_label or tablespace_map file, it indicates that a
-	 * recovery failed and this cluster probably can't be rewound, but exclude
-	 * them anyway if they are found.
+	 * If there is a tablespace_map file, it indicates that a recovery failed
+	 * and this cluster probably can't be rewound, but exclude it anyway if it
+	 * is found.
 	 */
-	{"backup_label", false},	/* defined as BACKUP_LABEL_FILE */
 	{"tablespace_map", false},	/* defined as TABLESPACE_MAP */
 
 	/*
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index bfd44a284e2..f42782e2eab 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -39,9 +39,6 @@ static void perform_rewind(filemap_t *filemap, rewind_source *source,
 						   TimeLineID chkpttli,
 						   XLogRecPtr chkptredo);
 
-static void createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli,
-							  XLogRecPtr checkpointloc);
-
 static void digestControlFile(ControlFileData *ControlFile,
 							  const char *content, size_t size);
 static void getRestoreCommand(const char *argv0);
@@ -654,7 +651,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 		pg_log_info("creating backup label and updating control file");
 
 	/*
-	 * Create a backup label file, to tell the target where to begin the WAL
+	 * Get recovery fields to tell the target where to begin the WAL
 	 * replay. Normally, from the last common checkpoint between the source
 	 * and the target. But if the source is a standby server, it's possible
 	 * that the last common checkpoint is *after* the standby's restartpoint.
@@ -672,7 +669,6 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 		chkpttli = ControlFile_source.checkPointCopy.ThisTimeLineID;
 		chkptrec = ControlFile_source.checkPoint;
 	}
-	createBackupLabel(chkptredo, chkpttli, chkptrec);
 
 	/*
 	 * Update control file of target, to tell the target how far it must
@@ -722,6 +718,12 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	ControlFile_new.minRecoveryPoint = endrec;
 	ControlFile_new.minRecoveryPointTLI = endtli;
 	ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
+	ControlFile_new.backupRecoveryRequired = true;
+	ControlFile_new.backupFromStandby = true;
+	ControlFile_new.backupEndRequired = false;
+	ControlFile_new.backupCheckPoint = chkptrec;
+	ControlFile_new.backupStartPoint = chkptredo;
+	ControlFile_new.backupStartPointTLI = chkpttli;
 	if (!dry_run)
 		update_controlfile(datadir_target, &ControlFile_new, do_sync);
 }
@@ -729,7 +731,10 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 static void
 sanityChecks(void)
 {
-	/* TODO Check that there's no backup_label in either cluster */
+	/*
+	 * TODO Check that neither cluster has backupRecoveryRequested set in
+	 * pg_control.
+	 */
 
 	/* Check system_identifier match */
 	if (ControlFile_target.system_identifier != ControlFile_source.system_identifier)
@@ -951,51 +956,6 @@ findCommonAncestorTimeline(TimeLineHistoryEntry *a_history, int a_nentries,
 	}
 }
 
-
-/*
- * Create a backup_label file that forces recovery to begin at the last common
- * checkpoint.
- */
-static void
-createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli, XLogRecPtr checkpointloc)
-{
-	XLogSegNo	startsegno;
-	time_t		stamp_time;
-	char		strfbuf[128];
-	char		xlogfilename[MAXFNAMELEN];
-	struct tm  *tmp;
-	char		buf[1000];
-	int			len;
-
-	XLByteToSeg(startpoint, startsegno, WalSegSz);
-	XLogFileName(xlogfilename, starttli, startsegno, WalSegSz);
-
-	/*
-	 * Construct backup label file
-	 */
-	stamp_time = time(NULL);
-	tmp = localtime(&stamp_time);
-	strftime(strfbuf, sizeof(strfbuf), "%Y-%m-%d %H:%M:%S %Z", tmp);
-
-	len = snprintf(buf, sizeof(buf),
-				   "START WAL LOCATION: %X/%X (file %s)\n"
-				   "CHECKPOINT LOCATION: %X/%X\n"
-				   "BACKUP METHOD: pg_rewind\n"
-				   "BACKUP FROM: standby\n"
-				   "START TIME: %s\n",
-	/* omit LABEL: line */
-				   LSN_FORMAT_ARGS(startpoint), xlogfilename,
-				   LSN_FORMAT_ARGS(checkpointloc),
-				   strfbuf);
-	if (len >= sizeof(buf))
-		pg_fatal("backup label buffer too small");	/* shouldn't happen */
-
-	/* TODO: move old file out of the way, if any. */
-	open_target_file("backup_label", true); /* BACKUP_LABEL_FILE */
-	write_target_range(buf, 0, len);
-	close_target_file();
-}
-
 /*
  * Check CRC of control file
  */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164f..3aac6839a70 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -293,8 +293,6 @@ extern SessionBackupState get_backup_status(void);
 /* File path names (all relative to $PGDATA) */
 #define RECOVERY_SIGNAL_FILE	"recovery.signal"
 #define STANDBY_SIGNAL_FILE		"standby.signal"
-#define BACKUP_LABEL_FILE		"backup_label"
-#define BACKUP_LABEL_OLD		"backup_label.old"
 
 #define TABLESPACE_MAP			"tablespace_map"
 #define TABLESPACE_MAP_OLD		"tablespace_map.old"
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137b..b75411b7c3d 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -15,6 +15,7 @@
 #define XLOG_BACKUP_H
 
 #include "access/xlogdefs.h"
+#include "catalog/pg_control.h"
 #include "pgtime.h"
 
 /* Structure to hold backup state. */
@@ -33,9 +34,18 @@ typedef struct BackupState
 	XLogRecPtr	stoppoint;		/* backup stop WAL location */
 	TimeLineID	stoptli;		/* backup stop TLI */
 	pg_time_t	stoptime;		/* backup stop time */
+
+	/*
+	 * After pg_backup_stop() returns this field will contain a copy of
+	 * pg_control that should be stored with the backup. Fields have been
+	 * updated for recovery and the CRC has been recalculated. The buffer
+	 * is padded to PG_CONTROL_MAX_SAFE_SIZE so that pg_control is always
+	 * a consistent size but smaller (and hopefully easier to handle) than
+	 * PG_CONTROL_FILE_SIZE. Bytes after sizeof(ControlFileData) are zeroed.
+	 */
+	uint8_t controlFile[PG_CONTROL_MAX_SAFE_SIZE];
 } BackupState;
 
-extern char *build_backup_content(BackupState *state,
-								  bool ishistoryfile);
+extern char *build_backup_content(BackupState *state);
 
 #endif							/* XLOG_BACKUP_H */
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index ee0bc742782..981266f7340 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -80,8 +80,7 @@ extern Size XLogRecoveryShmemSize(void);
 extern void XLogRecoveryShmemInit(void);
 
 extern void InitWalRecovery(ControlFileData *ControlFile,
-							bool *wasShutdown_ptr, bool *haveBackupLabel_ptr,
-							bool *haveTblspcMap_ptr);
+							bool *wasShutdown_ptr, bool *haveTblspcMap_ptr);
 extern void PerformWalRecovery(void);
 
 /*
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 2ae72e3b266..64bab07e056 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -146,6 +146,9 @@ typedef struct ControlFileData
 	 * to disk, we mustn't start up until we reach X again. Zero when not
 	 * doing archive recovery.
 	 *
+	 * backupCheckPoint is the backup start checkpoint and is set to zero after
+	 * recovery is initialized.
+	 *
 	 * backupStartPoint is the redo pointer of the backup start checkpoint, if
 	 * we are recovering from an online backup and haven't reached the end of
 	 * backup yet. It is reset to zero when the end of backup is reached, and
@@ -160,14 +163,25 @@ typedef struct ControlFileData
 	 * pg_control which was backed up last. It is reset to zero when the end
 	 * of backup is reached, and we mustn't start up before that.
 	 *
+	 * backupRecoveryRequired indicates that the pg_control file was provided
+	 * by a backup or pg_rewind and recovery settings need to be copied. It will
+	 * be set to false when the settings have been copied.
+	 *
+	 * backupFromStandby indicates that the backup was taken on a standby. It is
+	 * require to initialize recovery and set to false afterwards.
+	 *
 	 * If backupEndRequired is true, we know for sure that we're restoring
 	 * from a backup, and must see a backup-end record before we can safely
 	 * start up.
 	 */
 	XLogRecPtr	minRecoveryPoint;
 	TimeLineID	minRecoveryPointTLI;
+	XLogRecPtr	backupCheckPoint;
 	XLogRecPtr	backupStartPoint;
+	TimeLineID	backupStartPointTLI;
 	XLogRecPtr	backupEndPoint;
+	bool 		backupRecoveryRequired;
+	bool 		backupFromStandby;
 	bool		backupEndRequired;
 
 	/*
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 091f7e343c3..ff095e93505 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6413,13 +6413,17 @@
   prosrc => 'pg_terminate_backend' },
 { oid => '2172', descr => 'prepare for taking an online backup',
   proname => 'pg_backup_start', provolatile => 'v', proparallel => 'r',
-  prorettype => 'pg_lsn', proargtypes => 'text bool',
+  prorettype => 'record', proargtypes => 'text bool',
+  proallargtypes => '{text,bool,pg_lsn,int8,timestamptz}',
+  proargmodes => '{i,i,o,o,o}',
+  proargnames => '{label,fast,lsn,timeline_id,start}',
   prosrc => 'pg_backup_start' },
 { oid => '2739', descr => 'finish taking an online backup',
   proname => 'pg_backup_stop', provolatile => 'v', proparallel => 'r',
   prorettype => 'record', proargtypes => 'bool',
-  proallargtypes => '{bool,pg_lsn,text,text}', proargmodes => '{i,o,o,o}',
-  proargnames => '{wait_for_archive,lsn,labelfile,spcmapfile}',
+  proallargtypes => '{bool,bytea,text,pg_lsn,int8,timestamptz}',
+  proargmodes => '{i,o,o,o,o,o}',
+  proargnames => '{wait_for_archive,pg_control_file,tablespace_map_file,lsn,timeline_id,stop}',
   prosrc => 'pg_backup_stop' },
 { oid => '3436', descr => 'promote standby server',
   proname => 'pg_promote', provolatile => 'v', prorettype => 'bool',

michael@paquier.xyz

about 2 years ago

In reply to: David Steele (#3)

Re: Add recovery to pg_control and remove backup_label

On Fri, Oct 27, 2023 at 10:10:42AM -0400, David Steele wrote:

We are still planning to address this issue in the back branches.

FWIW, redesigning the backend code in charge of doing base backups in
the back branches is out of scope. Based on a read of the proposed
patch, it includes catalog changes which would require a catversion
bump, so that's not going to work anyway.
--
Michael

michael@paquier.xyz

about 2 years ago

In reply to: David Steele (#6)

Re: Add recovery to pg_control and remove backup_label

On Sun, Nov 05, 2023 at 01:45:39PM -0400, David Steele wrote:

Rebased on 151ffcf6.

I like this patch a lot. Even if the backup_label file is removed, we
still have all the debug information from the backup history file,
thanks to its LABEL, BACKUP METHOD and BACKUP FROM, so no information
is lost. It does a 1:1 replacement of the contents parsed from the
backup_label needed by recovery by fetching them from the control
file. Sounds like a straight-forward change to me.

The patch is failing the recovery test 039_end_of_wal.pl. Could you
look at the failure?

         /* Build and save the contents of the backup history file */
-        history_file = build_backup_content(state, true);
+        history_file = build_backup_content(state);

build_backup_content() sounds like an incorrect name if it is a
routine onlyused to build the contents of backup history files.

Why is there nothing updated in src/bin/pg_controldata/?

+        /* Clear fields used to initialize recovery */
+        ControlFile->backupCheckPoint = InvalidXLogRecPtr;
+        ControlFile->backupStartPointTLI = 0;
+        ControlFile->backupRecoveryRequired = false;
+        ControlFile->backupFromStandby = false;

These variables in the control file are cleaned up when the
backup_label file was read previously, but backup_label is renamed to
backup_label.old a bit later than that. Your logic looks correct seen
from here, but shouldn't these variables be set much later, aka just
*after* UpdateControlFile(). This gap between the initialization of
the control file and the in-memory reset makes the code quite brittle,
IMO.

- basebackup_progress_wait_wal_archive(&state);
- do_pg_backup_stop(backup_state, !opt->nowait);

Why is that moved?

-    The backup label
-    file includes the label string you gave to <function>pg_backup_start</function>,
-    as well as the time at which <function>pg_backup_start</function> was run, and
-    the name of the starting WAL file.  In case of confusion it is therefore
-    possible to look inside a backup file and determine exactly which
-    backup session the dump file came from.  The tablespace map file includes
+    The tablespace map file includes

It may be worth mentioning that the backup history file holds this
information on the primary's pg_wal, as well.

The changes in sendFileWithContent() may be worth a patch of its own.

--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -146,6 +146,9 @@ typedef struct ControlFileData
@@ -160,14 +163,25 @@ typedef struct ControlFileData
     XLogRecPtr    minRecoveryPoint;
     TimeLineID    minRecoveryPointTLI;
+    XLogRecPtr    backupCheckPoint;
     XLogRecPtr    backupStartPoint;
+    TimeLineID    backupStartPointTLI;
     XLogRecPtr    backupEndPoint;
+    bool         backupRecoveryRequired;
+    bool         backupFromStandby;

This increases the size of the control file from 296B to 312B with an
8-byte alignment, as far as I can see. The size of the control file
has been always a sensitive subject especially with the hard limit of
PG_CONTROL_MAX_SAFE_SIZE. Well, the point of this patch is that this
is the price to pay to prevent users from doing something stupid with
a removal of a backup_label when they should not. Do others have an
opinion about this increase in size?

Actually, grouping backupStartPointTLI and minRecoveryPointTLI should
reduce more the size with some alignment magic, no?

- /*
- * BACKUP METHOD lets us know if this was a typical backup ("streamed",
- * which could mean either pg_basebackup or the pg_backup_start/stop
- * method was used) or if this label came from somewhere else (the only
- * other option today being from pg_rewind). If this was a streamed
- * backup then we know that we need to play through until we get to the
- * end of the WAL which was generated during the backup (at which point we
- * will have reached consistency and backupEndRequired will be reset to be
- * false).
- */
- if (fscanf(lfp, "BACKUP METHOD: %19s\n", backuptype) == 1)
- {
- if (strcmp(backuptype, "streamed") == 0)
- *backupEndRequired = true;
- }

backupRecoveryRequired in the control file is switched to false for
pg_rewind and true for streamed backups. My gut feeling is telling me
that this should be OK, as out-of-core tools would need an upgrade if
they relied on the backend_label file anyway. I can see that this
change makes use lose some documentation, unfortunately. Shouldn't
these removed lines be moved to pg_control.h instead for the
description of backupEndRequired?

doc/src/sgml/ref/pg_rewind.sgml and
src/backend/access/transam/xlogrecovery.c still include references to
the backup_label file.
--
Michael

david@pgmasters.net

about 2 years ago

In reply to: Michael Paquier (#7)

Re: Add recovery to pg_control and remove backup_label

On 11/6/23 01:05, Michael Paquier wrote:

On Fri, Oct 27, 2023 at 10:10:42AM -0400, David Steele wrote:

We are still planning to address this issue in the back branches.

FWIW, redesigning the backend code in charge of doing base backups in
the back branches is out of scope. Based on a read of the proposed
patch, it includes catalog changes which would require a catversion
bump, so that's not going to work anyway.

I did not mean this patch -- rather some variation of what Thomas has
been working on, more than likely.

Regards,
-David

#10

david@pgmasters.net

about 2 years ago

In reply to: Michael Paquier (#8)

1 attachment(s)

Re: Add recovery to pg_control and remove backup_label

On 11/6/23 02:35, Michael Paquier wrote:

On Sun, Nov 05, 2023 at 01:45:39PM -0400, David Steele wrote:

Rebased on 151ffcf6.

I like this patch a lot. Even if the backup_label file is removed, we
still have all the debug information from the backup history file,
thanks to its LABEL, BACKUP METHOD and BACKUP FROM, so no information
is lost. It does a 1:1 replacement of the contents parsed from the
backup_label needed by recovery by fetching them from the control
file. Sounds like a straight-forward change to me.

That's the plan, at least!

The patch is failing the recovery test 039_end_of_wal.pl. Could you
look at the failure?

I'm not seeing this failure, and CI seems happy [1]https://cirrus-ci.com/build/4939808120766464. Can you give
details of the error message?

/* Build and save the contents of the backup history file */
-        history_file = build_backup_content(state, true);
+        history_file = build_backup_content(state);
build_backup_content() sounds like an incorrect name if it is a
routine onlyused to build the contents of backup history files.

Good point, I have renamed this to build_backup_history_content().

Why is there nothing updated in src/bin/pg_controldata/?

Oops, added.

+        /* Clear fields used to initialize recovery */
+        ControlFile->backupCheckPoint = InvalidXLogRecPtr;
+        ControlFile->backupStartPointTLI = 0;
+        ControlFile->backupRecoveryRequired = false;
+        ControlFile->backupFromStandby = false;
These variables in the control file are cleaned up when the
backup_label file was read previously, but backup_label is renamed to
backup_label.old a bit later than that. Your logic looks correct seen
from here, but shouldn't these variables be set much later, aka just
*after* UpdateControlFile(). This gap between the initialization of
the control file and the in-memory reset makes the code quite brittle,
IMO.

If we set these fields where backup_label was renamed, the logic would
not be exactly the same since pg_control won't be updated until the next
time through the loop. Since the fields should be updated before
UpdateControlFile() I thought it made sense to keep all the updates
together.

Overall I think it is simpler, and we don't need to acquire a lock on
ControlFile.

- basebackup_progress_wait_wal_archive(&state);
- do_pg_backup_stop(backup_state, !opt->nowait);

Why is that moved?

do_pg_backup_stop() generates the updated pg_control so it needs to run
before we transmit pg_control.

-    The backup label
-    file includes the label string you gave to <function>pg_backup_start</function>,
-    as well as the time at which <function>pg_backup_start</function> was run, and
-    the name of the starting WAL file.  In case of confusion it is therefore
-    possible to look inside a backup file and determine exactly which
-    backup session the dump file came from.  The tablespace map file includes
+    The tablespace map file includes

It may be worth mentioning that the backup history file holds this
information on the primary's pg_wal, as well.

OK, reworded.

The changes in sendFileWithContent() may be worth a patch of its own.

Thomas included this change in his pg_basebackup changes so I did the
same. Maybe wait a bit before we split this out? Seems like a pretty
small change...

--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -146,6 +146,9 @@ typedef struct ControlFileData
@@ -160,14 +163,25 @@ typedef struct ControlFileData
XLogRecPtr    minRecoveryPoint;
TimeLineID    minRecoveryPointTLI;
+    XLogRecPtr    backupCheckPoint;
XLogRecPtr    backupStartPoint;
+    TimeLineID    backupStartPointTLI;
XLogRecPtr    backupEndPoint;
+    bool         backupRecoveryRequired;
+    bool         backupFromStandby;
This increases the size of the control file from 296B to 312B with an
8-byte alignment, as far as I can see. The size of the control file
has been always a sensitive subject especially with the hard limit of
PG_CONTROL_MAX_SAFE_SIZE. Well, the point of this patch is that this
is the price to pay to prevent users from doing something stupid with
a removal of a backup_label when they should not. Do others have an
opinion about this increase in size?

Actually, grouping backupStartPointTLI and minRecoveryPointTLI should
reduce more the size with some alignment magic, no?

I thought about this, but it seemed to me that existing fields had been
positioned to make the grouping logical rather than to optimize
alignment, e.g. minRecoveryPointTLI. Ideally that would have been placed
near backupEndRequired (or vice versa). But if the general opinion is to
rearrange for alignment, I'm OK with that.

backupRecoveryRequired in the control file is switched to false for
pg_rewind and true for streamed backups. My gut feeling is telling me
that this should be OK, as out-of-core tools would need an upgrade if
they relied on the backend_label file anyway. I can see that this
change makes use lose some documentation, unfortunately. Shouldn't
these removed lines be moved to pg_control.h instead for the
description of backupEndRequired?

Updated description in pg_control.h -- it's a bit vague but not sure it
is a good idea to get into the inner workings of pg_rewind here?

doc/src/sgml/ref/pg_rewind.sgml and
src/backend/access/transam/xlogrecovery.c still include references to
the backup_label file.

Fixed.

Attached is a new patch based on 18b585155.

Regards,
-David

[1]: https://cirrus-ci.com/build/4939808120766464

Attachments:

v03-recovery-in-pgcontrol-remove-backuplabel.patchtext/plain; charset=UTF-8; name=v03-recovery-in-pgcontrol-remove-backuplabel.patchDownload

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae54..584384875be 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -935,19 +935,20 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
      ready to archive.
     </para>
     <para>
-     <function>pg_backup_stop</function> will return one row with three
-     values. The second of these fields should be written to a file named
-     <filename>backup_label</filename> in the root directory of the backup. The
-     third field should be written to a file named
-     <filename>tablespace_map</filename> unless the field is empty. These files are
+     <function>pg_backup_stop</function> returns the
+     <filename>pg_control</filename> file, which must be stored in the
+     <filename>global</filename> directory of the backup. It also returns the
+     <filename>tablespace_map</filename> file, which should be written in the
+     root directory of the backup unless the field is empty. These files are
      vital to the backup working and must be written byte for byte without
-     modification, which may require opening the file in binary mode.
+     modification, which will require opening the file in binary mode.
     </para>
    </listitem>
    <listitem>
     <para>
      Once the WAL segment files active during the backup are archived, you are
-     done.  The file identified by <function>pg_backup_stop</function>'s first return
+     done.  The file identified by <function>pg_backup_stop</function>'s
+     <parameter>lsn</parameter> return
      value is the last segment that is required to form a complete set of
      backup files.  On a primary, if <varname>archive_mode</varname> is enabled and the
      <literal>wait_for_archive</literal> parameter is <literal>true</literal>,
@@ -1013,7 +1014,15 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
    </para>
 
    <para>
-    You should, however, omit from the backup the files within the
+    You must exclude <filename>global/pg_control</filename> from your backup
+    and put the contents of the <parameter>pg_control_file</parameter> column
+    returned from <function>pg_backup_stop</function> in your backup at
+    <filename>global/pg_control</filename>. This file contains the information
+    required to safely recover.
+   </para>
+
+   <para>
+    You should also omit from the backup the files within the
     cluster's <filename>pg_wal/</filename> subdirectory.  This
     slight adjustment is worthwhile because it reduces the risk
     of mistakes when restoring.  This is easy to arrange if
@@ -1062,11 +1071,11 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
    </para>
 
    <para>
-    The backup label
-    file includes the label string you gave to <function>pg_backup_start</function>,
+    The backup history file (which is archived like WAL) includes the label
+    string you gave to <function>pg_backup_start</function>,
     as well as the time at which <function>pg_backup_start</function> was run, and
     the name of the starting WAL file.  In case of confusion it is therefore
-    possible to look inside a backup file and determine exactly which
+    possible to look inside a backup history file and determine exactly which
     backup session the dump file came from.  The tablespace map file includes
     the symbolic link names as they exist in the directory
     <filename>pg_tblspc/</filename> and the full path of each symbolic link.
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index d963f0a0a00..ed3e5b9dce6 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26845,7 +26845,10 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
           <parameter>label</parameter> <type>text</type>
           <optional>, <parameter>fast</parameter> <type>boolean</type>
           </optional> )
-        <returnvalue>pg_lsn</returnvalue>
+        <returnvalue>record</returnvalue>
+        ( <parameter>lsn</parameter> <type>pg_lsn</type>,
+        <parameter>timeline_id</parameter> <type>int8</type>,
+        <parameter>start</parameter> <type>timestamptz</type> )
        </para>
        <para>
         Prepares the server to begin an on-line backup.  The only required
@@ -26857,6 +26860,13 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
         as possible.  This forces an immediate checkpoint which will cause a
         spike in I/O operations, slowing any concurrently executing queries.
        </para>
+       <para>
+        The result columns contain information about the start of the backup
+        and can be ignored: the <parameter>lsn</parameter> column holds the
+        starting write-ahead log location, the
+        <parameter>timeline_id</parameter> column holds the starting timeline,
+        and the <parameter>stop</parameter> column holds the starting timestamp.
+       </para>
        <para>
         This function is restricted to superusers by default, but other users
         can be granted EXECUTE to run the function.
@@ -26872,13 +26882,15 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
           <optional><parameter>wait_for_archive</parameter> <type>boolean</type>
           </optional> )
         <returnvalue>record</returnvalue>
-        ( <parameter>lsn</parameter> <type>pg_lsn</type>,
-        <parameter>labelfile</parameter> <type>text</type>,
-        <parameter>spcmapfile</parameter> <type>text</type> )
+        ( <parameter>pg_control_file</parameter> <type>text</type>,
+        <parameter>tablespace_map_file</parameter> <type>text</type>,
+        <parameter>lsn</parameter> <type>pg_lsn</type>,
+        <parameter>timeline_id</parameter> <type>int8</type>,
+        <parameter>stop</parameter> <type>timestamptz</type> )
        </para>
        <para>
         Finishes performing an on-line backup.  The desired contents of the
-        backup label file and the tablespace map file are returned as part of
+        pg_control file and the tablespace map file are returned as part of
         the result of the function and must be written to files in the
         backup area.  These files must not be written to the live data directory
         (doing so will cause PostgreSQL to fail to restart in the event of a
@@ -26910,13 +26922,16 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
         backup.
        </para>
        <para>
-        The result of the function is a single record.
-        The <parameter>lsn</parameter> column holds the backup's ending
-        write-ahead log location (which again can be ignored).  The second
-        column returns the contents of the backup label file, and the third
-        column returns the contents of the tablespace map file.  These must be
-        stored as part of the backup and are required as part of the restore
-        process.
+        The result of the function is a single record. The first column returns
+        the contents of the <filename>pg_control</filename> file and the
+        second column returns the contents of the
+        <filename>tablespace_map</filename> file.  These must be stored as part
+        of the backup and are required as part of the restore process. The
+        remainder of the columns contain information about the end of the backup
+        and can be ignored: the <parameter>lsn</parameter> column holds the
+        ending write-ahead log location, the <parameter>timeline_id</parameter>
+        column holds the ending timeline, and the <parameter>stop</parameter>
+        column holds the ending timestamp.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 8e0000d39fb..889add4c5e4 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -400,7 +400,6 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b
       <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>,
       <filename>pg_stat_tmp/</filename>, and <filename>pg_subtrans/</filename>
       are omitted from the data copied from the source cluster. The files
-      <filename>backup_label</filename>,
       <filename>tablespace_map</filename>,
       <filename>pg_internal.init</filename>,
       <filename>postmaster.opts</filename>, and
@@ -410,7 +409,7 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b
     </step>
     <step>
      <para>
-      Create a <filename>backup_label</filename> file to begin WAL replay at
+      Update <filename>pg_control</filename> file to begin WAL replay at
       the checkpoint created at failover and configure the
       <filename>pg_control</filename> file with a minimum consistency LSN
       defined as the result of <literal>pg_current_wal_insert_lsn()</literal>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b541be8eec2..34311cfc2b9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -74,6 +74,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "port/pg_crc32c.h"
 #include "port/pg_iovec.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
@@ -5116,7 +5117,6 @@ StartupXLOG(void)
 	bool		wasShutdown;
 	bool		didCrash;
 	bool		haveTblspcMap;
-	bool		haveBackupLabel;
 	XLogRecPtr	EndOfLog;
 	TimeLineID	EndOfLogTLI;
 	TimeLineID	newTLI;
@@ -5240,13 +5240,14 @@ StartupXLOG(void)
 	/*
 	 * Prepare for WAL recovery if needed.
 	 *
-	 * InitWalRecovery analyzes the control file and the backup label file, if
-	 * any.  It updates the in-memory ControlFile buffer according to the
-	 * starting checkpoint, and sets InRecovery and ArchiveRecoveryRequested.
+	 * InitWalRecovery analyzes the control file and checks if backup recovery
+	 * has been requested.  It updates the in-memory ControlFile buffer
+	 * according to the starting checkpoint, and sets InRecovery and
+	 * ArchiveRecoveryRequested.
+	 *
 	 * It also applies the tablespace map file, if any.
 	 */
-	InitWalRecovery(ControlFile, &wasShutdown,
-					&haveBackupLabel, &haveTblspcMap);
+	InitWalRecovery(ControlFile, &wasShutdown, &haveTblspcMap);
 	checkPoint = ControlFile->checkPointCopy;
 
 	/* initialize shared memory variables from the checkpoint record */
@@ -5389,20 +5390,6 @@ StartupXLOG(void)
 		 */
 		UpdateControlFile();
 
-		/*
-		 * If there was a backup label file, it's done its job and the info
-		 * has now been propagated into pg_control.  We must get rid of the
-		 * label file so that if we crash during recovery, we'll pick up at
-		 * the latest recovery restartpoint instead of going all the way back
-		 * to the backup start point.  It seems prudent though to just rename
-		 * the file out of the way rather than delete it completely.
-		 */
-		if (haveBackupLabel)
-		{
-			unlink(BACKUP_LABEL_OLD);
-			durable_rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD, FATAL);
-		}
-
 		/*
 		 * If there was a tablespace_map file, it's done its job and the
 		 * symlinks have been created.  We must get rid of the map file so
@@ -5552,10 +5539,8 @@ StartupXLOG(void)
 	 * (at which point we reset backupStartPoint to be Invalid), for
 	 * backup-from-replica (which can't inject records into the WAL stream),
 	 * that point is when we reach the minRecoveryPoint in pg_control (which
-	 * we purposefully copy last when backing up from a replica).  For
-	 * pg_rewind (which creates a backup_label with a method of "pg_rewind")
-	 * or snapshot-style backups (which don't), backupEndRequired will be set
-	 * to false.
+	 * we purposefully copy last when backing up).  For pg_rewind or
+	 * snapshot-style backups, backupEndRequired will be set to false.
 	 *
 	 * Note: it is indeed okay to look at the local variable
 	 * LocalMinRecoveryPoint here, even though ControlFile->minRecoveryPoint
@@ -8725,11 +8710,33 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 	int			seconds_before_warning;
 	int			waits = 0;
 	bool		reported_waiting = false;
+	ControlFileData *controlFileCopy = (ControlFileData *)state->controlFile;
 
 	Assert(state != NULL);
 
 	backup_stopped_in_recovery = RecoveryInProgress();
 
+	/*
+	 * Create a copy of control data and update it with fields required for
+	 * recovery. Also recalculate the CRC.
+	 */
+	memset(controlFileCopy, 0, PG_CONTROL_MAX_SAFE_SIZE);
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	memcpy(controlFileCopy, ControlFile, sizeof(ControlFileData));
+	LWLockRelease(ControlFileLock);
+
+	controlFileCopy->backupRecoveryRequired = true;
+	controlFileCopy->backupFromStandby = backup_stopped_in_recovery;
+	controlFileCopy->backupEndRequired = true;
+	controlFileCopy->backupCheckPoint = state->checkpointloc;
+	controlFileCopy->backupStartPoint = state->startpoint;
+	controlFileCopy->backupStartPointTLI = state->starttli;
+
+	INIT_CRC32C(controlFileCopy->crc);
+	COMP_CRC32C(controlFileCopy->crc, controlFileCopy, offsetof(ControlFileData, crc));
+	FIN_CRC32C(controlFileCopy->crc);
+
 	/*
 	 * During recovery, we don't need to check WAL level. Because, if WAL
 	 * level is not sufficient, it's impossible to get here during recovery.
@@ -8831,11 +8838,8 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 							 "Enable full_page_writes and run CHECKPOINT on the primary, "
 							 "and then try an online backup again.")));
 
-
-		LWLockAcquire(ControlFileLock, LW_SHARED);
-		state->stoppoint = ControlFile->minRecoveryPoint;
-		state->stoptli = ControlFile->minRecoveryPointTLI;
-		LWLockRelease(ControlFileLock);
+		state->stoppoint = controlFileCopy->minRecoveryPoint;
+		state->stoptli = controlFileCopy->minRecoveryPointTLI;
 	}
 	else
 	{
@@ -8877,7 +8881,7 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 							histfilepath)));
 
 		/* Build and save the contents of the backup history file */
-		history_file = build_backup_content(state, true);
+		history_file = build_backup_history_content(state);
 		fprintf(fp, "%s", history_file);
 		pfree(history_file);
 
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae1..22c95f3c4c9 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -18,19 +18,19 @@
 #include "access/xlogbackup.h"
 
 /*
- * Build contents for backup_label or backup history file.
- *
- * When ishistoryfile is true, it creates the contents for a backup history
- * file, otherwise it creates contents for a backup_label file.
+ * Build contents for backup history file.
  *
  * Returns the result generated as a palloc'd string.
  */
 char *
-build_backup_content(BackupState *state, bool ishistoryfile)
+build_backup_history_content(BackupState *state)
 {
 	char		startstrbuf[128];
+	char		stopstrfbuf[128];
 	char		startxlogfile[MAXFNAMELEN]; /* backup start WAL file */
+	char		stopxlogfile[MAXFNAMELEN];	/* backup stop WAL file */
 	XLogSegNo	startsegno;
+	XLogSegNo	stopsegno;
 	StringInfo	result = makeStringInfo();
 	char	   *data;
 
@@ -45,16 +45,10 @@ build_backup_content(BackupState *state, bool ishistoryfile)
 	appendStringInfo(result, "START WAL LOCATION: %X/%X (file %s)\n",
 					 LSN_FORMAT_ARGS(state->startpoint), startxlogfile);
 
-	if (ishistoryfile)
-	{
-		char		stopxlogfile[MAXFNAMELEN];	/* backup stop WAL file */
-		XLogSegNo	stopsegno;
-
-		XLByteToSeg(state->stoppoint, stopsegno, wal_segment_size);
-		XLogFileName(stopxlogfile, state->stoptli, stopsegno, wal_segment_size);
-		appendStringInfo(result, "STOP WAL LOCATION: %X/%X (file %s)\n",
-						 LSN_FORMAT_ARGS(state->stoppoint), stopxlogfile);
-	}
+	XLByteToSeg(state->stoppoint, stopsegno, wal_segment_size);
+	XLogFileName(stopxlogfile, state->stoptli, stopsegno, wal_segment_size);
+	appendStringInfo(result, "STOP WAL LOCATION: %X/%X (file %s)\n",
+						LSN_FORMAT_ARGS(state->stoppoint), stopxlogfile);
 
 	appendStringInfo(result, "CHECKPOINT LOCATION: %X/%X\n",
 					 LSN_FORMAT_ARGS(state->checkpointloc));
@@ -65,17 +59,12 @@ build_backup_content(BackupState *state, bool ishistoryfile)
 	appendStringInfo(result, "LABEL: %s\n", state->name);
 	appendStringInfo(result, "START TIMELINE: %u\n", state->starttli);
 
-	if (ishistoryfile)
-	{
-		char		stopstrfbuf[128];
-
-		/* Use the log timezone here, not the session timezone */
-		pg_strftime(stopstrfbuf, sizeof(stopstrfbuf), "%Y-%m-%d %H:%M:%S %Z",
-					pg_localtime(&state->stoptime, log_timezone));
+	/* Use the log timezone here, not the session timezone */
+	pg_strftime(stopstrfbuf, sizeof(stopstrfbuf), "%Y-%m-%d %H:%M:%S %Z",
+				pg_localtime(&state->stoptime, log_timezone));
 
-		appendStringInfo(result, "STOP TIME: %s\n", stopstrfbuf);
-		appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
-	}
+	appendStringInfo(result, "STOP TIME: %s\n", stopstrfbuf);
+	appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
 
 	data = result->data;
 	pfree(result);
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 45a70668b1c..2388a60a5e5 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -53,7 +53,7 @@ static MemoryContext backupcontext = NULL;
  * pg_backup_start: set up for taking an on-line backup dump
  *
  * Essentially what this does is to create the contents required for the
- * backup_label file and the tablespace map.
+ * the tablespace map.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -61,6 +61,10 @@ static MemoryContext backupcontext = NULL;
 Datum
 pg_backup_start(PG_FUNCTION_ARGS)
 {
+#define PG_BACKUP_START_V2_COLS 3
+	TupleDesc	tupdesc;
+	Datum		values[PG_BACKUP_START_V2_COLS] = {0};
+	bool		nulls[PG_BACKUP_START_V2_COLS] = {0};
 	text	   *backupid = PG_GETARG_TEXT_PP(0);
 	bool		fast = PG_GETARG_BOOL(1);
 	char	   *backupidstr;
@@ -69,6 +73,10 @@ pg_backup_start(PG_FUNCTION_ARGS)
 
 	backupidstr = text_to_cstring(backupid);
 
+	/* Initialize attributes information in the tuple descriptor */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
 	if (status == SESSION_BACKUP_RUNNING)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@@ -102,7 +110,12 @@ pg_backup_start(PG_FUNCTION_ARGS)
 	register_persistent_abort_backup_handler();
 	do_pg_backup_start(backupidstr, fast, NULL, backup_state, tablespace_map);
 
-	PG_RETURN_LSN(backup_state->startpoint);
+	values[0] = LSNGetDatum(backup_state->startpoint);
+	values[1] = Int64GetDatum(backup_state->starttli);
+	values[2] = TimestampTzGetDatum(time_t_to_timestamptz(backup_state->starttime));
+
+	/* Returns the record as Datum */
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
 
 
@@ -113,14 +126,12 @@ pg_backup_start(PG_FUNCTION_ARGS)
  * allows the user to choose if they want to wait for the WAL to be archived
  * or if we should just return as soon as the WAL record is written.
  *
- * This function stops an in-progress backup, creates backup_label contents and
- * it returns the backup stop LSN, backup_label and tablespace_map contents.
+ * This function stops an in-progress backup and returns the backup stop LSN,
+ * pg_control and tablespace_map contents.
  *
- * The backup_label contains the user-supplied label string (typically this
- * would be used to tell where the backup dump will be stored), the starting
- * time, starting WAL location for the dump and so on.  It is the caller's
- * responsibility to write the backup_label and tablespace_map files in the
- * data folder that will be restored from this backup.
+ * The pg_control file contains the recovery information for the backup.  It is
+ * the caller's responsibility to write the pg_control and tablespace_map files
+ * in the data folder that will be restored from this backup.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -128,12 +139,12 @@ pg_backup_start(PG_FUNCTION_ARGS)
 Datum
 pg_backup_stop(PG_FUNCTION_ARGS)
 {
-#define PG_BACKUP_STOP_V2_COLS 3
+#define PG_BACKUP_STOP_V2_COLS 5
 	TupleDesc	tupdesc;
 	Datum		values[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		nulls[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		waitforarchive = PG_GETARG_BOOL(0);
-	char	   *backup_label;
+	bytea	   *pg_control_bytea;
 	SessionBackupState status = get_backup_status();
 
 	/* Initialize attributes information in the tuple descriptor */
@@ -152,15 +163,16 @@ pg_backup_stop(PG_FUNCTION_ARGS)
 	/* Stop the backup */
 	do_pg_backup_stop(backup_state, waitforarchive);
 
-	/* Build the contents of backup_label */
-	backup_label = build_backup_content(backup_state, false);
-
-	values[0] = LSNGetDatum(backup_state->stoppoint);
-	values[1] = CStringGetTextDatum(backup_label);
-	values[2] = CStringGetTextDatum(tablespace_map->data);
+	/* Build the contents of pg_control */
+	pg_control_bytea = (bytea *) palloc(PG_CONTROL_MAX_SAFE_SIZE + VARHDRSZ);
+	SET_VARSIZE(pg_control_bytea, PG_CONTROL_MAX_SAFE_SIZE + VARHDRSZ);
+	memcpy(VARDATA(pg_control_bytea), backup_state->controlFile, PG_CONTROL_MAX_SAFE_SIZE);
 
-	/* Deallocate backup-related variables */
-	pfree(backup_label);
+	values[0] = PointerGetDatum(pg_control_bytea);
+	values[1] = CStringGetTextDatum(tablespace_map->data);
+	values[2] = LSNGetDatum(backup_state->stoppoint);
+	values[3] = Int64GetDatum(backup_state->stoptli);
+	values[4] = TimestampTzGetDatum(time_t_to_timestamptz(backup_state->stoptime));
 
 	/* Clean up the session-level state and its memory context */
 	backup_state = NULL;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666aa..f43ea39f963 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -6,7 +6,7 @@
  * This source file contains functions controlling WAL recovery.
  * InitWalRecovery() initializes the system for crash or archive recovery,
  * or standby mode, depending on configuration options and the state of
- * the control file and possible backup label file.  PerformWalRecovery()
+ * the control file and possible backup recovery.  PerformWalRecovery()
  * performs the actual WAL replay, calling the rmgr-specific redo routines.
  * FinishWalRecovery() performs end-of-recovery checks and cleanup actions,
  * and prepares information needed to initialize the WAL for writes.  In
@@ -152,11 +152,12 @@ static bool recovery_signal_file_found = false;
 
 /*
  * CheckPointLoc is the position of the checkpoint record that determines
- * where to start the replay.  It comes from the backup label file or the
- * control file.
+ * where to start the replay.  It comes from the control file, either from the
+ * default location or from a backup recovery field.
  *
- * RedoStartLSN is the checkpoint's REDO location, also from the backup label
- * file or the control file.  In standby mode, XLOG streaming usually starts
+ * RedoStartLSN is the checkpoint's REDO location, also from the default
+ * control file location or from a backup recovery field.  In standby mode,
+ * XLOG streaming usually starts
  * from the position where an invalid record was found.  But if we fail to
  * read even the initial checkpoint record, we use the REDO location instead
  * of the checkpoint location as the start position of XLOG streaming.
@@ -388,9 +389,6 @@ static void ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, Time
 static void EnableStandbyMode(void);
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
-static bool read_backup_label(XLogRecPtr *checkPointLoc,
-							  TimeLineID *backupLabelTLI,
-							  bool *backupEndRequired, bool *backupFromStandby);
 static bool read_tablespace_map(List **tablespaces);
 
 static void xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI);
@@ -492,8 +490,8 @@ EnableStandbyMode(void)
  * Prepare the system for WAL recovery, if needed.
  *
  * This is called by StartupXLOG() which coordinates the server startup
- * sequence.  This function analyzes the control file and the backup label
- * file, if any, and figures out whether we need to perform crash recovery or
+ * sequence.  This function analyzes the control file and backup recovery
+ * info, if any, and figures out whether we need to perform crash recovery or
  * archive recovery, and how far we need to replay the WAL to reach a
  * consistent state.
  *
@@ -510,7 +508,7 @@ EnableStandbyMode(void)
  */
 void
 InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
-				bool *haveBackupLabel_ptr, bool *haveTblspcMap_ptr)
+				bool *haveTblspcMap_ptr)
 {
 	XLogPageReadPrivate *private;
 	struct stat st;
@@ -518,7 +516,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	XLogRecord *record;
 	DBState		dbstate_at_startup;
 	bool		haveTblspcMap = false;
-	bool		haveBackupLabel = false;
+	bool		backupRecoveryRequired = false;
 	CheckPoint	checkPoint;
 	bool		backupFromStandby = false;
 
@@ -549,7 +547,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 
 	/*
 	 * Set the WAL reading processor now, as it will be needed when reading
-	 * the checkpoint record required (backup_label or not).
+	 * the checkpoint record required (backup recovery required or not).
 	 */
 	private = palloc0(sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -585,18 +583,34 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	primary_image_masked = (char *) palloc(BLCKSZ);
 
 	/*
-	 * Read the backup_label file.  We want to run this part of the recovery
-	 * process after checking for signal files and after performing validation
-	 * of the recovery parameters.
+	 * Load recovery settings from pg_control.  We want to run this part of the
+	 * recovery process after checking for signal files and after performing
+	 * validation of the recovery parameters.
 	 */
-	if (read_backup_label(&CheckPointLoc, &CheckPointTLI, &backupEndRequired,
-						  &backupFromStandby))
+	if (ControlFile->backupRecoveryRequired)
 	{
 		List	   *tablespaces = NIL;
 
+		/* Initialize recovery from fields stored in pg_control */
+		CheckPointLoc = ControlFile->backupCheckPoint;
+		CheckPointTLI = ControlFile->backupStartPointTLI;
+		RedoStartLSN = ControlFile->backupStartPoint;
+		RedoStartTLI = ControlFile->backupStartPointTLI;
+		backupEndRequired = ControlFile->backupEndRequired;
+		backupFromStandby = ControlFile->backupFromStandby;
+
+		/* Clear fields used to initialize recovery */
+		ControlFile->backupCheckPoint = InvalidXLogRecPtr;
+		ControlFile->backupStartPointTLI = 0;
+		ControlFile->backupRecoveryRequired = false;
+		ControlFile->backupFromStandby = false;
+
+		/* Indicate that recovery was requested */
+		backupRecoveryRequired = true;
+
 		/*
-		 * Archive recovery was requested, and thanks to the backup label
-		 * file, we know how far we need to replay to reach consistency. Enter
+		 * Archive recovery was requested, and thanks to the recovery
+		 * info, we know how far we need to replay to reach consistency. Enter
 		 * archive recovery directly.
 		 */
 		InArchiveRecovery = true;
@@ -604,8 +618,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			EnableStandbyMode();
 
 		/*
-		 * When a backup_label file is present, we want to roll forward from
-		 * the checkpoint it identifies, rather than using pg_control.
+		 * When backup recovery is requested, we want to roll forward from
+		 * the checkpoint it identifies, rather than using the default
+		 * checkpoint.
 		 */
 		record = ReadCheckpointRecord(xlogprefetcher, CheckPointLoc,
 									  CheckPointTLI);
@@ -620,9 +635,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 
 			/*
 			 * Make sure that REDO location exists. This may not be the case
-			 * if there was a crash during an online backup, which left a
-			 * backup_label around that references a WAL segment that's
-			 * already been archived.
+			 * if recovery.signal is missing and the WAL has already been
+			 * archived.
 			 */
 			if (checkPoint.redo < CheckPointLoc)
 			{
@@ -631,20 +645,16 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 								checkPoint.ThisTimeLineID))
 					ereport(FATAL,
 							(errmsg("could not find redo location referenced by checkpoint record"),
-							 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n"
-									 "If you are not restoring from a backup, try removing the file \"%s/backup_label\".\n"
-									 "Be careful: removing \"%s/backup_label\" will result in a corrupt cluster if restoring from a backup.",
-									 DataDir, DataDir, DataDir, DataDir)));
+							 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n",
+									 DataDir, DataDir)));
 			}
 		}
 		else
 		{
 			ereport(FATAL,
 					(errmsg("could not locate required checkpoint record"),
-					 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n"
-							 "If you are not restoring from a backup, try removing the file \"%s/backup_label\".\n"
-							 "Be careful: removing \"%s/backup_label\" will result in a corrupt cluster if restoring from a backup.",
-							 DataDir, DataDir, DataDir, DataDir)));
+					 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n",
+							 DataDir, DataDir)));
 			wasShutdown = false;	/* keep compiler quiet */
 		}
 
@@ -679,37 +689,32 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			/* tell the caller to delete it later */
 			haveTblspcMap = true;
 		}
-
-		/* tell the caller to delete it later */
-		haveBackupLabel = true;
 	}
 	else
 	{
-		/* No backup_label file has been found if we are here. */
-
 		/*
-		 * If tablespace_map file is present without backup_label file, there
-		 * is no use of such file.  There is no harm in retaining it, but it
-		 * is better to get rid of the map file so that we don't have any
+		 * If tablespace_map file is present without backup recovery requested,
+		 * there is no use of such file.  There is no harm in retaining it, but
+		 * it is better to get rid of the map file so that we don't have any
 		 * redundant file in data directory and it will avoid any sort of
 		 * confusion.  It seems prudent though to just rename the file out of
 		 * the way rather than delete it completely, also we ignore any error
 		 * that occurs in rename operation as even if map file is present
-		 * without backup_label file, it is harmless.
+		 * without backup recovery requested, it is harmless.
 		 */
 		if (stat(TABLESPACE_MAP, &st) == 0)
 		{
 			unlink(TABLESPACE_MAP_OLD);
 			if (durable_rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD, DEBUG1) == 0)
 				ereport(LOG,
-						(errmsg("ignoring file \"%s\" because no file \"%s\" exists",
-								TABLESPACE_MAP, BACKUP_LABEL_FILE),
+						(errmsg("ignoring file \"%s\" because backup recovery was not requested",
+								TABLESPACE_MAP),
 						 errdetail("File \"%s\" was renamed to \"%s\".",
 								   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 			else
 				ereport(LOG,
-						(errmsg("ignoring file \"%s\" because no file \"%s\" exists",
-								TABLESPACE_MAP, BACKUP_LABEL_FILE),
+						(errmsg("ignoring file \"%s\" because backup recovery was not requested",
+								TABLESPACE_MAP),
 						 errdetail("Could not rename file \"%s\" to \"%s\": %m.",
 								   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 		}
@@ -943,7 +948,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		 * Any other state indicates that the backup somehow became corrupted
 		 * and we can't sensibly continue with recovery.
 		 */
-		if (haveBackupLabel)
+		if (backupRecoveryRequired)
 		{
 			ControlFile->backupStartPoint = checkPoint.redo;
 			ControlFile->backupEndRequired = backupEndRequired;
@@ -953,7 +958,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 				if (dbstate_at_startup != DB_IN_ARCHIVE_RECOVERY &&
 					dbstate_at_startup != DB_SHUTDOWNED_IN_RECOVERY)
 					ereport(FATAL,
-							(errmsg("backup_label contains data inconsistent with control file"),
+							(errmsg("pg_control contains inconsistent data for standby backup"),
 							 errhint("This means that the backup is corrupted and you will "
 									 "have to use another backup for recovery.")));
 				ControlFile->backupEndPoint = ControlFile->minRecoveryPoint;
@@ -983,7 +988,6 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	missingContrecPtr = InvalidXLogRecPtr;
 
 	*wasShutdown_ptr = wasShutdown;
-	*haveBackupLabel_ptr = haveBackupLabel;
 	*haveTblspcMap_ptr = haveTblspcMap;
 }
 
@@ -1156,154 +1160,6 @@ validateRecoveryParameters(void)
 	}
 }
 
-/*
- * read_backup_label: check to see if a backup_label file is present
- *
- * If we see a backup_label during recovery, we assume that we are recovering
- * from a backup dump file, and we therefore roll forward from the checkpoint
- * identified by the label file, NOT what pg_control says.  This avoids the
- * problem that pg_control might have been archived one or more checkpoints
- * later than the start of the dump, and so if we rely on it as the start
- * point, we will fail to restore a consistent database state.
- *
- * Returns true if a backup_label was found (and fills the checkpoint
- * location and TLI into *checkPointLoc and *backupLabelTLI, respectively);
- * returns false if not. If this backup_label came from a streamed backup,
- * *backupEndRequired is set to true. If this backup_label was created during
- * recovery, *backupFromStandby is set to true.
- *
- * Also sets the global variables RedoStartLSN and RedoStartTLI with the LSN
- * and TLI read from the backup file.
- */
-static bool
-read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
-				  bool *backupEndRequired, bool *backupFromStandby)
-{
-	char		startxlogfilename[MAXFNAMELEN];
-	TimeLineID	tli_from_walseg,
-				tli_from_file;
-	FILE	   *lfp;
-	char		ch;
-	char		backuptype[20];
-	char		backupfrom[20];
-	char		backuplabel[MAXPGPATH];
-	char		backuptime[128];
-	uint32		hi,
-				lo;
-
-	/* suppress possible uninitialized-variable warnings */
-	*checkPointLoc = InvalidXLogRecPtr;
-	*backupLabelTLI = 0;
-	*backupEndRequired = false;
-	*backupFromStandby = false;
-
-	/*
-	 * See if label file is present
-	 */
-	lfp = AllocateFile(BACKUP_LABEL_FILE, "r");
-	if (!lfp)
-	{
-		if (errno != ENOENT)
-			ereport(FATAL,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m",
-							BACKUP_LABEL_FILE)));
-		return false;			/* it's not there, all is fine */
-	}
-
-	/*
-	 * Read and parse the START WAL LOCATION and CHECKPOINT lines (this code
-	 * is pretty crude, but we are not expecting any variability in the file
-	 * format).
-	 */
-	if (fscanf(lfp, "START WAL LOCATION: %X/%X (file %08X%16s)%c",
-			   &hi, &lo, &tli_from_walseg, startxlogfilename, &ch) != 5 || ch != '\n')
-		ereport(FATAL,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
-	RedoStartLSN = ((uint64) hi) << 32 | lo;
-	RedoStartTLI = tli_from_walseg;
-	if (fscanf(lfp, "CHECKPOINT LOCATION: %X/%X%c",
-			   &hi, &lo, &ch) != 3 || ch != '\n')
-		ereport(FATAL,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
-	*checkPointLoc = ((uint64) hi) << 32 | lo;
-	*backupLabelTLI = tli_from_walseg;
-
-	/*
-	 * BACKUP METHOD lets us know if this was a typical backup ("streamed",
-	 * which could mean either pg_basebackup or the pg_backup_start/stop
-	 * method was used) or if this label came from somewhere else (the only
-	 * other option today being from pg_rewind).  If this was a streamed
-	 * backup then we know that we need to play through until we get to the
-	 * end of the WAL which was generated during the backup (at which point we
-	 * will have reached consistency and backupEndRequired will be reset to be
-	 * false).
-	 */
-	if (fscanf(lfp, "BACKUP METHOD: %19s\n", backuptype) == 1)
-	{
-		if (strcmp(backuptype, "streamed") == 0)
-			*backupEndRequired = true;
-	}
-
-	/*
-	 * BACKUP FROM lets us know if this was from a primary or a standby.  If
-	 * it was from a standby, we'll double-check that the control file state
-	 * matches that of a standby.
-	 */
-	if (fscanf(lfp, "BACKUP FROM: %19s\n", backupfrom) == 1)
-	{
-		if (strcmp(backupfrom, "standby") == 0)
-			*backupFromStandby = true;
-	}
-
-	/*
-	 * Parse START TIME and LABEL. Those are not mandatory fields for recovery
-	 * but checking for their presence is useful for debugging and the next
-	 * sanity checks. Cope also with the fact that the result buffers have a
-	 * pre-allocated size, hence if the backup_label file has been generated
-	 * with strings longer than the maximum assumed here an incorrect parsing
-	 * happens. That's fine as only minor consistency checks are done
-	 * afterwards.
-	 */
-	if (fscanf(lfp, "START TIME: %127[^\n]\n", backuptime) == 1)
-		ereport(DEBUG1,
-				(errmsg_internal("backup time %s in file \"%s\"",
-								 backuptime, BACKUP_LABEL_FILE)));
-
-	if (fscanf(lfp, "LABEL: %1023[^\n]\n", backuplabel) == 1)
-		ereport(DEBUG1,
-				(errmsg_internal("backup label %s in file \"%s\"",
-								 backuplabel, BACKUP_LABEL_FILE)));
-
-	/*
-	 * START TIMELINE is new as of 11. Its parsing is not mandatory, still use
-	 * it as a sanity check if present.
-	 */
-	if (fscanf(lfp, "START TIMELINE: %u\n", &tli_from_file) == 1)
-	{
-		if (tli_from_walseg != tli_from_file)
-			ereport(FATAL,
-					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE),
-					 errdetail("Timeline ID parsed is %u, but expected %u.",
-							   tli_from_file, tli_from_walseg)));
-
-		ereport(DEBUG1,
-				(errmsg_internal("backup timeline %u in file \"%s\"",
-								 tli_from_file, BACKUP_LABEL_FILE)));
-	}
-
-	if (ferror(lfp) || FreeFile(lfp))
-		ereport(FATAL,
-				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
-						BACKUP_LABEL_FILE)));
-
-	return true;
-}
-
 /*
  * read_tablespace_map: check to see if a tablespace_map file is present
  *
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index b537f462197..01d09dbdd21 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -22,6 +22,7 @@
 #include "backup/basebackup.h"
 #include "backup/basebackup_sink.h"
 #include "backup/basebackup_target.h"
+#include "catalog/pg_control.h"
 #include "commands/defrem.h"
 #include "common/compression.h"
 #include "common/file_perm.h"
@@ -94,7 +95,7 @@ static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
 								 BlockNumber blkno,
 								 uint16 *expected_checksum);
 static void sendFileWithContent(bbsink *sink, const char *filename,
-								const char *content,
+								const char *content, int len,
 								backup_manifest_info *manifest);
 static int64 _tarWriteHeader(bbsink *sink, const char *filename,
 							 const char *linktarget, struct stat *statbuf,
@@ -192,10 +193,9 @@ static const struct exclude_list_item excludeFiles[] =
 	{RELCACHE_INIT_FILENAME, true},
 
 	/*
-	 * backup_label and tablespace_map should not exist in a running cluster
-	 * capable of doing an online backup, but exclude them just in case.
+	 * tablespace_map should not exist in a running cluster capable of doing
+	 * an online backup, but exclude it just in case.
 	 */
-	{BACKUP_LABEL_FILE, false},
 	{TABLESPACE_MAP, false},
 
 	/*
@@ -325,23 +325,15 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 			if (ti->path == NULL)
 			{
-				struct stat statbuf;
 				bool		sendtblspclinks = true;
-				char	   *backup_label;
 
 				bbsink_begin_archive(sink, "base.tar");
 
-				/* In the main tar, include the backup_label first... */
-				backup_label = build_backup_content(backup_state, false);
-				sendFileWithContent(sink, BACKUP_LABEL_FILE,
-									backup_label, &manifest);
-				pfree(backup_label);
-
-				/* Then the tablespace_map file, if required... */
+				/* Send the tablespace_map file, if required... */
 				if (opt->sendtblspcmapfile)
 				{
 					sendFileWithContent(sink, TABLESPACE_MAP,
-										tablespace_map->data, &manifest);
+										tablespace_map->data, -1, &manifest);
 					sendtblspclinks = false;
 				}
 
@@ -349,14 +341,14 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 				sendDir(sink, ".", 1, false, state.tablespaces,
 						sendtblspclinks, &manifest, InvalidOid);
 
-				/* ... and pg_control after everything else. */
-				if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not stat file \"%s\": %m",
-									XLOG_CONTROL_FILE)));
-				sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
-						 false, InvalidOid, InvalidOid, &manifest);
+				/* End the backup before sending pg_control */
+				basebackup_progress_wait_wal_archive(&state);
+				do_pg_backup_stop(backup_state, !opt->nowait);
+
+				/* Send copy of pg_control containing recovery info */
+				sendFileWithContent(sink, XLOG_CONTROL_FILE,
+								    (char *)backup_state->controlFile,
+									PG_CONTROL_MAX_SAFE_SIZE, &manifest);
 			}
 			else
 			{
@@ -390,9 +382,6 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 			}
 		}
 
-		basebackup_progress_wait_wal_archive(&state);
-		do_pg_backup_stop(backup_state, !opt->nowait);
-
 		endptr = backup_state->stoppoint;
 		endtli = backup_state->stoptli;
 
@@ -601,7 +590,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 			 * complete segment.
 			 */
 			StatusFilePath(pathbuf, walFileName, ".done");
-			sendFileWithContent(sink, pathbuf, "", &manifest);
+			sendFileWithContent(sink, pathbuf, "", -1, &manifest);
 		}
 
 		/*
@@ -629,7 +618,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 			/* unconditionally mark file as archived */
 			StatusFilePath(pathbuf, fname, ".done");
-			sendFileWithContent(sink, pathbuf, "", &manifest);
+			sendFileWithContent(sink, pathbuf, "", -1, &manifest);
 		}
 
 		/* Properly terminate the tar file. */
@@ -1040,22 +1029,21 @@ SendBaseBackup(BaseBackupCmd *cmd)
  */
 static void
 sendFileWithContent(bbsink *sink, const char *filename, const char *content,
-					backup_manifest_info *manifest)
+					int len, backup_manifest_info *manifest)
 {
 	struct stat statbuf;
-	int			bytes_done = 0,
-				len;
+	int			bytes_done = 0;
 	pg_checksum_context checksum_ctx;
 
 	if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
 		elog(ERROR, "could not initialize checksum of file \"%s\"",
 			 filename);
 
-	len = strlen(content);
+	if (len < 0)
+		len = strlen(content);
 
 	/*
-	 * Construct a stat struct for the backup_label file we're injecting in
-	 * the tar.
+	 * Construct a stat struct for the file we're injecting in the tar.
 	 */
 	/* Windows doesn't have the concept of uid and gid */
 #ifdef WIN32
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 35d738d5763..24bf34b45eb 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -384,13 +384,15 @@ BEGIN ATOMIC
 END;
 
 CREATE OR REPLACE FUNCTION
-  pg_backup_start(label text, fast boolean DEFAULT false)
-  RETURNS pg_lsn STRICT VOLATILE LANGUAGE internal AS 'pg_backup_start'
+  pg_backup_start(label text, fast boolean DEFAULT false, OUT lsn pg_lsn,
+        OUT timeline_id int8, OUT start timestamptz)
+  RETURNS record STRICT VOLATILE LANGUAGE internal AS 'pg_backup_start'
   PARALLEL RESTRICTED;
 
 CREATE OR REPLACE FUNCTION pg_backup_stop (
-        wait_for_archive boolean DEFAULT true, OUT lsn pg_lsn,
-        OUT labelfile text, OUT spcmapfile text)
+        wait_for_archive boolean DEFAULT true, OUT pg_control_file bytea,
+        OUT tablespace_map_file text, OUT lsn pg_lsn, OUT timeline_id int8,
+        OUT stop timestamptz)
   RETURNS record STRICT VOLATILE LANGUAGE internal as 'pg_backup_stop'
   PARALLEL RESTRICTED;
 
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b4..c655cb03352 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -171,8 +171,8 @@ SKIP:
 
 # Write some files to test that they are not copied.
 foreach my $filename (
-	qw(backup_label tablespace_map postgresql.auto.conf.tmp
-	current_logfiles.tmp global/pg_internal.init.123))
+	qw(tablespace_map postgresql.auto.conf.tmp current_logfiles.tmp
+	   global/pg_internal.init.123))
 {
 	open my $file, '>>', "$pgdata/$filename";
 	print $file "DONOTCOPY";
@@ -261,14 +261,13 @@ foreach my $filename (@tempRelationFiles)
 		"base/$postgresOid/$filename not copied");
 }
 
-# Make sure existing backup_label was ignored.
-isnt(slurp_file("$tempdir/backup/backup_label"),
-	'DONOTCOPY', 'existing backup_label not copied');
+# Make sure existing tablespace_map was ignored.
+ok(!-f "$tempdir/backup/tablespace_map", 'tablespace_map not in backup');
 rmtree("$tempdir/backup");
 
-# Now delete the bogus backup_label file since it will interfere with startup
-unlink("$pgdata/backup_label")
-  or BAIL_OUT("unable to unlink $pgdata/backup_label");
+# Now delete the bogus tablespace_map file since it will interfere with startup
+unlink("$pgdata/tablespace_map")
+  or BAIL_OUT("unable to unlink $pgdata/tablespace_map");
 
 $node->command_ok(
 	[
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93e0837947c..cc515b622ff 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -277,10 +277,18 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->minRecoveryPoint));
 	printf(_("Min recovery ending loc's timeline:   %u\n"),
 		   ControlFile->minRecoveryPointTLI);
+	printf(_("Backup checkpoint location:           %X/%X\n"),
+		   LSN_FORMAT_ARGS(ControlFile->backupCheckPoint));
 	printf(_("Backup start location:                %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->backupStartPoint));
+	printf(_("Backup start location's timeline:     %u\n"),
+		   ControlFile->backupStartPointTLI);
 	printf(_("Backup end location:                  %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
+	printf(_("Backup recovery required:        		%s\n"),
+		   ControlFile->backupRecoveryRequired ? _("yes") : _("no"));
+	printf(_("Backup from standby:        			%s\n"),
+		   ControlFile->backupFromStandby ? _("yes") : _("no"));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index ecadd69dc53..213f4e71b88 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -139,11 +139,10 @@ static const struct exclude_list_item excludeFiles[] =
 	{"pg_internal.init", true}, /* defined as RELCACHE_INIT_FILENAME */
 
 	/*
-	 * If there is a backup_label or tablespace_map file, it indicates that a
-	 * recovery failed and this cluster probably can't be rewound, but exclude
-	 * them anyway if they are found.
+	 * If there is a tablespace_map file, it indicates that a recovery failed
+	 * and this cluster probably can't be rewound, but exclude it anyway if it
+	 * is found.
 	 */
-	{"backup_label", false},	/* defined as BACKUP_LABEL_FILE */
 	{"tablespace_map", false},	/* defined as TABLESPACE_MAP */
 
 	/*
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index bfd44a284e2..f42782e2eab 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -39,9 +39,6 @@ static void perform_rewind(filemap_t *filemap, rewind_source *source,
 						   TimeLineID chkpttli,
 						   XLogRecPtr chkptredo);
 
-static void createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli,
-							  XLogRecPtr checkpointloc);
-
 static void digestControlFile(ControlFileData *ControlFile,
 							  const char *content, size_t size);
 static void getRestoreCommand(const char *argv0);
@@ -654,7 +651,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 		pg_log_info("creating backup label and updating control file");
 
 	/*
-	 * Create a backup label file, to tell the target where to begin the WAL
+	 * Get recovery fields to tell the target where to begin the WAL
 	 * replay. Normally, from the last common checkpoint between the source
 	 * and the target. But if the source is a standby server, it's possible
 	 * that the last common checkpoint is *after* the standby's restartpoint.
@@ -672,7 +669,6 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 		chkpttli = ControlFile_source.checkPointCopy.ThisTimeLineID;
 		chkptrec = ControlFile_source.checkPoint;
 	}
-	createBackupLabel(chkptredo, chkpttli, chkptrec);
 
 	/*
 	 * Update control file of target, to tell the target how far it must
@@ -722,6 +718,12 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	ControlFile_new.minRecoveryPoint = endrec;
 	ControlFile_new.minRecoveryPointTLI = endtli;
 	ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
+	ControlFile_new.backupRecoveryRequired = true;
+	ControlFile_new.backupFromStandby = true;
+	ControlFile_new.backupEndRequired = false;
+	ControlFile_new.backupCheckPoint = chkptrec;
+	ControlFile_new.backupStartPoint = chkptredo;
+	ControlFile_new.backupStartPointTLI = chkpttli;
 	if (!dry_run)
 		update_controlfile(datadir_target, &ControlFile_new, do_sync);
 }
@@ -729,7 +731,10 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 static void
 sanityChecks(void)
 {
-	/* TODO Check that there's no backup_label in either cluster */
+	/*
+	 * TODO Check that neither cluster has backupRecoveryRequested set in
+	 * pg_control.
+	 */
 
 	/* Check system_identifier match */
 	if (ControlFile_target.system_identifier != ControlFile_source.system_identifier)
@@ -951,51 +956,6 @@ findCommonAncestorTimeline(TimeLineHistoryEntry *a_history, int a_nentries,
 	}
 }
 
-
-/*
- * Create a backup_label file that forces recovery to begin at the last common
- * checkpoint.
- */
-static void
-createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli, XLogRecPtr checkpointloc)
-{
-	XLogSegNo	startsegno;
-	time_t		stamp_time;
-	char		strfbuf[128];
-	char		xlogfilename[MAXFNAMELEN];
-	struct tm  *tmp;
-	char		buf[1000];
-	int			len;
-
-	XLByteToSeg(startpoint, startsegno, WalSegSz);
-	XLogFileName(xlogfilename, starttli, startsegno, WalSegSz);
-
-	/*
-	 * Construct backup label file
-	 */
-	stamp_time = time(NULL);
-	tmp = localtime(&stamp_time);
-	strftime(strfbuf, sizeof(strfbuf), "%Y-%m-%d %H:%M:%S %Z", tmp);
-
-	len = snprintf(buf, sizeof(buf),
-				   "START WAL LOCATION: %X/%X (file %s)\n"
-				   "CHECKPOINT LOCATION: %X/%X\n"
-				   "BACKUP METHOD: pg_rewind\n"
-				   "BACKUP FROM: standby\n"
-				   "START TIME: %s\n",
-	/* omit LABEL: line */
-				   LSN_FORMAT_ARGS(startpoint), xlogfilename,
-				   LSN_FORMAT_ARGS(checkpointloc),
-				   strfbuf);
-	if (len >= sizeof(buf))
-		pg_fatal("backup label buffer too small");	/* shouldn't happen */
-
-	/* TODO: move old file out of the way, if any. */
-	open_target_file("backup_label", true); /* BACKUP_LABEL_FILE */
-	write_target_range(buf, 0, len);
-	close_target_file();
-}
-
 /*
  * Check CRC of control file
  */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164f..3aac6839a70 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -293,8 +293,6 @@ extern SessionBackupState get_backup_status(void);
 /* File path names (all relative to $PGDATA) */
 #define RECOVERY_SIGNAL_FILE	"recovery.signal"
 #define STANDBY_SIGNAL_FILE		"standby.signal"
-#define BACKUP_LABEL_FILE		"backup_label"
-#define BACKUP_LABEL_OLD		"backup_label.old"
 
 #define TABLESPACE_MAP			"tablespace_map"
 #define TABLESPACE_MAP_OLD		"tablespace_map.old"
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137b..f2c3672fed6 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -15,6 +15,7 @@
 #define XLOG_BACKUP_H
 
 #include "access/xlogdefs.h"
+#include "catalog/pg_control.h"
 #include "pgtime.h"
 
 /* Structure to hold backup state. */
@@ -33,9 +34,18 @@ typedef struct BackupState
 	XLogRecPtr	stoppoint;		/* backup stop WAL location */
 	TimeLineID	stoptli;		/* backup stop TLI */
 	pg_time_t	stoptime;		/* backup stop time */
+
+	/*
+	 * After pg_backup_stop() returns this field will contain a copy of
+	 * pg_control that should be stored with the backup. Fields have been
+	 * updated for recovery and the CRC has been recalculated. The buffer
+	 * is padded to PG_CONTROL_MAX_SAFE_SIZE so that pg_control is always
+	 * a consistent size but smaller (and hopefully easier to handle) than
+	 * PG_CONTROL_FILE_SIZE. Bytes after sizeof(ControlFileData) are zeroed.
+	 */
+	uint8_t controlFile[PG_CONTROL_MAX_SAFE_SIZE];
 } BackupState;
 
-extern char *build_backup_content(BackupState *state,
-								  bool ishistoryfile);
+extern char *build_backup_history_content(BackupState *state);
 
 #endif							/* XLOG_BACKUP_H */
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index ee0bc742782..981266f7340 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -80,8 +80,7 @@ extern Size XLogRecoveryShmemSize(void);
 extern void XLogRecoveryShmemInit(void);
 
 extern void InitWalRecovery(ControlFileData *ControlFile,
-							bool *wasShutdown_ptr, bool *haveBackupLabel_ptr,
-							bool *haveTblspcMap_ptr);
+							bool *wasShutdown_ptr, bool *haveTblspcMap_ptr);
 extern void PerformWalRecovery(void);
 
 /*
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 2ae72e3b266..258da052563 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -146,6 +146,9 @@ typedef struct ControlFileData
 	 * to disk, we mustn't start up until we reach X again. Zero when not
 	 * doing archive recovery.
 	 *
+	 * backupCheckPoint is the backup start checkpoint and is set to zero after
+	 * recovery is initialized.
+	 *
 	 * backupStartPoint is the redo pointer of the backup start checkpoint, if
 	 * we are recovering from an online backup and haven't reached the end of
 	 * backup yet. It is reset to zero when the end of backup is reached, and
@@ -160,14 +163,27 @@ typedef struct ControlFileData
 	 * pg_control which was backed up last. It is reset to zero when the end
 	 * of backup is reached, and we mustn't start up before that.
 	 *
+	 * backupRecoveryRequired indicates that the pg_control file was provided
+	 * by a backup or pg_rewind and recovery settings need to be copied. It will
+	 * be set to false when the settings have been copied.
+	 *
+	 * backupFromStandby indicates that the backup was taken on a standby. It is
+	 * require to initialize recovery and set to false afterwards.
+	 *
 	 * If backupEndRequired is true, we know for sure that we're restoring
 	 * from a backup, and must see a backup-end record before we can safely
-	 * start up.
+	 * start up. Currently backupEndRequired should only be false if recovery
+	 * settings were configured by pg_rewind, which does not require an end
+	 * point.
 	 */
 	XLogRecPtr	minRecoveryPoint;
 	TimeLineID	minRecoveryPointTLI;
+	XLogRecPtr	backupCheckPoint;
 	XLogRecPtr	backupStartPoint;
+	TimeLineID	backupStartPointTLI;
 	XLogRecPtr	backupEndPoint;
+	bool 		backupRecoveryRequired;
+	bool 		backupFromStandby;
 	bool		backupEndRequired;
 
 	/*
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f14aed422a7..cc8156c57e7 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6413,13 +6413,17 @@
   prosrc => 'pg_terminate_backend' },
 { oid => '2172', descr => 'prepare for taking an online backup',
   proname => 'pg_backup_start', provolatile => 'v', proparallel => 'r',
-  prorettype => 'pg_lsn', proargtypes => 'text bool',
+  prorettype => 'record', proargtypes => 'text bool',
+  proallargtypes => '{text,bool,pg_lsn,int8,timestamptz}',
+  proargmodes => '{i,i,o,o,o}',
+  proargnames => '{label,fast,lsn,timeline_id,start}',
   prosrc => 'pg_backup_start' },
 { oid => '2739', descr => 'finish taking an online backup',
   proname => 'pg_backup_stop', provolatile => 'v', proparallel => 'r',
   prorettype => 'record', proargtypes => 'bool',
-  proallargtypes => '{bool,pg_lsn,text,text}', proargmodes => '{i,o,o,o}',
-  proargnames => '{wait_for_archive,lsn,labelfile,spcmapfile}',
+  proallargtypes => '{bool,bytea,text,pg_lsn,int8,timestamptz}',
+  proargmodes => '{i,o,o,o,o,o}',
+  proargnames => '{wait_for_archive,pg_control_file,tablespace_map_file,lsn,timeline_id,stop}',
   prosrc => 'pg_backup_stop' },
 { oid => '3436', descr => 'promote standby server',
   proname => 'pg_promote', provolatile => 'v', prorettype => 'bool',

#11

michael@paquier.xyz

about 2 years ago

In reply to: David Steele (#10)

Re: Add recovery to pg_control and remove backup_label

On Mon, Nov 06, 2023 at 05:39:02PM -0400, David Steele wrote:

On 11/6/23 02:35, Michael Paquier wrote:

The patch is failing the recovery test 039_end_of_wal.pl. Could you
look at the failure?

I'm not seeing this failure, and CI seems happy [1]. Can you give details of
the error message?

I've retested today, and miss the failure. I'll let you know if I see
this again.

+        /* Clear fields used to initialize recovery */
+        ControlFile->backupCheckPoint = InvalidXLogRecPtr;
+        ControlFile->backupStartPointTLI = 0;
+        ControlFile->backupRecoveryRequired = false;
+        ControlFile->backupFromStandby = false;
These variables in the control file are cleaned up when the
backup_label file was read previously, but backup_label is renamed to
backup_label.old a bit later than that. Your logic looks correct seen
from here, but shouldn't these variables be set much later, aka just
*after* UpdateControlFile(). This gap between the initialization of
the control file and the in-memory reset makes the code quite brittle,
IMO.

Yeah, sorry, there's a think from me here. I meant to reset these
variables just before the UpdateControlFile() after InitWalRecovery()
in UpdateControlFile(), much closer to it.

If we set these fields where backup_label was renamed, the logic would not
be exactly the same since pg_control won't be updated until the next time
through the loop. Since the fields should be updated before
UpdateControlFile() I thought it made sense to keep all the updates
together.

Overall I think it is simpler, and we don't need to acquire a lock on
ControlFile.

What you are proposing is the same as what we already do for
backupEndRequired or backupStartPoint in the control file when
initializing recovery, so objection withdrawn.

Thomas included this change in his pg_basebackup changes so I did the same.
Maybe wait a bit before we split this out? Seems like a pretty small
change...

Seems like a pretty good argument for refactoring that now, and let
any other patches rely on it. Would you like to send a separate
patch?

Actually, grouping backupStartPointTLI and minRecoveryPointTLI should
reduce more the size with some alignment magic, no?

I thought about this, but it seemed to me that existing fields had been
positioned to make the grouping logical rather than to optimize alignment,
e.g. minRecoveryPointTLI. Ideally that would have been placed near
backupEndRequired (or vice versa). But if the general opinion is to
rearrange for alignment, I'm OK with that.

I've not tested, but it looks like moving backupStartPointTLI after
backupEndPoint should shave 8 bytes, if you want to maintain a more
coherent group for the LSNs.
--
Michael

#12

michael@paquier.xyz

about 2 years ago

In reply to: Michael Paquier (#11)

Re: Add recovery to pg_control and remove backup_label

On Tue, Nov 07, 2023 at 05:20:27PM +0900, Michael Paquier wrote:

On Mon, Nov 06, 2023 at 05:39:02PM -0400, David Steele wrote:
I've retested today, and miss the failure. I'll let you know if I see
this again.

I've done a few more dozen runs, and still nothing. I am wondering
what this disturbance was.

If we set these fields where backup_label was renamed, the logic would not
be exactly the same since pg_control won't be updated until the next time
through the loop. Since the fields should be updated before
UpdateControlFile() I thought it made sense to keep all the updates
together.

Overall I think it is simpler, and we don't need to acquire a lock on
ControlFile.

What you are proposing is the same as what we already do for
backupEndRequired or backupStartPoint in the control file when
initializing recovery, so objection withdrawn.

Thomas included this change in his pg_basebackup changes so I did the same.
Maybe wait a bit before we split this out? Seems like a pretty small
change...

Seems like a pretty good argument for refactoring that now, and let
any other patches rely on it. Would you like to send a separate
patch?

The split still looks worth doing seen from here, so I am switching
the patch as WoA for now.

Actually, grouping backupStartPointTLI and minRecoveryPointTLI should
reduce more the size with some alignment magic, no?

I thought about this, but it seemed to me that existing fields had been
positioned to make the grouping logical rather than to optimize alignment,
e.g. minRecoveryPointTLI. Ideally that would have been placed near
backupEndRequired (or vice versa). But if the general opinion is to
rearrange for alignment, I'm OK with that.

I've not tested, but it looks like moving backupStartPointTLI after
backupEndPoint should shave 8 bytes, if you want to maintain a more
coherent group for the LSNs.

+    * backupFromStandby indicates that the backup was taken on a standby. It is
+    * require to initialize recovery and set to false afterwards.
s/require/required/.

The term "backup recovery", that we've never used in the tree until
now as far as I know. Perhaps this recovery method should just be
referred as "recovery from backup"?

By the way, there is another thing that this patch has forgotten: the
SQL functions that display data from the control file. Shouldn't
pg_control_recovery() be extended with the new fields? These fields
may be less critical than the other ones related to recovery, but I
suspect that showing them can become handy at least for debugging and
monitoring purposes.

Something in this area is that backupRecoveryRequired is the switch
controlling if the fields set by the recovery initialization. Could
it be actually useful to leave the other fields as they are and only
reset backupRecoveryRequired before the first control file update?
This would leave a trace of the backup history directly in the control
file.

What about pg_resetwal and RewriteControlFile()? Shouldn't these
recovery fields be reset as well?

git diff --check is complaining a bit.
--
Michael

#13

david@pgmasters.net

about 2 years ago

In reply to: Michael Paquier (#12)

2 attachment(s)

Re: Add recovery to pg_control and remove backup_label

On 11/10/23 00:37, Michael Paquier wrote:

On Tue, Nov 07, 2023 at 05:20:27PM +0900, Michael Paquier wrote:

On Mon, Nov 06, 2023 at 05:39:02PM -0400, David Steele wrote:
I've retested today, and miss the failure. I'll let you know if I see
this again.

I've done a few more dozen runs, and still nothing. I am wondering
what this disturbance was.

OK, hopefully it was just a blip.

If we set these fields where backup_label was renamed, the logic would not
be exactly the same since pg_control won't be updated until the next time
through the loop. Since the fields should be updated before
UpdateControlFile() I thought it made sense to keep all the updates
together.

Overall I think it is simpler, and we don't need to acquire a lock on
ControlFile.

What you are proposing is the same as what we already do for
backupEndRequired or backupStartPoint in the control file when
initializing recovery, so objection withdrawn.

Thomas included this change in his pg_basebackup changes so I did the same.
Maybe wait a bit before we split this out? Seems like a pretty small
change...

Seems like a pretty good argument for refactoring that now, and let
any other patches rely on it. Would you like to send a separate
patch?

The split still looks worth doing seen from here, so I am switching
the patch as WoA for now.

This has been split out.

Actually, grouping backupStartPointTLI and minRecoveryPointTLI should
reduce more the size with some alignment magic, no?

I thought about this, but it seemed to me that existing fields had been
positioned to make the grouping logical rather than to optimize alignment,
e.g. minRecoveryPointTLI. Ideally that would have been placed near
backupEndRequired (or vice versa). But if the general opinion is to
rearrange for alignment, I'm OK with that.

I've not tested, but it looks like moving backupStartPointTLI after
backupEndPoint should shave 8 bytes, if you want to maintain a more
coherent group for the LSNs.

OK, I have moved backupStartPointTLI.

+    * backupFromStandby indicates that the backup was taken on a standby. It is
+    * require to initialize recovery and set to false afterwards.
s/require/required/.

Fixed.

The term "backup recovery", that we've never used in the tree until
now as far as I know. Perhaps this recovery method should just be
referred as "recovery from backup"?

Well, "backup recovery" is less awkward, I think. For instance "backup
recovery field" vs "recovery from backup field".

By the way, there is another thing that this patch has forgotten: the
SQL functions that display data from the control file. Shouldn't
pg_control_recovery() be extended with the new fields? These fields
may be less critical than the other ones related to recovery, but I
suspect that showing them can become handy at least for debugging and
monitoring purposes.

I guess that depends on whether we reset them or not (discussion below).
Right now they would not be visible since by the time the user could log
on they would be reset.

Something in this area is that backupRecoveryRequired is the switch
controlling if the fields set by the recovery initialization. Could
it be actually useful to leave the other fields as they are and only
reset backupRecoveryRequired before the first control file update?
This would leave a trace of the backup history directly in the control
file.

Since the other recovery fields are cleared in ReachedEndOfBackup() this
would be a change from what we do now.

None of these fields are ever visible (with the exception of
minRecoveryPoint/TLI) since they are reset when the database becomes
consistent and before logons are allowed. Viewing them with
pg_controldata makes sense, but I'm not sure pg_control_recovery() does.

In fact, are backup_start_lsn, backup_end_lsn, and
end_of_backup_record_required ever non-zero when logged onto Postgres?
Maybe I'm missing something?

What about pg_resetwal and RewriteControlFile()? Shouldn't these
recovery fields be reset as well?

Done.

git diff --check is complaining a bit.

Fixed.

New patches attached based on eb81e8e790.

Regards,
-David

Attachments:

recovery-in-pgcontrol-v4-0001-pass-len-to-sendFileWithContent.patchtext/plain; charset=UTF-8; name=recovery-in-pgcontrol-v4-0001-pass-len-to-sendFileWithContent.patchDownload

From d241a0e8aa589b3abf66331f8a3af0aabe87c214 Mon Sep 17 00:00:00 2001
From: David Steele <david@pgmasters.net>
Date: Fri, 10 Nov 2023 17:50:54 +0000
Subject: Allow content size to be passed to sendFileWithContent().

sendFileWithContent() previously got the content length by using strlen(),
but it is also possible to pass binary content. Use len == -1 to indicate
that strlen() should be use to get the content length, otherwise honor the
value in len.
---
 src/backend/backup/basebackup.c | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index b537f462197..f216b588422 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -94,7 +94,7 @@ static bool verify_page_checksum(Page page, XLogRecPtr start_lsn,
 								 BlockNumber blkno,
 								 uint16 *expected_checksum);
 static void sendFileWithContent(bbsink *sink, const char *filename,
-								const char *content,
+								const char *content, int len,
 								backup_manifest_info *manifest);
 static int64 _tarWriteHeader(bbsink *sink, const char *filename,
 							 const char *linktarget, struct stat *statbuf,
@@ -334,14 +334,14 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 				/* In the main tar, include the backup_label first... */
 				backup_label = build_backup_content(backup_state, false);
 				sendFileWithContent(sink, BACKUP_LABEL_FILE,
-									backup_label, &manifest);
+									backup_label, -1, &manifest);
 				pfree(backup_label);
 
 				/* Then the tablespace_map file, if required... */
 				if (opt->sendtblspcmapfile)
 				{
 					sendFileWithContent(sink, TABLESPACE_MAP,
-										tablespace_map->data, &manifest);
+										tablespace_map->data, -1, &manifest);
 					sendtblspclinks = false;
 				}
 
@@ -601,7 +601,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 			 * complete segment.
 			 */
 			StatusFilePath(pathbuf, walFileName, ".done");
-			sendFileWithContent(sink, pathbuf, "", &manifest);
+			sendFileWithContent(sink, pathbuf, "", -1, &manifest);
 		}
 
 		/*
@@ -629,7 +629,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 			/* unconditionally mark file as archived */
 			StatusFilePath(pathbuf, fname, ".done");
-			sendFileWithContent(sink, pathbuf, "", &manifest);
+			sendFileWithContent(sink, pathbuf, "", -1, &manifest);
 		}
 
 		/* Properly terminate the tar file. */
@@ -1040,22 +1040,22 @@ SendBaseBackup(BaseBackupCmd *cmd)
  */
 static void
 sendFileWithContent(bbsink *sink, const char *filename, const char *content,
-					backup_manifest_info *manifest)
+					int len, backup_manifest_info *manifest)
 {
 	struct stat statbuf;
-	int			bytes_done = 0,
-				len;
+	int			bytes_done = 0;
 	pg_checksum_context checksum_ctx;
 
 	if (pg_checksum_init(&checksum_ctx, manifest->checksum_type) < 0)
 		elog(ERROR, "could not initialize checksum of file \"%s\"",
 			 filename);
 
-	len = strlen(content);
+	/* If len less than zero treat content as a string */
+	if (len < 0)
+		len = strlen(content);
 
 	/*
-	 * Construct a stat struct for the backup_label file we're injecting in
-	 * the tar.
+	 * Construct a stat struct for the file we're injecting in the tar.
 	 */
 	/* Windows doesn't have the concept of uid and gid */
 #ifdef WIN32
-- 
2.34.1

recovery-in-pgcontrol-v4-0002-remove-backuplabel.patchtext/plain; charset=UTF-8; name=recovery-in-pgcontrol-v4-0002-remove-backuplabel.patchDownload

From 79c33f300dd5bb9aabd08f27d2e6bb5857190524 Mon Sep 17 00:00:00 2001
From: David Steele <david@pgmasters.net>
Date: Fri, 10 Nov 2023 17:50:55 +0000
Subject: Add recovery to pg_control and remove backup_label.

Simplify and harden recovery by getting rid of backup_label and storing
recovery information directly in pg_control. Instead of backup software copying
pg_control from PGDATA, it stores an updated version that is returned from
pg_backup_stop(). This is better for the following reasons:

* The user can no longer remove backup_label and get what looks like a
successful restore (while almost certainly causing corruption). If pg_control
is removed the cluster will not start. The user may try pg_resetwal, but I
think that tool makes it pretty clear that corruption will result from its use.
We could also modify pg_resetwal to complain if recovery info is present in
pg_control.

* We don't need to worry about backup software seeing a torn copy of pg_control,
since Postgres can safely read it out of memory and provide a valid copy via
pg_backup_stop(). This solves [2] without needing to write pg_control via a temp
file, which may affect performance on a standby. Unfortunately, this solution
cannot be back patched.

* For backup from standby, we no longer need to instruct the backup software to
copy pg_control last. In fact the backup software should not copy pg_control from
PGDATA at all.

Since backup_label is now gone, the fields that used to be in backup_label are
now provided as columns returned from pg_backup_start() and pg_backup_stop() and
the backup history file is still written to the archive.
---
 doc/src/sgml/backup.sgml                     |  31 ++-
 doc/src/sgml/func.sgml                       |  39 ++-
 doc/src/sgml/ref/pg_rewind.sgml              |   3 +-
 src/backend/access/transam/xlog.c            |  64 ++---
 src/backend/access/transam/xlogbackup.c      |  39 ++-
 src/backend/access/transam/xlogfuncs.c       |  50 ++--
 src/backend/access/transam/xlogrecovery.c    | 250 ++++---------------
 src/backend/backup/basebackup.c              |  35 +--
 src/backend/catalog/system_functions.sql     |  10 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |  15 +-
 src/bin/pg_controldata/pg_controldata.c      |   8 +
 src/bin/pg_resetwal/pg_resetwal.c            |   4 +
 src/bin/pg_rewind/filemap.c                  |   7 +-
 src/bin/pg_rewind/pg_rewind.c                |  62 +----
 src/include/access/xlog.h                    |   2 -
 src/include/access/xlogbackup.h              |  14 +-
 src/include/access/xlogrecovery.h            |   3 +-
 src/include/catalog/pg_control.h             |  18 +-
 src/include/catalog/pg_proc.dat              |  10 +-
 19 files changed, 268 insertions(+), 396 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae54..584384875be 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -935,19 +935,20 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
      ready to archive.
     </para>
     <para>
-     <function>pg_backup_stop</function> will return one row with three
-     values. The second of these fields should be written to a file named
-     <filename>backup_label</filename> in the root directory of the backup. The
-     third field should be written to a file named
-     <filename>tablespace_map</filename> unless the field is empty. These files are
+     <function>pg_backup_stop</function> returns the
+     <filename>pg_control</filename> file, which must be stored in the
+     <filename>global</filename> directory of the backup. It also returns the
+     <filename>tablespace_map</filename> file, which should be written in the
+     root directory of the backup unless the field is empty. These files are
      vital to the backup working and must be written byte for byte without
-     modification, which may require opening the file in binary mode.
+     modification, which will require opening the file in binary mode.
     </para>
    </listitem>
    <listitem>
     <para>
      Once the WAL segment files active during the backup are archived, you are
-     done.  The file identified by <function>pg_backup_stop</function>'s first return
+     done.  The file identified by <function>pg_backup_stop</function>'s
+     <parameter>lsn</parameter> return
      value is the last segment that is required to form a complete set of
      backup files.  On a primary, if <varname>archive_mode</varname> is enabled and the
      <literal>wait_for_archive</literal> parameter is <literal>true</literal>,
@@ -1013,7 +1014,15 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
    </para>
 
    <para>
-    You should, however, omit from the backup the files within the
+    You must exclude <filename>global/pg_control</filename> from your backup
+    and put the contents of the <parameter>pg_control_file</parameter> column
+    returned from <function>pg_backup_stop</function> in your backup at
+    <filename>global/pg_control</filename>. This file contains the information
+    required to safely recover.
+   </para>
+
+   <para>
+    You should also omit from the backup the files within the
     cluster's <filename>pg_wal/</filename> subdirectory.  This
     slight adjustment is worthwhile because it reduces the risk
     of mistakes when restoring.  This is easy to arrange if
@@ -1062,11 +1071,11 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
    </para>
 
    <para>
-    The backup label
-    file includes the label string you gave to <function>pg_backup_start</function>,
+    The backup history file (which is archived like WAL) includes the label
+    string you gave to <function>pg_backup_start</function>,
     as well as the time at which <function>pg_backup_start</function> was run, and
     the name of the starting WAL file.  In case of confusion it is therefore
-    possible to look inside a backup file and determine exactly which
+    possible to look inside a backup history file and determine exactly which
     backup session the dump file came from.  The tablespace map file includes
     the symbolic link names as they exist in the directory
     <filename>pg_tblspc/</filename> and the full path of each symbolic link.
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index d963f0a0a00..ed3e5b9dce6 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26845,7 +26845,10 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
           <parameter>label</parameter> <type>text</type>
           <optional>, <parameter>fast</parameter> <type>boolean</type>
           </optional> )
-        <returnvalue>pg_lsn</returnvalue>
+        <returnvalue>record</returnvalue>
+        ( <parameter>lsn</parameter> <type>pg_lsn</type>,
+        <parameter>timeline_id</parameter> <type>int8</type>,
+        <parameter>start</parameter> <type>timestamptz</type> )
        </para>
        <para>
         Prepares the server to begin an on-line backup.  The only required
@@ -26857,6 +26860,13 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
         as possible.  This forces an immediate checkpoint which will cause a
         spike in I/O operations, slowing any concurrently executing queries.
        </para>
+       <para>
+        The result columns contain information about the start of the backup
+        and can be ignored: the <parameter>lsn</parameter> column holds the
+        starting write-ahead log location, the
+        <parameter>timeline_id</parameter> column holds the starting timeline,
+        and the <parameter>stop</parameter> column holds the starting timestamp.
+       </para>
        <para>
         This function is restricted to superusers by default, but other users
         can be granted EXECUTE to run the function.
@@ -26872,13 +26882,15 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
           <optional><parameter>wait_for_archive</parameter> <type>boolean</type>
           </optional> )
         <returnvalue>record</returnvalue>
-        ( <parameter>lsn</parameter> <type>pg_lsn</type>,
-        <parameter>labelfile</parameter> <type>text</type>,
-        <parameter>spcmapfile</parameter> <type>text</type> )
+        ( <parameter>pg_control_file</parameter> <type>text</type>,
+        <parameter>tablespace_map_file</parameter> <type>text</type>,
+        <parameter>lsn</parameter> <type>pg_lsn</type>,
+        <parameter>timeline_id</parameter> <type>int8</type>,
+        <parameter>stop</parameter> <type>timestamptz</type> )
        </para>
        <para>
         Finishes performing an on-line backup.  The desired contents of the
-        backup label file and the tablespace map file are returned as part of
+        pg_control file and the tablespace map file are returned as part of
         the result of the function and must be written to files in the
         backup area.  These files must not be written to the live data directory
         (doing so will cause PostgreSQL to fail to restart in the event of a
@@ -26910,13 +26922,16 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
         backup.
        </para>
        <para>
-        The result of the function is a single record.
-        The <parameter>lsn</parameter> column holds the backup's ending
-        write-ahead log location (which again can be ignored).  The second
-        column returns the contents of the backup label file, and the third
-        column returns the contents of the tablespace map file.  These must be
-        stored as part of the backup and are required as part of the restore
-        process.
+        The result of the function is a single record. The first column returns
+        the contents of the <filename>pg_control</filename> file and the
+        second column returns the contents of the
+        <filename>tablespace_map</filename> file.  These must be stored as part
+        of the backup and are required as part of the restore process. The
+        remainder of the columns contain information about the end of the backup
+        and can be ignored: the <parameter>lsn</parameter> column holds the
+        ending write-ahead log location, the <parameter>timeline_id</parameter>
+        column holds the ending timeline, and the <parameter>stop</parameter>
+        column holds the ending timestamp.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 8e0000d39fb..889add4c5e4 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -400,7 +400,6 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b
       <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>,
       <filename>pg_stat_tmp/</filename>, and <filename>pg_subtrans/</filename>
       are omitted from the data copied from the source cluster. The files
-      <filename>backup_label</filename>,
       <filename>tablespace_map</filename>,
       <filename>pg_internal.init</filename>,
       <filename>postmaster.opts</filename>, and
@@ -410,7 +409,7 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b
     </step>
     <step>
      <para>
-      Create a <filename>backup_label</filename> file to begin WAL replay at
+      Update <filename>pg_control</filename> file to begin WAL replay at
       the checkpoint created at failover and configure the
       <filename>pg_control</filename> file with a minimum consistency LSN
       defined as the result of <literal>pg_current_wal_insert_lsn()</literal>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1159dff1a69..1689bc7d3a7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -74,6 +74,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "port/pg_crc32c.h"
 #include "port/pg_iovec.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
@@ -5135,7 +5136,6 @@ StartupXLOG(void)
 	bool		wasShutdown;
 	bool		didCrash;
 	bool		haveTblspcMap;
-	bool		haveBackupLabel;
 	XLogRecPtr	EndOfLog;
 	TimeLineID	EndOfLogTLI;
 	TimeLineID	newTLI;
@@ -5259,13 +5259,14 @@ StartupXLOG(void)
 	/*
 	 * Prepare for WAL recovery if needed.
 	 *
-	 * InitWalRecovery analyzes the control file and the backup label file, if
-	 * any.  It updates the in-memory ControlFile buffer according to the
-	 * starting checkpoint, and sets InRecovery and ArchiveRecoveryRequested.
+	 * InitWalRecovery analyzes the control file and checks if backup recovery
+	 * has been requested.  It updates the in-memory ControlFile buffer
+	 * according to the starting checkpoint, and sets InRecovery and
+	 * ArchiveRecoveryRequested.
+	 *
 	 * It also applies the tablespace map file, if any.
 	 */
-	InitWalRecovery(ControlFile, &wasShutdown,
-					&haveBackupLabel, &haveTblspcMap);
+	InitWalRecovery(ControlFile, &wasShutdown, &haveTblspcMap);
 	checkPoint = ControlFile->checkPointCopy;
 
 	/* initialize shared memory variables from the checkpoint record */
@@ -5408,20 +5409,6 @@ StartupXLOG(void)
 		 */
 		UpdateControlFile();
 
-		/*
-		 * If there was a backup label file, it's done its job and the info
-		 * has now been propagated into pg_control.  We must get rid of the
-		 * label file so that if we crash during recovery, we'll pick up at
-		 * the latest recovery restartpoint instead of going all the way back
-		 * to the backup start point.  It seems prudent though to just rename
-		 * the file out of the way rather than delete it completely.
-		 */
-		if (haveBackupLabel)
-		{
-			unlink(BACKUP_LABEL_OLD);
-			durable_rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD, FATAL);
-		}
-
 		/*
 		 * If there was a tablespace_map file, it's done its job and the
 		 * symlinks have been created.  We must get rid of the map file so
@@ -5571,10 +5558,8 @@ StartupXLOG(void)
 	 * (at which point we reset backupStartPoint to be Invalid), for
 	 * backup-from-replica (which can't inject records into the WAL stream),
 	 * that point is when we reach the minRecoveryPoint in pg_control (which
-	 * we purposefully copy last when backing up from a replica).  For
-	 * pg_rewind (which creates a backup_label with a method of "pg_rewind")
-	 * or snapshot-style backups (which don't), backupEndRequired will be set
-	 * to false.
+	 * we purposefully copy last when backing up).  For pg_rewind or
+	 * snapshot-style backups, backupEndRequired will be set to false.
 	 *
 	 * Note: it is indeed okay to look at the local variable
 	 * LocalMinRecoveryPoint here, even though ControlFile->minRecoveryPoint
@@ -8744,11 +8729,33 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 	int			seconds_before_warning;
 	int			waits = 0;
 	bool		reported_waiting = false;
+	ControlFileData *controlFileCopy = (ControlFileData *)state->controlFile;
 
 	Assert(state != NULL);
 
 	backup_stopped_in_recovery = RecoveryInProgress();
 
+	/*
+	 * Create a copy of control data and update it with fields required for
+	 * recovery. Also recalculate the CRC.
+	 */
+	memset(controlFileCopy, 0, PG_CONTROL_MAX_SAFE_SIZE);
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	memcpy(controlFileCopy, ControlFile, sizeof(ControlFileData));
+	LWLockRelease(ControlFileLock);
+
+	controlFileCopy->backupRecoveryRequired = true;
+	controlFileCopy->backupFromStandby = backup_stopped_in_recovery;
+	controlFileCopy->backupEndRequired = true;
+	controlFileCopy->backupCheckPoint = state->checkpointloc;
+	controlFileCopy->backupStartPoint = state->startpoint;
+	controlFileCopy->backupStartPointTLI = state->starttli;
+
+	INIT_CRC32C(controlFileCopy->crc);
+	COMP_CRC32C(controlFileCopy->crc, controlFileCopy, offsetof(ControlFileData, crc));
+	FIN_CRC32C(controlFileCopy->crc);
+
 	/*
 	 * During recovery, we don't need to check WAL level. Because, if WAL
 	 * level is not sufficient, it's impossible to get here during recovery.
@@ -8850,11 +8857,8 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 							 "Enable full_page_writes and run CHECKPOINT on the primary, "
 							 "and then try an online backup again.")));
 
-
-		LWLockAcquire(ControlFileLock, LW_SHARED);
-		state->stoppoint = ControlFile->minRecoveryPoint;
-		state->stoptli = ControlFile->minRecoveryPointTLI;
-		LWLockRelease(ControlFileLock);
+		state->stoppoint = controlFileCopy->minRecoveryPoint;
+		state->stoptli = controlFileCopy->minRecoveryPointTLI;
 	}
 	else
 	{
@@ -8896,7 +8900,7 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 							histfilepath)));
 
 		/* Build and save the contents of the backup history file */
-		history_file = build_backup_content(state, true);
+		history_file = build_backup_history_content(state);
 		fprintf(fp, "%s", history_file);
 		pfree(history_file);
 
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae1..22c95f3c4c9 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -18,19 +18,19 @@
 #include "access/xlogbackup.h"
 
 /*
- * Build contents for backup_label or backup history file.
- *
- * When ishistoryfile is true, it creates the contents for a backup history
- * file, otherwise it creates contents for a backup_label file.
+ * Build contents for backup history file.
  *
  * Returns the result generated as a palloc'd string.
  */
 char *
-build_backup_content(BackupState *state, bool ishistoryfile)
+build_backup_history_content(BackupState *state)
 {
 	char		startstrbuf[128];
+	char		stopstrfbuf[128];
 	char		startxlogfile[MAXFNAMELEN]; /* backup start WAL file */
+	char		stopxlogfile[MAXFNAMELEN];	/* backup stop WAL file */
 	XLogSegNo	startsegno;
+	XLogSegNo	stopsegno;
 	StringInfo	result = makeStringInfo();
 	char	   *data;
 
@@ -45,16 +45,10 @@ build_backup_content(BackupState *state, bool ishistoryfile)
 	appendStringInfo(result, "START WAL LOCATION: %X/%X (file %s)\n",
 					 LSN_FORMAT_ARGS(state->startpoint), startxlogfile);
 
-	if (ishistoryfile)
-	{
-		char		stopxlogfile[MAXFNAMELEN];	/* backup stop WAL file */
-		XLogSegNo	stopsegno;
-
-		XLByteToSeg(state->stoppoint, stopsegno, wal_segment_size);
-		XLogFileName(stopxlogfile, state->stoptli, stopsegno, wal_segment_size);
-		appendStringInfo(result, "STOP WAL LOCATION: %X/%X (file %s)\n",
-						 LSN_FORMAT_ARGS(state->stoppoint), stopxlogfile);
-	}
+	XLByteToSeg(state->stoppoint, stopsegno, wal_segment_size);
+	XLogFileName(stopxlogfile, state->stoptli, stopsegno, wal_segment_size);
+	appendStringInfo(result, "STOP WAL LOCATION: %X/%X (file %s)\n",
+						LSN_FORMAT_ARGS(state->stoppoint), stopxlogfile);
 
 	appendStringInfo(result, "CHECKPOINT LOCATION: %X/%X\n",
 					 LSN_FORMAT_ARGS(state->checkpointloc));
@@ -65,17 +59,12 @@ build_backup_content(BackupState *state, bool ishistoryfile)
 	appendStringInfo(result, "LABEL: %s\n", state->name);
 	appendStringInfo(result, "START TIMELINE: %u\n", state->starttli);
 
-	if (ishistoryfile)
-	{
-		char		stopstrfbuf[128];
-
-		/* Use the log timezone here, not the session timezone */
-		pg_strftime(stopstrfbuf, sizeof(stopstrfbuf), "%Y-%m-%d %H:%M:%S %Z",
-					pg_localtime(&state->stoptime, log_timezone));
+	/* Use the log timezone here, not the session timezone */
+	pg_strftime(stopstrfbuf, sizeof(stopstrfbuf), "%Y-%m-%d %H:%M:%S %Z",
+				pg_localtime(&state->stoptime, log_timezone));
 
-		appendStringInfo(result, "STOP TIME: %s\n", stopstrfbuf);
-		appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
-	}
+	appendStringInfo(result, "STOP TIME: %s\n", stopstrfbuf);
+	appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
 
 	data = result->data;
 	pfree(result);
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 45a70668b1c..2388a60a5e5 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -53,7 +53,7 @@ static MemoryContext backupcontext = NULL;
  * pg_backup_start: set up for taking an on-line backup dump
  *
  * Essentially what this does is to create the contents required for the
- * backup_label file and the tablespace map.
+ * the tablespace map.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -61,6 +61,10 @@ static MemoryContext backupcontext = NULL;
 Datum
 pg_backup_start(PG_FUNCTION_ARGS)
 {
+#define PG_BACKUP_START_V2_COLS 3
+	TupleDesc	tupdesc;
+	Datum		values[PG_BACKUP_START_V2_COLS] = {0};
+	bool		nulls[PG_BACKUP_START_V2_COLS] = {0};
 	text	   *backupid = PG_GETARG_TEXT_PP(0);
 	bool		fast = PG_GETARG_BOOL(1);
 	char	   *backupidstr;
@@ -69,6 +73,10 @@ pg_backup_start(PG_FUNCTION_ARGS)
 
 	backupidstr = text_to_cstring(backupid);
 
+	/* Initialize attributes information in the tuple descriptor */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
 	if (status == SESSION_BACKUP_RUNNING)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@@ -102,7 +110,12 @@ pg_backup_start(PG_FUNCTION_ARGS)
 	register_persistent_abort_backup_handler();
 	do_pg_backup_start(backupidstr, fast, NULL, backup_state, tablespace_map);
 
-	PG_RETURN_LSN(backup_state->startpoint);
+	values[0] = LSNGetDatum(backup_state->startpoint);
+	values[1] = Int64GetDatum(backup_state->starttli);
+	values[2] = TimestampTzGetDatum(time_t_to_timestamptz(backup_state->starttime));
+
+	/* Returns the record as Datum */
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
 
 
@@ -113,14 +126,12 @@ pg_backup_start(PG_FUNCTION_ARGS)
  * allows the user to choose if they want to wait for the WAL to be archived
  * or if we should just return as soon as the WAL record is written.
  *
- * This function stops an in-progress backup, creates backup_label contents and
- * it returns the backup stop LSN, backup_label and tablespace_map contents.
+ * This function stops an in-progress backup and returns the backup stop LSN,
+ * pg_control and tablespace_map contents.
  *
- * The backup_label contains the user-supplied label string (typically this
- * would be used to tell where the backup dump will be stored), the starting
- * time, starting WAL location for the dump and so on.  It is the caller's
- * responsibility to write the backup_label and tablespace_map files in the
- * data folder that will be restored from this backup.
+ * The pg_control file contains the recovery information for the backup.  It is
+ * the caller's responsibility to write the pg_control and tablespace_map files
+ * in the data folder that will be restored from this backup.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -128,12 +139,12 @@ pg_backup_start(PG_FUNCTION_ARGS)
 Datum
 pg_backup_stop(PG_FUNCTION_ARGS)
 {
-#define PG_BACKUP_STOP_V2_COLS 3
+#define PG_BACKUP_STOP_V2_COLS 5
 	TupleDesc	tupdesc;
 	Datum		values[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		nulls[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		waitforarchive = PG_GETARG_BOOL(0);
-	char	   *backup_label;
+	bytea	   *pg_control_bytea;
 	SessionBackupState status = get_backup_status();
 
 	/* Initialize attributes information in the tuple descriptor */
@@ -152,15 +163,16 @@ pg_backup_stop(PG_FUNCTION_ARGS)
 	/* Stop the backup */
 	do_pg_backup_stop(backup_state, waitforarchive);
 
-	/* Build the contents of backup_label */
-	backup_label = build_backup_content(backup_state, false);
-
-	values[0] = LSNGetDatum(backup_state->stoppoint);
-	values[1] = CStringGetTextDatum(backup_label);
-	values[2] = CStringGetTextDatum(tablespace_map->data);
+	/* Build the contents of pg_control */
+	pg_control_bytea = (bytea *) palloc(PG_CONTROL_MAX_SAFE_SIZE + VARHDRSZ);
+	SET_VARSIZE(pg_control_bytea, PG_CONTROL_MAX_SAFE_SIZE + VARHDRSZ);
+	memcpy(VARDATA(pg_control_bytea), backup_state->controlFile, PG_CONTROL_MAX_SAFE_SIZE);
 
-	/* Deallocate backup-related variables */
-	pfree(backup_label);
+	values[0] = PointerGetDatum(pg_control_bytea);
+	values[1] = CStringGetTextDatum(tablespace_map->data);
+	values[2] = LSNGetDatum(backup_state->stoppoint);
+	values[3] = Int64GetDatum(backup_state->stoptli);
+	values[4] = TimestampTzGetDatum(time_t_to_timestamptz(backup_state->stoptime));
 
 	/* Clean up the session-level state and its memory context */
 	backup_state = NULL;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666aa..f43ea39f963 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -6,7 +6,7 @@
  * This source file contains functions controlling WAL recovery.
  * InitWalRecovery() initializes the system for crash or archive recovery,
  * or standby mode, depending on configuration options and the state of
- * the control file and possible backup label file.  PerformWalRecovery()
+ * the control file and possible backup recovery.  PerformWalRecovery()
  * performs the actual WAL replay, calling the rmgr-specific redo routines.
  * FinishWalRecovery() performs end-of-recovery checks and cleanup actions,
  * and prepares information needed to initialize the WAL for writes.  In
@@ -152,11 +152,12 @@ static bool recovery_signal_file_found = false;
 
 /*
  * CheckPointLoc is the position of the checkpoint record that determines
- * where to start the replay.  It comes from the backup label file or the
- * control file.
+ * where to start the replay.  It comes from the control file, either from the
+ * default location or from a backup recovery field.
  *
- * RedoStartLSN is the checkpoint's REDO location, also from the backup label
- * file or the control file.  In standby mode, XLOG streaming usually starts
+ * RedoStartLSN is the checkpoint's REDO location, also from the default
+ * control file location or from a backup recovery field.  In standby mode,
+ * XLOG streaming usually starts
  * from the position where an invalid record was found.  But if we fail to
  * read even the initial checkpoint record, we use the REDO location instead
  * of the checkpoint location as the start position of XLOG streaming.
@@ -388,9 +389,6 @@ static void ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, Time
 static void EnableStandbyMode(void);
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
-static bool read_backup_label(XLogRecPtr *checkPointLoc,
-							  TimeLineID *backupLabelTLI,
-							  bool *backupEndRequired, bool *backupFromStandby);
 static bool read_tablespace_map(List **tablespaces);
 
 static void xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI);
@@ -492,8 +490,8 @@ EnableStandbyMode(void)
  * Prepare the system for WAL recovery, if needed.
  *
  * This is called by StartupXLOG() which coordinates the server startup
- * sequence.  This function analyzes the control file and the backup label
- * file, if any, and figures out whether we need to perform crash recovery or
+ * sequence.  This function analyzes the control file and backup recovery
+ * info, if any, and figures out whether we need to perform crash recovery or
  * archive recovery, and how far we need to replay the WAL to reach a
  * consistent state.
  *
@@ -510,7 +508,7 @@ EnableStandbyMode(void)
  */
 void
 InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
-				bool *haveBackupLabel_ptr, bool *haveTblspcMap_ptr)
+				bool *haveTblspcMap_ptr)
 {
 	XLogPageReadPrivate *private;
 	struct stat st;
@@ -518,7 +516,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	XLogRecord *record;
 	DBState		dbstate_at_startup;
 	bool		haveTblspcMap = false;
-	bool		haveBackupLabel = false;
+	bool		backupRecoveryRequired = false;
 	CheckPoint	checkPoint;
 	bool		backupFromStandby = false;
 
@@ -549,7 +547,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 
 	/*
 	 * Set the WAL reading processor now, as it will be needed when reading
-	 * the checkpoint record required (backup_label or not).
+	 * the checkpoint record required (backup recovery required or not).
 	 */
 	private = palloc0(sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -585,18 +583,34 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	primary_image_masked = (char *) palloc(BLCKSZ);
 
 	/*
-	 * Read the backup_label file.  We want to run this part of the recovery
-	 * process after checking for signal files and after performing validation
-	 * of the recovery parameters.
+	 * Load recovery settings from pg_control.  We want to run this part of the
+	 * recovery process after checking for signal files and after performing
+	 * validation of the recovery parameters.
 	 */
-	if (read_backup_label(&CheckPointLoc, &CheckPointTLI, &backupEndRequired,
-						  &backupFromStandby))
+	if (ControlFile->backupRecoveryRequired)
 	{
 		List	   *tablespaces = NIL;
 
+		/* Initialize recovery from fields stored in pg_control */
+		CheckPointLoc = ControlFile->backupCheckPoint;
+		CheckPointTLI = ControlFile->backupStartPointTLI;
+		RedoStartLSN = ControlFile->backupStartPoint;
+		RedoStartTLI = ControlFile->backupStartPointTLI;
+		backupEndRequired = ControlFile->backupEndRequired;
+		backupFromStandby = ControlFile->backupFromStandby;
+
+		/* Clear fields used to initialize recovery */
+		ControlFile->backupCheckPoint = InvalidXLogRecPtr;
+		ControlFile->backupStartPointTLI = 0;
+		ControlFile->backupRecoveryRequired = false;
+		ControlFile->backupFromStandby = false;
+
+		/* Indicate that recovery was requested */
+		backupRecoveryRequired = true;
+
 		/*
-		 * Archive recovery was requested, and thanks to the backup label
-		 * file, we know how far we need to replay to reach consistency. Enter
+		 * Archive recovery was requested, and thanks to the recovery
+		 * info, we know how far we need to replay to reach consistency. Enter
 		 * archive recovery directly.
 		 */
 		InArchiveRecovery = true;
@@ -604,8 +618,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			EnableStandbyMode();
 
 		/*
-		 * When a backup_label file is present, we want to roll forward from
-		 * the checkpoint it identifies, rather than using pg_control.
+		 * When backup recovery is requested, we want to roll forward from
+		 * the checkpoint it identifies, rather than using the default
+		 * checkpoint.
 		 */
 		record = ReadCheckpointRecord(xlogprefetcher, CheckPointLoc,
 									  CheckPointTLI);
@@ -620,9 +635,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 
 			/*
 			 * Make sure that REDO location exists. This may not be the case
-			 * if there was a crash during an online backup, which left a
-			 * backup_label around that references a WAL segment that's
-			 * already been archived.
+			 * if recovery.signal is missing and the WAL has already been
+			 * archived.
 			 */
 			if (checkPoint.redo < CheckPointLoc)
 			{
@@ -631,20 +645,16 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 								checkPoint.ThisTimeLineID))
 					ereport(FATAL,
 							(errmsg("could not find redo location referenced by checkpoint record"),
-							 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n"
-									 "If you are not restoring from a backup, try removing the file \"%s/backup_label\".\n"
-									 "Be careful: removing \"%s/backup_label\" will result in a corrupt cluster if restoring from a backup.",
-									 DataDir, DataDir, DataDir, DataDir)));
+							 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n",
+									 DataDir, DataDir)));
 			}
 		}
 		else
 		{
 			ereport(FATAL,
 					(errmsg("could not locate required checkpoint record"),
-					 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n"
-							 "If you are not restoring from a backup, try removing the file \"%s/backup_label\".\n"
-							 "Be careful: removing \"%s/backup_label\" will result in a corrupt cluster if restoring from a backup.",
-							 DataDir, DataDir, DataDir, DataDir)));
+					 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n",
+							 DataDir, DataDir)));
 			wasShutdown = false;	/* keep compiler quiet */
 		}
 
@@ -679,37 +689,32 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			/* tell the caller to delete it later */
 			haveTblspcMap = true;
 		}
-
-		/* tell the caller to delete it later */
-		haveBackupLabel = true;
 	}
 	else
 	{
-		/* No backup_label file has been found if we are here. */
-
 		/*
-		 * If tablespace_map file is present without backup_label file, there
-		 * is no use of such file.  There is no harm in retaining it, but it
-		 * is better to get rid of the map file so that we don't have any
+		 * If tablespace_map file is present without backup recovery requested,
+		 * there is no use of such file.  There is no harm in retaining it, but
+		 * it is better to get rid of the map file so that we don't have any
 		 * redundant file in data directory and it will avoid any sort of
 		 * confusion.  It seems prudent though to just rename the file out of
 		 * the way rather than delete it completely, also we ignore any error
 		 * that occurs in rename operation as even if map file is present
-		 * without backup_label file, it is harmless.
+		 * without backup recovery requested, it is harmless.
 		 */
 		if (stat(TABLESPACE_MAP, &st) == 0)
 		{
 			unlink(TABLESPACE_MAP_OLD);
 			if (durable_rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD, DEBUG1) == 0)
 				ereport(LOG,
-						(errmsg("ignoring file \"%s\" because no file \"%s\" exists",
-								TABLESPACE_MAP, BACKUP_LABEL_FILE),
+						(errmsg("ignoring file \"%s\" because backup recovery was not requested",
+								TABLESPACE_MAP),
 						 errdetail("File \"%s\" was renamed to \"%s\".",
 								   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 			else
 				ereport(LOG,
-						(errmsg("ignoring file \"%s\" because no file \"%s\" exists",
-								TABLESPACE_MAP, BACKUP_LABEL_FILE),
+						(errmsg("ignoring file \"%s\" because backup recovery was not requested",
+								TABLESPACE_MAP),
 						 errdetail("Could not rename file \"%s\" to \"%s\": %m.",
 								   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 		}
@@ -943,7 +948,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		 * Any other state indicates that the backup somehow became corrupted
 		 * and we can't sensibly continue with recovery.
 		 */
-		if (haveBackupLabel)
+		if (backupRecoveryRequired)
 		{
 			ControlFile->backupStartPoint = checkPoint.redo;
 			ControlFile->backupEndRequired = backupEndRequired;
@@ -953,7 +958,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 				if (dbstate_at_startup != DB_IN_ARCHIVE_RECOVERY &&
 					dbstate_at_startup != DB_SHUTDOWNED_IN_RECOVERY)
 					ereport(FATAL,
-							(errmsg("backup_label contains data inconsistent with control file"),
+							(errmsg("pg_control contains inconsistent data for standby backup"),
 							 errhint("This means that the backup is corrupted and you will "
 									 "have to use another backup for recovery.")));
 				ControlFile->backupEndPoint = ControlFile->minRecoveryPoint;
@@ -983,7 +988,6 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	missingContrecPtr = InvalidXLogRecPtr;
 
 	*wasShutdown_ptr = wasShutdown;
-	*haveBackupLabel_ptr = haveBackupLabel;
 	*haveTblspcMap_ptr = haveTblspcMap;
 }
 
@@ -1156,154 +1160,6 @@ validateRecoveryParameters(void)
 	}
 }
 
-/*
- * read_backup_label: check to see if a backup_label file is present
- *
- * If we see a backup_label during recovery, we assume that we are recovering
- * from a backup dump file, and we therefore roll forward from the checkpoint
- * identified by the label file, NOT what pg_control says.  This avoids the
- * problem that pg_control might have been archived one or more checkpoints
- * later than the start of the dump, and so if we rely on it as the start
- * point, we will fail to restore a consistent database state.
- *
- * Returns true if a backup_label was found (and fills the checkpoint
- * location and TLI into *checkPointLoc and *backupLabelTLI, respectively);
- * returns false if not. If this backup_label came from a streamed backup,
- * *backupEndRequired is set to true. If this backup_label was created during
- * recovery, *backupFromStandby is set to true.
- *
- * Also sets the global variables RedoStartLSN and RedoStartTLI with the LSN
- * and TLI read from the backup file.
- */
-static bool
-read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
-				  bool *backupEndRequired, bool *backupFromStandby)
-{
-	char		startxlogfilename[MAXFNAMELEN];
-	TimeLineID	tli_from_walseg,
-				tli_from_file;
-	FILE	   *lfp;
-	char		ch;
-	char		backuptype[20];
-	char		backupfrom[20];
-	char		backuplabel[MAXPGPATH];
-	char		backuptime[128];
-	uint32		hi,
-				lo;
-
-	/* suppress possible uninitialized-variable warnings */
-	*checkPointLoc = InvalidXLogRecPtr;
-	*backupLabelTLI = 0;
-	*backupEndRequired = false;
-	*backupFromStandby = false;
-
-	/*
-	 * See if label file is present
-	 */
-	lfp = AllocateFile(BACKUP_LABEL_FILE, "r");
-	if (!lfp)
-	{
-		if (errno != ENOENT)
-			ereport(FATAL,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m",
-							BACKUP_LABEL_FILE)));
-		return false;			/* it's not there, all is fine */
-	}
-
-	/*
-	 * Read and parse the START WAL LOCATION and CHECKPOINT lines (this code
-	 * is pretty crude, but we are not expecting any variability in the file
-	 * format).
-	 */
-	if (fscanf(lfp, "START WAL LOCATION: %X/%X (file %08X%16s)%c",
-			   &hi, &lo, &tli_from_walseg, startxlogfilename, &ch) != 5 || ch != '\n')
-		ereport(FATAL,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
-	RedoStartLSN = ((uint64) hi) << 32 | lo;
-	RedoStartTLI = tli_from_walseg;
-	if (fscanf(lfp, "CHECKPOINT LOCATION: %X/%X%c",
-			   &hi, &lo, &ch) != 3 || ch != '\n')
-		ereport(FATAL,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
-	*checkPointLoc = ((uint64) hi) << 32 | lo;
-	*backupLabelTLI = tli_from_walseg;
-
-	/*
-	 * BACKUP METHOD lets us know if this was a typical backup ("streamed",
-	 * which could mean either pg_basebackup or the pg_backup_start/stop
-	 * method was used) or if this label came from somewhere else (the only
-	 * other option today being from pg_rewind).  If this was a streamed
-	 * backup then we know that we need to play through until we get to the
-	 * end of the WAL which was generated during the backup (at which point we
-	 * will have reached consistency and backupEndRequired will be reset to be
-	 * false).
-	 */
-	if (fscanf(lfp, "BACKUP METHOD: %19s\n", backuptype) == 1)
-	{
-		if (strcmp(backuptype, "streamed") == 0)
-			*backupEndRequired = true;
-	}
-
-	/*
-	 * BACKUP FROM lets us know if this was from a primary or a standby.  If
-	 * it was from a standby, we'll double-check that the control file state
-	 * matches that of a standby.
-	 */
-	if (fscanf(lfp, "BACKUP FROM: %19s\n", backupfrom) == 1)
-	{
-		if (strcmp(backupfrom, "standby") == 0)
-			*backupFromStandby = true;
-	}
-
-	/*
-	 * Parse START TIME and LABEL. Those are not mandatory fields for recovery
-	 * but checking for their presence is useful for debugging and the next
-	 * sanity checks. Cope also with the fact that the result buffers have a
-	 * pre-allocated size, hence if the backup_label file has been generated
-	 * with strings longer than the maximum assumed here an incorrect parsing
-	 * happens. That's fine as only minor consistency checks are done
-	 * afterwards.
-	 */
-	if (fscanf(lfp, "START TIME: %127[^\n]\n", backuptime) == 1)
-		ereport(DEBUG1,
-				(errmsg_internal("backup time %s in file \"%s\"",
-								 backuptime, BACKUP_LABEL_FILE)));
-
-	if (fscanf(lfp, "LABEL: %1023[^\n]\n", backuplabel) == 1)
-		ereport(DEBUG1,
-				(errmsg_internal("backup label %s in file \"%s\"",
-								 backuplabel, BACKUP_LABEL_FILE)));
-
-	/*
-	 * START TIMELINE is new as of 11. Its parsing is not mandatory, still use
-	 * it as a sanity check if present.
-	 */
-	if (fscanf(lfp, "START TIMELINE: %u\n", &tli_from_file) == 1)
-	{
-		if (tli_from_walseg != tli_from_file)
-			ereport(FATAL,
-					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE),
-					 errdetail("Timeline ID parsed is %u, but expected %u.",
-							   tli_from_file, tli_from_walseg)));
-
-		ereport(DEBUG1,
-				(errmsg_internal("backup timeline %u in file \"%s\"",
-								 tli_from_file, BACKUP_LABEL_FILE)));
-	}
-
-	if (ferror(lfp) || FreeFile(lfp))
-		ereport(FATAL,
-				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
-						BACKUP_LABEL_FILE)));
-
-	return true;
-}
-
 /*
  * read_tablespace_map: check to see if a tablespace_map file is present
  *
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index f216b588422..a4bd79447fd 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -22,6 +22,7 @@
 #include "backup/basebackup.h"
 #include "backup/basebackup_sink.h"
 #include "backup/basebackup_target.h"
+#include "catalog/pg_control.h"
 #include "commands/defrem.h"
 #include "common/compression.h"
 #include "common/file_perm.h"
@@ -192,10 +193,9 @@ static const struct exclude_list_item excludeFiles[] =
 	{RELCACHE_INIT_FILENAME, true},
 
 	/*
-	 * backup_label and tablespace_map should not exist in a running cluster
-	 * capable of doing an online backup, but exclude them just in case.
+	 * tablespace_map should not exist in a running cluster capable of doing
+	 * an online backup, but exclude it just in case.
 	 */
-	{BACKUP_LABEL_FILE, false},
 	{TABLESPACE_MAP, false},
 
 	/*
@@ -325,19 +325,11 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 			if (ti->path == NULL)
 			{
-				struct stat statbuf;
 				bool		sendtblspclinks = true;
-				char	   *backup_label;
 
 				bbsink_begin_archive(sink, "base.tar");
 
-				/* In the main tar, include the backup_label first... */
-				backup_label = build_backup_content(backup_state, false);
-				sendFileWithContent(sink, BACKUP_LABEL_FILE,
-									backup_label, -1, &manifest);
-				pfree(backup_label);
-
-				/* Then the tablespace_map file, if required... */
+				/* Send the tablespace_map file, if required... */
 				if (opt->sendtblspcmapfile)
 				{
 					sendFileWithContent(sink, TABLESPACE_MAP,
@@ -349,14 +341,14 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 				sendDir(sink, ".", 1, false, state.tablespaces,
 						sendtblspclinks, &manifest, InvalidOid);
 
-				/* ... and pg_control after everything else. */
-				if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not stat file \"%s\": %m",
-									XLOG_CONTROL_FILE)));
-				sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
-						 false, InvalidOid, InvalidOid, &manifest);
+				/* End the backup before sending pg_control */
+				basebackup_progress_wait_wal_archive(&state);
+				do_pg_backup_stop(backup_state, !opt->nowait);
+
+				/* Send copy of pg_control containing recovery info */
+				sendFileWithContent(sink, XLOG_CONTROL_FILE,
+									(char *)backup_state->controlFile,
+									PG_CONTROL_MAX_SAFE_SIZE, &manifest);
 			}
 			else
 			{
@@ -390,9 +382,6 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 			}
 		}
 
-		basebackup_progress_wait_wal_archive(&state);
-		do_pg_backup_stop(backup_state, !opt->nowait);
-
 		endptr = backup_state->stoppoint;
 		endtli = backup_state->stoptli;
 
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 35d738d5763..24bf34b45eb 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -384,13 +384,15 @@ BEGIN ATOMIC
 END;
 
 CREATE OR REPLACE FUNCTION
-  pg_backup_start(label text, fast boolean DEFAULT false)
-  RETURNS pg_lsn STRICT VOLATILE LANGUAGE internal AS 'pg_backup_start'
+  pg_backup_start(label text, fast boolean DEFAULT false, OUT lsn pg_lsn,
+        OUT timeline_id int8, OUT start timestamptz)
+  RETURNS record STRICT VOLATILE LANGUAGE internal AS 'pg_backup_start'
   PARALLEL RESTRICTED;
 
 CREATE OR REPLACE FUNCTION pg_backup_stop (
-        wait_for_archive boolean DEFAULT true, OUT lsn pg_lsn,
-        OUT labelfile text, OUT spcmapfile text)
+        wait_for_archive boolean DEFAULT true, OUT pg_control_file bytea,
+        OUT tablespace_map_file text, OUT lsn pg_lsn, OUT timeline_id int8,
+        OUT stop timestamptz)
   RETURNS record STRICT VOLATILE LANGUAGE internal as 'pg_backup_stop'
   PARALLEL RESTRICTED;
 
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b4..c655cb03352 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -171,8 +171,8 @@ SKIP:
 
 # Write some files to test that they are not copied.
 foreach my $filename (
-	qw(backup_label tablespace_map postgresql.auto.conf.tmp
-	current_logfiles.tmp global/pg_internal.init.123))
+	qw(tablespace_map postgresql.auto.conf.tmp current_logfiles.tmp
+	   global/pg_internal.init.123))
 {
 	open my $file, '>>', "$pgdata/$filename";
 	print $file "DONOTCOPY";
@@ -261,14 +261,13 @@ foreach my $filename (@tempRelationFiles)
 		"base/$postgresOid/$filename not copied");
 }
 
-# Make sure existing backup_label was ignored.
-isnt(slurp_file("$tempdir/backup/backup_label"),
-	'DONOTCOPY', 'existing backup_label not copied');
+# Make sure existing tablespace_map was ignored.
+ok(!-f "$tempdir/backup/tablespace_map", 'tablespace_map not in backup');
 rmtree("$tempdir/backup");
 
-# Now delete the bogus backup_label file since it will interfere with startup
-unlink("$pgdata/backup_label")
-  or BAIL_OUT("unable to unlink $pgdata/backup_label");
+# Now delete the bogus tablespace_map file since it will interfere with startup
+unlink("$pgdata/tablespace_map")
+  or BAIL_OUT("unable to unlink $pgdata/tablespace_map");
 
 $node->command_ok(
 	[
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93e0837947c..cc515b622ff 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -277,10 +277,18 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->minRecoveryPoint));
 	printf(_("Min recovery ending loc's timeline:   %u\n"),
 		   ControlFile->minRecoveryPointTLI);
+	printf(_("Backup checkpoint location:           %X/%X\n"),
+		   LSN_FORMAT_ARGS(ControlFile->backupCheckPoint));
 	printf(_("Backup start location:                %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->backupStartPoint));
+	printf(_("Backup start location's timeline:     %u\n"),
+		   ControlFile->backupStartPointTLI);
 	printf(_("Backup end location:                  %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
+	printf(_("Backup recovery required:        		%s\n"),
+		   ControlFile->backupRecoveryRequired ? _("yes") : _("no"));
+	printf(_("Backup from standby:        			%s\n"),
+		   ControlFile->backupFromStandby ? _("yes") : _("no"));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 3ae3fc06df2..255101ff3a1 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -870,8 +870,12 @@ RewriteControlFile(void)
 	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
 	ControlFile.minRecoveryPoint = 0;
 	ControlFile.minRecoveryPointTLI = 0;
+	ControlFile.backupCheckPoint = 0;
 	ControlFile.backupStartPoint = 0;
+	ControlFile.backupStartPointTLI = 0;
 	ControlFile.backupEndPoint = 0;
+	ControlFile.backupRecoveryRequired = false;
+	ControlFile.backupFromStandby = false;
 	ControlFile.backupEndRequired = false;
 
 	/*
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index ecadd69dc53..213f4e71b88 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -139,11 +139,10 @@ static const struct exclude_list_item excludeFiles[] =
 	{"pg_internal.init", true}, /* defined as RELCACHE_INIT_FILENAME */
 
 	/*
-	 * If there is a backup_label or tablespace_map file, it indicates that a
-	 * recovery failed and this cluster probably can't be rewound, but exclude
-	 * them anyway if they are found.
+	 * If there is a tablespace_map file, it indicates that a recovery failed
+	 * and this cluster probably can't be rewound, but exclude it anyway if it
+	 * is found.
 	 */
-	{"backup_label", false},	/* defined as BACKUP_LABEL_FILE */
 	{"tablespace_map", false},	/* defined as TABLESPACE_MAP */
 
 	/*
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index bfd44a284e2..f42782e2eab 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -39,9 +39,6 @@ static void perform_rewind(filemap_t *filemap, rewind_source *source,
 						   TimeLineID chkpttli,
 						   XLogRecPtr chkptredo);
 
-static void createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli,
-							  XLogRecPtr checkpointloc);
-
 static void digestControlFile(ControlFileData *ControlFile,
 							  const char *content, size_t size);
 static void getRestoreCommand(const char *argv0);
@@ -654,7 +651,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 		pg_log_info("creating backup label and updating control file");
 
 	/*
-	 * Create a backup label file, to tell the target where to begin the WAL
+	 * Get recovery fields to tell the target where to begin the WAL
 	 * replay. Normally, from the last common checkpoint between the source
 	 * and the target. But if the source is a standby server, it's possible
 	 * that the last common checkpoint is *after* the standby's restartpoint.
@@ -672,7 +669,6 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 		chkpttli = ControlFile_source.checkPointCopy.ThisTimeLineID;
 		chkptrec = ControlFile_source.checkPoint;
 	}
-	createBackupLabel(chkptredo, chkpttli, chkptrec);
 
 	/*
 	 * Update control file of target, to tell the target how far it must
@@ -722,6 +718,12 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	ControlFile_new.minRecoveryPoint = endrec;
 	ControlFile_new.minRecoveryPointTLI = endtli;
 	ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
+	ControlFile_new.backupRecoveryRequired = true;
+	ControlFile_new.backupFromStandby = true;
+	ControlFile_new.backupEndRequired = false;
+	ControlFile_new.backupCheckPoint = chkptrec;
+	ControlFile_new.backupStartPoint = chkptredo;
+	ControlFile_new.backupStartPointTLI = chkpttli;
 	if (!dry_run)
 		update_controlfile(datadir_target, &ControlFile_new, do_sync);
 }
@@ -729,7 +731,10 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 static void
 sanityChecks(void)
 {
-	/* TODO Check that there's no backup_label in either cluster */
+	/*
+	 * TODO Check that neither cluster has backupRecoveryRequested set in
+	 * pg_control.
+	 */
 
 	/* Check system_identifier match */
 	if (ControlFile_target.system_identifier != ControlFile_source.system_identifier)
@@ -951,51 +956,6 @@ findCommonAncestorTimeline(TimeLineHistoryEntry *a_history, int a_nentries,
 	}
 }
 
-
-/*
- * Create a backup_label file that forces recovery to begin at the last common
- * checkpoint.
- */
-static void
-createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli, XLogRecPtr checkpointloc)
-{
-	XLogSegNo	startsegno;
-	time_t		stamp_time;
-	char		strfbuf[128];
-	char		xlogfilename[MAXFNAMELEN];
-	struct tm  *tmp;
-	char		buf[1000];
-	int			len;
-
-	XLByteToSeg(startpoint, startsegno, WalSegSz);
-	XLogFileName(xlogfilename, starttli, startsegno, WalSegSz);
-
-	/*
-	 * Construct backup label file
-	 */
-	stamp_time = time(NULL);
-	tmp = localtime(&stamp_time);
-	strftime(strfbuf, sizeof(strfbuf), "%Y-%m-%d %H:%M:%S %Z", tmp);
-
-	len = snprintf(buf, sizeof(buf),
-				   "START WAL LOCATION: %X/%X (file %s)\n"
-				   "CHECKPOINT LOCATION: %X/%X\n"
-				   "BACKUP METHOD: pg_rewind\n"
-				   "BACKUP FROM: standby\n"
-				   "START TIME: %s\n",
-	/* omit LABEL: line */
-				   LSN_FORMAT_ARGS(startpoint), xlogfilename,
-				   LSN_FORMAT_ARGS(checkpointloc),
-				   strfbuf);
-	if (len >= sizeof(buf))
-		pg_fatal("backup label buffer too small");	/* shouldn't happen */
-
-	/* TODO: move old file out of the way, if any. */
-	open_target_file("backup_label", true); /* BACKUP_LABEL_FILE */
-	write_target_range(buf, 0, len);
-	close_target_file();
-}
-
 /*
  * Check CRC of control file
  */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164f..3aac6839a70 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -293,8 +293,6 @@ extern SessionBackupState get_backup_status(void);
 /* File path names (all relative to $PGDATA) */
 #define RECOVERY_SIGNAL_FILE	"recovery.signal"
 #define STANDBY_SIGNAL_FILE		"standby.signal"
-#define BACKUP_LABEL_FILE		"backup_label"
-#define BACKUP_LABEL_OLD		"backup_label.old"
 
 #define TABLESPACE_MAP			"tablespace_map"
 #define TABLESPACE_MAP_OLD		"tablespace_map.old"
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137b..f2c3672fed6 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -15,6 +15,7 @@
 #define XLOG_BACKUP_H
 
 #include "access/xlogdefs.h"
+#include "catalog/pg_control.h"
 #include "pgtime.h"
 
 /* Structure to hold backup state. */
@@ -33,9 +34,18 @@ typedef struct BackupState
 	XLogRecPtr	stoppoint;		/* backup stop WAL location */
 	TimeLineID	stoptli;		/* backup stop TLI */
 	pg_time_t	stoptime;		/* backup stop time */
+
+	/*
+	 * After pg_backup_stop() returns this field will contain a copy of
+	 * pg_control that should be stored with the backup. Fields have been
+	 * updated for recovery and the CRC has been recalculated. The buffer
+	 * is padded to PG_CONTROL_MAX_SAFE_SIZE so that pg_control is always
+	 * a consistent size but smaller (and hopefully easier to handle) than
+	 * PG_CONTROL_FILE_SIZE. Bytes after sizeof(ControlFileData) are zeroed.
+	 */
+	uint8_t controlFile[PG_CONTROL_MAX_SAFE_SIZE];
 } BackupState;
 
-extern char *build_backup_content(BackupState *state,
-								  bool ishistoryfile);
+extern char *build_backup_history_content(BackupState *state);
 
 #endif							/* XLOG_BACKUP_H */
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index ee0bc742782..981266f7340 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -80,8 +80,7 @@ extern Size XLogRecoveryShmemSize(void);
 extern void XLogRecoveryShmemInit(void);
 
 extern void InitWalRecovery(ControlFileData *ControlFile,
-							bool *wasShutdown_ptr, bool *haveBackupLabel_ptr,
-							bool *haveTblspcMap_ptr);
+							bool *wasShutdown_ptr, bool *haveTblspcMap_ptr);
 extern void PerformWalRecovery(void);
 
 /*
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 2ae72e3b266..8144c972ec1 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -146,6 +146,9 @@ typedef struct ControlFileData
 	 * to disk, we mustn't start up until we reach X again. Zero when not
 	 * doing archive recovery.
 	 *
+	 * backupCheckPoint is the backup start checkpoint and is set to zero after
+	 * recovery is initialized.
+	 *
 	 * backupStartPoint is the redo pointer of the backup start checkpoint, if
 	 * we are recovering from an online backup and haven't reached the end of
 	 * backup yet. It is reset to zero when the end of backup is reached, and
@@ -160,14 +163,27 @@ typedef struct ControlFileData
 	 * pg_control which was backed up last. It is reset to zero when the end
 	 * of backup is reached, and we mustn't start up before that.
 	 *
+	 * backupRecoveryRequired indicates that the pg_control file was provided
+	 * by a backup or pg_rewind and recovery settings need to be copied. It will
+	 * be set to false when the settings have been copied.
+	 *
+	 * backupFromStandby indicates that the backup was taken on a standby. It is
+	 * required to initialize recovery and set to false afterwards.
+	 *
 	 * If backupEndRequired is true, we know for sure that we're restoring
 	 * from a backup, and must see a backup-end record before we can safely
-	 * start up.
+	 * start up. Currently backupEndRequired should only be false if recovery
+	 * settings were configured by pg_rewind, which does not require an end
+	 * point.
 	 */
 	XLogRecPtr	minRecoveryPoint;
 	TimeLineID	minRecoveryPointTLI;
+	XLogRecPtr	backupCheckPoint;
 	XLogRecPtr	backupStartPoint;
 	XLogRecPtr	backupEndPoint;
+	TimeLineID	backupStartPointTLI;
+	bool 		backupRecoveryRequired;
+	bool 		backupFromStandby;
 	bool		backupEndRequired;
 
 	/*
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f14aed422a7..cc8156c57e7 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6413,13 +6413,17 @@
   prosrc => 'pg_terminate_backend' },
 { oid => '2172', descr => 'prepare for taking an online backup',
   proname => 'pg_backup_start', provolatile => 'v', proparallel => 'r',
-  prorettype => 'pg_lsn', proargtypes => 'text bool',
+  prorettype => 'record', proargtypes => 'text bool',
+  proallargtypes => '{text,bool,pg_lsn,int8,timestamptz}',
+  proargmodes => '{i,i,o,o,o}',
+  proargnames => '{label,fast,lsn,timeline_id,start}',
   prosrc => 'pg_backup_start' },
 { oid => '2739', descr => 'finish taking an online backup',
   proname => 'pg_backup_stop', provolatile => 'v', proparallel => 'r',
   prorettype => 'record', proargtypes => 'bool',
-  proallargtypes => '{bool,pg_lsn,text,text}', proargmodes => '{i,o,o,o}',
-  proargnames => '{wait_for_archive,lsn,labelfile,spcmapfile}',
+  proallargtypes => '{bool,bytea,text,pg_lsn,int8,timestamptz}',
+  proargmodes => '{i,o,o,o,o,o}',
+  proargnames => '{wait_for_archive,pg_control_file,tablespace_map_file,lsn,timeline_id,stop}',
   prosrc => 'pg_backup_stop' },
 { oid => '3436', descr => 'promote standby server',
   proname => 'pg_promote', provolatile => 'v', prorettype => 'bool',
-- 
2.34.1

#14

michael@paquier.xyz

about 2 years ago

In reply to: David Steele (#13)

Re: Add recovery to pg_control and remove backup_label

On Fri, Nov 10, 2023 at 02:55:19PM -0400, David Steele wrote:

On 11/10/23 00:37, Michael Paquier wrote:

I've done a few more dozen runs, and still nothing. I am wondering
what this disturbance was.

OK, hopefully it was just a blip.

Still nothing on this side. So that seems really like a random blip
in the matrix.

This has been split out.

Thanks, applied 0001.

The term "backup recovery", that we've never used in the tree until
now as far as I know. Perhaps this recovery method should just be
referred as "recovery from backup"?

Well, "backup recovery" is less awkward, I think. For instance "backup
recovery field" vs "recovery from backup field".

Not sure. I've never used this term when referring to recovery from a
backup. Perhaps I'm just not used to it, still that sounds a bit
confusing here.

Something in this area is that backupRecoveryRequired is the switch
controlling if the fields set by the recovery initialization. Could
it be actually useful to leave the other fields as they are and only
reset backupRecoveryRequired before the first control file update?
This would leave a trace of the backup history directly in the control
file.

Since the other recovery fields are cleared in ReachedEndOfBackup() this
would be a change from what we do now.

None of these fields are ever visible (with the exception of
minRecoveryPoint/TLI) since they are reset when the database becomes
consistent and before logons are allowed. Viewing them with pg_controldata
makes sense, but I'm not sure pg_control_recovery() does.

In fact, are backup_start_lsn, backup_end_lsn, and
end_of_backup_record_required ever non-zero when logged onto Postgres? Maybe
I'm missing something?

Yeah, but custom backup/restore tools may want manipulate the contents
of the control file for their own work, so at least for the sake of
visibility it sounds important to me to show all the information at
hand, and that there is no need to.

-    The backup label
-    file includes the label string you gave to <function>pg_backup_start</function>,
+    The backup history file (which is archived like WAL) includes the label
+    string you gave to <function>pg_backup_start</function>,
     as well as the time at which <function>pg_backup_start</function> was run, and
     the name of the starting WAL file.  In case of confusion it is therefore
-    possible to look inside a backup file and determine exactly which
+    possible to look inside a backup history file and determine exactly which

As a side note, it is a bit disappointing that we lose the backup
label from the backup itself, even if the patch changes correctly the
documentation to reflect the new behavior. It is in the backup
history file on the node from where the base backup has been taken or
in the archives, hopefully. However there is nothing that remains in
the base backup itself, and backups can be self-contained (easy with
pg_basebackup --wal-method=stream). I think that we should retain a
minimum amount of information as a replacement for the backup_label,
at least. With this state, the current patch slightly reduces the
debuggability of deployments. That could be annoying for some users.

New patches attached based on eb81e8e790.

Diving into the code for references about the backup label file, I
have spotted this log in pg_rewind that is now incorrect:
if (showprogress)
pg_log_info("creating backup label and updating control file");

+    printf(_("Backup start location's timeline:     %u\n"),
+           ControlFile->backupStartPointTLI);
     printf(_("Backup end location:                  %X/%X\n"),
            LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
Perhaps these two should be reversed to match with the header file.

+    /*
+     * After pg_backup_stop() returns this field will contain a copy of
+     * pg_control that should be stored with the backup. Fields have been
+     * updated for recovery and the CRC has been recalculated. The buffer
+     * is padded to PG_CONTROL_MAX_SAFE_SIZE so that pg_control is always
+     * a consistent size but smaller (and hopefully easier to handle) than
+     * PG_CONTROL_FILE_SIZE. Bytes after sizeof(ControlFileData) are zeroed.
+     */
+    uint8_t controlFile[PG_CONTROL_MAX_SAFE_SIZE];

I don't mind the addition of a control file with the max safe size,
because it will never be higher than that. However:

+                /* End the backup before sending pg_control */
+                basebackup_progress_wait_wal_archive(&state);
+                do_pg_backup_stop(backup_state, !opt->nowait);
+
+                /* Send copy of pg_control containing recovery info */
+                sendFileWithContent(sink, XLOG_CONTROL_FILE,
+                                    (char *)backup_state->controlFile,
+                                    PG_CONTROL_MAX_SAFE_SIZE, &manifest);

It seems to me that the base backup protocol should always send an 8k
file for the control file so as we maintain consistency with the
on-disk format. Currently, a base backup taken with this patch
results in a control file of size 512B.

+	/* Build the contents of pg_control */
+	pg_control_bytea = (bytea *) palloc(PG_CONTROL_MAX_SAFE_SIZE + VARHDRSZ);
+	SET_VARSIZE(pg_control_bytea, PG_CONTROL_MAX_SAFE_SIZE + VARHDRSZ);
+	memcpy(VARDATA(pg_control_bytea), backup_state->controlFile, PG_CONTROL_MAX_SAFE_SIZE);

Similar comment for the control file returned by pg_backup_stop(),
which could just be made a 8k field?

+     <function>pg_backup_stop</function> returns the
+     <filename>pg_control</filename> file, which must be stored in the
+     <filename>global</filename> directory of the backup. It also returns the

And perhaps emphasize that this file should be an 8kB file in the
paragraph mentioning the data returned by pg_backup_stop()?

-      Create a <filename>backup_label</filename> file to begin WAL replay at
+      Update <filename>pg_control</filename> file to begin WAL replay at
       the checkpoint created at failover and configure the
       <filename>pg_control</filename> file with a minimum consistency LSN

pg_control is mentioned twice, so perhaps this could be worded better?

PG_CONTROL_VERSION is important to not forget about.. Perhaps this
should be noted somewhere, or just changed in the patch itself.
Contrary to catalog changes, we do few of these in the control file so
there is close to zero risk of conflicts with other patches in the CF
app.
--
Michael

#15

michael@paquier.xyz

about 2 years ago

In reply to: David Steele (#10)

2 attachment(s)

Re: Add recovery to pg_control and remove backup_label

(I am not exactly sure how, but we've lost pgsql-hackers on the way
when you sent v5. Now added back in CC with the two latest patches
you've proposed attached.)

Here is a short summary of what has been missed by the lists:
- I've commented that the patch should not create, not show up in
fields returned the SQL functions or stream control files with a size
of 512B, just stick to 8kB. If this is worth changing this should be
applied consistently across the board including initdb, discussed on
its own thread.
- The backup-related fields in the control file are reset at the end
of recovery. I've suggested to not do that to keep a trace of what
was happening during recovery. The latest version of the patch resets
the fields.
- With the backup_label file gone, we lose some information in the
backups themselves, which is not good. Instead, you have suggested an
approach where this data is added to the backup manifest, meaning that
no information would be lost, particularly useful for self-contained
backups. The fields planned to be added to the backup manifest are:
-- The start and end time of the backup, the end timestamp being
useful to know when stop time can be used for PITR.
-- The backup label.
I've agreed that it may be the best thing to do at this end to not
lose any data related to the removal of the backup_label file.

On Sun, Nov 19, 2023 at 02:14:32PM -0400, David Steele wrote:

On 11/15/23 20:03, Michael Paquier wrote:

As the label is only an informational field, the parsing added to
pg_verifybackup is not really needed because it is used nowhere in the
validation process, so keeping the logic simpler would be the way to
go IMO. This is contrary to the WAL range for example, where start
and end LSNs are used for validation with a pg_waldump command.
Robert, any comments about the addition of the label in the manifest?

I'm sure Robert will comment on this when he gets the time, but for now I
have backed off on passing the new info to pg_verifybackup and added
start/stop time.

FWIW, I'm OK with the bits for the backup manifest as presented. So
if there are no remarks and/or no objections, I'd like to apply it but
let give some room to others to comment on that as there's been a gap
in the emails exchanged on pgsql-hackers. I hope that the summary
I've posted above covers everything. So let's see about doing
something around the middle of next week. With Thanksgiving in the
US, a lot of folks will not have the time to monitor what's happening
on this thread.

+      The end time for the backup. This is when the backup was stopped in
+      <productname>PostgreSQL</productname> and represents the earliest time
+      that can be used for time-based Point-In-Time Recovery.

This one is actually a very good point. We'd lost this capacity with
the backup_label file gone without the end timestamps in the control
file.

New patches attached based on b218fbb7.

I've noticed on the other thread the remark about being less
aggressive with the fields related to recovery in the control file, so
I assume that this patch should leave the fields be after the end of
recovery from the start and only rely on backupRecoveryRequired to
decide if the recovery should use the fields or not:
/messages/by-id/241ccde1-1928-4ba2-a0bb-5350f7b191a8@=pgmasters.net

+	ControlFile->backupCheckPoint = InvalidXLogRecPtr;
 	ControlFile->backupStartPoint = InvalidXLogRecPtr;
+	ControlFile->backupStartPointTLI = 0;
 	ControlFile->backupEndPoint = InvalidXLogRecPtr;
+	ControlFile->backupFromStandby = false;
 	ControlFile->backupEndRequired = false;

Still, I get the temptation of being consistent with the current style
on HEAD to reset everything, as well..
--
Michael

Attachments:

recovery-in-pgcontrol-v7-0001-add-info-to-manifest.patchtext/x-diff; charset=us-asciiDownload

From 97bb113b5bf5427449b748c3ee25b647a2c5fef5 Mon Sep 17 00:00:00 2001
From: David Steele <david@pgmasters.net>
Date: Sun, 19 Nov 2023 16:54:36 +0000
Subject: Add label and start/stop time to backup manifest.

Add label passed by the user to pg_basebackup and backup start/stop time to
the backup_manifest file. Currently these fields are purely for informational
purposes.
---
 doc/src/sgml/backup-manifest.sgml        | 51 +++++++++++++++++++++
 src/backend/backup/backup_manifest.c     | 56 +++++++++++++++++++++++-
 src/backend/backup/basebackup.c          |  9 +++-
 src/bin/pg_verifybackup/parse_manifest.c | 40 +++++++++++++++++
 src/include/backup/backup_manifest.h     |  5 ++-
 5 files changed, 157 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/backup-manifest.sgml b/doc/src/sgml/backup-manifest.sgml
index 771be1310a..a80b79e587 100644
--- a/doc/src/sgml/backup-manifest.sgml
+++ b/doc/src/sgml/backup-manifest.sgml
@@ -42,6 +42,16 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>Backup-Label</literal></term>
+    <listitem>
+     <para>
+      Backup label specified by the user. This will be set to a default value
+      if no backup label was specified.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><literal>Files</literal></term>
     <listitem>
@@ -67,6 +77,17 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term><literal>Time-Range</literal></term>
+    <listitem>
+     <para>
+      The associated value is always an object describing the start/stop time
+      of the backup. The structure of this object is further described in
+      <xref linkend="backup-manifest-time-range" />.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term><literal>Manifest-Checksum</literal></term>
     <listitem>
@@ -213,4 +234,34 @@
    for the same timeline.
   </para>
  </sect1>
+
+ <sect1 id="backup-manifest-time-range">
+  <title>Backup Manifest Time Range Object</title>
+
+  <para>
+   The object which describes the time range always has three keys:
+  </para>
+
+  <variablelist>
+   <varlistentry>
+    <term><literal>Start-Time</literal></term>
+    <listitem>
+     <para>
+      The start time for the backup.
+     </para>
+    </listitem>
+   </varlistentry>
+
+   <varlistentry>
+    <term><literal>End-Time</literal></term>
+    <listitem>
+     <para>
+      The end time for the backup. This is when the backup was stopped in
+      <productname>PostgreSQL</productname> and represents the earliest time
+      that can be used for time-based Point-In-Time Recovery.
+     </para>
+    </listitem>
+   </varlistentry>
+  </variablelist>
+ </sect1>
 </chapter>
diff --git a/src/backend/backup/backup_manifest.c b/src/backend/backup/backup_manifest.c
index aeed362a9a..9540f05964 100644
--- a/src/backend/backup/backup_manifest.c
+++ b/src/backend/backup/backup_manifest.c
@@ -56,7 +56,8 @@ IsManifestEnabled(backup_manifest_info *manifest)
 void
 InitializeBackupManifest(backup_manifest_info *manifest,
 						 backup_manifest_option want_manifest,
-						 pg_checksum_type manifest_checksum_type)
+						 pg_checksum_type manifest_checksum_type,
+						 const char *label)
 {
 	memset(manifest, 0, sizeof(backup_manifest_info));
 	manifest->checksum_type = manifest_checksum_type;
@@ -78,9 +79,21 @@ InitializeBackupManifest(backup_manifest_info *manifest,
 	manifest->still_checksumming = true;
 
 	if (want_manifest != MANIFEST_OPTION_NO)
+	{
+		StringInfoData buf;
+
 		AppendToManifest(manifest,
 						 "{ \"PostgreSQL-Backup-Manifest-Version\": 1,\n"
-						 "\"Files\": [");
+						 "\"Backup-Label\": ");
+
+		/* JSON encode label and add it to manifest */
+		initStringInfo(&buf);
+		escape_json(&buf, label);
+		AppendStringToManifest(manifest, buf.data);
+		pfree(buf.data);
+
+		AppendToManifest(manifest, ",\n\"Files\": [");
+	}
 }
 
 /*
@@ -308,6 +321,45 @@ AddWALInfoToBackupManifest(backup_manifest_info *manifest, XLogRecPtr startptr,
 	AppendStringToManifest(manifest, "\n],\n");
 }
 
+/*
+ * Add backup start/end time information to the manifest.
+ */
+void
+AddTimeInfoToBackupManifest(backup_manifest_info *manifest, pg_time_t starttime,
+							pg_time_t endtime)
+{
+	StringInfoData buf;
+
+	if (!IsManifestEnabled(manifest))
+		return;
+
+	/* Start the time range. */
+	AppendStringToManifest(manifest, "\"Time-Range\": { ");
+
+	/*
+	 * Convert start/end time to strings and append them to the manifest. Since
+	 * it's not clear what time zone to use and since time zone definitions can
+	 * change, possibly causing confusion, use GMT always.
+	 */
+	initStringInfo(&buf);
+
+	appendStringInfoString(&buf, "\"Start-Time\": \"");
+	enlargeStringInfo(&buf, 128);
+	buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%d %H:%M:%S %Z",
+						   pg_gmtime(&starttime));
+	appendStringInfoString(&buf, "\", \"End-Time\": \"");
+	enlargeStringInfo(&buf, 128);
+	buf.len += pg_strftime(&buf.data[buf.len], 128, "%Y-%m-%d %H:%M:%S %Z",
+						   pg_gmtime(&endtime));
+	appendStringInfoString(&buf, "\" },\n");
+
+	/* Add to the manifest. */
+	AppendStringToManifest(manifest, buf.data);
+
+	/* Avoid leaking memory. */
+	pfree(buf.data);
+}
+
 /*
  * Finalize the backup manifest, and send it to the client.
  */
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 35dd79babc..c7b3ba3e6e 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -225,6 +225,8 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 	bbsink_state state;
 	XLogRecPtr	endptr;
 	TimeLineID	endtli;
+	pg_time_t	starttime;
+	pg_time_t	stoptime;
 	backup_manifest_info manifest;
 	BackupState *backup_state;
 	StringInfo	tablespace_map;
@@ -243,7 +245,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 	backup_started_in_recovery = RecoveryInProgress();
 
 	InitializeBackupManifest(&manifest, opt->manifest,
-							 opt->manifest_checksum_type);
+							 opt->manifest_checksum_type, opt->label);
 
 	total_checksum_failures = 0;
 
@@ -380,6 +382,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 		endptr = backup_state->stoppoint;
 		endtli = backup_state->stoptli;
 
+		/* Record start/stop time for manifest */
+		starttime = backup_state->starttime;
+		stoptime = backup_state->stoptime;
+
 		/* Deallocate backup-related variables. */
 		pfree(tablespace_map->data);
 		pfree(tablespace_map);
@@ -629,6 +635,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 	AddWALInfoToBackupManifest(&manifest, state.startptr, state.starttli,
 							   endptr, endtli);
+	AddTimeInfoToBackupManifest(&manifest, starttime, stoptime);
 
 	SendBackupManifest(&manifest, sink);
 
diff --git a/src/bin/pg_verifybackup/parse_manifest.c b/src/bin/pg_verifybackup/parse_manifest.c
index bf0227c668..408af88e58 100644
--- a/src/bin/pg_verifybackup/parse_manifest.c
+++ b/src/bin/pg_verifybackup/parse_manifest.c
@@ -25,6 +25,7 @@ typedef enum
 	JM_EXPECT_TOPLEVEL_END,
 	JM_EXPECT_TOPLEVEL_FIELD,
 	JM_EXPECT_VERSION_VALUE,
+	JM_EXPECT_BACKUP_LABEL_VALUE,
 	JM_EXPECT_FILES_START,
 	JM_EXPECT_FILES_NEXT,
 	JM_EXPECT_THIS_FILE_FIELD,
@@ -33,6 +34,9 @@ typedef enum
 	JM_EXPECT_WAL_RANGES_NEXT,
 	JM_EXPECT_THIS_WAL_RANGE_FIELD,
 	JM_EXPECT_THIS_WAL_RANGE_VALUE,
+	JM_EXPECT_TIME_RANGE_START,
+	JM_EXPECT_THIS_TIME_RANGE_FIELD,
+	JM_EXPECT_THIS_TIME_RANGE_VALUE,
 	JM_EXPECT_MANIFEST_CHECKSUM_VALUE,
 	JM_EXPECT_EOF,
 } JsonManifestSemanticState;
@@ -188,6 +192,9 @@ json_manifest_object_start(void *state)
 			parse->start_lsn = NULL;
 			parse->end_lsn = NULL;
 			break;
+		case JM_EXPECT_TIME_RANGE_START:
+			parse->state = JM_EXPECT_THIS_TIME_RANGE_FIELD;
+			break;
 		default:
 			json_manifest_parse_failure(parse->context,
 										"unexpected object start");
@@ -223,6 +230,9 @@ json_manifest_object_end(void *state)
 			json_manifest_finalize_wal_range(parse);
 			parse->state = JM_EXPECT_WAL_RANGES_NEXT;
 			break;
+		case JM_EXPECT_THIS_TIME_RANGE_FIELD:
+			parse->state = JM_EXPECT_TOPLEVEL_FIELD;
+			break;
 		default:
 			json_manifest_parse_failure(parse->context,
 										"unexpected object end");
@@ -312,6 +322,13 @@ json_manifest_object_field_start(void *state, char *fname, bool isnull)
 				break;
 			}
 
+			/* Is this the backup label? */
+			if (strcmp(fname, "Backup-Label") == 0)
+			{
+				parse->state = JM_EXPECT_BACKUP_LABEL_VALUE;
+				break;
+			}
+
 			/* Is this the list of files? */
 			if (strcmp(fname, "Files") == 0)
 			{
@@ -326,6 +343,13 @@ json_manifest_object_field_start(void *state, char *fname, bool isnull)
 				break;
 			}
 
+			/* Is this the time range? */
+			if (strcmp(fname, "Time-Range") == 0)
+			{
+				parse->state = JM_EXPECT_TIME_RANGE_START;
+				break;
+			}
+
 			/* Is this the manifest checksum? */
 			if (strcmp(fname, "Manifest-Checksum") == 0)
 			{
@@ -372,6 +396,14 @@ json_manifest_object_field_start(void *state, char *fname, bool isnull)
 			parse->state = JM_EXPECT_THIS_WAL_RANGE_VALUE;
 			break;
 
+		case JM_EXPECT_THIS_TIME_RANGE_FIELD:
+			if (strcmp(fname, "Start-Time") != 0 &&
+				strcmp(fname, "End-Time") != 0)
+				json_manifest_parse_failure(parse->context,
+											"unexpected time range field");
+			parse->state = JM_EXPECT_THIS_TIME_RANGE_VALUE;
+			break;
+
 		default:
 			json_manifest_parse_failure(parse->context,
 										"unexpected object field");
@@ -410,6 +442,10 @@ json_manifest_scalar(void *state, char *token, JsonTokenType tokentype)
 			parse->state = JM_EXPECT_TOPLEVEL_FIELD;
 			break;
 
+		case JM_EXPECT_BACKUP_LABEL_VALUE:
+			parse->state = JM_EXPECT_TOPLEVEL_FIELD;
+			break;
+
 		case JM_EXPECT_THIS_FILE_VALUE:
 			switch (parse->file_field)
 			{
@@ -451,6 +487,10 @@ json_manifest_scalar(void *state, char *token, JsonTokenType tokentype)
 			parse->state = JM_EXPECT_THIS_WAL_RANGE_FIELD;
 			break;
 
+		case JM_EXPECT_THIS_TIME_RANGE_VALUE:
+			parse->state = JM_EXPECT_THIS_TIME_RANGE_FIELD;
+			break;
+
 		case JM_EXPECT_MANIFEST_CHECKSUM_VALUE:
 			parse->state = JM_EXPECT_TOPLEVEL_END;
 			parse->manifest_checksum = token;
diff --git a/src/include/backup/backup_manifest.h b/src/include/backup/backup_manifest.h
index bd7067ae42..4ab1291bba 100644
--- a/src/include/backup/backup_manifest.h
+++ b/src/include/backup/backup_manifest.h
@@ -37,7 +37,8 @@ typedef struct backup_manifest_info
 
 extern void InitializeBackupManifest(backup_manifest_info *manifest,
 									 backup_manifest_option want_manifest,
-									 pg_checksum_type manifest_checksum_type);
+									 pg_checksum_type manifest_checksum_type,
+									 const char *label);
 extern void AddFileToBackupManifest(backup_manifest_info *manifest,
 									Oid spcoid,
 									const char *pathname, size_t size,
@@ -47,6 +48,8 @@ extern void AddWALInfoToBackupManifest(backup_manifest_info *manifest,
 									   XLogRecPtr startptr,
 									   TimeLineID starttli, XLogRecPtr endptr,
 									   TimeLineID endtli);
+extern void AddTimeInfoToBackupManifest(backup_manifest_info *manifest,
+										pg_time_t starttime, pg_time_t endtime);
 
 extern void SendBackupManifest(backup_manifest_info *manifest, bbsink *sink);
 extern void FreeBackupManifest(backup_manifest_info *manifest);
-- 
2.34.1

recovery-in-pgcontrol-v7-0002-remove-backuplabel.patchtext/x-diff; charset=us-asciiDownload

From 6b732c6e085a4ef2ec717af943b44d02cb4b9849 Mon Sep 17 00:00:00 2001
From: David Steele <david@pgmasters.net>
Date: Sun, 19 Nov 2023 16:54:37 +0000
Subject: Add recovery to pg_control and remove backup_label.

Simplify and harden recovery by removing backup_label and storing recovery
information directly in pg_control. Instead of backup software copying
pg_control from PGDATA, it stores an updated version that is returned from
pg_backup_stop(). This is better for the following reasons:

* The user can no longer remove backup_label and get what looks like a
successful recovery (while almost certainly causing corruption). If pg_control
is removed the cluster will not start. The user may try pg_resetwal, but that
tool makes it pretty clear that corruption will result from its use. We could
also modify pg_resetwal to complain if recovery info is present in pg_control.

* We don't need to worry about backup software seeing a torn copy of pg_control,
since Postgres can safely read it out of memory and provide a valid copy via
pg_backup_stop(). This solves torn reads without needing to write pg_control via
a temp file, which may affect performance on a standby.

* For backup from standby, we no longer need to instruct the backup software to
copy pg_control last. In fact the backup software should not copy pg_control from
PGDATA at all.

Since backup_label is now gone, the fields that used to be in backup_label are
now provided as columns returned from pg_backup_start() and pg_backup_stop() and
the backup history file is still written to the archive. For pg_basebackup, the
label passed on the cmd line as --label is now stored in the manifest.

Control and catalog version bumps are required.
---
 doc/src/sgml/backup.sgml                     |  31 ++-
 doc/src/sgml/func.sgml                       |  41 ++-
 doc/src/sgml/ref/pg_rewind.sgml              |   6 +-
 src/backend/access/transam/xlog.c            |  67 ++---
 src/backend/access/transam/xlogbackup.c      |  39 ++-
 src/backend/access/transam/xlogfuncs.c       |  50 ++--
 src/backend/access/transam/xlogrecovery.c    | 249 ++++---------------
 src/backend/backup/basebackup.c              |  37 +--
 src/backend/catalog/system_functions.sql     |  10 +-
 src/backend/utils/misc/pg_controldata.c      |  22 +-
 src/bin/pg_basebackup/t/010_pg_basebackup.pl |  15 +-
 src/bin/pg_controldata/pg_controldata.c      |   8 +
 src/bin/pg_resetwal/pg_resetwal.c            |   4 +
 src/bin/pg_rewind/filemap.c                  |   7 +-
 src/bin/pg_rewind/pg_rewind.c                |  64 +----
 src/include/access/xlog.h                    |   2 -
 src/include/access/xlogbackup.h              |  12 +-
 src/include/access/xlogrecovery.h            |   3 +-
 src/include/catalog/pg_control.h             |  19 +-
 src/include/catalog/pg_proc.dat              |  16 +-
 20 files changed, 293 insertions(+), 409 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 8cb24d6ae5..584384875b 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -935,19 +935,20 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
      ready to archive.
     </para>
     <para>
-     <function>pg_backup_stop</function> will return one row with three
-     values. The second of these fields should be written to a file named
-     <filename>backup_label</filename> in the root directory of the backup. The
-     third field should be written to a file named
-     <filename>tablespace_map</filename> unless the field is empty. These files are
+     <function>pg_backup_stop</function> returns the
+     <filename>pg_control</filename> file, which must be stored in the
+     <filename>global</filename> directory of the backup. It also returns the
+     <filename>tablespace_map</filename> file, which should be written in the
+     root directory of the backup unless the field is empty. These files are
      vital to the backup working and must be written byte for byte without
-     modification, which may require opening the file in binary mode.
+     modification, which will require opening the file in binary mode.
     </para>
    </listitem>
    <listitem>
     <para>
      Once the WAL segment files active during the backup are archived, you are
-     done.  The file identified by <function>pg_backup_stop</function>'s first return
+     done.  The file identified by <function>pg_backup_stop</function>'s
+     <parameter>lsn</parameter> return
      value is the last segment that is required to form a complete set of
      backup files.  On a primary, if <varname>archive_mode</varname> is enabled and the
      <literal>wait_for_archive</literal> parameter is <literal>true</literal>,
@@ -1013,7 +1014,15 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
    </para>
 
    <para>
-    You should, however, omit from the backup the files within the
+    You must exclude <filename>global/pg_control</filename> from your backup
+    and put the contents of the <parameter>pg_control_file</parameter> column
+    returned from <function>pg_backup_stop</function> in your backup at
+    <filename>global/pg_control</filename>. This file contains the information
+    required to safely recover.
+   </para>
+
+   <para>
+    You should also omit from the backup the files within the
     cluster's <filename>pg_wal/</filename> subdirectory.  This
     slight adjustment is worthwhile because it reduces the risk
     of mistakes when restoring.  This is easy to arrange if
@@ -1062,11 +1071,11 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
    </para>
 
    <para>
-    The backup label
-    file includes the label string you gave to <function>pg_backup_start</function>,
+    The backup history file (which is archived like WAL) includes the label
+    string you gave to <function>pg_backup_start</function>,
     as well as the time at which <function>pg_backup_start</function> was run, and
     the name of the starting WAL file.  In case of confusion it is therefore
-    possible to look inside a backup file and determine exactly which
+    possible to look inside a backup history file and determine exactly which
     backup session the dump file came from.  The tablespace map file includes
     the symbolic link names as they exist in the directory
     <filename>pg_tblspc/</filename> and the full path of each symbolic link.
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 93f068edcf..5f27cce161 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -26865,7 +26865,10 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
           <parameter>label</parameter> <type>text</type>
           <optional>, <parameter>fast</parameter> <type>boolean</type>
           </optional> )
-        <returnvalue>pg_lsn</returnvalue>
+        <returnvalue>record</returnvalue>
+        ( <parameter>lsn</parameter> <type>pg_lsn</type>,
+        <parameter>timeline_id</parameter> <type>int8</type>,
+        <parameter>start</parameter> <type>timestamptz</type> )
        </para>
        <para>
         Prepares the server to begin an on-line backup.  The only required
@@ -26877,6 +26880,13 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
         as possible.  This forces an immediate checkpoint which will cause a
         spike in I/O operations, slowing any concurrently executing queries.
        </para>
+       <para>
+        The result columns contain information about the start of the backup
+        and can be ignored: the <parameter>lsn</parameter> column holds the
+        starting write-ahead log location, the
+        <parameter>timeline_id</parameter> column holds the starting timeline,
+        and the <parameter>stop</parameter> column holds the starting timestamp.
+       </para>
        <para>
         This function is restricted to superusers by default, but other users
         can be granted EXECUTE to run the function.
@@ -26892,13 +26902,15 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
           <optional><parameter>wait_for_archive</parameter> <type>boolean</type>
           </optional> )
         <returnvalue>record</returnvalue>
-        ( <parameter>lsn</parameter> <type>pg_lsn</type>,
-        <parameter>labelfile</parameter> <type>text</type>,
-        <parameter>spcmapfile</parameter> <type>text</type> )
+        ( <parameter>pg_control_file</parameter> <type>text</type>,
+        <parameter>tablespace_map_file</parameter> <type>text</type>,
+        <parameter>lsn</parameter> <type>pg_lsn</type>,
+        <parameter>timeline_id</parameter> <type>int8</type>,
+        <parameter>stop</parameter> <type>timestamptz</type> )
        </para>
        <para>
         Finishes performing an on-line backup.  The desired contents of the
-        backup label file and the tablespace map file are returned as part of
+        pg_control file and the tablespace map file are returned as part of
         the result of the function and must be written to files in the
         backup area.  These files must not be written to the live data directory
         (doing so will cause PostgreSQL to fail to restart in the event of a
@@ -26930,13 +26942,18 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
         backup.
        </para>
        <para>
-        The result of the function is a single record.
-        The <parameter>lsn</parameter> column holds the backup's ending
-        write-ahead log location (which again can be ignored).  The second
-        column returns the contents of the backup label file, and the third
-        column returns the contents of the tablespace map file.  These must be
-        stored as part of the backup and are required as part of the restore
-        process.
+        The result of the function is a single record. The first column returns
+        the contents of the <filename>pg_control</filename> file and the
+        second column returns the contents of the
+        <filename>tablespace_map</filename> file.  These must be stored as part
+        of the backup and are required as part of the restore process. Note that
+        <filename>pg_control</filename> will be 8192KiB to match the on-disk
+        size of <filename>pg_control</filename>. The remainder of the columns
+        contain information about the end of the backup and can be ignored: the
+        <parameter>lsn</parameter> column holds the ending write-ahead log
+        location, the <parameter>timeline_id</parameter> column holds the ending
+        timeline, and the <parameter>stop</parameter> column holds the ending
+        timestamp.
        </para>
        <para>
         This function is restricted to superusers by default, but other users
diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 8e0000d39f..404ce7c65e 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -400,7 +400,6 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b
       <filename>pg_serial/</filename>, <filename>pg_snapshots/</filename>,
       <filename>pg_stat_tmp/</filename>, and <filename>pg_subtrans/</filename>
       are omitted from the data copied from the source cluster. The files
-      <filename>backup_label</filename>,
       <filename>tablespace_map</filename>,
       <filename>pg_internal.init</filename>,
       <filename>postmaster.opts</filename>, and
@@ -410,9 +409,8 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b
     </step>
     <step>
      <para>
-      Create a <filename>backup_label</filename> file to begin WAL replay at
-      the checkpoint created at failover and configure the
-      <filename>pg_control</filename> file with a minimum consistency LSN
+      Update <filename>pg_control</filename> file to begin WAL replay at
+      the checkpoint created at failover and with a minimum consistency LSN
       defined as the result of <literal>pg_current_wal_insert_lsn()</literal>
       when rewinding from a live source or the last checkpoint LSN when
       rewinding from a stopped source.
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1159dff1a6..928cad0651 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -74,6 +74,7 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "port/pg_crc32c.h"
 #include "port/pg_iovec.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
@@ -5135,7 +5136,6 @@ StartupXLOG(void)
 	bool		wasShutdown;
 	bool		didCrash;
 	bool		haveTblspcMap;
-	bool		haveBackupLabel;
 	XLogRecPtr	EndOfLog;
 	TimeLineID	EndOfLogTLI;
 	TimeLineID	newTLI;
@@ -5259,13 +5259,14 @@ StartupXLOG(void)
 	/*
 	 * Prepare for WAL recovery if needed.
 	 *
-	 * InitWalRecovery analyzes the control file and the backup label file, if
-	 * any.  It updates the in-memory ControlFile buffer according to the
-	 * starting checkpoint, and sets InRecovery and ArchiveRecoveryRequested.
+	 * InitWalRecovery analyzes the control file and checks if backup recovery
+	 * has been requested.  It updates the in-memory ControlFile buffer
+	 * according to the starting checkpoint, and sets InRecovery and
+	 * ArchiveRecoveryRequested.
+	 *
 	 * It also applies the tablespace map file, if any.
 	 */
-	InitWalRecovery(ControlFile, &wasShutdown,
-					&haveBackupLabel, &haveTblspcMap);
+	InitWalRecovery(ControlFile, &wasShutdown, &haveTblspcMap);
 	checkPoint = ControlFile->checkPointCopy;
 
 	/* initialize shared memory variables from the checkpoint record */
@@ -5408,20 +5409,6 @@ StartupXLOG(void)
 		 */
 		UpdateControlFile();
 
-		/*
-		 * If there was a backup label file, it's done its job and the info
-		 * has now been propagated into pg_control.  We must get rid of the
-		 * label file so that if we crash during recovery, we'll pick up at
-		 * the latest recovery restartpoint instead of going all the way back
-		 * to the backup start point.  It seems prudent though to just rename
-		 * the file out of the way rather than delete it completely.
-		 */
-		if (haveBackupLabel)
-		{
-			unlink(BACKUP_LABEL_OLD);
-			durable_rename(BACKUP_LABEL_FILE, BACKUP_LABEL_OLD, FATAL);
-		}
-
 		/*
 		 * If there was a tablespace_map file, it's done its job and the
 		 * symlinks have been created.  We must get rid of the map file so
@@ -5571,10 +5558,8 @@ StartupXLOG(void)
 	 * (at which point we reset backupStartPoint to be Invalid), for
 	 * backup-from-replica (which can't inject records into the WAL stream),
 	 * that point is when we reach the minRecoveryPoint in pg_control (which
-	 * we purposefully copy last when backing up from a replica).  For
-	 * pg_rewind (which creates a backup_label with a method of "pg_rewind")
-	 * or snapshot-style backups (which don't), backupEndRequired will be set
-	 * to false.
+	 * we purposefully copy last when backing up).  For pg_rewind or
+	 * snapshot-style backups, backupEndRequired will be set to false.
 	 *
 	 * Note: it is indeed okay to look at the local variable
 	 * LocalMinRecoveryPoint here, even though ControlFile->minRecoveryPoint
@@ -5951,8 +5936,11 @@ ReachedEndOfBackup(XLogRecPtr EndRecPtr, TimeLineID tli)
 		ControlFile->minRecoveryPointTLI = tli;
 	}
 
+	ControlFile->backupCheckPoint = InvalidXLogRecPtr;
 	ControlFile->backupStartPoint = InvalidXLogRecPtr;
+	ControlFile->backupStartPointTLI = 0;
 	ControlFile->backupEndPoint = InvalidXLogRecPtr;
+	ControlFile->backupFromStandby = false;
 	ControlFile->backupEndRequired = false;
 	UpdateControlFile();
 
@@ -8744,11 +8732,33 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 	int			seconds_before_warning;
 	int			waits = 0;
 	bool		reported_waiting = false;
+	ControlFileData *controlFileCopy = (ControlFileData *)state->controlFile;
 
 	Assert(state != NULL);
 
 	backup_stopped_in_recovery = RecoveryInProgress();
 
+	/*
+	 * Create a copy of control data and update it with fields required for
+	 * recovery. Also recalculate the CRC.
+	 */
+	memset(controlFileCopy, 0, PG_CONTROL_FILE_SIZE);
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	memcpy(controlFileCopy, ControlFile, sizeof(ControlFileData));
+	LWLockRelease(ControlFileLock);
+
+	controlFileCopy->backupRecoveryRequired = true;
+	controlFileCopy->backupFromStandby = backup_stopped_in_recovery;
+	controlFileCopy->backupEndRequired = true;
+	controlFileCopy->backupCheckPoint = state->checkpointloc;
+	controlFileCopy->backupStartPoint = state->startpoint;
+	controlFileCopy->backupStartPointTLI = state->starttli;
+
+	INIT_CRC32C(controlFileCopy->crc);
+	COMP_CRC32C(controlFileCopy->crc, controlFileCopy, offsetof(ControlFileData, crc));
+	FIN_CRC32C(controlFileCopy->crc);
+
 	/*
 	 * During recovery, we don't need to check WAL level. Because, if WAL
 	 * level is not sufficient, it's impossible to get here during recovery.
@@ -8850,11 +8860,8 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 							 "Enable full_page_writes and run CHECKPOINT on the primary, "
 							 "and then try an online backup again.")));
 
-
-		LWLockAcquire(ControlFileLock, LW_SHARED);
-		state->stoppoint = ControlFile->minRecoveryPoint;
-		state->stoptli = ControlFile->minRecoveryPointTLI;
-		LWLockRelease(ControlFileLock);
+		state->stoppoint = controlFileCopy->minRecoveryPoint;
+		state->stoptli = controlFileCopy->minRecoveryPointTLI;
 	}
 	else
 	{
@@ -8896,7 +8903,7 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 							histfilepath)));
 
 		/* Build and save the contents of the backup history file */
-		history_file = build_backup_content(state, true);
+		history_file = build_backup_history_content(state);
 		fprintf(fp, "%s", history_file);
 		pfree(history_file);
 
diff --git a/src/backend/access/transam/xlogbackup.c b/src/backend/access/transam/xlogbackup.c
index 21d68133ae..22c95f3c4c 100644
--- a/src/backend/access/transam/xlogbackup.c
+++ b/src/backend/access/transam/xlogbackup.c
@@ -18,19 +18,19 @@
 #include "access/xlogbackup.h"
 
 /*
- * Build contents for backup_label or backup history file.
- *
- * When ishistoryfile is true, it creates the contents for a backup history
- * file, otherwise it creates contents for a backup_label file.
+ * Build contents for backup history file.
  *
  * Returns the result generated as a palloc'd string.
  */
 char *
-build_backup_content(BackupState *state, bool ishistoryfile)
+build_backup_history_content(BackupState *state)
 {
 	char		startstrbuf[128];
+	char		stopstrfbuf[128];
 	char		startxlogfile[MAXFNAMELEN]; /* backup start WAL file */
+	char		stopxlogfile[MAXFNAMELEN];	/* backup stop WAL file */
 	XLogSegNo	startsegno;
+	XLogSegNo	stopsegno;
 	StringInfo	result = makeStringInfo();
 	char	   *data;
 
@@ -45,16 +45,10 @@ build_backup_content(BackupState *state, bool ishistoryfile)
 	appendStringInfo(result, "START WAL LOCATION: %X/%X (file %s)\n",
 					 LSN_FORMAT_ARGS(state->startpoint), startxlogfile);
 
-	if (ishistoryfile)
-	{
-		char		stopxlogfile[MAXFNAMELEN];	/* backup stop WAL file */
-		XLogSegNo	stopsegno;
-
-		XLByteToSeg(state->stoppoint, stopsegno, wal_segment_size);
-		XLogFileName(stopxlogfile, state->stoptli, stopsegno, wal_segment_size);
-		appendStringInfo(result, "STOP WAL LOCATION: %X/%X (file %s)\n",
-						 LSN_FORMAT_ARGS(state->stoppoint), stopxlogfile);
-	}
+	XLByteToSeg(state->stoppoint, stopsegno, wal_segment_size);
+	XLogFileName(stopxlogfile, state->stoptli, stopsegno, wal_segment_size);
+	appendStringInfo(result, "STOP WAL LOCATION: %X/%X (file %s)\n",
+						LSN_FORMAT_ARGS(state->stoppoint), stopxlogfile);
 
 	appendStringInfo(result, "CHECKPOINT LOCATION: %X/%X\n",
 					 LSN_FORMAT_ARGS(state->checkpointloc));
@@ -65,17 +59,12 @@ build_backup_content(BackupState *state, bool ishistoryfile)
 	appendStringInfo(result, "LABEL: %s\n", state->name);
 	appendStringInfo(result, "START TIMELINE: %u\n", state->starttli);
 
-	if (ishistoryfile)
-	{
-		char		stopstrfbuf[128];
-
-		/* Use the log timezone here, not the session timezone */
-		pg_strftime(stopstrfbuf, sizeof(stopstrfbuf), "%Y-%m-%d %H:%M:%S %Z",
-					pg_localtime(&state->stoptime, log_timezone));
+	/* Use the log timezone here, not the session timezone */
+	pg_strftime(stopstrfbuf, sizeof(stopstrfbuf), "%Y-%m-%d %H:%M:%S %Z",
+				pg_localtime(&state->stoptime, log_timezone));
 
-		appendStringInfo(result, "STOP TIME: %s\n", stopstrfbuf);
-		appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
-	}
+	appendStringInfo(result, "STOP TIME: %s\n", stopstrfbuf);
+	appendStringInfo(result, "STOP TIMELINE: %u\n", state->stoptli);
 
 	data = result->data;
 	pfree(result);
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 45a70668b1..d81d36705f 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -53,7 +53,7 @@ static MemoryContext backupcontext = NULL;
  * pg_backup_start: set up for taking an on-line backup dump
  *
  * Essentially what this does is to create the contents required for the
- * backup_label file and the tablespace map.
+ * the tablespace map.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -61,6 +61,10 @@ static MemoryContext backupcontext = NULL;
 Datum
 pg_backup_start(PG_FUNCTION_ARGS)
 {
+#define PG_BACKUP_START_V2_COLS 3
+	TupleDesc	tupdesc;
+	Datum		values[PG_BACKUP_START_V2_COLS] = {0};
+	bool		nulls[PG_BACKUP_START_V2_COLS] = {0};
 	text	   *backupid = PG_GETARG_TEXT_PP(0);
 	bool		fast = PG_GETARG_BOOL(1);
 	char	   *backupidstr;
@@ -69,6 +73,10 @@ pg_backup_start(PG_FUNCTION_ARGS)
 
 	backupidstr = text_to_cstring(backupid);
 
+	/* Initialize attributes information in the tuple descriptor */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
 	if (status == SESSION_BACKUP_RUNNING)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
@@ -102,7 +110,12 @@ pg_backup_start(PG_FUNCTION_ARGS)
 	register_persistent_abort_backup_handler();
 	do_pg_backup_start(backupidstr, fast, NULL, backup_state, tablespace_map);
 
-	PG_RETURN_LSN(backup_state->startpoint);
+	values[0] = LSNGetDatum(backup_state->startpoint);
+	values[1] = Int64GetDatum(backup_state->starttli);
+	values[2] = TimestampTzGetDatum(time_t_to_timestamptz(backup_state->starttime));
+
+	/* Returns the record as Datum */
+	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
 }
 
 
@@ -113,14 +126,12 @@ pg_backup_start(PG_FUNCTION_ARGS)
  * allows the user to choose if they want to wait for the WAL to be archived
  * or if we should just return as soon as the WAL record is written.
  *
- * This function stops an in-progress backup, creates backup_label contents and
- * it returns the backup stop LSN, backup_label and tablespace_map contents.
+ * This function stops an in-progress backup and returns the backup stop LSN,
+ * pg_control and tablespace_map contents.
  *
- * The backup_label contains the user-supplied label string (typically this
- * would be used to tell where the backup dump will be stored), the starting
- * time, starting WAL location for the dump and so on.  It is the caller's
- * responsibility to write the backup_label and tablespace_map files in the
- * data folder that will be restored from this backup.
+ * The pg_control file contains the recovery information for the backup.  It is
+ * the caller's responsibility to write the pg_control and tablespace_map files
+ * in the data folder that will be restored from this backup.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -128,12 +139,12 @@ pg_backup_start(PG_FUNCTION_ARGS)
 Datum
 pg_backup_stop(PG_FUNCTION_ARGS)
 {
-#define PG_BACKUP_STOP_V2_COLS 3
+#define PG_BACKUP_STOP_V2_COLS 5
 	TupleDesc	tupdesc;
 	Datum		values[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		nulls[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		waitforarchive = PG_GETARG_BOOL(0);
-	char	   *backup_label;
+	bytea	   *pg_control_bytea;
 	SessionBackupState status = get_backup_status();
 
 	/* Initialize attributes information in the tuple descriptor */
@@ -152,15 +163,16 @@ pg_backup_stop(PG_FUNCTION_ARGS)
 	/* Stop the backup */
 	do_pg_backup_stop(backup_state, waitforarchive);
 
-	/* Build the contents of backup_label */
-	backup_label = build_backup_content(backup_state, false);
-
-	values[0] = LSNGetDatum(backup_state->stoppoint);
-	values[1] = CStringGetTextDatum(backup_label);
-	values[2] = CStringGetTextDatum(tablespace_map->data);
+	/* Build the contents of pg_control */
+	pg_control_bytea = (bytea *) palloc(PG_CONTROL_FILE_SIZE + VARHDRSZ);
+	SET_VARSIZE(pg_control_bytea, PG_CONTROL_FILE_SIZE + VARHDRSZ);
+	memcpy(VARDATA(pg_control_bytea), backup_state->controlFile, PG_CONTROL_FILE_SIZE);
 
-	/* Deallocate backup-related variables */
-	pfree(backup_label);
+	values[0] = PointerGetDatum(pg_control_bytea);
+	values[1] = CStringGetTextDatum(tablespace_map->data);
+	values[2] = LSNGetDatum(backup_state->stoppoint);
+	values[3] = Int64GetDatum(backup_state->stoptli);
+	values[4] = TimestampTzGetDatum(time_t_to_timestamptz(backup_state->stoptime));
 
 	/* Clean up the session-level state and its memory context */
 	backup_state = NULL;
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index c61566666a..d05ff41f60 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -6,7 +6,7 @@
  * This source file contains functions controlling WAL recovery.
  * InitWalRecovery() initializes the system for crash or archive recovery,
  * or standby mode, depending on configuration options and the state of
- * the control file and possible backup label file.  PerformWalRecovery()
+ * the control file and possible backup recovery.  PerformWalRecovery()
  * performs the actual WAL replay, calling the rmgr-specific redo routines.
  * FinishWalRecovery() performs end-of-recovery checks and cleanup actions,
  * and prepares information needed to initialize the WAL for writes.  In
@@ -152,11 +152,12 @@ static bool recovery_signal_file_found = false;
 
 /*
  * CheckPointLoc is the position of the checkpoint record that determines
- * where to start the replay.  It comes from the backup label file or the
- * control file.
+ * where to start the replay.  It comes from the control file, either from the
+ * default location or from a backup recovery field.
  *
- * RedoStartLSN is the checkpoint's REDO location, also from the backup label
- * file or the control file.  In standby mode, XLOG streaming usually starts
+ * RedoStartLSN is the checkpoint's REDO location, also from the default
+ * control file location or from a backup recovery field.  In standby mode,
+ * XLOG streaming usually starts
  * from the position where an invalid record was found.  But if we fail to
  * read even the initial checkpoint record, we use the REDO location instead
  * of the checkpoint location as the start position of XLOG streaming.
@@ -388,9 +389,6 @@ static void ApplyWalRecord(XLogReaderState *xlogreader, XLogRecord *record, Time
 static void EnableStandbyMode(void);
 static void readRecoverySignalFile(void);
 static void validateRecoveryParameters(void);
-static bool read_backup_label(XLogRecPtr *checkPointLoc,
-							  TimeLineID *backupLabelTLI,
-							  bool *backupEndRequired, bool *backupFromStandby);
 static bool read_tablespace_map(List **tablespaces);
 
 static void xlogrecovery_redo(XLogReaderState *record, TimeLineID replayTLI);
@@ -492,8 +490,8 @@ EnableStandbyMode(void)
  * Prepare the system for WAL recovery, if needed.
  *
  * This is called by StartupXLOG() which coordinates the server startup
- * sequence.  This function analyzes the control file and the backup label
- * file, if any, and figures out whether we need to perform crash recovery or
+ * sequence.  This function analyzes the control file and backup recovery
+ * info, if any, and figures out whether we need to perform crash recovery or
  * archive recovery, and how far we need to replay the WAL to reach a
  * consistent state.
  *
@@ -510,7 +508,7 @@ EnableStandbyMode(void)
  */
 void
 InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
-				bool *haveBackupLabel_ptr, bool *haveTblspcMap_ptr)
+				bool *haveTblspcMap_ptr)
 {
 	XLogPageReadPrivate *private;
 	struct stat st;
@@ -518,7 +516,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	XLogRecord *record;
 	DBState		dbstate_at_startup;
 	bool		haveTblspcMap = false;
-	bool		haveBackupLabel = false;
+	bool		backupRecoveryRequired = false;
 	CheckPoint	checkPoint;
 	bool		backupFromStandby = false;
 
@@ -549,7 +547,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 
 	/*
 	 * Set the WAL reading processor now, as it will be needed when reading
-	 * the checkpoint record required (backup_label or not).
+	 * the checkpoint record required (backup recovery required or not).
 	 */
 	private = palloc0(sizeof(XLogPageReadPrivate));
 	xlogreader =
@@ -585,18 +583,33 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	primary_image_masked = (char *) palloc(BLCKSZ);
 
 	/*
-	 * Read the backup_label file.  We want to run this part of the recovery
-	 * process after checking for signal files and after performing validation
-	 * of the recovery parameters.
+	 * Load recovery settings from pg_control.  We want to run this part of the
+	 * recovery process after checking for signal files and after performing
+	 * validation of the recovery parameters.
 	 */
-	if (read_backup_label(&CheckPointLoc, &CheckPointTLI, &backupEndRequired,
-						  &backupFromStandby))
+	if (ControlFile->backupRecoveryRequired)
 	{
 		List	   *tablespaces = NIL;
 
+		/* Initialize recovery from fields stored in pg_control */
+		CheckPointLoc = ControlFile->backupCheckPoint;
+		CheckPointTLI = ControlFile->backupStartPointTLI;
+		RedoStartLSN = ControlFile->backupStartPoint;
+		RedoStartTLI = ControlFile->backupStartPointTLI;
+		backupEndRequired = ControlFile->backupEndRequired;
+		backupFromStandby = ControlFile->backupFromStandby;
+
+		/*
+		* Clear backupRecoveryRequired in ControlFile so we do not initialize
+		* recovery settings again, but also set a local variable so later logic
+		* knows that backup recovery was initialized.
+		*/
+		ControlFile->backupRecoveryRequired = false;
+		backupRecoveryRequired = true;
+
 		/*
-		 * Archive recovery was requested, and thanks to the backup label
-		 * file, we know how far we need to replay to reach consistency. Enter
+		 * Archive recovery was requested, and thanks to the recovery
+		 * info, we know how far we need to replay to reach consistency. Enter
 		 * archive recovery directly.
 		 */
 		InArchiveRecovery = true;
@@ -604,8 +617,9 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			EnableStandbyMode();
 
 		/*
-		 * When a backup_label file is present, we want to roll forward from
-		 * the checkpoint it identifies, rather than using pg_control.
+		 * When backup recovery is requested, we want to roll forward from
+		 * the checkpoint it identifies, rather than using the default
+		 * checkpoint.
 		 */
 		record = ReadCheckpointRecord(xlogprefetcher, CheckPointLoc,
 									  CheckPointTLI);
@@ -620,9 +634,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 
 			/*
 			 * Make sure that REDO location exists. This may not be the case
-			 * if there was a crash during an online backup, which left a
-			 * backup_label around that references a WAL segment that's
-			 * already been archived.
+			 * if recovery.signal is missing and the WAL has already been
+			 * archived.
 			 */
 			if (checkPoint.redo < CheckPointLoc)
 			{
@@ -631,20 +644,16 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 								checkPoint.ThisTimeLineID))
 					ereport(FATAL,
 							(errmsg("could not find redo location referenced by checkpoint record"),
-							 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n"
-									 "If you are not restoring from a backup, try removing the file \"%s/backup_label\".\n"
-									 "Be careful: removing \"%s/backup_label\" will result in a corrupt cluster if restoring from a backup.",
-									 DataDir, DataDir, DataDir, DataDir)));
+							 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n",
+									 DataDir, DataDir)));
 			}
 		}
 		else
 		{
 			ereport(FATAL,
 					(errmsg("could not locate required checkpoint record"),
-					 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n"
-							 "If you are not restoring from a backup, try removing the file \"%s/backup_label\".\n"
-							 "Be careful: removing \"%s/backup_label\" will result in a corrupt cluster if restoring from a backup.",
-							 DataDir, DataDir, DataDir, DataDir)));
+					 errhint("If you are restoring from a backup, touch \"%s/recovery.signal\" or \"%s/standby.signal\" and add required recovery options.\n",
+							 DataDir, DataDir)));
 			wasShutdown = false;	/* keep compiler quiet */
 		}
 
@@ -679,37 +688,32 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			/* tell the caller to delete it later */
 			haveTblspcMap = true;
 		}
-
-		/* tell the caller to delete it later */
-		haveBackupLabel = true;
 	}
 	else
 	{
-		/* No backup_label file has been found if we are here. */
-
 		/*
-		 * If tablespace_map file is present without backup_label file, there
-		 * is no use of such file.  There is no harm in retaining it, but it
-		 * is better to get rid of the map file so that we don't have any
+		 * If tablespace_map file is present without backup recovery requested,
+		 * there is no use of such file.  There is no harm in retaining it, but
+		 * it is better to get rid of the map file so that we don't have any
 		 * redundant file in data directory and it will avoid any sort of
 		 * confusion.  It seems prudent though to just rename the file out of
 		 * the way rather than delete it completely, also we ignore any error
 		 * that occurs in rename operation as even if map file is present
-		 * without backup_label file, it is harmless.
+		 * without backup recovery requested, it is harmless.
 		 */
 		if (stat(TABLESPACE_MAP, &st) == 0)
 		{
 			unlink(TABLESPACE_MAP_OLD);
 			if (durable_rename(TABLESPACE_MAP, TABLESPACE_MAP_OLD, DEBUG1) == 0)
 				ereport(LOG,
-						(errmsg("ignoring file \"%s\" because no file \"%s\" exists",
-								TABLESPACE_MAP, BACKUP_LABEL_FILE),
+						(errmsg("ignoring file \"%s\" because backup recovery was not requested",
+								TABLESPACE_MAP),
 						 errdetail("File \"%s\" was renamed to \"%s\".",
 								   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 			else
 				ereport(LOG,
-						(errmsg("ignoring file \"%s\" because no file \"%s\" exists",
-								TABLESPACE_MAP, BACKUP_LABEL_FILE),
+						(errmsg("ignoring file \"%s\" because backup recovery was not requested",
+								TABLESPACE_MAP),
 						 errdetail("Could not rename file \"%s\" to \"%s\": %m.",
 								   TABLESPACE_MAP, TABLESPACE_MAP_OLD)));
 		}
@@ -943,7 +947,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		 * Any other state indicates that the backup somehow became corrupted
 		 * and we can't sensibly continue with recovery.
 		 */
-		if (haveBackupLabel)
+		if (backupRecoveryRequired)
 		{
 			ControlFile->backupStartPoint = checkPoint.redo;
 			ControlFile->backupEndRequired = backupEndRequired;
@@ -953,7 +957,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 				if (dbstate_at_startup != DB_IN_ARCHIVE_RECOVERY &&
 					dbstate_at_startup != DB_SHUTDOWNED_IN_RECOVERY)
 					ereport(FATAL,
-							(errmsg("backup_label contains data inconsistent with control file"),
+							(errmsg("pg_control contains inconsistent data for standby backup"),
 							 errhint("This means that the backup is corrupted and you will "
 									 "have to use another backup for recovery.")));
 				ControlFile->backupEndPoint = ControlFile->minRecoveryPoint;
@@ -983,7 +987,6 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	missingContrecPtr = InvalidXLogRecPtr;
 
 	*wasShutdown_ptr = wasShutdown;
-	*haveBackupLabel_ptr = haveBackupLabel;
 	*haveTblspcMap_ptr = haveTblspcMap;
 }
 
@@ -1156,154 +1159,6 @@ validateRecoveryParameters(void)
 	}
 }
 
-/*
- * read_backup_label: check to see if a backup_label file is present
- *
- * If we see a backup_label during recovery, we assume that we are recovering
- * from a backup dump file, and we therefore roll forward from the checkpoint
- * identified by the label file, NOT what pg_control says.  This avoids the
- * problem that pg_control might have been archived one or more checkpoints
- * later than the start of the dump, and so if we rely on it as the start
- * point, we will fail to restore a consistent database state.
- *
- * Returns true if a backup_label was found (and fills the checkpoint
- * location and TLI into *checkPointLoc and *backupLabelTLI, respectively);
- * returns false if not. If this backup_label came from a streamed backup,
- * *backupEndRequired is set to true. If this backup_label was created during
- * recovery, *backupFromStandby is set to true.
- *
- * Also sets the global variables RedoStartLSN and RedoStartTLI with the LSN
- * and TLI read from the backup file.
- */
-static bool
-read_backup_label(XLogRecPtr *checkPointLoc, TimeLineID *backupLabelTLI,
-				  bool *backupEndRequired, bool *backupFromStandby)
-{
-	char		startxlogfilename[MAXFNAMELEN];
-	TimeLineID	tli_from_walseg,
-				tli_from_file;
-	FILE	   *lfp;
-	char		ch;
-	char		backuptype[20];
-	char		backupfrom[20];
-	char		backuplabel[MAXPGPATH];
-	char		backuptime[128];
-	uint32		hi,
-				lo;
-
-	/* suppress possible uninitialized-variable warnings */
-	*checkPointLoc = InvalidXLogRecPtr;
-	*backupLabelTLI = 0;
-	*backupEndRequired = false;
-	*backupFromStandby = false;
-
-	/*
-	 * See if label file is present
-	 */
-	lfp = AllocateFile(BACKUP_LABEL_FILE, "r");
-	if (!lfp)
-	{
-		if (errno != ENOENT)
-			ereport(FATAL,
-					(errcode_for_file_access(),
-					 errmsg("could not read file \"%s\": %m",
-							BACKUP_LABEL_FILE)));
-		return false;			/* it's not there, all is fine */
-	}
-
-	/*
-	 * Read and parse the START WAL LOCATION and CHECKPOINT lines (this code
-	 * is pretty crude, but we are not expecting any variability in the file
-	 * format).
-	 */
-	if (fscanf(lfp, "START WAL LOCATION: %X/%X (file %08X%16s)%c",
-			   &hi, &lo, &tli_from_walseg, startxlogfilename, &ch) != 5 || ch != '\n')
-		ereport(FATAL,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
-	RedoStartLSN = ((uint64) hi) << 32 | lo;
-	RedoStartTLI = tli_from_walseg;
-	if (fscanf(lfp, "CHECKPOINT LOCATION: %X/%X%c",
-			   &hi, &lo, &ch) != 3 || ch != '\n')
-		ereport(FATAL,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE)));
-	*checkPointLoc = ((uint64) hi) << 32 | lo;
-	*backupLabelTLI = tli_from_walseg;
-
-	/*
-	 * BACKUP METHOD lets us know if this was a typical backup ("streamed",
-	 * which could mean either pg_basebackup or the pg_backup_start/stop
-	 * method was used) or if this label came from somewhere else (the only
-	 * other option today being from pg_rewind).  If this was a streamed
-	 * backup then we know that we need to play through until we get to the
-	 * end of the WAL which was generated during the backup (at which point we
-	 * will have reached consistency and backupEndRequired will be reset to be
-	 * false).
-	 */
-	if (fscanf(lfp, "BACKUP METHOD: %19s\n", backuptype) == 1)
-	{
-		if (strcmp(backuptype, "streamed") == 0)
-			*backupEndRequired = true;
-	}
-
-	/*
-	 * BACKUP FROM lets us know if this was from a primary or a standby.  If
-	 * it was from a standby, we'll double-check that the control file state
-	 * matches that of a standby.
-	 */
-	if (fscanf(lfp, "BACKUP FROM: %19s\n", backupfrom) == 1)
-	{
-		if (strcmp(backupfrom, "standby") == 0)
-			*backupFromStandby = true;
-	}
-
-	/*
-	 * Parse START TIME and LABEL. Those are not mandatory fields for recovery
-	 * but checking for their presence is useful for debugging and the next
-	 * sanity checks. Cope also with the fact that the result buffers have a
-	 * pre-allocated size, hence if the backup_label file has been generated
-	 * with strings longer than the maximum assumed here an incorrect parsing
-	 * happens. That's fine as only minor consistency checks are done
-	 * afterwards.
-	 */
-	if (fscanf(lfp, "START TIME: %127[^\n]\n", backuptime) == 1)
-		ereport(DEBUG1,
-				(errmsg_internal("backup time %s in file \"%s\"",
-								 backuptime, BACKUP_LABEL_FILE)));
-
-	if (fscanf(lfp, "LABEL: %1023[^\n]\n", backuplabel) == 1)
-		ereport(DEBUG1,
-				(errmsg_internal("backup label %s in file \"%s\"",
-								 backuplabel, BACKUP_LABEL_FILE)));
-
-	/*
-	 * START TIMELINE is new as of 11. Its parsing is not mandatory, still use
-	 * it as a sanity check if present.
-	 */
-	if (fscanf(lfp, "START TIMELINE: %u\n", &tli_from_file) == 1)
-	{
-		if (tli_from_walseg != tli_from_file)
-			ereport(FATAL,
-					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-					 errmsg("invalid data in file \"%s\"", BACKUP_LABEL_FILE),
-					 errdetail("Timeline ID parsed is %u, but expected %u.",
-							   tli_from_file, tli_from_walseg)));
-
-		ereport(DEBUG1,
-				(errmsg_internal("backup timeline %u in file \"%s\"",
-								 tli_from_file, BACKUP_LABEL_FILE)));
-	}
-
-	if (ferror(lfp) || FreeFile(lfp))
-		ereport(FATAL,
-				(errcode_for_file_access(),
-				 errmsg("could not read file \"%s\": %m",
-						BACKUP_LABEL_FILE)));
-
-	return true;
-}
-
 /*
  * read_tablespace_map: check to see if a tablespace_map file is present
  *
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index c7b3ba3e6e..f6cd7d954c 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -22,6 +22,7 @@
 #include "backup/basebackup.h"
 #include "backup/basebackup_sink.h"
 #include "backup/basebackup_target.h"
+#include "catalog/pg_control.h"
 #include "commands/defrem.h"
 #include "common/compression.h"
 #include "common/file_perm.h"
@@ -192,10 +193,9 @@ static const struct exclude_list_item excludeFiles[] =
 	{RELCACHE_INIT_FILENAME, true},
 
 	/*
-	 * backup_label and tablespace_map should not exist in a running cluster
-	 * capable of doing an online backup, but exclude them just in case.
+	 * tablespace_map should not exist in a running cluster capable of doing
+	 * an online backup, but exclude it just in case.
 	 */
-	{BACKUP_LABEL_FILE, false},
 	{TABLESPACE_MAP, false},
 
 	/*
@@ -310,19 +310,11 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 
 			if (ti->path == NULL)
 			{
-				struct stat statbuf;
 				bool		sendtblspclinks = true;
-				char	   *backup_label;
 
 				bbsink_begin_archive(sink, "base.tar");
 
-				/* In the main tar, include the backup_label first... */
-				backup_label = build_backup_content(backup_state, false);
-				sendFileWithContent(sink, BACKUP_LABEL_FILE,
-									backup_label, -1, &manifest);
-				pfree(backup_label);
-
-				/* Then the tablespace_map file, if required... */
+				/* Send the tablespace_map file, if required... */
 				if (opt->sendtblspcmapfile)
 				{
 					sendFileWithContent(sink, TABLESPACE_MAP,
@@ -334,15 +326,14 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 				sendDir(sink, ".", 1, false, state.tablespaces,
 						sendtblspclinks, &manifest, InvalidOid);
 
-				/* ... and pg_control after everything else. */
-				if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not stat file \"%s\": %m",
-									XLOG_CONTROL_FILE)));
-				sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
-						 false, InvalidOid, InvalidOid,
-						 InvalidRelFileNumber, 0, &manifest);
+				/* End the backup before sending pg_control */
+				basebackup_progress_wait_wal_archive(&state);
+				do_pg_backup_stop(backup_state, !opt->nowait);
+
+				/* Send copy of pg_control containing recovery info */
+				sendFileWithContent(sink, XLOG_CONTROL_FILE,
+									(char *)backup_state->controlFile,
+									PG_CONTROL_FILE_SIZE, &manifest);
 			}
 			else
 			{
@@ -376,9 +367,6 @@ perform_base_backup(basebackup_options *opt, bbsink *sink)
 			}
 		}
 
-		basebackup_progress_wait_wal_archive(&state);
-		do_pg_backup_stop(backup_state, !opt->nowait);
-
 		endptr = backup_state->stoppoint;
 		endtli = backup_state->stoptli;
 
@@ -1051,7 +1039,6 @@ sendFileWithContent(bbsink *sink, const char *filename, const char *content,
 	/*
 	 * Construct a stat struct for the file we're injecting in the tar.
 	 */
-
 	/* Windows doesn't have the concept of uid and gid */
 #ifdef WIN32
 	statbuf.st_uid = 0;
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 4206752881..1f37719cf2 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -384,13 +384,15 @@ BEGIN ATOMIC
 END;
 
 CREATE OR REPLACE FUNCTION
-  pg_backup_start(label text, fast boolean DEFAULT false)
-  RETURNS pg_lsn STRICT VOLATILE LANGUAGE internal AS 'pg_backup_start'
+  pg_backup_start(label text, fast boolean DEFAULT false, OUT lsn pg_lsn,
+        OUT timeline_id int8, OUT start timestamptz)
+  RETURNS record STRICT VOLATILE LANGUAGE internal AS 'pg_backup_start'
   PARALLEL RESTRICTED;
 
 CREATE OR REPLACE FUNCTION pg_backup_stop (
-        wait_for_archive boolean DEFAULT true, OUT lsn pg_lsn,
-        OUT labelfile text, OUT spcmapfile text)
+        wait_for_archive boolean DEFAULT true, OUT pg_control_file bytea,
+        OUT tablespace_map_file text, OUT lsn pg_lsn, OUT timeline_id int8,
+        OUT stop timestamptz)
   RETURNS record STRICT VOLATILE LANGUAGE internal as 'pg_backup_stop'
   PARALLEL RESTRICTED;
 
diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index a1003a464d..44c4e987ce 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -163,8 +163,8 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 Datum
 pg_control_recovery(PG_FUNCTION_ARGS)
 {
-	Datum		values[5];
-	bool		nulls[5];
+	Datum		values[9];
+	bool		nulls[9];
 	TupleDesc	tupdesc;
 	HeapTuple	htup;
 	ControlFileData *ControlFile;
@@ -187,15 +187,27 @@ pg_control_recovery(PG_FUNCTION_ARGS)
 	values[1] = Int32GetDatum(ControlFile->minRecoveryPointTLI);
 	nulls[1] = false;
 
-	values[2] = LSNGetDatum(ControlFile->backupStartPoint);
+	values[2] = LSNGetDatum(ControlFile->backupCheckPoint);
 	nulls[2] = false;
 
-	values[3] = LSNGetDatum(ControlFile->backupEndPoint);
+	values[3] = LSNGetDatum(ControlFile->backupStartPoint);
 	nulls[3] = false;
 
-	values[4] = BoolGetDatum(ControlFile->backupEndRequired);
+	values[4] = Int32GetDatum(ControlFile->backupStartPointTLI);
 	nulls[4] = false;
 
+	values[5] = LSNGetDatum(ControlFile->backupEndPoint);
+	nulls[5] = false;
+
+	values[6] = BoolGetDatum(ControlFile->backupRecoveryRequired);
+	nulls[6] = false;
+
+	values[7] = BoolGetDatum(ControlFile->backupFromStandby);
+	nulls[7] = false;
+
+	values[8] = BoolGetDatum(ControlFile->backupEndRequired);
+	nulls[8] = false;
+
 	htup = heap_form_tuple(tupdesc, values, nulls);
 
 	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
diff --git a/src/bin/pg_basebackup/t/010_pg_basebackup.pl b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
index b9f5e1266b..c655cb0335 100644
--- a/src/bin/pg_basebackup/t/010_pg_basebackup.pl
+++ b/src/bin/pg_basebackup/t/010_pg_basebackup.pl
@@ -171,8 +171,8 @@ SKIP:
 
 # Write some files to test that they are not copied.
 foreach my $filename (
-	qw(backup_label tablespace_map postgresql.auto.conf.tmp
-	current_logfiles.tmp global/pg_internal.init.123))
+	qw(tablespace_map postgresql.auto.conf.tmp current_logfiles.tmp
+	   global/pg_internal.init.123))
 {
 	open my $file, '>>', "$pgdata/$filename";
 	print $file "DONOTCOPY";
@@ -261,14 +261,13 @@ foreach my $filename (@tempRelationFiles)
 		"base/$postgresOid/$filename not copied");
 }
 
-# Make sure existing backup_label was ignored.
-isnt(slurp_file("$tempdir/backup/backup_label"),
-	'DONOTCOPY', 'existing backup_label not copied');
+# Make sure existing tablespace_map was ignored.
+ok(!-f "$tempdir/backup/tablespace_map", 'tablespace_map not in backup');
 rmtree("$tempdir/backup");
 
-# Now delete the bogus backup_label file since it will interfere with startup
-unlink("$pgdata/backup_label")
-  or BAIL_OUT("unable to unlink $pgdata/backup_label");
+# Now delete the bogus tablespace_map file since it will interfere with startup
+unlink("$pgdata/tablespace_map")
+  or BAIL_OUT("unable to unlink $pgdata/tablespace_map");
 
 $node->command_ok(
 	[
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93e0837947..cc515b622f 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -277,10 +277,18 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->minRecoveryPoint));
 	printf(_("Min recovery ending loc's timeline:   %u\n"),
 		   ControlFile->minRecoveryPointTLI);
+	printf(_("Backup checkpoint location:           %X/%X\n"),
+		   LSN_FORMAT_ARGS(ControlFile->backupCheckPoint));
 	printf(_("Backup start location:                %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->backupStartPoint));
+	printf(_("Backup start location's timeline:     %u\n"),
+		   ControlFile->backupStartPointTLI);
 	printf(_("Backup end location:                  %X/%X\n"),
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
+	printf(_("Backup recovery required:        		%s\n"),
+		   ControlFile->backupRecoveryRequired ? _("yes") : _("no"));
+	printf(_("Backup from standby:        			%s\n"),
+		   ControlFile->backupFromStandby ? _("yes") : _("no"));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index 3ae3fc06df..255101ff3a 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -870,8 +870,12 @@ RewriteControlFile(void)
 	ControlFile.checkPoint = ControlFile.checkPointCopy.redo;
 	ControlFile.minRecoveryPoint = 0;
 	ControlFile.minRecoveryPointTLI = 0;
+	ControlFile.backupCheckPoint = 0;
 	ControlFile.backupStartPoint = 0;
+	ControlFile.backupStartPointTLI = 0;
 	ControlFile.backupEndPoint = 0;
+	ControlFile.backupRecoveryRequired = false;
+	ControlFile.backupFromStandby = false;
 	ControlFile.backupEndRequired = false;
 
 	/*
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index ecadd69dc5..213f4e71b8 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -139,11 +139,10 @@ static const struct exclude_list_item excludeFiles[] =
 	{"pg_internal.init", true}, /* defined as RELCACHE_INIT_FILENAME */
 
 	/*
-	 * If there is a backup_label or tablespace_map file, it indicates that a
-	 * recovery failed and this cluster probably can't be rewound, but exclude
-	 * them anyway if they are found.
+	 * If there is a tablespace_map file, it indicates that a recovery failed
+	 * and this cluster probably can't be rewound, but exclude it anyway if it
+	 * is found.
 	 */
-	{"backup_label", false},	/* defined as BACKUP_LABEL_FILE */
 	{"tablespace_map", false},	/* defined as TABLESPACE_MAP */
 
 	/*
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index bfd44a284e..6b23ed4a1c 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -39,9 +39,6 @@ static void perform_rewind(filemap_t *filemap, rewind_source *source,
 						   TimeLineID chkpttli,
 						   XLogRecPtr chkptredo);
 
-static void createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli,
-							  XLogRecPtr checkpointloc);
-
 static void digestControlFile(ControlFileData *ControlFile,
 							  const char *content, size_t size);
 static void getRestoreCommand(const char *argv0);
@@ -651,10 +648,10 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	}
 
 	if (showprogress)
-		pg_log_info("creating backup label and updating control file");
+		pg_log_info("updating control file");
 
 	/*
-	 * Create a backup label file, to tell the target where to begin the WAL
+	 * Get recovery fields to tell the target where to begin the WAL
 	 * replay. Normally, from the last common checkpoint between the source
 	 * and the target. But if the source is a standby server, it's possible
 	 * that the last common checkpoint is *after* the standby's restartpoint.
@@ -672,7 +669,6 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 		chkpttli = ControlFile_source.checkPointCopy.ThisTimeLineID;
 		chkptrec = ControlFile_source.checkPoint;
 	}
-	createBackupLabel(chkptredo, chkpttli, chkptrec);
 
 	/*
 	 * Update control file of target, to tell the target how far it must
@@ -722,6 +718,12 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	ControlFile_new.minRecoveryPoint = endrec;
 	ControlFile_new.minRecoveryPointTLI = endtli;
 	ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
+	ControlFile_new.backupRecoveryRequired = true;
+	ControlFile_new.backupFromStandby = true;
+	ControlFile_new.backupEndRequired = false;
+	ControlFile_new.backupCheckPoint = chkptrec;
+	ControlFile_new.backupStartPoint = chkptredo;
+	ControlFile_new.backupStartPointTLI = chkpttli;
 	if (!dry_run)
 		update_controlfile(datadir_target, &ControlFile_new, do_sync);
 }
@@ -729,7 +731,10 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 static void
 sanityChecks(void)
 {
-	/* TODO Check that there's no backup_label in either cluster */
+	/*
+	 * TODO Check that neither cluster has backupRecoveryRequested set in
+	 * pg_control.
+	 */
 
 	/* Check system_identifier match */
 	if (ControlFile_target.system_identifier != ControlFile_source.system_identifier)
@@ -951,51 +956,6 @@ findCommonAncestorTimeline(TimeLineHistoryEntry *a_history, int a_nentries,
 	}
 }
 
-
-/*
- * Create a backup_label file that forces recovery to begin at the last common
- * checkpoint.
- */
-static void
-createBackupLabel(XLogRecPtr startpoint, TimeLineID starttli, XLogRecPtr checkpointloc)
-{
-	XLogSegNo	startsegno;
-	time_t		stamp_time;
-	char		strfbuf[128];
-	char		xlogfilename[MAXFNAMELEN];
-	struct tm  *tmp;
-	char		buf[1000];
-	int			len;
-
-	XLByteToSeg(startpoint, startsegno, WalSegSz);
-	XLogFileName(xlogfilename, starttli, startsegno, WalSegSz);
-
-	/*
-	 * Construct backup label file
-	 */
-	stamp_time = time(NULL);
-	tmp = localtime(&stamp_time);
-	strftime(strfbuf, sizeof(strfbuf), "%Y-%m-%d %H:%M:%S %Z", tmp);
-
-	len = snprintf(buf, sizeof(buf),
-				   "START WAL LOCATION: %X/%X (file %s)\n"
-				   "CHECKPOINT LOCATION: %X/%X\n"
-				   "BACKUP METHOD: pg_rewind\n"
-				   "BACKUP FROM: standby\n"
-				   "START TIME: %s\n",
-	/* omit LABEL: line */
-				   LSN_FORMAT_ARGS(startpoint), xlogfilename,
-				   LSN_FORMAT_ARGS(checkpointloc),
-				   strfbuf);
-	if (len >= sizeof(buf))
-		pg_fatal("backup label buffer too small");	/* shouldn't happen */
-
-	/* TODO: move old file out of the way, if any. */
-	open_target_file("backup_label", true); /* BACKUP_LABEL_FILE */
-	write_target_range(buf, 0, len);
-	close_target_file();
-}
-
 /*
  * Check CRC of control file
  */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164..3aac6839a7 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -293,8 +293,6 @@ extern SessionBackupState get_backup_status(void);
 /* File path names (all relative to $PGDATA) */
 #define RECOVERY_SIGNAL_FILE	"recovery.signal"
 #define STANDBY_SIGNAL_FILE		"standby.signal"
-#define BACKUP_LABEL_FILE		"backup_label"
-#define BACKUP_LABEL_OLD		"backup_label.old"
 
 #define TABLESPACE_MAP			"tablespace_map"
 #define TABLESPACE_MAP_OLD		"tablespace_map.old"
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index 1611358137..9be80a4f7d 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -15,6 +15,7 @@
 #define XLOG_BACKUP_H
 
 #include "access/xlogdefs.h"
+#include "catalog/pg_control.h"
 #include "pgtime.h"
 
 /* Structure to hold backup state. */
@@ -33,9 +34,16 @@ typedef struct BackupState
 	XLogRecPtr	stoppoint;		/* backup stop WAL location */
 	TimeLineID	stoptli;		/* backup stop TLI */
 	pg_time_t	stoptime;		/* backup stop time */
+
+	/*
+	 * After pg_backup_stop() returns this field will contain a copy of
+	 * pg_control that should be stored with the backup. Fields have been
+	 * updated for recovery and the CRC has been recalculated. Bytes after
+	 * sizeof(ControlFileData) are zeroed.
+	 */
+	uint8_t controlFile[PG_CONTROL_FILE_SIZE];
 } BackupState;
 
-extern char *build_backup_content(BackupState *state,
-								  bool ishistoryfile);
+extern char *build_backup_history_content(BackupState *state);
 
 #endif							/* XLOG_BACKUP_H */
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index ee0bc74278..981266f734 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -80,8 +80,7 @@ extern Size XLogRecoveryShmemSize(void);
 extern void XLogRecoveryShmemInit(void);
 
 extern void InitWalRecovery(ControlFileData *ControlFile,
-							bool *wasShutdown_ptr, bool *haveBackupLabel_ptr,
-							bool *haveTblspcMap_ptr);
+							bool *wasShutdown_ptr, bool *haveTblspcMap_ptr);
 extern void PerformWalRecovery(void);
 
 /*
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 2ae72e3b26..0ea2e73368 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -146,6 +146,9 @@ typedef struct ControlFileData
 	 * to disk, we mustn't start up until we reach X again. Zero when not
 	 * doing archive recovery.
 	 *
+	 * backupCheckPoint is the backup start checkpoint and is set to zero after
+	 * recovery is initialized.
+	 *
 	 * backupStartPoint is the redo pointer of the backup start checkpoint, if
 	 * we are recovering from an online backup and haven't reached the end of
 	 * backup yet. It is reset to zero when the end of backup is reached, and
@@ -160,14 +163,28 @@ typedef struct ControlFileData
 	 * pg_control which was backed up last. It is reset to zero when the end
 	 * of backup is reached, and we mustn't start up before that.
 	 *
+	 * backupRecoveryRequired indicates that the pg_control file was provided
+	 * by a backup or pg_rewind and recovery settings need to be copied to the
+	 * appropriate fields. It will be set to false when the settings have been
+	 * copied.
+	 *
+	 * backupFromStandby indicates that the backup was taken on a standby. It is
+	 * required to initialize recovery and set to false afterwards.
+	 *
 	 * If backupEndRequired is true, we know for sure that we're restoring
 	 * from a backup, and must see a backup-end record before we can safely
-	 * start up.
+	 * start up. Currently backupEndRequired should only be false if recovery
+	 * settings were configured by pg_rewind, which does not require an end
+	 * point.
 	 */
 	XLogRecPtr	minRecoveryPoint;
 	TimeLineID	minRecoveryPointTLI;
+	XLogRecPtr	backupCheckPoint;
 	XLogRecPtr	backupStartPoint;
 	XLogRecPtr	backupEndPoint;
+	TimeLineID	backupStartPointTLI;
+	bool 		backupRecoveryRequired;
+	bool 		backupFromStandby;
 	bool		backupEndRequired;
 
 	/*
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index fb58dee3bc..16346f0540 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6427,13 +6427,17 @@
   prosrc => 'pg_terminate_backend' },
 { oid => '2172', descr => 'prepare for taking an online backup',
   proname => 'pg_backup_start', provolatile => 'v', proparallel => 'r',
-  prorettype => 'pg_lsn', proargtypes => 'text bool',
+  prorettype => 'record', proargtypes => 'text bool',
+  proallargtypes => '{text,bool,pg_lsn,int8,timestamptz}',
+  proargmodes => '{i,i,o,o,o}',
+  proargnames => '{label,fast,lsn,timeline_id,start}',
   prosrc => 'pg_backup_start' },
 { oid => '2739', descr => 'finish taking an online backup',
   proname => 'pg_backup_stop', provolatile => 'v', proparallel => 'r',
   prorettype => 'record', proargtypes => 'bool',
-  proallargtypes => '{bool,pg_lsn,text,text}', proargmodes => '{i,o,o,o}',
-  proargnames => '{wait_for_archive,lsn,labelfile,spcmapfile}',
+  proallargtypes => '{bool,bytea,text,pg_lsn,int8,timestamptz}',
+  proargmodes => '{i,o,o,o,o,o}',
+  proargnames => '{wait_for_archive,pg_control_file,tablespace_map_file,lsn,timeline_id,stop}',
   prosrc => 'pg_backup_stop' },
 { oid => '3436', descr => 'promote standby server',
   proname => 'pg_promote', provolatile => 'v', prorettype => 'bool',
@@ -11917,9 +11921,9 @@
 { oid => '3443',
   descr => 'pg_controldata recovery state information as a function',
   proname => 'pg_control_recovery', provolatile => 'v', prorettype => 'record',
-  proargtypes => '', proallargtypes => '{pg_lsn,int4,pg_lsn,pg_lsn,bool}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{min_recovery_end_lsn,min_recovery_end_timeline,backup_start_lsn,backup_end_lsn,end_of_backup_record_required}',
+  proargtypes => '', proallargtypes => '{pg_lsn,int4,pg_lsn,pg_lsn,int4,pg_lsn,bool,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o}',
+  proargnames => '{min_recovery_end_lsn,min_recovery_end_timeline,backup_checkpoint_lsn,backup_start_lsn,backup_start_tli,backup_end_lsn,backup_recovery_required,backup_from_standby,end_of_backup_record_required}',
   prosrc => 'pg_control_recovery' },
 
 { oid => '3444',
-- 
2.34.1

Import Notes

Reply to msg id not found: a1d77886-97ac-4ec3-8f5d-7f0d95cbdc1e@pgmasters.net

#16

david@pgmasters.net

about 2 years ago

In reply to: Michael Paquier (#15)

Re: Add recovery to pg_control and remove backup_label

On 11/19/23 21:15, Michael Paquier wrote:

(I am not exactly sure how, but we've lost pgsql-hackers on the way
when you sent v5. Now added back in CC with the two latest patches
you've proposed attached.)

Ugh, I must have hit reply instead of reply all. It's a rookie error and
you hate to see it.

Here is a short summary of what has been missed by the lists:
- I've commented that the patch should not create, not show up in
fields returned the SQL functions or stream control files with a size
of 512B, just stick to 8kB. If this is worth changing this should be
applied consistently across the board including initdb, discussed on
its own thread.
- The backup-related fields in the control file are reset at the end
of recovery. I've suggested to not do that to keep a trace of what
was happening during recovery. The latest version of the patch resets
the fields.
- With the backup_label file gone, we lose some information in the
backups themselves, which is not good. Instead, you have suggested an
approach where this data is added to the backup manifest, meaning that
no information would be lost, particularly useful for self-contained
backups. The fields planned to be added to the backup manifest are:
-- The start and end time of the backup, the end timestamp being
useful to know when stop time can be used for PITR.
-- The backup label.
I've agreed that it may be the best thing to do at this end to not
lose any data related to the removal of the backup_label file.

This looks right to me.

On Sun, Nov 19, 2023 at 02:14:32PM -0400, David Steele wrote:

On 11/15/23 20:03, Michael Paquier wrote:

As the label is only an informational field, the parsing added to
pg_verifybackup is not really needed because it is used nowhere in the
validation process, so keeping the logic simpler would be the way to
go IMO. This is contrary to the WAL range for example, where start
and end LSNs are used for validation with a pg_waldump command.
Robert, any comments about the addition of the label in the manifest?

I'm sure Robert will comment on this when he gets the time, but for now I
have backed off on passing the new info to pg_verifybackup and added
start/stop time.

FWIW, I'm OK with the bits for the backup manifest as presented. So
if there are no remarks and/or no objections, I'd like to apply it but
let give some room to others to comment on that as there's been a gap
in the emails exchanged on pgsql-hackers. I hope that the summary
I've posted above covers everything. So let's see about doing
something around the middle of next week. With Thanksgiving in the
US, a lot of folks will not have the time to monitor what's happening
on this thread.

Timing sounds good to me.

+      The end time for the backup. This is when the backup was stopped in
+      <productname>PostgreSQL</productname> and represents the earliest time
+      that can be used for time-based Point-In-Time Recovery.
This one is actually a very good point. We'd lost this capacity with
the backup_label file gone without the end timestamps in the control
file.

Yeah, the end time is very important for recovery scenarios. We
definitely need that recorded somewhere.

I've noticed on the other thread the remark about being less
aggressive with the fields related to recovery in the control file, so
I assume that this patch should leave the fields be after the end of
recovery from the start and only rely on backupRecoveryRequired to
decide if the recovery should use the fields or not:
/messages/by-id/241ccde1-1928-4ba2-a0bb-5350f7b191a8@=pgmasters.net
+	ControlFile->backupCheckPoint = InvalidXLogRecPtr;
ControlFile->backupStartPoint = InvalidXLogRecPtr;
+	ControlFile->backupStartPointTLI = 0;
ControlFile->backupEndPoint = InvalidXLogRecPtr;
+	ControlFile->backupFromStandby = false;
ControlFile->backupEndRequired = false;
Still, I get the temptation of being consistent with the current style
on HEAD to reset everything, as well..

I'd rather reset everything for now (as we do now) and think about
keeping these values as a separate patch. It may be that we don't want
to keep all of them, or we need a separate flag to say recovery was
completed. We are accumulating a lot of booleans here, maybe we need a
state var (recoveryRequired, recoveryInProgress, recoveryComplete) and
then define which other vars are valid in each state.

Regards,
-David

#17

robertmhaas@gmail.com

about 2 years ago

In reply to: Michael Paquier (#15)

Re: Add recovery to pg_control and remove backup_label

On Sun, Nov 19, 2023 at 8:16 PM Michael Paquier <michael@paquier.xyz> wrote:

(I am not exactly sure how, but we've lost pgsql-hackers on the way
when you sent v5. Now added back in CC with the two latest patches
you've proposed attached.)

Here is a short summary of what has been missed by the lists:
- I've commented that the patch should not create, not show up in
fields returned the SQL functions or stream control files with a size
of 512B, just stick to 8kB. If this is worth changing this should be
applied consistently across the board including initdb, discussed on
its own thread.
- The backup-related fields in the control file are reset at the end
of recovery. I've suggested to not do that to keep a trace of what
was happening during recovery. The latest version of the patch resets
the fields.
- With the backup_label file gone, we lose some information in the
backups themselves, which is not good. Instead, you have suggested an
approach where this data is added to the backup manifest, meaning that
no information would be lost, particularly useful for self-contained
backups. The fields planned to be added to the backup manifest are:
-- The start and end time of the backup, the end timestamp being
useful to know when stop time can be used for PITR.
-- The backup label.
I've agreed that it may be the best thing to do at this end to not
lose any data related to the removal of the backup_label file.

I think we need more votes to make a change this big. I have a
concern, which I think I've expressed before, that we keep whacking
around the backup APIs, and that has a cost which is potentially
larger than the benefits. The last time we changed the API, we changed
pg_stop_backup to pg_backup_stop, but this doesn't do that, and I
wonder if that's OK. Even if it is, do we really want to change this
API around again after such a short time?

That said, I don't have an intrinsic problem with moving this
information from the backup_label to the backup_manifest file since it
is purely informational. I do think there should perhaps be some
additions to the test cases, though.

I am concerned about the interaction of this proposal with incremental
backup. When you take an incremental backup, you get something that
looks a lot like a usable data directory but isn't. To prevent that
from causing avoidable disasters, the present version of the patch
adds INCREMENTAL FROM LSN and INCREMENTAL FROM TLI fields to the
backup_label. pg_combinebackup knows to look for those fields, and the
server knows that if they are present it should refuse to start. With
this change, though, I suppose those fields would end up in
pg_control. But that does not feel entirely great, because we have a
goal of keeping the amount of real data in pg_control below 512 bytes,
the traditional sector size, and this adds another 12 bytes of stuff
to that file that currently doesn't need to be there. I feel like
that's kind of a problem.

But my main point here is ... if we have a few more senior hackers
weigh in and vote in favor of this change, well then that's one thing.
But IMHO a discussion that's mostly between 2 people is not nearly a
strong enough consensus to justify this amount of disruption.

--
Robert Haas
EDB: http://www.enterprisedb.com

#18

david@pgmasters.net

about 2 years ago

In reply to: Robert Haas (#17)

Re: Add recovery to pg_control and remove backup_label

On 11/20/23 12:11, Robert Haas wrote:

On Sun, Nov 19, 2023 at 8:16 PM Michael Paquier <michael@paquier.xyz> wrote:

(I am not exactly sure how, but we've lost pgsql-hackers on the way
when you sent v5. Now added back in CC with the two latest patches
you've proposed attached.)

Here is a short summary of what has been missed by the lists:
- I've commented that the patch should not create, not show up in
fields returned the SQL functions or stream control files with a size
of 512B, just stick to 8kB. If this is worth changing this should be
applied consistently across the board including initdb, discussed on
its own thread.
- The backup-related fields in the control file are reset at the end
of recovery. I've suggested to not do that to keep a trace of what
was happening during recovery. The latest version of the patch resets
the fields.
- With the backup_label file gone, we lose some information in the
backups themselves, which is not good. Instead, you have suggested an
approach where this data is added to the backup manifest, meaning that
no information would be lost, particularly useful for self-contained
backups. The fields planned to be added to the backup manifest are:
-- The start and end time of the backup, the end timestamp being
useful to know when stop time can be used for PITR.
-- The backup label.
I've agreed that it may be the best thing to do at this end to not
lose any data related to the removal of the backup_label file.

I think we need more votes to make a change this big. I have a
concern, which I think I've expressed before, that we keep whacking
around the backup APIs, and that has a cost which is potentially
larger than the benefits.

From my perspective it's not that big a change for backup software but
it does bring a lot of benefits, including fixing an outstanding bug in
Postgres, i.e. reading pg_control without getting a torn copy.

The last time we changed the API, we changed
pg_stop_backup to pg_backup_stop, but this doesn't do that, and I
wonder if that's OK. Even if it is, do we really want to change this
API around again after such a short time?

This is a good point. We could just rename again, but not sure what
names to go for this time. OTOH if the backup software is selecting
fields then they will get an error because the names have changed. If
the software is grabbing fields by position then they'll get a
valid-looking result (even if querying by position is a terrible idea).

Another thing we could do is explicitly error if we see backup_label in
PGDATA during recovery. That's just a few lines of code so would not be
a big deal to maintain. This error would only be visible on restore, so
it presumes that backup software is being tested.

Maybe just a rename to something like pg_backup_begin/end would be the
way to go.

That said, I don't have an intrinsic problem with moving this
information from the backup_label to the backup_manifest file since it
is purely informational. I do think there should perhaps be some
additions to the test cases, though.

A little hard to add to the tests, I think, since they are purely
informational, i.e. not pushed up by the parser. Maybe we could just
grep for the fields?

I am concerned about the interaction of this proposal with incremental
backup. When you take an incremental backup, you get something that
looks a lot like a usable data directory but isn't. To prevent that
from causing avoidable disasters, the present version of the patch
adds INCREMENTAL FROM LSN and INCREMENTAL FROM TLI fields to the
backup_label. pg_combinebackup knows to look for those fields, and the
server knows that if they are present it should refuse to start. With
this change, though, I suppose those fields would end up in
pg_control. But that does not feel entirely great, because we have a
goal of keeping the amount of real data in pg_control below 512 bytes,
the traditional sector size, and this adds another 12 bytes of stuff
to that file that currently doesn't need to be there. I feel like
that's kind of a problem.

I think these fields would be handled the same as the rest of the fields
in backup_label: returned from pg_backup_stop() and also stored in
backup_manifest. Third-party software can do as they like with them and
pg_combinebackup can just read from backup_manifest.

As for the pg_control file -- it might be best to give it a different
name for backups that are not essentially copies of PGDATA. On the other
hand, pgBackRest has included pg_control in incremental backups since
day one and we've never had a user mistakenly do a manual restore of one
and cause a problem (though manual restores are not the norm). Still,
probably can't hurt to be a bit careful.

But my main point here is ... if we have a few more senior hackers
weigh in and vote in favor of this change, well then that's one thing.
But IMHO a discussion that's mostly between 2 people is not nearly a
strong enough consensus to justify this amount of disruption.

We absolutely need more people to look at this and sign off. I'm glad
they have not so far because it has allowed time to whack the patch
around and get it into better shape.

Regards,
-David

#19

robertmhaas@gmail.com

about 2 years ago

In reply to: David Steele (#18)

Re: Add recovery to pg_control and remove backup_label

On Mon, Nov 20, 2023 at 12:54 PM David Steele <david@pgmasters.net> wrote:

Another thing we could do is explicitly error if we see backup_label in
PGDATA during recovery. That's just a few lines of code so would not be
a big deal to maintain. This error would only be visible on restore, so
it presumes that backup software is being tested.

I think that if we do decide to adopt this proposal, that would be a
smart precaution.

A little hard to add to the tests, I think, since they are purely
informational, i.e. not pushed up by the parser. Maybe we could just
grep for the fields?

Hmm. Or should they be pushed up by the parser?

I think these fields would be handled the same as the rest of the fields
in backup_label: returned from pg_backup_stop() and also stored in
backup_manifest. Third-party software can do as they like with them and
pg_combinebackup can just read from backup_manifest.

I think that would be a bad plan, because this is critical
information, and a backup manifest is not a thing that you're required
to have. It's not a natural fit at all. We don't want to create a
situation where if you nuke the backup_manifest then the server
forgets that what it has is an incremental backup rather than a usable
data directory.

We absolutely need more people to look at this and sign off. I'm glad
they have not so far because it has allowed time to whack the patch
around and get it into better shape.

Cool.

--
Robert Haas
EDB: http://www.enterprisedb.com

#20

david@pgmasters.net

about 2 years ago

In reply to: Robert Haas (#19)

Re: Add recovery to pg_control and remove backup_label

On 11/20/23 14:44, Robert Haas wrote:

On Mon, Nov 20, 2023 at 12:54 PM David Steele <david@pgmasters.net> wrote:

Another thing we could do is explicitly error if we see backup_label in
PGDATA during recovery. That's just a few lines of code so would not be
a big deal to maintain. This error would only be visible on restore, so
it presumes that backup software is being tested.

I think that if we do decide to adopt this proposal, that would be a
smart precaution.

I'd be OK with it -- what do you think, Michael? Would this be enough
that we would not need to rename the functions, or should we just go
with the rename?

A little hard to add to the tests, I think, since they are purely
informational, i.e. not pushed up by the parser. Maybe we could just
grep for the fields?

Hmm. Or should they be pushed up by the parser?

We could do that. I started on that road, but it's a lot of code for
fields that aren't used. I think it would be better if the parser also
loaded a data structure that represented the manifest. Seems to me
there's a lot of duplicated code between pg_verifybackup and
pg_combinebackup the way it is now.

I think these fields would be handled the same as the rest of the fields
in backup_label: returned from pg_backup_stop() and also stored in
backup_manifest. Third-party software can do as they like with them and
pg_combinebackup can just read from backup_manifest.

I think that would be a bad plan, because this is critical
information, and a backup manifest is not a thing that you're required
to have. It's not a natural fit at all. We don't want to create a
situation where if you nuke the backup_manifest then the server
forgets that what it has is an incremental backup rather than a usable
data directory.

I can't see why a backup would continue to be valid without a manifest
-- that's not very standard for backup software. If you have the
critical info in backup_label, you can't afford to lose that, so why
should backup_manifest be any different?

Regards,
-David

#21

robertmhaas@gmail.com

about 2 years ago

In reply to: David Steele (#20)

Re: Add recovery to pg_control and remove backup_label

On Mon, Nov 20, 2023 at 2:41 PM David Steele <david@pgmasters.net> wrote:

I can't see why a backup would continue to be valid without a manifest
-- that's not very standard for backup software. If you have the
critical info in backup_label, you can't afford to lose that, so why
should backup_manifest be any different?

I mean, you can run pg_basebackup --no-manifest.

--
Robert Haas
EDB: http://www.enterprisedb.com

#22

david@pgmasters.net

about 2 years ago

In reply to: Robert Haas (#21)

Re: Add recovery to pg_control and remove backup_label

On 11/20/23 15:47, Robert Haas wrote:

On Mon, Nov 20, 2023 at 2:41 PM David Steele <david@pgmasters.net> wrote:

I can't see why a backup would continue to be valid without a manifest
-- that's not very standard for backup software. If you have the
critical info in backup_label, you can't afford to lose that, so why
should backup_manifest be any different?

I mean, you can run pg_basebackup --no-manifest.

Maybe this would be a good thing to disable for page incremental. With
all the work being done by pg_combinebackup, it seems like it would be a
good idea to be able to verify the final result?

I understand this is an option -- but does it need to be? What is the
benefit of excluding the manifest?

Regards,
-David

#23

andres@anarazel.de

about 2 years ago

In reply to: Robert Haas (#17)

Re: Add recovery to pg_control and remove backup_label

Hi,

On 2023-11-20 11:11:13 -0500, Robert Haas wrote:

I think we need more votes to make a change this big. I have a
concern, which I think I've expressed before, that we keep whacking
around the backup APIs, and that has a cost which is potentially
larger than the benefits.

+1. The amount of whacking around in this area has been substantial, and it's
hard for operators to keep up. And realistically, with data sizes today, the
pressure to do basebackups with disk snapshots etc is not going to shrink.

Leaving that concern aside, I am still on the fence about this proposal. I
think it does decrease the chance of getting things wrong in the
streaming-basebackup case. But for external backups, it seems almost
universally worse (with the exception of the torn pg_control issue, that we
also can address otherwise):

It doesn't reduce the risk of getting things wrong, you can still omit placing
a file into the data directory and get silent corruption as a consequence. In
addition, it's harder to see when looking at a base backup whether the process
was right or not, because now the good and bad state look the same if you just
look on the filesystem level!

Then there's the issue of making ad-hoc debugging harder by not having a
human readable file with information anymore, including when looking at the
history, via backup_label.old.

Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup

Greetings,

Andres Freund

#24

andres@anarazel.de

about 2 years ago

In reply to: David Steele (#22)

Re: Add recovery to pg_control and remove backup_label

Hi,

On 2023-11-20 15:56:19 -0400, David Steele wrote:

I understand this is an option -- but does it need to be? What is the
benefit of excluding the manifest?

It's not free to create the manifest, particularly if checksums are enabled.

Also, for external backups, there's no manifest...

- Andres

#25

David G. Johnston

david.g.johnston@gmail.com

about 2 years ago

In reply to: Andres Freund (#23)

Re: Add recovery to pg_control and remove backup_label

On Mon, Nov 20, 2023 at 1:37 PM Andres Freund <andres@anarazel.de> wrote:

Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we
set
when creating a streaming base backup

I thought this was DOA since we don't want to ever leave the cluster in a
state where a crash requires intervention to restart. But I agree that it
is not possible to fool-proof agaInst a naive backup that copies over the
pg_control file as-is if breaking the crashed cluster option is not in play.

I agree that this works if the pg_control generated by stop backup produces
the line and we retain the label file as a separate and now mandatory
component to using the backup.

Or is the idea to make v17 error if it sees a backup label unless
pg_control has the feature flag field? Which doesn't exist normally, does
in the basebackup version, and is removed once the backup is restored?

David J.

#26

michael@paquier.xyz

about 2 years ago

In reply to: Robert Haas (#17)

Re: Add recovery to pg_control and remove backup_label

On Mon, Nov 20, 2023 at 11:11:13AM -0500, Robert Haas wrote:

I think we need more votes to make a change this big. I have a
concern, which I think I've expressed before, that we keep whacking
around the backup APIs, and that has a cost which is potentially
larger than the benefits. The last time we changed the API, we changed
pg_stop_backup to pg_backup_stop, but this doesn't do that, and I
wonder if that's OK. Even if it is, do we really want to change this
API around again after such a short time?

Agreed.

That said, I don't have an intrinsic problem with moving this
information from the backup_label to the backup_manifest file since it
is purely informational. I do think there should perhaps be some
additions to the test cases, though.

Yep, cool. Even if we decide to not go with what's discussed in this
patch, I think that's useful for some users at the end to get more
redundancy, as well. And that's in a format easier to parse.

I am concerned about the interaction of this proposal with incremental
backup. When you take an incremental backup, you get something that
looks a lot like a usable data directory but isn't. To prevent that
from causing avoidable disasters, the present version of the patch
adds INCREMENTAL FROM LSN and INCREMENTAL FROM TLI fields to the
backup_label. pg_combinebackup knows to look for those fields, and the
server knows that if they are present it should refuse to start. With
this change, though, I suppose those fields would end up in
pg_control. But that does not feel entirely great, because we have a
goal of keeping the amount of real data in pg_control below 512 bytes,
the traditional sector size, and this adds another 12 bytes of stuff
to that file that currently doesn't need to be there. I feel like
that's kind of a problem.

I don't recall one time where the addition of new fields to the
control file was easy to discuss because of its 512B hard limit.
Anyway, putting the addition aside for a second, and I've not looked
at the incremental backup patch, does the removal of the backup_label
make the combine logic more complicated, or that's just moving a chunk
of code to do a control file lookup instead of a backup_file parsing?
Making the information less readable is definitely an issue for me. A
different alternative that I've mentioned upthread is to keep an
equivalent of the backup_label and rename it to something like
backup.debug or similar, with a name good enough to tell people that
we don't care about it being removed.
--
Michael

#27

andres@anarazel.de

about 2 years ago

In reply to: David G. Johnston (#25)

Re: Add recovery to pg_control and remove backup_label

Hi,

On 2023-11-20 14:18:15 -0700, David G. Johnston wrote:

On Mon, Nov 20, 2023 at 1:37 PM Andres Freund <andres@anarazel.de> wrote:

Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we
set
when creating a streaming base backup

I thought this was DOA since we don't want to ever leave the cluster in a
state where a crash requires intervention to restart.

I was trying to suggest that we'd set the field in-memory, when streaming out
a pg_basebackup style backup (by just replacing pg_control with an otherwise
identical file that has the flag set). So it'd not have any effect on the
primary.

Greetings,

Andres Freund

#28

michael@paquier.xyz

about 2 years ago

In reply to: Andres Freund (#23)

Re: Add recovery to pg_control and remove backup_label

On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote:

Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup

That would mean that one still needs to take an extra step to update a
control file with this byte set, which is something you had a concern
with in terms of compatibility when it comes to external backup
solutions because more steps are necessary to take a backup, no? I
don't quite see why it is different than what's proposed on this
thread, except that you don't need to write one file to the data
folder to store the backup label fields, but two, meaning that there's
a risk for more mistakes because a clean backup process would require
more steps.

With the current position of the fields in ControlFileData, there are
three free bytes after backupEndRequired, so it is possible to add
that for free. Now, would you actually need an extra field knowing
that backupStartPoint is around?
--
Michael

#29

andres@anarazel.de

about 2 years ago

In reply to: Michael Paquier (#28)

Re: Add recovery to pg_control and remove backup_label

Hi,

On 2023-11-21 08:52:08 +0900, Michael Paquier wrote:

On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote:

Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup

That would mean that one still needs to take an extra step to update a
control file with this byte set, which is something you had a concern
with in terms of compatibility when it comes to external backup
solutions because more steps are necessary to take a backup, no?

I was thinking we'd just set it in the pg_basebackup style path, and we'd
error out if it's set and backup_label is present. But we'd still use
backup_label without the pg_control flag set.

So it'd just provide a cross-check that backup_label was not removed for
pg_basebackup style backup, but wouldn't do anything for external backups. But
imo the proposal to just us pg_control doesn't actually do anything for
external backups either - which is why I think my proposal would achieve as
much, for a much lower price.

Greetings,

Andres Freund

#30

michael@paquier.xyz

about 2 years ago

In reply to: Andres Freund (#29)

Re: Add recovery to pg_control and remove backup_label

On Mon, Nov 20, 2023 at 03:58:55PM -0800, Andres Freund wrote:

I was thinking we'd just set it in the pg_basebackup style path, and we'd
error out if it's set and backup_label is present. But we'd still use
backup_label without the pg_control flag set.

So it'd just provide a cross-check that backup_label was not removed for
pg_basebackup style backup, but wouldn't do anything for external backups. But
imo the proposal to just us pg_control doesn't actually do anything for
external backups either - which is why I think my proposal would achieve as
much, for a much lower price.

I don't see why not. It does not increase the number of steps when
doing a backup, and backupStartPoint alone would not be able to offer
this much protection.
--
Michael

#31

david@pgmasters.net

about 2 years ago

In reply to: Andres Freund (#29)

Re: Add recovery to pg_control and remove backup_label

On 11/20/23 19:58, Andres Freund wrote:

On 2023-11-21 08:52:08 +0900, Michael Paquier wrote:

On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote:

Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup

That would mean that one still needs to take an extra step to update a
control file with this byte set, which is something you had a concern
with in terms of compatibility when it comes to external backup
solutions because more steps are necessary to take a backup, no?

I was thinking we'd just set it in the pg_basebackup style path, and we'd
error out if it's set and backup_label is present. But we'd still use
backup_label without the pg_control flag set.

So it'd just provide a cross-check that backup_label was not removed for
pg_basebackup style backup, but wouldn't do anything for external backups. But
imo the proposal to just us pg_control doesn't actually do anything for
external backups either - which is why I think my proposal would achieve as
much, for a much lower price.

I'm not sure why you think the patch under discussion doesn't do
anything for external backups. It provides the same benefits to both
pg_basebackup and external backups, i.e. they both receive the updated
version of pg_control.

I really dislike the idea of pg_basebackup having a special mechanism
for making recovery safer that is not generally available to external
backup software. It might be easy enough for some (e.g. pgBackRest) to
manipulate pg_control but would be out of reach for most.

Regards,
-David

#32

andres@anarazel.de

about 2 years ago

In reply to: David Steele (#31)

Re: Add recovery to pg_control and remove backup_label

Hi,

On 2023-11-21 07:42:42 -0400, David Steele wrote:

On 11/20/23 19:58, Andres Freund wrote:

On 2023-11-21 08:52:08 +0900, Michael Paquier wrote:

On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote:

Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup

That would mean that one still needs to take an extra step to update a
control file with this byte set, which is something you had a concern
with in terms of compatibility when it comes to external backup
solutions because more steps are necessary to take a backup, no?

I was thinking we'd just set it in the pg_basebackup style path, and we'd
error out if it's set and backup_label is present. But we'd still use
backup_label without the pg_control flag set.

So it'd just provide a cross-check that backup_label was not removed for
pg_basebackup style backup, but wouldn't do anything for external backups. But
imo the proposal to just us pg_control doesn't actually do anything for
external backups either - which is why I think my proposal would achieve as
much, for a much lower price.

I'm not sure why you think the patch under discussion doesn't do anything
for external backups. It provides the same benefits to both pg_basebackup
and external backups, i.e. they both receive the updated version of
pg_control.

Sure. They also receive a backup_label today. If an external solution forgets
to replace pg_control copied as part of the filesystem copy, they won't get an
error after the remove of backup_label, just like they don't get one today if
they don't put backup_label in the data directory. Given that users don't do
the right thing with backup_label today, why can we rely on them doing the
right thing with pg_control?

Greetings,

Andres Freund

#33

david@pgmasters.net

about 2 years ago

In reply to: Andres Freund (#32)

Re: Add recovery to pg_control and remove backup_label

On 11/21/23 12:41, Andres Freund wrote:

On 2023-11-21 07:42:42 -0400, David Steele wrote:

On 11/20/23 19:58, Andres Freund wrote:

On 2023-11-21 08:52:08 +0900, Michael Paquier wrote:

On Mon, Nov 20, 2023 at 12:37:46PM -0800, Andres Freund wrote:

Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup

That would mean that one still needs to take an extra step to update a
control file with this byte set, which is something you had a concern
with in terms of compatibility when it comes to external backup
solutions because more steps are necessary to take a backup, no?

I was thinking we'd just set it in the pg_basebackup style path, and we'd
error out if it's set and backup_label is present. But we'd still use
backup_label without the pg_control flag set.

So it'd just provide a cross-check that backup_label was not removed for
pg_basebackup style backup, but wouldn't do anything for external backups. But
imo the proposal to just us pg_control doesn't actually do anything for
external backups either - which is why I think my proposal would achieve as
much, for a much lower price.

I'm not sure why you think the patch under discussion doesn't do anything
for external backups. It provides the same benefits to both pg_basebackup
and external backups, i.e. they both receive the updated version of
pg_control.

Sure. They also receive a backup_label today. If an external solution forgets
to replace pg_control copied as part of the filesystem copy, they won't get an
error after the remove of backup_label, just like they don't get one today if
they don't put backup_label in the data directory. Given that users don't do
the right thing with backup_label today, why can we rely on them doing the
right thing with pg_control?

I think reliable backup software does the right thing with backup_label,
but if the user starts getting errors on recovery they the decide to
remove backup_label. I know we can't do much about bad backup software,
but we can at least make this a bit more resistant to user error after
the fact.

It doesn't help that one of our hints suggests removing backup_label. In
highly automated systems, the user might not even know they just
restored from a backup. They are only in the loop because the restore
failed and they are trying to figure out what is going wrong. When they
remove backup_label the cluster comes up just fine. Victory!

This is the scenario I've seen most often -- not the backup/restore
process getting it wrong but the user removing backup_label on their own
initiative. And because it yields such a positive result, at least
initially, they remember in the future that the thing to do is to remove
backup_label whenever they see the error.

If they only have pg_control, then their only choice is to get it right
or run pg_resetwal. Most users have no knowledge of pg_resetwal so it
will take them longer to get there. Also, I think that tool make it
pretty clear that corruption will result and the only thing to do is a
logical dump and restore after using it.

There are plenty of ways a user can mess things up. What I'd like to
prevent is the appearance of everything being OK when in fact they have
corrupted their cluster. That's the situation we have now with
backup_label. Is this new solution perfect? No, but I do think it checks
several boxes, and is a worthwhile improvement.

Regards,
-David

#34

david@pgmasters.net

about 2 years ago

In reply to: Andres Freund (#24)

Re: Add recovery to pg_control and remove backup_label

On 11/20/23 16:41, Andres Freund wrote:

On 2023-11-20 15:56:19 -0400, David Steele wrote:

I understand this is an option -- but does it need to be? What is the
benefit of excluding the manifest?

It's not free to create the manifest, particularly if checksums are enabled.

It's virtually free, even with the basic CRCs. Anyway, would you really
want a backup without a manifest? How would you know something is
missing? In particular, for page incremental how do you know something
is new (but not WAL logged) if there is no manifest? Is the plan to just
recopy anything not WAL logged with each incremental?

Also, for external backups, there's no manifest...

There certainly is a manifest for many external backup solutions. Not
having a manifest is just running with scissors, backup-wise.

Regards,
-David

#35

andres@anarazel.de

about 2 years ago

In reply to: David Steele (#34)

Re: Add recovery to pg_control and remove backup_label

Hi,

On 2023-11-21 13:41:15 -0400, David Steele wrote:

On 11/20/23 16:41, Andres Freund wrote:

On 2023-11-20 15:56:19 -0400, David Steele wrote:

I understand this is an option -- but does it need to be? What is the
benefit of excluding the manifest?

It's not free to create the manifest, particularly if checksums are enabled.

It's virtually free, even with the basic CRCs.

Huh?

perf stat src/bin/pg_basebackup/pg_basebackup -h /tmp/ -p 5440 -D - -cfast -Xnone --format=tar > /dev/null

4,423.81 msec task-clock # 0.626 CPUs utilized
433,475 context-switches # 97.987 K/sec
5 cpu-migrations # 1.130 /sec
599 page-faults # 135.404 /sec
12,208,261,153 cycles # 2.760 GHz
6,805,401,520 instructions # 0.56 insn per cycle
1,273,896,027 branches # 287.964 M/sec
14,233,126 branch-misses # 1.12% of all branches

7.068946385 seconds time elapsed

1.106072000 seconds user
3.403793000 seconds sys

perf stat src/bin/pg_basebackup/pg_basebackup -h /tmp/ -p 5440 -D - -cfast -Xnone --format=tar --manifest-checksums=CRC32C > /dev/null

4,324.64 msec task-clock # 0.640 CPUs utilized
433,306 context-switches # 100.195 K/sec
3 cpu-migrations # 0.694 /sec
598 page-faults # 138.277 /sec
11,952,475,908 cycles # 2.764 GHz
6,816,888,845 instructions # 0.57 insn per cycle
1,275,949,455 branches # 295.042 M/sec
13,721,376 branch-misses # 1.08% of all branches

6.760321433 seconds time elapsed

1.113256000 seconds user
3.302907000 seconds sys

perf stat src/bin/pg_basebackup/pg_basebackup -h /tmp/ -p 5440 -D - -cfast -Xnone --format=tar --no-manifest > /dev/null

3,925.38 msec task-clock # 0.823 CPUs utilized
257,467 context-switches # 65.590 K/sec
4 cpu-migrations # 1.019 /sec
552 page-faults # 140.624 /sec
11,577,054,842 cycles # 2.949 GHz
5,933,731,797 instructions # 0.51 insn per cycle
1,108,784,719 branches # 282.466 M/sec
11,867,511 branch-misses # 1.07% of all branches

4.770347012 seconds time elapsed

1.002521000 seconds user
2.991769000 seconds sys

I'd not call 7.06->4.77 or 6.76->4.77 "virtually free".

And this actually *under* selling the cost - we waste a lot of cycles due to
bad buffering decisions. Once we fix that, the cost differential increases
further.

Anyway, would you really want a backup without a manifest? How would you
know something is missing? In particular, for page incremental how do you
know something is new (but not WAL logged) if there is no manifest? Is the
plan to just recopy anything not WAL logged with each incremental?

Shrug. If you just want to create a new standby by copying the primary, I
don't think creating and then validating the manifest buys you much. Long term
backups are a different story, particularly if data files are stored
individually, rather than in a single checksummed file.

Also, for external backups, there's no manifest...

There certainly is a manifest for many external backup solutions. Not having
a manifest is just running with scissors, backup-wise.

You mean that you have an external solution gin up a backup manifest? I fail
to see how that's relevant here?

Greetings,

Andres Freund

#36

david@pgmasters.net

about 2 years ago

In reply to: Andres Freund (#23)

Re: Add recovery to pg_control and remove backup_label

On 11/20/23 16:37, Andres Freund wrote:

On 2023-11-20 11:11:13 -0500, Robert Haas wrote:

I think we need more votes to make a change this big. I have a
concern, which I think I've expressed before, that we keep whacking
around the backup APIs, and that has a cost which is potentially
larger than the benefits.

+1. The amount of whacking around in this area has been substantial, and it's
hard for operators to keep up. And realistically, with data sizes today, the
pressure to do basebackups with disk snapshots etc is not going to shrink.

True enough, but disk snapshots aren't really backups in themselves, in
most scenarios, because they reside on the same storage as the cluster.
Of course, snapshots can be exported, but that's also expensive.

I see snapshots as an adjunct to backups -- a safe backup offsite
somewhere for DR and snapshots for day to day operations. Even so,
managing snapshots as backups is harder than people think. It is easy to
get wrong and end up with silent corruption.

Leaving that concern aside, I am still on the fence about this proposal. I
think it does decrease the chance of getting things wrong in the
streaming-basebackup case. But for external backups, it seems almost
universally worse (with the exception of the torn pg_control issue, that we
also can address otherwise):

Why universally worse? The software stores pg_control instead of backup
label. The changes to pg_basebackup were pretty trivial and the changes
to external backup are pretty much the same, at least in my limited
sample of one.

And I don't believe we have a satisfactory solution to the torn
pg_control issue yet. Certainly it has not been committed and Thomas has
shown enthusiasm for this approach, to the point of hoping it could be
back patched (it can't).

It doesn't reduce the risk of getting things wrong, you can still omit placing
a file into the data directory and get silent corruption as a consequence. In
addition, it's harder to see when looking at a base backup whether the process
was right or not, because now the good and bad state look the same if you just
look on the filesystem level!

This is one of the reasons I thought writing just the first 512 bytes of
pg_control would be valuable. It would give an easy indicator that
pg_control came from a backup. Michael was not in favor of conflating
that change with this patch -- but I still think it's a good idea.

Then there's the issue of making ad-hoc debugging harder by not having a
human readable file with information anymore, including when looking at the
history, via backup_label.old.

Yeah, you'd need to use pg_controldata instead. But as Michael has
suggested, we could also write backup_label as backup_info so there is
human-readable information available.

Given that, I wonder if what we should do is to just add a new field to
pg_control that says "error out if backup_label does not exist", that we set
when creating a streaming base backup

I'm not in favor of a change only accessible to pg_basebackup or
external software that can manipulate pg_control.

Regards,
-David

#37

david@pgmasters.net

about 2 years ago

In reply to: Andres Freund (#35)

Re: Add recovery to pg_control and remove backup_label

On 11/21/23 13:59, Andres Freund wrote:

On 2023-11-21 13:41:15 -0400, David Steele wrote:

On 11/20/23 16:41, Andres Freund wrote:

On 2023-11-20 15:56:19 -0400, David Steele wrote:

I understand this is an option -- but does it need to be? What is the
benefit of excluding the manifest?

It's not free to create the manifest, particularly if checksums are enabled.

It's virtually free, even with the basic CRCs.

Huh?

<snip>

I'd not call 7.06->4.77 or 6.76->4.77 "virtually free".

OK, but how does that look with compression -- to a remote location?
Uncompressed backup to local storage doesn't seem very realistic. With
gzip compression we measure SHA1 checksums at about 5% of total CPU.
Obviously that goes up with zstd or lz4. but parallelism helps offset
that cost, at least in clock time.

I can't understate how valuable checksums are in finding corruption,
especially in long-lived backups.

Anyway, would you really want a backup without a manifest? How would you
know something is missing? In particular, for page incremental how do you
know something is new (but not WAL logged) if there is no manifest? Is the
plan to just recopy anything not WAL logged with each incremental?

Shrug. If you just want to create a new standby by copying the primary, I
don't think creating and then validating the manifest buys you much. Long term
backups are a different story, particularly if data files are stored
individually, rather than in a single checksummed file.

Fine, but you are probably not using page incremental if just using
pg_basebackup to create a standby. With page incremental, at least one
of the backups will already exist, which argues for a manifest.

Also, for external backups, there's no manifest...

There certainly is a manifest for many external backup solutions. Not having
a manifest is just running with scissors, backup-wise.

You mean that you have an external solution gin up a backup manifest? I fail
to see how that's relevant here?

Just saying that for external backups there *is* often a manifest and it
is a good thing to have.

Regards,
-David

#38

andres@anarazel.de

about 2 years ago

In reply to: David Steele (#37)

Re: Add recovery to pg_control and remove backup_label

Hi,

On 2023-11-21 14:48:59 -0400, David Steele wrote:

I'd not call 7.06->4.77 or 6.76->4.77 "virtually free".

OK, but how does that look with compression

With compression it's obviously somewhat different - but that part is done in
parallel, potentially on a different machine with client side compression,
whereas I think right now the checksumming is single-threaded, on the server
side.

With parallel server side compression, it's still 20% slower with the default
checksumming than none. With client side it's 15%.

-- to a remote location?

I think this one unfortunately makes checksums a bigger issue, not a smaller
one. The network interaction piece is single-threaded, adding another
significant use of CPU onto the same thread means that you are hit harder by
using substantial amount of CPU for checksumming in the same thread.

Once you go beyond the small instances, you have plenty network bandwidth in
cloud environments. We top out well before the network on bigger instances.

Uncompressed backup to local storage doesn't seem very realistic. With gzip
compression we measure SHA1 checksums at about 5% of total CPU.

IMO using gzip is basically infeasible for non-toy sized databases today. I
think we're using our users a disservice by defaulting to it in a bunch of
places. Even if another default exposes them to difficulty due to potentially
using a different compiled binary with fewer supported compression methods -
that's gona be very rare in practice.

I can't understate how valuable checksums are in finding corruption,
especially in long-lived backups.

I agree! But I think we need faster checksum algorithms or a faster
implementation of the existing ones. And probably default to something faster
once we have it.

Greetings,

Andres Freund

#39

[1]: https://github.com/omniti-labs/omnipitr/issues/43
[2]: https://github.com/wal-e/wal-e/commit/f5b3e790fe10daa098b8cbf01d836c4885dc13c7
[3]: https://github.com/wal-e/wal-e/issues/433

david@pgmasters.net

about 2 years ago

In reply to: Andres Freund (#38)

Re: Add recovery to pg_control and remove backup_label

On 11/21/23 16:00, Andres Freund wrote:

Hi,

On 2023-11-21 14:48:59 -0400, David Steele wrote:

I'd not call 7.06->4.77 or 6.76->4.77 "virtually free".

OK, but how does that look with compression

With compression it's obviously somewhat different - but that part is done in
parallel, potentially on a different machine with client side compression,
whereas I think right now the checksumming is single-threaded, on the server
side.

Ah, yes, that's certainly a bottleneck.

With parallel server side compression, it's still 20% slower with the default
checksumming than none. With client side it's 15%.

Yeah, that still seems a lot. But to a large extent it sounds like a
limitation of the current implementation.

-- to a remote location?

I think this one unfortunately makes checksums a bigger issue, not a smaller
one. The network interaction piece is single-threaded, adding another
significant use of CPU onto the same thread means that you are hit harder by
using substantial amount of CPU for checksumming in the same thread.

Once you go beyond the small instances, you have plenty network bandwidth in
cloud environments. We top out well before the network on bigger instances.

Uncompressed backup to local storage doesn't seem very realistic. With gzip
compression we measure SHA1 checksums at about 5% of total CPU.

IMO using gzip is basically infeasible for non-toy sized databases today. I
think we're using our users a disservice by defaulting to it in a bunch of
places. Even if another default exposes them to difficulty due to potentially
using a different compiled binary with fewer supported compression methods -
that's gona be very rare in practice.

Yeah, I don't use gzip anymore, but there are still some platforms that
do not provide zstd (at least not easily) and lz4 compresses less. One
thing people do seem to have is a lot of cores.

I can't understate how valuable checksums are in finding corruption,
especially in long-lived backups.

I agree! But I think we need faster checksum algorithms or a faster
implementation of the existing ones. And probably default to something faster
once we have it.

We've been using xxHash to generate checksums for our block-level
incremental and it is seriously fast, written by the same guy who did
zstd and lz4.

Regards,
-David

#40

Stephen Frost

sfrost@snowman.net

about 2 years ago

In reply to: David Steele (#33)

Re: Add recovery to pg_control and remove backup_label

Greetings,

* David Steele (david@pgmasters.net) wrote:

On 11/21/23 12:41, Andres Freund wrote:

Sure. They also receive a backup_label today. If an external solution forgets
to replace pg_control copied as part of the filesystem copy, they won't get an
error after the remove of backup_label, just like they don't get one today if
they don't put backup_label in the data directory. Given that users don't do
the right thing with backup_label today, why can we rely on them doing the
right thing with pg_control?

I think reliable backup software does the right thing with backup_label, but
if the user starts getting errors on recovery they the decide to remove
backup_label. I know we can't do much about bad backup software, but we can
at least make this a bit more resistant to user error after the fact.

It doesn't help that one of our hints suggests removing backup_label. In
highly automated systems, the user might not even know they just restored
from a backup. They are only in the loop because the restore failed and they
are trying to figure out what is going wrong. When they remove backup_label
the cluster comes up just fine. Victory!

Yup, this is exactly the issue.

This is the scenario I've seen most often -- not the backup/restore process
getting it wrong but the user removing backup_label on their own initiative.
And because it yields such a positive result, at least initially, they
remember in the future that the thing to do is to remove backup_label
whenever they see the error.

If they only have pg_control, then their only choice is to get it right or
run pg_resetwal. Most users have no knowledge of pg_resetwal so it will take
them longer to get there. Also, I think that tool make it pretty clear that
corruption will result and the only thing to do is a logical dump and
restore after using it.

Agreed.

There are plenty of ways a user can mess things up. What I'd like to prevent
is the appearance of everything being OK when in fact they have corrupted
their cluster. That's the situation we have now with backup_label. Is this
new solution perfect? No, but I do think it checks several boxes, and is a
worthwhile improvement.

+1.

As for the complaint about 'operators' having issue with the changes
we've been making in this area- where are those people complaining,
exactly? Who are they? I feel like we keep getting this kind of
push-back in this area from folks on this list but not from actual
backup software authors; all the complaints seem to either be
speculative or unattributed pass-through from someone else.

What would really be helpful would be hearing from these individuals
directly as to what the issues are with the changes, such that perhaps
we can do things better in the future to avoid whatever the issue is
they're having with the changes. Simply saying we shouldn't make
changes in this area isn't workable and the constant push-back is
actively discouraging to folks trying to make improvements. Obviously
it's a biased view, but we've not had issues making the necessary
adjustments in pgbackrest with each release and I feel like if the
authors of wal-g or barman did that they would have spoken up.

Making a change as suggested which only helps pg_basebackup (and tools
like pgbackrest, since it's in C and can also make this particular
change) ends up leaving tools like wal-g and barman potentially still
with an easy way for users of those tools to corrupt their databases-
even though we've not heard anything from the authors of those tools
about issues with the proposed change, nor have there been a lot of
complaints from them about the prior changes to indicate that they'd
even have an issue with the more involved change. Given the lack of
complaint about past changes, I'd certainly rather err on the side of
improved safety for users than on the side of the authors of these tools
possibly complaining.

What these changes have done is finally break things like omnipitr
completely, which hasn't been maintained in a very long time. The
changes in v12 broke recovery with omnipitr but not backup, and folks
were trying to use omnipitr as recently as with v13[1]https://github.com/omniti-labs/omnipitr/issues/43. Certainly
having a backup tool that only works for backup (fsvo works, anyway, as
it still used exclusive backup mode meaning that a crash during a backup
would cause the system to not come back up after...) but doesn't work
for recovery isn't exactly great and I'm glad that, now, an attempt to
use omnipitr to perform a backup will fail. As with lots of other areas
of PG, folks need to read the release notes and potentially update their
code for new major versions. If anything, the backup area is less of an
issue for this because the authors of the backup tools are able to make
the change (and who are often the ones pushing for these changes) and
the end-user isn't impacted at all.

Much the same can be said for wal-e, with users still trying to use it
even long after it was stated to be obsolete (the Obsolescence Notice[2]https://github.com/wal-e/wal-e/commit/f5b3e790fe10daa098b8cbf01d836c4885dc13c7
was added in February 2022, though it hadn't been maintained for a while
before that, and an issue was opened in December 2022 asking for it to
be updated to v15[3]https://github.com/wal-e/wal-e/issues/433...).

Thanks,

Stephen

#41

robertmhaas@gmail.com

about 2 years ago

In reply to: Stephen Frost (#40)

Re: Add recovery to pg_control and remove backup_label

On Sun, Nov 26, 2023 at 3:42 AM Stephen Frost <sfrost@snowman.net> wrote:

What would really be helpful would be hearing from these individuals
directly as to what the issues are with the changes, such that perhaps
we can do things better in the future to avoid whatever the issue is
they're having with the changes. Simply saying we shouldn't make
changes in this area isn't workable and the constant push-back is
actively discouraging to folks trying to make improvements. Obviously
it's a biased view, but we've not had issues making the necessary
adjustments in pgbackrest with each release and I feel like if the
authors of wal-g or barman did that they would have spoken up.

I'm happy if people show up to comment on proposed changes, but I
think you're being a little bit unrealistic here. I have had to help
plenty of people who have screwed up their backups in one way or
another, generally by using some home-grown script, sometimes by
misusing some existing backup tool. Those people are EDB customers;
they don't read and participate in discussions here. If they did,
perhaps they wouldn't be paying EDB to have me and my colleagues sort
things out for them when it all goes wrong. I'm not trying to say that
EDB doesn't have customers who participate in mailing list
discussions, because we do, but it's a small minority, and I don't
think that should surprise anyone. Moreover, the people who don't
wouldn't necessarily have the background, expertise, or *time* to
assess specific proposals in detail. If your point is that my
perspective on what's helpful or unhelpful is not valid because I've
only helped 30 people who had problems in this area, but that the
perspective of those 30 people who were helped would be more valid,
well, I don't agree with that. I think your perspective and David's
are valid precisely *because* you've worked a lot on pgbackrest and no
doubt interacted with lots of users; I think Andres's perspective is
valid precisely *because* of his experience working with the fleet at
Microsoft and individual customers at EDB and 2Q before that; and I
think my perspective is valid for the same kinds of reasons.

I am more in agreement with the idea that it would be nice to hear
from backup tool authors, but I think even that has limited value.
Surely we can all agree that if the backup tool is correctly written,
none of this matters, because you'll make the tool do the right things
and then you'll be fine. The difficulty here, and the motivation
behind this proposal and others like it, is that too many users fail
to follow the procedure correctly. If we hear from the authors of
well-written backup tools, I expect they will tell us they can adapt
their tool to whatever we do. And if we hear from the authors of
poorly-written tools, well, I don't think their opinions would form a
great basis for making decisions.

[ lengthy discussion of tools that don't work any more ]

What confuses me here is that you seem to be arguing that we should
*once again* make a breaking change to the backup API, but at the same
time you're acknowledging that there are plenty of tools out there on
the Internet that have gotten broken by previous rounds of changes.
It's only one step from there to conclude that whacking the API around
does more harm than good, but you seem to reject that conclusion.

Personally, I haven't yet seen any evidence that the removal of
exclusive backup mode made any real difference one way or the other. I
think I've heard about people needing to adjust code for it, but not
about that being a problem. I have yet to run into anyone who was
previously using it but, because it was deprecated, switched to doing
something better and safer. Have you?

--
Robert Haas
EDB: http://www.enterprisedb.com

#42

Stephen Frost

sfrost@snowman.net

about 2 years ago

In reply to: Robert Haas (#41)

Re: Add recovery to pg_control and remove backup_label

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:

On Sun, Nov 26, 2023 at 3:42 AM Stephen Frost <sfrost@snowman.net> wrote:

What would really be helpful would be hearing from these individuals
directly as to what the issues are with the changes, such that perhaps
we can do things better in the future to avoid whatever the issue is
they're having with the changes. Simply saying we shouldn't make
changes in this area isn't workable and the constant push-back is
actively discouraging to folks trying to make improvements. Obviously
it's a biased view, but we've not had issues making the necessary
adjustments in pgbackrest with each release and I feel like if the
authors of wal-g or barman did that they would have spoken up.

I'm happy if people show up to comment on proposed changes, but I
think you're being a little bit unrealistic here. I have had to help
plenty of people who have screwed up their backups in one way or
another, generally by using some home-grown script, sometimes by
misusing some existing backup tool. Those people are EDB customers;
they don't read and participate in discussions here. If they did,
perhaps they wouldn't be paying EDB to have me and my colleagues sort
things out for them when it all goes wrong. I'm not trying to say that
EDB doesn't have customers who participate in mailing list
discussions, because we do, but it's a small minority, and I don't
think that should surprise anyone. Moreover, the people who don't
wouldn't necessarily have the background, expertise, or *time* to
assess specific proposals in detail. If your point is that my
perspective on what's helpful or unhelpful is not valid because I've
only helped 30 people who had problems in this area, but that the
perspective of those 30 people who were helped would be more valid,
well, I don't agree with that. I think your perspective and David's
are valid precisely *because* you've worked a lot on pgbackrest and no
doubt interacted with lots of users; I think Andres's perspective is
valid precisely *because* of his experience working with the fleet at
Microsoft and individual customers at EDB and 2Q before that; and I
think my perspective is valid for the same kinds of reasons.

I didn't mean to imply that anyone's perspective wasn't valid. I was
simply trying to get at the root question of: what *is* the issue with
the changes that are being made? If the answer to that is: we made
this change, which was hard for folks to deal with, and could have
been avoided by doing X, then I really, really want to hear what X
was! If the answer is, well, the changes weren't hard, but we didn't
like having to make any changes at all ... then I just don't have any
sympathy for that. People who write backup software for PG, be it
pgbackrest authors, wal-g authors, or homegrown script authors, will
need to adapt between major versions as we discover things that are
broken (such as exclusive mode, and such as the clear risk that's been
demonstrated of a torn copy of pg_control getting copied, resulting in
a completely invalid backup) and fix them.

I am more in agreement with the idea that it would be nice to hear
from backup tool authors, but I think even that has limited value.
Surely we can all agree that if the backup tool is correctly written,
none of this matters, because you'll make the tool do the right things
and then you'll be fine. The difficulty here, and the motivation
behind this proposal and others like it, is that too many users fail
to follow the procedure correctly. If we hear from the authors of
well-written backup tools, I expect they will tell us they can adapt
their tool to whatever we do. And if we hear from the authors of
poorly-written tools, well, I don't think their opinions would form a
great basis for making decisions.

Uhhh. No, I disagree with this- I'd argue that pgbackrest was broken
until the most recently releases where we implemented a check to ensure
that the pg_control we copy has a valid PG CRC. Did we know it was
broken before this discussion? No, but that doesn't change the fact
that we certainly could have ended up copying an invalid pg_control and
thus have an invalid backup, which even our 'pgbackrest verify' wouldn't
have caught because that just checks that the checksum that pgbackrest
calculates for every file hasn't changed since we copied it- but that
didn't do anything for the issue about pg_control having an invalid
internal checksum due to a torn write when we copied it.

So, yes, it does matter. We didn't make pgbackrest do the right thing
in this case because we thought it was true that you couldn't get a torn
read of pg_control; Thomas showed that wasn't true and that puts all of
our users at risk. Thankfully somewhat minimal since we always copy
pg_control from the primary ... but still, it's not right, and we've
now taken steps to address it. Unfortunately, other tools are going to
have a more difficult time because they're not written in C, but we
still care about them, and that's why we're pushing for this change- to
allow them to get a pretty much guaranteed valid pg_control from PG to
store without having to figure out how to validate it themselves.

[ lengthy discussion of tools that don't work any more ]

What confuses me here is that you seem to be arguing that we should
*once again* make a breaking change to the backup API, but at the same
time you're acknowledging that there are plenty of tools out there on
the Internet that have gotten broken by previous rounds of changes.

The broken ones aren't being maintained. Yes, I'm happy to have those
explicitly and clearly broken. I don't want people using outdated,
broken, and unmaintained tools to backup their PG databases.

It's only one step from there to conclude that whacking the API around
does more harm than good, but you seem to reject that conclusion.

We change the API because it objectively, clearly, addresses real issues
that users can run into that will cause them to have invalid backups if
left the way it is. That backup software authors need to adjust to this
isn't a bad thing- it's a good thing, because we're fixing things and
they should be thrilled to have these issues addressed that they may not
have even considered.

Personally, I haven't yet seen any evidence that the removal of
exclusive backup mode made any real difference one way or the other. I
think I've heard about people needing to adjust code for it, but not
about that being a problem. I have yet to run into anyone who was
previously using it but, because it was deprecated, switched to doing
something better and safer. Have you?

I'm glad that people haven't had a problem adjusting their code to the
removal of exclusive backup mode, that's good, and leaves me, again, a
bit confused at what the issue here is about changing things- apparently
people don't actually have a problem with it, yet it keeps getting
raised as an issue every time we change things in this area. I don't
understand that.

I'm not following the question entirely, I don't think. Most backup
tool authors actively changed to using non-exclusive backup when
exclusive backup mode was deprecated, certainly pgbackrest did and we've
been using non-exclusive backup mode since it was available. Are you
saying that, because everyone moved off of it, we should have kept it?
In that case the answer is clearly no- omnipitr, at the least, didn't
update to non-exclusive and therefore continued to run with the risk
that a crash during a backup would result in a cluster that wouldn't
start without manual intervention (an issue I've definitely heard about
a number of times, even recently) and that manual intervention (remove
the backup_label file) actively results in a *corrupt* cluster if the
user is actually restoring from a backup, which makes it really terrible
direction to give someone. Here, use this hack- but only if you're 100%
coming back from a crash and absolutely never, ever, ever if you're
actually restoring from a backup.

Thanks!

Stephen

#43

vignesh C

vignesh21@gmail.com

almost 2 years ago

In reply to: Michael Paquier (#15)

Re: Add recovery to pg_control and remove backup_label

On Mon, 20 Nov 2023 at 06:46, Michael Paquier <michael@paquier.xyz> wrote:

(I am not exactly sure how, but we've lost pgsql-hackers on the way
when you sent v5. Now added back in CC with the two latest patches
you've proposed attached.)

Here is a short summary of what has been missed by the lists:
- I've commented that the patch should not create, not show up in
fields returned the SQL functions or stream control files with a size
of 512B, just stick to 8kB. If this is worth changing this should be
applied consistently across the board including initdb, discussed on
its own thread.
- The backup-related fields in the control file are reset at the end
of recovery. I've suggested to not do that to keep a trace of what
was happening during recovery. The latest version of the patch resets
the fields.
- With the backup_label file gone, we lose some information in the
backups themselves, which is not good. Instead, you have suggested an
approach where this data is added to the backup manifest, meaning that
no information would be lost, particularly useful for self-contained
backups. The fields planned to be added to the backup manifest are:
-- The start and end time of the backup, the end timestamp being
useful to know when stop time can be used for PITR.
-- The backup label.
I've agreed that it may be the best thing to do at this end to not
lose any data related to the removal of the backup_label file.

On Sun, Nov 19, 2023 at 02:14:32PM -0400, David Steele wrote:

On 11/15/23 20:03, Michael Paquier wrote:

As the label is only an informational field, the parsing added to
pg_verifybackup is not really needed because it is used nowhere in the
validation process, so keeping the logic simpler would be the way to
go IMO. This is contrary to the WAL range for example, where start
and end LSNs are used for validation with a pg_waldump command.
Robert, any comments about the addition of the label in the manifest?

I'm sure Robert will comment on this when he gets the time, but for now I
have backed off on passing the new info to pg_verifybackup and added
start/stop time.

FWIW, I'm OK with the bits for the backup manifest as presented. So
if there are no remarks and/or no objections, I'd like to apply it but
let give some room to others to comment on that as there's been a gap
in the emails exchanged on pgsql-hackers. I hope that the summary
I've posted above covers everything. So let's see about doing
something around the middle of next week. With Thanksgiving in the
US, a lot of folks will not have the time to monitor what's happening
on this thread.
+      The end time for the backup. This is when the backup was stopped in
+      <productname>PostgreSQL</productname> and represents the earliest time
+      that can be used for time-based Point-In-Time Recovery.
This one is actually a very good point. We'd lost this capacity with
the backup_label file gone without the end timestamps in the control
file.

New patches attached based on b218fbb7.

I've noticed on the other thread the remark about being less
aggressive with the fields related to recovery in the control file, so
I assume that this patch should leave the fields be after the end of
recovery from the start and only rely on backupRecoveryRequired to
decide if the recovery should use the fields or not:
/messages/by-id/241ccde1-1928-4ba2-a0bb-5350f7b191a8@=pgmasters.net
+       ControlFile->backupCheckPoint = InvalidXLogRecPtr;
ControlFile->backupStartPoint = InvalidXLogRecPtr;
+       ControlFile->backupStartPointTLI = 0;
ControlFile->backupEndPoint = InvalidXLogRecPtr;
+       ControlFile->backupFromStandby = false;
ControlFile->backupEndRequired = false;
Still, I get the temptation of being consistent with the current style
on HEAD to reset everything, as well..

CFBot shows that the patch does not apply anymore as in [1]http://cfbot.cputube.org/patch_46_3511.log:

=== Applying patches on top of PostgreSQL commit ID
7014c9a4bba2d1b67d60687afb5b2091c1d07f73 ===
=== applying patch ./recovery-in-pgcontrol-v7-0001-add-info-to-manifest.patch
patching file doc/src/sgml/backup-manifest.sgml
patching file src/backend/backup/backup_manifest.c
patching file src/backend/backup/basebackup.c
Hunk #1 succeeded at 238 (offset 13 lines).
Hunk #2 succeeded at 258 (offset 13 lines).
Hunk #3 succeeded at 399 (offset 17 lines).
Hunk #4 succeeded at 652 (offset 17 lines).
can't find file to patch at input line 219
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--------------------------
|diff --git a/src/bin/pg_verifybackup/parse_manifest.c
b/src/bin/pg_verifybackup/parse_manifest.c
|index bf0227c668..408af88e58 100644
|--- a/src/bin/pg_verifybackup/parse_manifest.c
|+++ b/src/bin/pg_verifybackup/parse_manifest.c
--------------------------
No file to patch. Skipping patch.
9 out of 9 hunks ignored
patching file src/include/backup/backup_manifest.h

Please post an updated version for the same.

[1]: http://cfbot.cputube.org/patch_46_3511.log

Regards,
Vignesh

#44

michael@paquier.xyz

almost 2 years ago

In reply to: vignesh C (#43)

Re: Add recovery to pg_control and remove backup_label

On Fri, Jan 26, 2024 at 06:27:30PM +0530, vignesh C wrote:

Please post an updated version for the same.

[1] - http://cfbot.cputube.org/patch_46_3511.log

With the recent introduction of incremental backups that depend on
backup_label and the rather negative feedback received, I think that
it would be better to return this entry as RwF for now. What do you
think?
--
Michael

#45

david@pgmasters.net

almost 2 years ago

In reply to: Michael Paquier (#44)

Re: Add recovery to pg_control and remove backup_label

On 1/28/24 19:11, Michael Paquier wrote:

On Fri, Jan 26, 2024 at 06:27:30PM +0530, vignesh C wrote:

Please post an updated version for the same.

[1] - http://cfbot.cputube.org/patch_46_3511.log

With the recent introduction of incremental backups that depend on
backup_label and the rather negative feedback received, I think that
it would be better to return this entry as RwF for now. What do you
think?

I've been thinking it makes little sense to update the patch. It would
be a lot of work with all the new changes for incremental backup and
since Andres and Robert appear to be very against the idea, I doubt it
would be worth the effort.

I have withdrawn the patch.

Regards,
-David

#46

[1]: /messages/by-id/lwXoqQdOT9Nw1tJIx_h7WuqMKrB1YMePQY99RFTZ87H7V52mgUJaSlw2WRbcOgKNUurF1yJqX3nqtZi4hJhtd3e_XlmLsLvnEtGXY-fZPoA=@protonmail.com

david@pgmasters.net

almost 2 years ago

In reply to: David Steele (#45)

Re: Add recovery to pg_control and remove backup_label

On 1/29/24 12:28, David Steele wrote:

On 1/28/24 19:11, Michael Paquier wrote:

On Fri, Jan 26, 2024 at 06:27:30PM +0530, vignesh C wrote:

Please post an updated version for the same.

[1] - http://cfbot.cputube.org/patch_46_3511.log

With the recent introduction of incremental backups that depend on
backup_label and the rather negative feedback received, I think that
it would be better to return this entry as RwF for now. What do you
think?

I've been thinking it makes little sense to update the patch. It would
be a lot of work with all the new changes for incremental backup and
since Andres and Robert appear to be very against the idea, I doubt it
would be worth the effort.

I've had a new idea which may revive this patch. The basic idea is to
keep backup_label but also return a copy of pg_control from
pg_stop_backup(). This copy of pg_control would be safe from tears and
have a backupLabelRequired field set (as Andres suggested) so recovery
cannot proceed without the backup label.

So, everything will continue to work as it does now. But, backup
software can be enhanced to write the improved pg_control that is
guaranteed not to be torn and has protection against a missing backup label.

Of course, pg_basebackup will write the new backupLabelRequired field
into pg_control, but this way third party software can also gain
advantages from the new field.

Thoughts?

Regards,
-David

#47

Stefan Fercot

stefan.fercot@protonmail.com

over 1 year ago

In reply to: David Steele (#46)

Re: Add recovery to pg_control and remove backup_label

Hi,

On Sunday, March 10th, 2024 at 4:47 AM, David Steele wrote:

I've had a new idea which may revive this patch. The basic idea is to
keep backup_label but also return a copy of pg_control from
pg_stop_backup(). This copy of pg_control would be safe from tears and
have a backupLabelRequired field set (as Andres suggested) so recovery
cannot proceed without the backup label.

So, everything will continue to work as it does now. But, backup
software can be enhanced to write the improved pg_control that is
guaranteed not to be torn and has protection against a missing backup label.

Of course, pg_basebackup will write the new backupLabelRequired field
into pg_control, but this way third party software can also gain
advantages from the new field.

Bump on this idea.

Given the discussion in [1]/messages/by-id/lwXoqQdOT9Nw1tJIx_h7WuqMKrB1YMePQY99RFTZ87H7V52mgUJaSlw2WRbcOgKNUurF1yJqX3nqtZi4hJhtd3e_XlmLsLvnEtGXY-fZPoA=@protonmail.com, even if it obviously makes sense to improve the in core backup capabilities, the more we go in that direction, the more we'll rely on outside orchestration.
So IMHO it also worth worrying about given more leverage to such orchestration tools. In that sense, I really like the idea to extend the backup functions.

More thoughts?

Thanks all,
Kind Regards,
--
Stefan FERCOT
Data Egret (https://dataegret.com)

#48