Return pg_control from pg_backup_stop().

Started by David Steeleover 1 year ago8 messages
#1David Steele
david@pgmasters.net
1 attachment(s)

Hackers,

This is greatly simplified implementation of the patch proposed in [1]/messages/by-id/2daf8adc-8db7-4204-a7f2-a7e94e2bfa4b@pgmasters.net
and hopefully it addresses the concerns expressed there. Since the
implementation is quite different it seemed like a new thread was
appropriate, especially since the old thread title would be very
misleading regarding the new functionality.

The basic idea is to harden recovery by returning a copy of pg_control
from pg_backup_stop() that has a flag set to prevent recovery if the
backup_label file is missing. Instead of backup software copying
pg_control from PGDATA, it stores an updated version that is returned
from pg_backup_stop(). This is better for the following reasons:

* The user can no longer remove backup_label and get what looks like a
successful recovery (while almost certainly causing corruption). If
backup_label is removed the cluster will not start. The user may try
pg_resetwal, but that tool makes it pretty clear that corruption will
result from its use.

* We don't need to worry about backup software seeing a torn copy of
pg_control, since Postgres can safely read it out of memory and provide
a valid copy via pg_backup_stop(). This solves torn reads without
needing to write pg_control via a temp file, which may affect
performance on a standby.

* For backup from standby, we no longer need to instruct the backup
software to copy pg_control last. In fact the backup software should not
copy pg_control from PGDATA at all.

These changes have no impact on current backup software and they are
free to use the pg_control available from pg_stop_backup() or continue
to use pg_control from PGDATA. Of course they will miss the benefits of
getting a consistent copy of pg_control and the backup_label checking,
but will be no worse off than before.

I'll register this in the July CF.

Regards,
-David

[1]: /messages/by-id/2daf8adc-8db7-4204-a7f2-a7e94e2bfa4b@pgmasters.net
/messages/by-id/2daf8adc-8db7-4204-a7f2-a7e94e2bfa4b@pgmasters.net

Attachments:

pgcontrol-from-backupstop-v1.patchtext/plain; charset=UTF-8; name=pgcontrol-from-backupstop-v1.patchDownload
From 531872dcdb09f5d2f89af44a3c04c9d2d6da89be Mon Sep 17 00:00:00 2001
From: David Steele <david@pgmasters.net>
Date: Fri, 17 May 2024 02:42:35 +0000
Subject: Return pg_control from pg_backup_stop().

Harden recovery by returning a copy of pg_control from pg_backup_stop() that has
a flag set to prevent recovery if the backup_label file is missing. Instead of
backup software copying pg_control from PGDATA, it stores an updated version
that is returned from pg_backup_stop(). This is better for the following
reasons:

* The user can no longer remove backup_label and get what looks like a
successful recovery (while almost certainly causing corruption). If backup_label
is removed the cluster will not start. The user may try pg_resetwal, but that
tool makes it pretty clear that corruption will result from its use.

* We don't need to worry about backup software seeing a torn copy of pg_control,
since Postgres can safely read it out of memory and provide a valid copy via
pg_backup_stop(). This solves torn reads without needing to write pg_control via
a temp file, which may affect performance on a standby.

* For backup from standby, we no longer need to instruct the backup software to
copy pg_control last. In fact the backup software should not copy pg_control from
PGDATA at all.

These changes have no impact on current backup software and they are free to use
the pg_control available from pg_stop_backup() or continue to use pg_control from
PGDATA. Of course they will miss the benefits of getting a consistent copy of
pg_control and the backup_label checking, but will be no worse off than before.

Control and catalog version bumps are required.
---
 doc/src/sgml/backup.sgml                    | 18 +++++-
 doc/src/sgml/func.sgml                      | 10 ++-
 src/backend/access/transam/xlog.c           | 19 ++++++
 src/backend/access/transam/xlogfuncs.c      | 17 ++++--
 src/backend/access/transam/xlogrecovery.c   | 10 ++-
 src/backend/backup/basebackup.c             | 19 +++---
 src/backend/catalog/system_functions.sql    |  2 +-
 src/backend/utils/misc/pg_controldata.c     |  7 ++-
 src/bin/pg_controldata/pg_controldata.c     |  2 +
 src/bin/pg_resetwal/pg_resetwal.c           |  1 +
 src/bin/pg_rewind/pg_rewind.c               |  1 +
 src/include/access/xlogbackup.h             |  8 +++
 src/include/catalog/pg_control.h            |  4 ++
 src/include/catalog/pg_proc.dat             | 10 +--
 src/test/recovery/t/002_archiving.pl        | 20 ++++++
 src/test/recovery/t/042_low_level_backup.pl | 67 ++++++++++++++++++++-
 16 files changed, 182 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 91da3c26ba..1b87a05dc5 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1010,9 +1010,12 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
      values. The second of these fields should be written to a file named
      <filename>backup_label</filename> in the root directory of the backup. The
      third field should be written to a file named
-     <filename>tablespace_map</filename> unless the field is empty. These files are
+     <filename>tablespace_map</filename> unless the field is empty. The fourth
+     field should be written into a file named
+     <filename>global/pg_control</filename> (overwriting the existing file when
+     present). These files are
      vital to the backup working and must be written byte for byte without
-     modification, which may require opening the file in binary mode.
+     modification, which will require opening the file in binary mode.
     </para>
    </listitem>
    <listitem>
@@ -1084,7 +1087,16 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
    </para>
 
    <para>
-    You should, however, omit from the backup the files within the
+    You should exclude <filename>global/pg_control</filename> from your backup
+    and put the contents of the <parameter>controlfile</parameter> column
+    returned from <function>pg_backup_stop</function> in your backup at
+    <filename>global/pg_control</filename>. This version of pg_control contains
+    safeguards against recovery without backup_label present and is guaranteed
+    not to be torn.
+   </para>
+
+   <para>
+    You should also omit from the backup the files within the
     cluster's <filename>pg_wal/</filename> subdirectory.  This
     slight adjustment is worthwhile because it reduces the risk
     of mistakes when restoring.  This is easy to arrange if
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 17c44bc338..88f942af44 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27697,6 +27697,11 @@ SELECT currval(pg_get_serial_sequence('sometable', 'id'));
        <entry><type>boolean</type></entry>
       </row>
 
+      <row>
+       <entry><structfield>backup_label_required</structfield></entry>
+       <entry><type>boolean</type></entry>
+      </row>
+
      </tbody>
     </tgroup>
    </table>
@@ -28355,8 +28360,9 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
         The result of the function is a single record.
         The <parameter>lsn</parameter> column holds the backup's ending
         write-ahead log location (which again can be ignored).  The second
-        column returns the contents of the backup label file, and the third
-        column returns the contents of the tablespace map file.  These must be
+        column returns the contents of the backup label file, the third column
+        returns the contents of the tablespace map file, and the fourth column
+        returns the contents of pg_control.  These must be
         stored as part of the backup and are required as part of the restore
         process.
        </para>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c3fd9c1eae..7d7c3d727e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9048,11 +9048,30 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 	int			seconds_before_warning;
 	int			waits = 0;
 	bool		reported_waiting = false;
+	ControlFileData *controlFileCopy = (ControlFileData *)state->controlFile;
 
 	Assert(state != NULL);
 
 	backup_stopped_in_recovery = RecoveryInProgress();
 
+	/*
+	 * Create a copy of control data and update it to require a backup label
+	 * for recovery. Also recalculate the CRC.
+	 */
+	memset(
+		state->controlFile + sizeof(ControlFileData), 0,
+		PG_CONTROL_FILE_SIZE - sizeof(ControlFileData));
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	memcpy(controlFileCopy, ControlFile, sizeof(ControlFileData));
+	LWLockRelease(ControlFileLock);
+
+	controlFileCopy->backupLabelRequired = true;
+
+	INIT_CRC32C(controlFileCopy->crc);
+	COMP_CRC32C(controlFileCopy->crc, controlFileCopy, offsetof(ControlFileData, crc));
+	FIN_CRC32C(controlFileCopy->crc);
+
 	/*
 	 * During recovery, we don't need to check WAL level. Because, if WAL
 	 * level is not sufficient, it's impossible to get here during recovery.
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 92bdb17ed5..2576cacf22 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -112,9 +112,11 @@ pg_backup_start(PG_FUNCTION_ARGS)
  *
  * The backup_label contains the user-supplied label string (typically this
  * would be used to tell where the backup dump will be stored), the starting
- * time, starting WAL location for the dump and so on.  It is the caller's
- * responsibility to write the backup_label and tablespace_map files in the
- * data folder that will be restored from this backup.
+ * time, starting WAL location for the dump and so on.  The pg_control file
+ * contains represents a consistent copy of pg_control that also has a safeguard
+ * against being used without backup_label.  It is the caller's responsibility
+ * to write the backup_label, pg_control, and tablespace_map files in the data
+ * folder that will be restored from this backup.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -122,12 +124,13 @@ pg_backup_start(PG_FUNCTION_ARGS)
 Datum
 pg_backup_stop(PG_FUNCTION_ARGS)
 {
-#define PG_BACKUP_STOP_V2_COLS 3
+#define PG_BACKUP_STOP_V2_COLS 4
 	TupleDesc	tupdesc;
 	Datum		values[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		nulls[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		waitforarchive = PG_GETARG_BOOL(0);
 	char	   *backup_label;
+	bytea	   *pg_control_bytea;
 	SessionBackupState status = get_backup_status();
 
 	/* Initialize attributes information in the tuple descriptor */
@@ -149,9 +152,15 @@ pg_backup_stop(PG_FUNCTION_ARGS)
 	/* Build the contents of backup_label */
 	backup_label = build_backup_content(backup_state, false);
 
+	/* Build the contents of pg_control */
+	pg_control_bytea = (bytea *) palloc(PG_CONTROL_FILE_SIZE + VARHDRSZ);
+	SET_VARSIZE(pg_control_bytea, PG_CONTROL_FILE_SIZE + VARHDRSZ);
+	memcpy(VARDATA(pg_control_bytea), backup_state->controlFile, PG_CONTROL_FILE_SIZE);
+
 	values[0] = LSNGetDatum(backup_state->stoppoint);
 	values[1] = CStringGetTextDatum(backup_label);
 	values[2] = CStringGetTextDatum(tablespace_map->data);
+	values[3] = PointerGetDatum(pg_control_bytea);
 
 	/* Deallocate backup-related variables */
 	pfree(backup_label);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 29c5bec084..0940e4730e 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -703,7 +703,14 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	}
 	else
 	{
-		/* No backup_label file has been found if we are here. */
+		/*
+		 * No backup_label file has been found if we are here. Error if the
+		 * control file requires backup_label.
+		 */
+		if (ControlFile->backupLabelRequired)
+			ereport(FATAL,
+					(errmsg("could not find backup_label required for recovery"),
+					 errhint("backup_label must be present for recovery to succeed")));
 
 		/*
 		 * If tablespace_map file is present without backup_label file, there
@@ -976,6 +983,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		{
 			ControlFile->backupStartPoint = checkPoint.redo;
 			ControlFile->backupEndRequired = backupEndRequired;
+			ControlFile->backupLabelRequired = false;
 
 			if (backupFromStandby)
 			{
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 9a2bf59e84..ae0f83416a 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -324,7 +324,6 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 
 			if (ti->path == NULL)
 			{
-				struct stat statbuf;
 				bool		sendtblspclinks = true;
 				char	   *backup_label;
 
@@ -348,15 +347,14 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 				sendDir(sink, ".", 1, false, state.tablespaces,
 						sendtblspclinks, &manifest, InvalidOid, ib);
 
+				/* End the backup before sending pg_control */
+				basebackup_progress_wait_wal_archive(&state);
+				do_pg_backup_stop(backup_state, !opt->nowait);
+
 				/* ... and pg_control after everything else. */
-				if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not stat file \"%s\": %m",
-									XLOG_CONTROL_FILE)));
-				sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
-						 false, InvalidOid, InvalidOid,
-						 InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
+				sendFileWithContent(sink, XLOG_CONTROL_FILE,
+									(char *)backup_state->controlFile,
+									PG_CONTROL_FILE_SIZE, &manifest);
 			}
 			else
 			{
@@ -390,9 +388,6 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 			}
 		}
 
-		basebackup_progress_wait_wal_archive(&state);
-		do_pg_backup_stop(backup_state, !opt->nowait);
-
 		endptr = backup_state->stoppoint;
 		endtli = backup_state->stoptli;
 
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index ae099e328c..d76d3c608f 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -390,7 +390,7 @@ CREATE OR REPLACE FUNCTION
 
 CREATE OR REPLACE FUNCTION pg_backup_stop (
         wait_for_archive boolean DEFAULT true, OUT lsn pg_lsn,
-        OUT labelfile text, OUT spcmapfile text)
+        OUT labelfile text, OUT spcmapfile text, OUT controlfile bytea)
   RETURNS record STRICT VOLATILE LANGUAGE internal as 'pg_backup_stop'
   PARALLEL RESTRICTED;
 
diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index 98c932dc7b..9eaf3f8b9f 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -162,8 +162,8 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 Datum
 pg_control_recovery(PG_FUNCTION_ARGS)
 {
-	Datum		values[5];
-	bool		nulls[5];
+	Datum		values[6];
+	bool		nulls[6];
 	TupleDesc	tupdesc;
 	HeapTuple	htup;
 	ControlFileData *ControlFile;
@@ -195,6 +195,9 @@ pg_control_recovery(PG_FUNCTION_ARGS)
 	values[4] = BoolGetDatum(ControlFile->backupEndRequired);
 	nulls[4] = false;
 
+	values[5] = BoolGetDatum(ControlFile->backupLabelRequired);
+	nulls[5] = false;
+
 	htup = heap_form_tuple(tupdesc, values, nulls);
 
 	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93e0837947..2114488dc0 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -283,6 +283,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("Backup label required:                %s\n"),
+		   ControlFile->backupLabelRequired ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d8..7056752cff 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -875,6 +875,7 @@ RewriteControlFile(void)
 	ControlFile.backupStartPoint = 0;
 	ControlFile.backupEndPoint = 0;
 	ControlFile.backupEndRequired = false;
+	ControlFile.backupLabelRequired = false;
 
 	/*
 	 * Force the defaults for max_* settings. The values don't really matter
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 8449ae78ef..7c27d47110 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -722,6 +722,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	ControlFile_new.minRecoveryPoint = endrec;
 	ControlFile_new.minRecoveryPointTLI = endtli;
 	ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
+	ControlFile_new.backupLabelRequired = true;
 	if (!dry_run)
 		update_controlfile(datadir_target, &ControlFile_new, do_sync);
 }
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index c30d4a5991..93a0e5c5cc 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -15,6 +15,7 @@
 #define XLOG_BACKUP_H
 
 #include "access/xlogdefs.h"
+#include "catalog/pg_control.h"
 #include "pgtime.h"
 
 /* Structure to hold backup state. */
@@ -35,6 +36,13 @@ typedef struct BackupState
 	XLogRecPtr	stoppoint;		/* backup stop WAL location */
 	TimeLineID	stoptli;		/* backup stop TLI */
 	pg_time_t	stoptime;		/* backup stop time */
+
+	/*
+	 * After pg_backup_stop() returns this field will contain a copy of
+	 * pg_control that should be stored with the backup. backupLabelRequired
+	 * has been set so backup_label will be required for recovery to start.
+	 */
+	uint8_t controlFile[PG_CONTROL_FILE_SIZE];
 } BackupState;
 
 extern char *build_backup_content(BackupState *state,
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index a00606ffcd..71a9beb05a 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -163,12 +163,16 @@ typedef struct ControlFileData
 	 * If backupEndRequired is true, we know for sure that we're restoring
 	 * from a backup, and must see a backup-end record before we can safely
 	 * start up.
+	 *
+	 * If backupLabelRequired is true, then a backup_label file must be
+	 * present in order for recovery to succeed.
 	 */
 	XLogRecPtr	minRecoveryPoint;
 	TimeLineID	minRecoveryPointTLI;
 	XLogRecPtr	backupStartPoint;
 	XLogRecPtr	backupEndPoint;
 	bool		backupEndRequired;
+	bool		backupLabelRequired;
 
 	/*
 	 * Parameter settings that determine if the WAL can be used for archival
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 6a5476d3c4..6cb87fb744 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6463,8 +6463,8 @@
 { oid => '2739', descr => 'finish taking an online backup',
   proname => 'pg_backup_stop', provolatile => 'v', proparallel => 'r',
   prorettype => 'record', proargtypes => 'bool',
-  proallargtypes => '{bool,pg_lsn,text,text}', proargmodes => '{i,o,o,o}',
-  proargnames => '{wait_for_archive,lsn,labelfile,spcmapfile}',
+  proallargtypes => '{bool,pg_lsn,text,text,bytea}', proargmodes => '{i,o,o,o,o}',
+  proargnames => '{wait_for_archive,lsn,labelfile,spcmapfile,controlfile}',
   prosrc => 'pg_backup_stop' },
 { oid => '3436', descr => 'promote standby server',
   proname => 'pg_promote', provolatile => 'v', prorettype => 'bool',
@@ -11981,9 +11981,9 @@
 { oid => '3443',
   descr => 'pg_controldata recovery state information as a function',
   proname => 'pg_control_recovery', provolatile => 'v', prorettype => 'record',
-  proargtypes => '', proallargtypes => '{pg_lsn,int4,pg_lsn,pg_lsn,bool}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{min_recovery_end_lsn,min_recovery_end_timeline,backup_start_lsn,backup_end_lsn,end_of_backup_record_required}',
+  proargtypes => '', proallargtypes => '{pg_lsn,int4,pg_lsn,pg_lsn,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o}',
+  proargnames => '{min_recovery_end_lsn,min_recovery_end_timeline,backup_start_lsn,backup_end_lsn,end_of_backup_record_required,backup_label_required}',
   prosrc => 'pg_control_recovery' },
 
 { oid => '3444',
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index a2e012e42d..653546a6c0 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -41,6 +41,26 @@ $node_standby->append_conf(
 archive_cleanup_command = 'echo archive_cleanup_done > $archive_cleanup_command_file'
 recovery_end_command = 'echo recovery_ended_done > $recovery_end_command_file'
 ));
+
+# Rename backup_label to verify that recovery will not start without it
+rename("$data_dir/backup_label", "$data_dir/backup_label.tmp")
+  or BAIL_OUT "could not move $data_dir/backup_label";
+
+my $res = run_log(
+	[
+		'pg_ctl', '-D', $node_standby->data_dir, '-l',
+		$node_standby->logfile, 'start'
+	]);
+ok(!$res, 'invalid recovery startup fails');
+
+my $logfile = slurp_file($node_standby->logfile());
+ok($logfile =~ qr/could not find backup_label required for recovery/,
+	'could not find backup_label required for recovery');
+
+# Restore backup_label so recovery proceeds normally
+rename("$data_dir/backup_label.tmp", "$data_dir/backup_label")
+  or BAIL_OUT "could not move $data_dir/backup_label";
+
 $node_standby->start;
 
 # Create some content on primary
diff --git a/src/test/recovery/t/042_low_level_backup.pl b/src/test/recovery/t/042_low_level_backup.pl
index 61d23187e0..bd3a99960f 100644
--- a/src/test/recovery/t/042_low_level_backup.pl
+++ b/src/test/recovery/t/042_low_level_backup.pl
@@ -13,6 +13,42 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Decode hex to binary
+sub decode_hex
+{
+	my ($encoded) = @_;
+	my $decoded;
+
+	$encoded =~ s/^\s+|\s+$//g;
+
+	for (my $idx = 0; $idx < length($encoded); $idx += 2)
+	{
+		$decoded .= pack('C', hex(substr($encoded, $idx, 2)));
+	}
+
+	return $decoded;
+}
+
+# Get backup_label/pg_control from pg_stop_backup()
+sub stop_backup_result
+{
+	my ($psql) = @_;
+
+	my $encoded = $psql->query_safe(
+		"select encode(labelfile::bytea, 'hex') || ',' || " .
+		"       encode(controlfile, 'hex')" .
+		"  from pg_backup_stop()");
+
+	my @result;
+
+    foreach my $column (split(',', $encoded))
+	{
+		push(@result, decode_hex($column));
+	}
+
+	return @result;
+}
+
 # Start primary node with archiving.
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(has_archiving => 1, allows_streaming => 1);
@@ -80,8 +116,7 @@ my $stop_segment_name = $node_primary->safe_psql('postgres',
 	'SELECT pg_walfile_name(pg_current_wal_lsn())');
 
 # Stop backup and get backup_label, the last segment is archived.
-my $backup_label =
-  $psql->query_safe("select labelfile from pg_backup_stop()");
+(my $backup_label, my $pg_control) = stop_backup_result($psql);
 
 $psql->quit;
 
@@ -118,10 +153,36 @@ ok( $node_replica->log_contains(
 $node_replica->teardown_node;
 $node_replica->clean_node;
 
+# Save only pg_control into the backup to demonstrate that pg_control returned
+# from pg_stop_backup() will only perform recovery when backup_label is present.
+open(my $fh, ">", "$backup_dir/global/pg_control")
+  or die "could not open pg_control";
+binmode($fh);
+syswrite($fh, $pg_control);
+close($fh);
+
+$node_replica = PostgreSQL::Test::Cluster->new('replica_fail2');
+$node_replica->init_from_backup($node_primary, $backup_name,
+	has_restoring => 1);
+
+my $res = run_log(
+	[
+		'pg_ctl', '-D', $node_replica->data_dir, '-l',
+		$node_replica->logfile, 'start'
+	]);
+ok(!$res, 'invalid recovery startup fails');
+
+my $logfile = slurp_file($node_replica->logfile());
+ok($logfile =~ qr/could not find backup_label required for recovery/,
+	'could not find backup_label required for recovery');
+
+$node_replica->teardown_node;
+$node_replica->clean_node;
+
 # Save backup_label into the backup directory and recover using the primary's
 # archive.  This time recovery will succeed and the canary table will be
 # present.
-open my $fh, ">>", "$backup_dir/backup_label"
+open $fh, ">>", "$backup_dir/backup_label"
   or die "could not open backup_label";
 # Binary mode is required for Windows, as the backup_label parsing is not
 # able to cope with CRLFs.
-- 
2.34.1

#2Michael Paquier
michael@paquier.xyz
In reply to: David Steele (#1)
Re: Return pg_control from pg_backup_stop().

On Fri, May 17, 2024 at 12:46:49PM +1000, David Steele wrote:

This is greatly simplified implementation of the patch proposed in [1] and
hopefully it addresses the concerns expressed there. Since the
implementation is quite different it seemed like a new thread was
appropriate, especially since the old thread title would be very misleading
regarding the new functionality.

-        /* No backup_label file has been found if we are here. */
+        /*
+         * No backup_label file has been found if we are here. Error if the
+         * control file requires backup_label.
+         */
+        if (ControlFile->backupLabelRequired)
+            ereport(FATAL,
+                    (errmsg("could not find backup_label required for recovery"),
+                     errhint("backup_label must be present for recovery to succeed")));

I thought that this had some similarities with my last fight in this
area, where xlogrecovery.c would fail hard if there was a backup_label
file but no signal files, but nope that's not the case:
/messages/by-id/ZArVOMifjzE7f8W7@paquier.xyz

The basic idea is to harden recovery by returning a copy of pg_control from
pg_backup_stop() that has a flag set to prevent recovery if the backup_label
file is missing. Instead of backup software copying pg_control from PGDATA,
it stores an updated version that is returned from pg_backup_stop().

Hmm, okay. There is also a slight impact for BASE_BACKUP, requiring
basebackup_progress_wait_wal_archive() and do_pg_backup_stop() to be
called earlier when sending the main data directory which is the last
one in the list of tablespaces. As far as I can see, this does not
change the logic because do_pg_backup_stop() does not touch the
control file, but it seems to me that we'd rather keep these two
calls as they are now, and send the control file once we are out of
the for loop that processes all the tablespaces. That seems slightly
cleaner to me, still I agree that both are the same things.

Anyway, finishing do_pg_backup_stop() and then sending the control
file is just a consequence of the implementation choice driven by the
output required for the SQL function so as this is stored in the
backup state to get it down to pg_backup_stop() in xlogfuncs.c, so we
could also take one step back and forget about the SQL function,
setting only the new flag when sending a BASE_BACKUP, or just not use
the backupState to store this data as the exercise involves forcing
one boolean and recalculate a CRC32.

* We don't need to worry about backup software seeing a torn copy of
pg_control, since Postgres can safely read it out of memory and provide a
valid copy via pg_backup_stop(). This solves torn reads without needing to
write pg_control via a temp file, which may affect performance on a standby.

We're talking about a 8kB file which has a size of 512B
(PG_CONTROL_MAX_SAFE_SIZE) to avoid such issues. So I'm not sure to
see your point here?

* For backup from standby, we no longer need to instruct the backup software
to copy pg_control last. In fact the backup software should not copy
pg_control from PGDATA at all.

Yep. Not from PGDATA, but from the function.

These changes have no impact on current backup software and they are free to
use the pg_control available from pg_stop_backup() or continue to use
pg_control from PGDATA. Of course they will miss the benefits of getting a
consistent copy of pg_control and the backup_label checking, but will be no
worse off than before.

There is a large comment block in do_pg_backup_stop() around
backup_stopped_in_recovery. Perhaps it should be improved based on
this patch.

The main concern that I have over this patch is: who is actually going
to use this extension of the SQL stop function? Perhaps existing
backup solutions are good enough risk vs reward is not worth it? The
label_file and the tablespace map are text, this is a series of bytes
that has no visibility for the end-user unless checked on the
client-side. This adds a new step where backups would need to copy
the control file to the data folder.
--
Michael

#3David Steele
david@pgbackrest.org
In reply to: David Steele (#1)
1 attachment(s)
Re: Return pg_control from pg_backup_stop().

On 10/2/24 10:11, Michael Paquier wrote:

On Fri, May 17, 2024 at 12:46:49PM +1000, David Steele wrote:

The basic idea is to harden recovery by returning a copy of pg_control from
pg_backup_stop() that has a flag set to prevent recovery if the backup_label
file is missing. Instead of backup software copying pg_control from PGDATA,
it stores an updated version that is returned from pg_backup_stop().

Hmm, okay. There is also a slight impact for BASE_BACKUP, requiring
basebackup_progress_wait_wal_archive() and do_pg_backup_stop() to be
called earlier when sending the main data directory which is the last
one in the list of tablespaces. As far as I can see, this does not
change the logic because do_pg_backup_stop() does not touch the
control file, but it seems to me that we'd rather keep these two
calls as they are now, and send the control file once we are out of
the for loop that processes all the tablespaces. That seems slightly
cleaner to me, still I agree that both are the same things.

Sending pg_control later results in even more code churn because of how
tar files are terminated. I've updated it that way in v2 so you can see
what I mean. I don't have a strong preference, though, so if you prefer
the implementation in v2 then that's fine with me.

Anyway, finishing do_pg_backup_stop() and then sending the control
file is just a consequence of the implementation choice driven by the
output required for the SQL function so as this is stored in the
backup state to get it down to pg_backup_stop() in xlogfuncs.c, so we
could also take one step back and forget about the SQL function,
setting only the new flag when sending a BASE_BACKUP, or just not use
the backupState to store this data as the exercise involves forcing
one boolean and recalculate a CRC32.

I can definitely see us making other updates to pg_control so I would
rather keep this logic centralized, even though it is not too
complicated at this point. Still, even 8 lines of code (as it is now)
seems better not to duplicate.

* We don't need to worry about backup software seeing a torn copy of
pg_control, since Postgres can safely read it out of memory and provide a
valid copy via pg_backup_stop(). This solves torn reads without needing to
write pg_control via a temp file, which may affect performance on a standby.

We're talking about a 8kB file which has a size of 512B
(PG_CONTROL_MAX_SAFE_SIZE) to avoid such issues. So I'm not sure to
see your point here?

Even at 512B it is possible to see tears in pg_control and they happen
in the build farm right now. In fact, this thread [1]/messages/by-id/CA+hUKG+jig+QdBETj_ab++VWSoJjbwT3sLNCnk0wFsY_6tRqoA@mail.gmail.com trying to fix the
problem was what got me thinking about alternate solutions to preventing
tears in pg_control. Thomas' proposed fixes have not been committed to
my knowledge so the problem remains, but would be fixed by this commit.

There is a large comment block in do_pg_backup_stop() around
backup_stopped_in_recovery. Perhaps it should be improved based on
this patch.

I added a sentence to this comment block in v2.

The main concern that I have over this patch is: who is actually going
to use this extension of the SQL stop function?

Primarily existing backup software, I would imagine. The idea is that it
would give them feature parity with pg_basebackup, rather than the new
protections being exclusive to pg_basebackup.

Perhaps existing
backup solutions are good enough risk vs reward is not worth it?

I'm not sure I see the risk here. Saving out pg_control is optional so
no changes to current software is required. Of course they miss the
benefit of the protection against tears and missing backup_label, but
that is a choice.

Also, no matter what current backup solutions do, they cannot prevent a
user from removing backup_label after restore. This patch prevents
invalid recovery when that happens.

The
label_file and the tablespace map are text, this is a series of bytes
that has no visibility for the end-user unless checked on the
client-side. This adds a new step where backups would need to copy
the control file to the data folder.

Again, optional, but if I was able to manage these saves using the psql
interface in the TAP tests then I'd say it would be pretty easy for
anyone with a normal connection to Postgres. Also, we require users to
treat tabelspace_map and backup_label as binary so not too big a change
here.

[1]: /messages/by-id/CA+hUKG+jig+QdBETj_ab++VWSoJjbwT3sLNCnk0wFsY_6tRqoA@mail.gmail.com
/messages/by-id/CA+hUKG+jig+QdBETj_ab++VWSoJjbwT3sLNCnk0wFsY_6tRqoA@mail.gmail.com

Attachments:

pgcontrol-from-backupstop-v2.patchtext/plain; charset=UTF-8; name=pgcontrol-from-backupstop-v2.patchDownload
From f9ebd108a9c5dc55fa485dc8c60eb11d2c16715c Mon Sep 17 00:00:00 2001
From: David Steele <david@pgmasters.net>
Date: Wed, 2 Oct 2024 08:59:12 +0000
Subject: Return pg_control from pg_backup_stop().

Harden recovery by returning a copy of pg_control from pg_backup_stop() that has
a flag set to prevent recovery if the backup_label file is missing. Instead of
backup software copying pg_control from PGDATA, it stores an updated version
that is returned from pg_backup_stop(). This is better for the following
reasons:

* The user can no longer remove backup_label and get what looks like a
successful recovery (while almost certainly causing corruption). If backup_label
is removed the cluster will not start. The user may try pg_resetwal, but that
tool makes it pretty clear that corruption will result from its use.

* We don't need to worry about backup software seeing a torn copy of pg_control,
since Postgres can safely read it out of memory and provide a valid copy via
pg_backup_stop(). This solves torn reads without needing to write pg_control via
a temp file, which may affect performance on a standby.

* For backup from standby, we no longer need to instruct the backup software to
copy pg_control last. In fact the backup software should not copy pg_control from
PGDATA at all.

These changes have no impact on current backup software and they are free to use
the pg_control available from pg_stop_backup() or continue to use pg_control from
PGDATA. Of course they will miss the benefits of getting a consistent copy of
pg_control and the backup_label checking, but will be no worse off than before.

Control and catalog version bumps are required.
---
 doc/src/sgml/backup.sgml                    | 18 +++++-
 doc/src/sgml/func.sgml                      | 10 ++-
 src/backend/access/transam/xlog.c           | 23 ++++++-
 src/backend/access/transam/xlogfuncs.c      | 17 ++++--
 src/backend/access/transam/xlogrecovery.c   | 10 ++-
 src/backend/backup/basebackup.c             | 34 +++++------
 src/backend/catalog/system_functions.sql    |  2 +-
 src/backend/utils/misc/pg_controldata.c     |  7 ++-
 src/bin/pg_controldata/pg_controldata.c     |  2 +
 src/bin/pg_resetwal/pg_resetwal.c           |  1 +
 src/bin/pg_rewind/pg_rewind.c               |  1 +
 src/include/access/xlogbackup.h             |  8 +++
 src/include/catalog/pg_control.h            |  4 ++
 src/include/catalog/pg_proc.dat             | 10 +--
 src/test/recovery/t/002_archiving.pl        | 20 ++++++
 src/test/recovery/t/042_low_level_backup.pl | 67 ++++++++++++++++++++-
 16 files changed, 192 insertions(+), 42 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index e4e4c56cf1..2fcf181121 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1021,9 +1021,12 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
      values. The second of these fields should be written to a file named
      <filename>backup_label</filename> in the root directory of the backup. The
      third field should be written to a file named
-     <filename>tablespace_map</filename> unless the field is empty. These files are
+     <filename>tablespace_map</filename> unless the field is empty. The fourth
+     field should be written into a file named
+     <filename>global/pg_control</filename> (overwriting the existing file when
+     present). These files are
      vital to the backup working and must be written byte for byte without
-     modification, which may require opening the file in binary mode.
+     modification, which will require opening the file in binary mode.
     </para>
    </listitem>
    <listitem>
@@ -1095,7 +1098,16 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
    </para>
 
    <para>
-    You should, however, omit from the backup the files within the
+    You should exclude <filename>global/pg_control</filename> from your backup
+    and put the contents of the <parameter>controlfile</parameter> column
+    returned from <function>pg_backup_stop</function> in your backup at
+    <filename>global/pg_control</filename>. This version of pg_control contains
+    safeguards against recovery without backup_label present and is guaranteed
+    not to be torn.
+   </para>
+
+   <para>
+    You should also omit from the backup the files within the
     cluster's <filename>pg_wal/</filename> subdirectory.  This
     slight adjustment is worthwhile because it reduces the risk
     of mistakes when restoring.  This is easy to arrange if
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index d6acdd3059..a2622014ae 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27899,6 +27899,11 @@ acl      | {postgres=arwdDxtm/postgres,foo=r/postgres}
        <entry><type>boolean</type></entry>
       </row>
 
+      <row>
+       <entry><structfield>backup_label_required</structfield></entry>
+       <entry><type>boolean</type></entry>
+      </row>
+
      </tbody>
     </tgroup>
    </table>
@@ -28576,8 +28581,9 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
         The result of the function is a single record.
         The <parameter>lsn</parameter> column holds the backup's ending
         write-ahead log location (which again can be ignored).  The second
-        column returns the contents of the backup label file, and the third
-        column returns the contents of the tablespace map file.  These must be
+        column returns the contents of the backup label file, the third column
+        returns the contents of the tablespace map file, and the fourth column
+        returns the contents of pg_control.  These must be
         stored as part of the backup and are required as part of the restore
         process.
        </para>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9102c8d772..d28f155700 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9110,7 +9110,9 @@ get_backup_status(void)
  * wait for WAL segments to be archived.
  *
  * "state" is filled with the information necessary to restore from this
- * backup with its stop LSN (stoppoint), its timeline ID (stoptli), etc.
+ * backup with its stop LSN (stoppoint), its timeline ID (stoptli), etc. This
+ * includes a copy of pg_control with safeguards against it being used without
+ * backup_label.
  *
  * It is the responsibility of the caller of this function to verify the
  * permissions of the calling user!
@@ -9127,11 +9129,30 @@ do_pg_backup_stop(BackupState *state, bool waitforarchive)
 	int			seconds_before_warning;
 	int			waits = 0;
 	bool		reported_waiting = false;
+	ControlFileData *controlFileCopy = (ControlFileData *)state->controlFile;
 
 	Assert(state != NULL);
 
 	backup_stopped_in_recovery = RecoveryInProgress();
 
+	/*
+	 * Create a copy of control data and update it to require a backup label
+	 * for recovery. Also recalculate the CRC.
+	 */
+	memset(
+		state->controlFile + sizeof(ControlFileData), 0,
+		PG_CONTROL_FILE_SIZE - sizeof(ControlFileData));
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	memcpy(controlFileCopy, ControlFile, sizeof(ControlFileData));
+	LWLockRelease(ControlFileLock);
+
+	controlFileCopy->backupLabelRequired = true;
+
+	INIT_CRC32C(controlFileCopy->crc);
+	COMP_CRC32C(controlFileCopy->crc, controlFileCopy, offsetof(ControlFileData, crc));
+	FIN_CRC32C(controlFileCopy->crc);
+
 	/*
 	 * During recovery, we don't need to check WAL level. Because, if WAL
 	 * level is not sufficient, it's impossible to get here during recovery.
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 3e3d2bb618..fc2401d859 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -115,9 +115,11 @@ pg_backup_start(PG_FUNCTION_ARGS)
  *
  * The backup_label contains the user-supplied label string (typically this
  * would be used to tell where the backup dump will be stored), the starting
- * time, starting WAL location for the dump and so on.  It is the caller's
- * responsibility to write the backup_label and tablespace_map files in the
- * data folder that will be restored from this backup.
+ * time, starting WAL location for the dump and so on.  The pg_control file
+ * contains represents a consistent copy of pg_control that also has a safeguard
+ * against being used without backup_label.  It is the caller's responsibility
+ * to write the backup_label, pg_control, and tablespace_map files in the data
+ * folder that will be restored from this backup.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -125,12 +127,13 @@ pg_backup_start(PG_FUNCTION_ARGS)
 Datum
 pg_backup_stop(PG_FUNCTION_ARGS)
 {
-#define PG_BACKUP_STOP_V2_COLS 3
+#define PG_BACKUP_STOP_V2_COLS 4
 	TupleDesc	tupdesc;
 	Datum		values[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		nulls[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		waitforarchive = PG_GETARG_BOOL(0);
 	char	   *backup_label;
+	bytea	   *pg_control_bytea;
 	SessionBackupState status = get_backup_status();
 
 	/* Initialize attributes information in the tuple descriptor */
@@ -152,9 +155,15 @@ pg_backup_stop(PG_FUNCTION_ARGS)
 	/* Build the contents of backup_label */
 	backup_label = build_backup_content(backup_state, false);
 
+	/* Build the contents of pg_control */
+	pg_control_bytea = (bytea *) palloc(PG_CONTROL_FILE_SIZE + VARHDRSZ);
+	SET_VARSIZE(pg_control_bytea, PG_CONTROL_FILE_SIZE + VARHDRSZ);
+	memcpy(VARDATA(pg_control_bytea), backup_state->controlFile, PG_CONTROL_FILE_SIZE);
+
 	values[0] = LSNGetDatum(backup_state->stoppoint);
 	values[1] = CStringGetTextDatum(backup_label);
 	values[2] = CStringGetTextDatum(tablespace_map->data);
+	values[3] = PointerGetDatum(pg_control_bytea);
 
 	/* Deallocate backup-related variables */
 	pfree(backup_label);
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 320b14add1..193b7e2045 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -704,7 +704,14 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	}
 	else
 	{
-		/* No backup_label file has been found if we are here. */
+		/*
+		 * No backup_label file has been found if we are here. Error if the
+		 * control file requires backup_label.
+		 */
+		if (ControlFile->backupLabelRequired)
+			ereport(FATAL,
+					(errmsg("could not find backup_label required for recovery"),
+					 errhint("backup_label must be present for recovery to succeed")));
 
 		/*
 		 * If tablespace_map file is present without backup_label file, there
@@ -977,6 +984,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		{
 			ControlFile->backupStartPoint = checkPoint.redo;
 			ControlFile->backupEndRequired = backupEndRequired;
+			ControlFile->backupLabelRequired = false;
 
 			if (backupFromStandby)
 			{
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 14e5ba72e9..d0d4e74cb0 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -325,7 +325,6 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 
 			if (ti->path == NULL)
 			{
-				struct stat statbuf;
 				bool		sendtblspclinks = true;
 				char	   *backup_label;
 
@@ -348,16 +347,6 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 				/* Then the bulk of the files... */
 				sendDir(sink, ".", 1, false, state.tablespaces,
 						sendtblspclinks, &manifest, InvalidOid, ib);
-
-				/* ... and pg_control after everything else. */
-				if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not stat file \"%s\": %m",
-									XLOG_CONTROL_FILE)));
-				sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
-						 false, InvalidOid, InvalidOid,
-						 InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
 			}
 			else
 			{
@@ -374,7 +363,7 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 			 * include the xlog files below and stop afterwards. This is safe
 			 * since the main data directory is always sent *last*.
 			 */
-			if (opt->includewal && ti->path == NULL)
+			if (ti->path == NULL)
 			{
 				Assert(lnext(state.tablespaces, lc) == NULL);
 			}
@@ -394,6 +383,11 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 		basebackup_progress_wait_wal_archive(&state);
 		do_pg_backup_stop(backup_state, !opt->nowait);
 
+		/* Send pg_control */
+		sendFileWithContent(sink, XLOG_CONTROL_FILE,
+							(char *)backup_state->controlFile,
+							PG_CONTROL_FILE_SIZE, &manifest);
+
 		endptr = backup_state->stoppoint;
 		endtli = backup_state->stoptli;
 
@@ -632,16 +626,16 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 			StatusFilePath(pathbuf, fname, ".done");
 			sendFileWithContent(sink, pathbuf, "", -1, &manifest);
 		}
+	}
 
-		/* Properly terminate the tar file. */
-		StaticAssertStmt(2 * TAR_BLOCK_SIZE <= BLCKSZ,
-						 "BLCKSZ too small for 2 tar blocks");
-		memset(sink->bbs_buffer, 0, 2 * TAR_BLOCK_SIZE);
-		bbsink_archive_contents(sink, 2 * TAR_BLOCK_SIZE);
+	/* Properly terminate the tar file. */
+	StaticAssertStmt(2 * TAR_BLOCK_SIZE <= BLCKSZ,
+						"BLCKSZ too small for 2 tar blocks");
+	memset(sink->bbs_buffer, 0, 2 * TAR_BLOCK_SIZE);
+	bbsink_archive_contents(sink, 2 * TAR_BLOCK_SIZE);
 
-		/* OK, that's the end of the archive. */
-		bbsink_end_archive(sink);
-	}
+	/* OK, that's the end of the archive. */
+	bbsink_end_archive(sink);
 
 	AddWALInfoToBackupManifest(&manifest, state.startptr, state.starttli,
 							   endptr, endtli);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index b0d0de051e..758f90b2c3 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -390,7 +390,7 @@ CREATE OR REPLACE FUNCTION
 
 CREATE OR REPLACE FUNCTION pg_backup_stop (
         wait_for_archive boolean DEFAULT true, OUT lsn pg_lsn,
-        OUT labelfile text, OUT spcmapfile text)
+        OUT labelfile text, OUT spcmapfile text, OUT controlfile bytea)
   RETURNS record STRICT VOLATILE LANGUAGE internal as 'pg_backup_stop'
   PARALLEL RESTRICTED;
 
diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index 98c932dc7b..9eaf3f8b9f 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -162,8 +162,8 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 Datum
 pg_control_recovery(PG_FUNCTION_ARGS)
 {
-	Datum		values[5];
-	bool		nulls[5];
+	Datum		values[6];
+	bool		nulls[6];
 	TupleDesc	tupdesc;
 	HeapTuple	htup;
 	ControlFileData *ControlFile;
@@ -195,6 +195,9 @@ pg_control_recovery(PG_FUNCTION_ARGS)
 	values[4] = BoolGetDatum(ControlFile->backupEndRequired);
 	nulls[4] = false;
 
+	values[5] = BoolGetDatum(ControlFile->backupLabelRequired);
+	nulls[5] = false;
+
 	htup = heap_form_tuple(tupdesc, values, nulls);
 
 	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca..8cc29904c6 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -283,6 +283,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("Backup label required:                %s\n"),
+		   ControlFile->backupLabelRequired ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d8..7056752cff 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -875,6 +875,7 @@ RewriteControlFile(void)
 	ControlFile.backupStartPoint = 0;
 	ControlFile.backupEndPoint = 0;
 	ControlFile.backupEndRequired = false;
+	ControlFile.backupLabelRequired = false;
 
 	/*
 	 * Force the defaults for max_* settings. The values don't really matter
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 960916a1e8..79a44b70ee 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -722,6 +722,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	ControlFile_new.minRecoveryPoint = endrec;
 	ControlFile_new.minRecoveryPointTLI = endtli;
 	ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
+	ControlFile_new.backupLabelRequired = true;
 	if (!dry_run)
 		update_controlfile(datadir_target, &ControlFile_new, do_sync);
 }
diff --git a/src/include/access/xlogbackup.h b/src/include/access/xlogbackup.h
index c30d4a5991..93a0e5c5cc 100644
--- a/src/include/access/xlogbackup.h
+++ b/src/include/access/xlogbackup.h
@@ -15,6 +15,7 @@
 #define XLOG_BACKUP_H
 
 #include "access/xlogdefs.h"
+#include "catalog/pg_control.h"
 #include "pgtime.h"
 
 /* Structure to hold backup state. */
@@ -35,6 +36,13 @@ typedef struct BackupState
 	XLogRecPtr	stoppoint;		/* backup stop WAL location */
 	TimeLineID	stoptli;		/* backup stop TLI */
 	pg_time_t	stoptime;		/* backup stop time */
+
+	/*
+	 * After pg_backup_stop() returns this field will contain a copy of
+	 * pg_control that should be stored with the backup. backupLabelRequired
+	 * has been set so backup_label will be required for recovery to start.
+	 */
+	uint8_t controlFile[PG_CONTROL_FILE_SIZE];
 } BackupState;
 
 extern char *build_backup_content(BackupState *state,
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e80ff8e414..b471a9b02e 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -164,12 +164,16 @@ typedef struct ControlFileData
 	 * If backupEndRequired is true, we know for sure that we're restoring
 	 * from a backup, and must see a backup-end record before we can safely
 	 * start up.
+	 *
+	 * If backupLabelRequired is true, then a backup_label file must be
+	 * present in order for recovery to succeed.
 	 */
 	XLogRecPtr	minRecoveryPoint;
 	TimeLineID	minRecoveryPointTLI;
 	XLogRecPtr	backupStartPoint;
 	XLogRecPtr	backupEndPoint;
 	bool		backupEndRequired;
+	bool		backupLabelRequired;
 
 	/*
 	 * Parameter settings that determine if the WAL can be used for archival
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 77f54a79e6..ea93241eed 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6569,8 +6569,8 @@
 { oid => '2739', descr => 'finish taking an online backup',
   proname => 'pg_backup_stop', provolatile => 'v', proparallel => 'r',
   prorettype => 'record', proargtypes => 'bool',
-  proallargtypes => '{bool,pg_lsn,text,text}', proargmodes => '{i,o,o,o}',
-  proargnames => '{wait_for_archive,lsn,labelfile,spcmapfile}',
+  proallargtypes => '{bool,pg_lsn,text,text,bytea}', proargmodes => '{i,o,o,o,o}',
+  proargnames => '{wait_for_archive,lsn,labelfile,spcmapfile,controlfile}',
   prosrc => 'pg_backup_stop' },
 { oid => '3436', descr => 'promote standby server',
   proname => 'pg_promote', provolatile => 'v', prorettype => 'bool',
@@ -12119,9 +12119,9 @@
 { oid => '3443',
   descr => 'pg_controldata recovery state information as a function',
   proname => 'pg_control_recovery', provolatile => 'v', prorettype => 'record',
-  proargtypes => '', proallargtypes => '{pg_lsn,int4,pg_lsn,pg_lsn,bool}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{min_recovery_end_lsn,min_recovery_end_timeline,backup_start_lsn,backup_end_lsn,end_of_backup_record_required}',
+  proargtypes => '', proallargtypes => '{pg_lsn,int4,pg_lsn,pg_lsn,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o}',
+  proargnames => '{min_recovery_end_lsn,min_recovery_end_timeline,backup_start_lsn,backup_end_lsn,end_of_backup_record_required,backup_label_required}',
   prosrc => 'pg_control_recovery' },
 
 { oid => '3444',
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index bc447330e1..462f1dcf0d 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -41,6 +41,26 @@ $node_standby->append_conf(
 archive_cleanup_command = 'echo archive_cleanup_done > $archive_cleanup_command_file'
 recovery_end_command = 'echo recovery_ended_done > $recovery_end_command_file'
 ));
+
+# Rename backup_label to verify that recovery will not start without it
+rename("$data_dir/backup_label", "$data_dir/backup_label.tmp")
+  or BAIL_OUT "could not move $data_dir/backup_label";
+
+my $res = run_log(
+	[
+		'pg_ctl', '-D', $node_standby->data_dir, '-l',
+		$node_standby->logfile, 'start'
+	]);
+ok(!$res, 'invalid recovery startup fails');
+
+my $logfile = slurp_file($node_standby->logfile());
+ok($logfile =~ qr/could not find backup_label required for recovery/,
+	'could not find backup_label required for recovery');
+
+# Restore backup_label so recovery proceeds normally
+rename("$data_dir/backup_label.tmp", "$data_dir/backup_label")
+  or BAIL_OUT "could not move $data_dir/backup_label";
+
 $node_standby->start;
 
 # Create some content on primary
diff --git a/src/test/recovery/t/042_low_level_backup.pl b/src/test/recovery/t/042_low_level_backup.pl
index 61d23187e0..bd3a99960f 100644
--- a/src/test/recovery/t/042_low_level_backup.pl
+++ b/src/test/recovery/t/042_low_level_backup.pl
@@ -13,6 +13,42 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Decode hex to binary
+sub decode_hex
+{
+	my ($encoded) = @_;
+	my $decoded;
+
+	$encoded =~ s/^\s+|\s+$//g;
+
+	for (my $idx = 0; $idx < length($encoded); $idx += 2)
+	{
+		$decoded .= pack('C', hex(substr($encoded, $idx, 2)));
+	}
+
+	return $decoded;
+}
+
+# Get backup_label/pg_control from pg_stop_backup()
+sub stop_backup_result
+{
+	my ($psql) = @_;
+
+	my $encoded = $psql->query_safe(
+		"select encode(labelfile::bytea, 'hex') || ',' || " .
+		"       encode(controlfile, 'hex')" .
+		"  from pg_backup_stop()");
+
+	my @result;
+
+    foreach my $column (split(',', $encoded))
+	{
+		push(@result, decode_hex($column));
+	}
+
+	return @result;
+}
+
 # Start primary node with archiving.
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(has_archiving => 1, allows_streaming => 1);
@@ -80,8 +116,7 @@ my $stop_segment_name = $node_primary->safe_psql('postgres',
 	'SELECT pg_walfile_name(pg_current_wal_lsn())');
 
 # Stop backup and get backup_label, the last segment is archived.
-my $backup_label =
-  $psql->query_safe("select labelfile from pg_backup_stop()");
+(my $backup_label, my $pg_control) = stop_backup_result($psql);
 
 $psql->quit;
 
@@ -118,10 +153,36 @@ ok( $node_replica->log_contains(
 $node_replica->teardown_node;
 $node_replica->clean_node;
 
+# Save only pg_control into the backup to demonstrate that pg_control returned
+# from pg_stop_backup() will only perform recovery when backup_label is present.
+open(my $fh, ">", "$backup_dir/global/pg_control")
+  or die "could not open pg_control";
+binmode($fh);
+syswrite($fh, $pg_control);
+close($fh);
+
+$node_replica = PostgreSQL::Test::Cluster->new('replica_fail2');
+$node_replica->init_from_backup($node_primary, $backup_name,
+	has_restoring => 1);
+
+my $res = run_log(
+	[
+		'pg_ctl', '-D', $node_replica->data_dir, '-l',
+		$node_replica->logfile, 'start'
+	]);
+ok(!$res, 'invalid recovery startup fails');
+
+my $logfile = slurp_file($node_replica->logfile());
+ok($logfile =~ qr/could not find backup_label required for recovery/,
+	'could not find backup_label required for recovery');
+
+$node_replica->teardown_node;
+$node_replica->clean_node;
+
 # Save backup_label into the backup directory and recover using the primary's
 # archive.  This time recovery will succeed and the canary table will be
 # present.
-open my $fh, ">>", "$backup_dir/backup_label"
+open $fh, ">>", "$backup_dir/backup_label"
   or die "could not open backup_label";
 # Binary mode is required for Windows, as the backup_label parsing is not
 # able to cope with CRLFs.
-- 
2.34.1

#4Michael Paquier
michael@paquier.xyz
In reply to: David Steele (#3)
Re: Return pg_control from pg_backup_stop().

On Wed, Oct 02, 2024 at 09:03:27AM +0000, David Steele wrote:

On 10/2/24 10:11, Michael Paquier wrote:

Hmm, okay. There is also a slight impact for BASE_BACKUP, requiring
basebackup_progress_wait_wal_archive() and do_pg_backup_stop() to be
called earlier when sending the main data directory which is the last
one in the list of tablespaces. As far as I can see, this does not
change the logic because do_pg_backup_stop() does not touch the
control file, but it seems to me that we'd rather keep these two
calls as they are now, and send the control file once we are out of
the for loop that processes all the tablespaces. That seems slightly
cleaner to me, still I agree that both are the same things.

Sending pg_control later results in even more code churn because of how tar
files are terminated. I've updated it that way in v2 so you can see what I
mean. I don't have a strong preference, though, so if you prefer the
implementation in v2 then that's fine with me.

It does not make much of a difference, indeed.

Anyway, finishing do_pg_backup_stop() and then sending the control
file is just a consequence of the implementation choice driven by the
output required for the SQL function so as this is stored in the
backup state to get it down to pg_backup_stop() in xlogfuncs.c, so we
could also take one step back and forget about the SQL function,
setting only the new flag when sending a BASE_BACKUP, or just not use
the backupState to store this data as the exercise involves forcing
one boolean and recalculate a CRC32.

I can definitely see us making other updates to pg_control so I would rather
keep this logic centralized, even though it is not too complicated at this
point. Still, even 8 lines of code (as it is now) seems better not to
duplicate.

I was wondering if the field update should be hidden behind a macro
that uses an offsetof() on ControlFileData, with the name of the field
and a pointer to the value to update to. If you include the CRC32
calculation in that, that makes for less chunks of code when updating
one field of the control file.

The full CRC calculation could also be hidden inside a macro, as there
are a couple of places where we do the same things, like pg_rewind.c,
etc.

We're talking about a 8kB file which has a size of 512B
(PG_CONTROL_MAX_SAFE_SIZE) to avoid such issues. So I'm not sure to
see your point here?

Even at 512B it is possible to see tears in pg_control and they happen in
the build farm right now. In fact, this thread [1] trying to fix the problem
was what got me thinking about alternate solutions to preventing tears in
pg_control. Thomas' proposed fixes have not been committed to my knowledge
so the problem remains, but would be fixed by this commit.

Ah, right. That rings a bell. Thomas has done some work with
c558e6fd92ff and 63a582222c6b. And we're still not taking the
ControlFileLock while copying it over.. It looks like we should do it
separately, and backpatch. That's not something for this thread to
worry about.

Perhaps existing
backup solutions are good enough risk vs reward is not worth it?

I'm not sure I see the risk here. Saving out pg_control is optional so no
changes to current software is required. Of course they miss the benefit of
the protection against tears and missing backup_label, but that is a choice.

Again, optional, but if I was able to manage these saves using the psql
interface in the TAP tests then I'd say it would be pretty easy for anyone
with a normal connection to Postgres. Also, we require users to treat
tabelspace_map and backup_label as binary so not too big a change here.

Maintenance cost for a limited user impact overall. With incremental
backups being a thing in v18 only available through the replication
protocol, the SQL functions have less advantages these days. My point
would be to see this thread as a two-step process:
1) Update the flag in the control file when sending it across in
replication stream.
2) Do the SQL function thing with the bytea for the control file, if
necessary.

1) is something that has more value than 2), IMO, because there is no
need for a manual step when a backup is taken by the replication
protocol. Well, custom backup solutions that rely on the replication
protocol to copy the data would need to make sure that they have a
backup_label, but that's something they should do anyway and what this
patch wants to protect users from. The SQL part is optional IMO. It
can be done, but it has less impact overall and makes backups more
complicated by requiring the manual copy of the control file.
--
Michael

#5David Steele
david@pgbackrest.org
In reply to: Michael Paquier (#4)
2 attachment(s)
Re: Return pg_control from pg_backup_stop().

On 10/3/24 07:45, Michael Paquier wrote:

On Wed, Oct 02, 2024 at 09:03:27AM +0000, David Steele wrote:

On 10/2/24 10:11, Michael Paquier wrote:

I can definitely see us making other updates to pg_control so I would rather
keep this logic centralized, even though it is not too complicated at this
point. Still, even 8 lines of code (as it is now) seems better not to
duplicate.

I was wondering if the field update should be hidden behind a macro
that uses an offsetof() on ControlFileData, with the name of the field
and a pointer to the value to update to. If you include the CRC32
calculation in that, that makes for less chunks of code when updating
one field of the control file.

The full CRC calculation could also be hidden inside a macro, as there
are a couple of places where we do the same things, like pg_rewind.c,
etc.

This seems to be a different case than pg_rewind, especially since we
need a ControlFileLock. I think that is a bit much to do in a macro, so
I split the functionality out into a function instead. This simplifies
the logic in basebackup.c but has little impact elsewhere.

We're talking about a 8kB file which has a size of 512B
(PG_CONTROL_MAX_SAFE_SIZE) to avoid such issues. So I'm not sure to
see your point here?

Even at 512B it is possible to see tears in pg_control and they happen in
the build farm right now. In fact, this thread [1] trying to fix the problem
was what got me thinking about alternate solutions to preventing tears in
pg_control. Thomas' proposed fixes have not been committed to my knowledge
so the problem remains, but would be fixed by this commit.

Ah, right. That rings a bell. Thomas has done some work with
c558e6fd92ff and 63a582222c6b. And we're still not taking the
ControlFileLock while copying it over.. It looks like we should do it
separately, and backpatch. That's not something for this thread to
worry about.

I'd be happy to adapt patch 01 to be back-patched (without the new flag)
if we decide it is a good idea. Just locking and making a copy of
pg_control is easy enough, but if we accept the backup_control_file()
function for new versions then we could keep that for the back patch to
reduce churn between versions.

Perhaps existing
backup solutions are good enough risk vs reward is not worth it?

I'm not sure I see the risk here. Saving out pg_control is optional so no
changes to current software is required. Of course they miss the benefit of
the protection against tears and missing backup_label, but that is a choice.

Again, optional, but if I was able to manage these saves using the psql
interface in the TAP tests then I'd say it would be pretty easy for anyone
with a normal connection to Postgres. Also, we require users to treat
tabelspace_map and backup_label as binary so not too big a change here.

Maintenance cost for a limited user impact overall. With incremental
backups being a thing in v18 only available through the replication
protocol, the SQL functions have less advantages these days. My point
would be to see this thread as a two-step process:
1) Update the flag in the control file when sending it across in
replication stream.
2) Do the SQL function thing with the bytea for the control file, if
necessary.

OK, I have split the patch into two parts along these lines.

1) is something that has more value than 2), IMO, because there is no
need for a manual step when a backup is taken by the replication
protocol. Well, custom backup solutions that rely on the replication
protocol to copy the data would need to make sure that they have a
backup_label, but that's something they should do anyway and what this
patch wants to protect users from. The SQL part is optional IMO. It
can be done, but it has less impact overall and makes backups more
complicated by requiring the manual copy of the control file.

I don't think having incremental backup in pg_basebackup means alternate
backup solutions are going away or that we should deprecate the SQL
interface. If nothing else, third-party solutions need a way to get an
untorn copy of pg_control and in general I think the new flag will be
universally useful.

Regards,
-David

Attachments:

pgcontrol-flag-v3-01-basebackup.patchtext/plain; charset=UTF-8; name=pgcontrol-flag-v3-01-basebackup.patchDownload
From 77df2c779b57dde90f81d5244b55e4d0d644137d Mon Sep 17 00:00:00 2001
From: David Steele <david@pgmasters.net>
Date: Thu, 3 Oct 2024 08:41:45 +0000
Subject: Add pg_control flag to prevent recovery without backup_label.

Harden recovery by adding a flag to pg_control to indicate that backup_label is
required. This prevents the user from deleting backup_label resulting in an
inconsistent recovery.

Another advantage is that the copy of pg_control used by pg_basebackup is
guaranteed not to be torn.

This functionality is limited to pg_basebackup (or any software comfortable
with modifying pg_control).

Control and catalog version bumps are required.
---
 doc/src/sgml/func.sgml                    |  5 +++++
 src/backend/access/transam/xlog.c         | 23 +++++++++++++++++++++++
 src/backend/access/transam/xlogrecovery.c | 10 +++++++++-
 src/backend/backup/basebackup.c           | 15 ++++++---------
 src/backend/utils/misc/pg_controldata.c   |  7 +++++--
 src/bin/pg_controldata/pg_controldata.c   |  2 ++
 src/bin/pg_resetwal/pg_resetwal.c         |  1 +
 src/bin/pg_rewind/pg_rewind.c             |  1 +
 src/include/access/xlog.h                 |  1 +
 src/include/catalog/pg_control.h          |  4 ++++
 src/include/catalog/pg_proc.dat           |  6 +++---
 src/test/recovery/t/002_archiving.pl      | 20 ++++++++++++++++++++
 12 files changed, 80 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 7b4fbb5047..786269b920 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27899,6 +27899,11 @@ acl      | {postgres=arwdDxtm/postgres,foo=r/postgres}
        <entry><type>boolean</type></entry>
       </row>
 
+      <row>
+       <entry><structfield>backup_label_required</structfield></entry>
+       <entry><type>boolean</type></entry>
+      </row>
+
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9102c8d772..79065b55ef 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9412,6 +9412,29 @@ do_pg_abort_backup(int code, Datum arg)
 	}
 }
 
+/*
+ * Create a consistent copy of control data to be used for backup and update it
+ * to require a backup label for recovery. Also recalculate the CRC.
+ */
+void
+backup_control_file(uint8_t *controlFile)
+{
+	ControlFileData *controlData = ((ControlFileData *)controlFile);
+
+	memset(controlFile + sizeof(ControlFileData), 0,
+		   PG_CONTROL_FILE_SIZE - sizeof(ControlFileData));
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	memcpy(controlFile, ControlFile, sizeof(ControlFileData));
+	LWLockRelease(ControlFileLock);
+
+	controlData->backupLabelRequired = true;
+
+	INIT_CRC32C(controlData->crc);
+	COMP_CRC32C(controlData->crc, controlFile, offsetof(ControlFileData, crc));
+	FIN_CRC32C(controlData->crc);
+}
+
 /*
  * Register a handler that will warn about unterminated backups at end of
  * session, unless this has already been done.
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 320b14add1..193b7e2045 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -704,7 +704,14 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	}
 	else
 	{
-		/* No backup_label file has been found if we are here. */
+		/*
+		 * No backup_label file has been found if we are here. Error if the
+		 * control file requires backup_label.
+		 */
+		if (ControlFile->backupLabelRequired)
+			ereport(FATAL,
+					(errmsg("could not find backup_label required for recovery"),
+					 errhint("backup_label must be present for recovery to succeed")));
 
 		/*
 		 * If tablespace_map file is present without backup_label file, there
@@ -977,6 +984,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		{
 			ControlFile->backupStartPoint = checkPoint.redo;
 			ControlFile->backupEndRequired = backupEndRequired;
+			ControlFile->backupLabelRequired = false;
 
 			if (backupFromStandby)
 			{
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 14e5ba72e9..ce7fd76e48 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -23,6 +23,7 @@
 #include "backup/basebackup_incremental.h"
 #include "backup/basebackup_sink.h"
 #include "backup/basebackup_target.h"
+#include "catalog/pg_control.h"
 #include "catalog/pg_tablespace_d.h"
 #include "commands/defrem.h"
 #include "common/compression.h"
@@ -325,9 +326,9 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 
 			if (ti->path == NULL)
 			{
-				struct stat statbuf;
 				bool		sendtblspclinks = true;
 				char	   *backup_label;
+				uint8_t controlFile[PG_CONTROL_FILE_SIZE];
 
 				bbsink_begin_archive(sink, "base.tar");
 
@@ -350,14 +351,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 						sendtblspclinks, &manifest, InvalidOid, ib);
 
 				/* ... and pg_control after everything else. */
-				if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not stat file \"%s\": %m",
-									XLOG_CONTROL_FILE)));
-				sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
-						 false, InvalidOid, InvalidOid,
-						 InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
+				backup_control_file(controlFile);
+				sendFileWithContent(sink, XLOG_CONTROL_FILE,
+									(char *)controlFile, PG_CONTROL_FILE_SIZE,
+									&manifest);
 			}
 			else
 			{
diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index 98c932dc7b..9eaf3f8b9f 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -162,8 +162,8 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 Datum
 pg_control_recovery(PG_FUNCTION_ARGS)
 {
-	Datum		values[5];
-	bool		nulls[5];
+	Datum		values[6];
+	bool		nulls[6];
 	TupleDesc	tupdesc;
 	HeapTuple	htup;
 	ControlFileData *ControlFile;
@@ -195,6 +195,9 @@ pg_control_recovery(PG_FUNCTION_ARGS)
 	values[4] = BoolGetDatum(ControlFile->backupEndRequired);
 	nulls[4] = false;
 
+	values[5] = BoolGetDatum(ControlFile->backupLabelRequired);
+	nulls[5] = false;
+
 	htup = heap_form_tuple(tupdesc, values, nulls);
 
 	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca..8cc29904c6 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -283,6 +283,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("Backup label required:                %s\n"),
+		   ControlFile->backupLabelRequired ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d8..7056752cff 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -875,6 +875,7 @@ RewriteControlFile(void)
 	ControlFile.backupStartPoint = 0;
 	ControlFile.backupEndPoint = 0;
 	ControlFile.backupEndRequired = false;
+	ControlFile.backupLabelRequired = false;
 
 	/*
 	 * Force the defaults for max_* settings. The values don't really matter
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 960916a1e8..79a44b70ee 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -722,6 +722,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	ControlFile_new.minRecoveryPoint = endrec;
 	ControlFile_new.minRecoveryPointTLI = endtli;
 	ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
+	ControlFile_new.backupLabelRequired = true;
 	if (!dry_run)
 		update_controlfile(datadir_target, &ControlFile_new, do_sync);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 34ad46c067..9d5d8ed43c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -295,6 +295,7 @@ extern void do_pg_backup_start(const char *backupidstr, bool fast,
 							   StringInfo tblspcmapfile);
 extern void do_pg_backup_stop(BackupState *state, bool waitforarchive);
 extern void do_pg_abort_backup(int code, Datum arg);
+extern void backup_control_file(uint8_t *controlFile);
 extern void register_persistent_abort_backup_handler(void);
 extern SessionBackupState get_backup_status(void);
 
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e80ff8e414..b471a9b02e 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -164,12 +164,16 @@ typedef struct ControlFileData
 	 * If backupEndRequired is true, we know for sure that we're restoring
 	 * from a backup, and must see a backup-end record before we can safely
 	 * start up.
+	 *
+	 * If backupLabelRequired is true, then a backup_label file must be
+	 * present in order for recovery to succeed.
 	 */
 	XLogRecPtr	minRecoveryPoint;
 	TimeLineID	minRecoveryPointTLI;
 	XLogRecPtr	backupStartPoint;
 	XLogRecPtr	backupEndPoint;
 	bool		backupEndRequired;
+	bool		backupLabelRequired;
 
 	/*
 	 * Parameter settings that determine if the WAL can be used for archival
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 77f54a79e6..7cc5421ea1 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12119,9 +12119,9 @@
 { oid => '3443',
   descr => 'pg_controldata recovery state information as a function',
   proname => 'pg_control_recovery', provolatile => 'v', prorettype => 'record',
-  proargtypes => '', proallargtypes => '{pg_lsn,int4,pg_lsn,pg_lsn,bool}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{min_recovery_end_lsn,min_recovery_end_timeline,backup_start_lsn,backup_end_lsn,end_of_backup_record_required}',
+  proargtypes => '', proallargtypes => '{pg_lsn,int4,pg_lsn,pg_lsn,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o}',
+  proargnames => '{min_recovery_end_lsn,min_recovery_end_timeline,backup_start_lsn,backup_end_lsn,end_of_backup_record_required,backup_label_required}',
   prosrc => 'pg_control_recovery' },
 
 { oid => '3444',
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index bc447330e1..462f1dcf0d 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -41,6 +41,26 @@ $node_standby->append_conf(
 archive_cleanup_command = 'echo archive_cleanup_done > $archive_cleanup_command_file'
 recovery_end_command = 'echo recovery_ended_done > $recovery_end_command_file'
 ));
+
+# Rename backup_label to verify that recovery will not start without it
+rename("$data_dir/backup_label", "$data_dir/backup_label.tmp")
+  or BAIL_OUT "could not move $data_dir/backup_label";
+
+my $res = run_log(
+	[
+		'pg_ctl', '-D', $node_standby->data_dir, '-l',
+		$node_standby->logfile, 'start'
+	]);
+ok(!$res, 'invalid recovery startup fails');
+
+my $logfile = slurp_file($node_standby->logfile());
+ok($logfile =~ qr/could not find backup_label required for recovery/,
+	'could not find backup_label required for recovery');
+
+# Restore backup_label so recovery proceeds normally
+rename("$data_dir/backup_label.tmp", "$data_dir/backup_label")
+  or BAIL_OUT "could not move $data_dir/backup_label";
+
 $node_standby->start;
 
 # Create some content on primary
-- 
2.34.1

pgcontrol-flag-v3-02-sql.patchtext/plain; charset=UTF-8; name=pgcontrol-flag-v3-02-sql.patchDownload
From dba5a2a57e5472682a9be6d4badf1e8127768610 Mon Sep 17 00:00:00 2001
From: David Steele <david@pgmasters.net>
Date: Thu, 3 Oct 2024 08:41:45 +0000
Subject: Return pg_control from pg_backup_stop().

Harden recovery by returning a copy of pg_control from pg_backup_stop() that has
a flag set to prevent recovery if the backup_label file is missing. Instead of
backup software copying pg_control from PGDATA, it stores an updated version
that is returned from pg_backup_stop(). This is better for the following
reasons:

* The user can no longer remove backup_label and get what looks like a
successful recovery (while almost certainly causing corruption). If backup_label
is removed the cluster will not start. The user may try pg_resetwal, but that
tool makes it pretty clear that corruption will result from its use.

* We don't need to worry about backup software seeing a torn copy of pg_control,
since Postgres can safely read it out of memory and provide a valid copy via
pg_backup_stop(). This solves torn reads without needing to write pg_control via
a temp file, which may affect performance on a standby.

* For backup from standby, we no longer need to instruct the backup software to
copy pg_control last. In fact the backup software should not copy pg_control from
PGDATA at all.

These changes have no impact on current backup software and they are free to use
the pg_control available from pg_stop_backup() or continue to use pg_control from
PGDATA. Of course they will miss the benefits of getting a consistent copy of
pg_control and the backup_label checking, but will be no worse off than before.

Catalog version bump is required.
---
 doc/src/sgml/backup.sgml                    | 18 +++++-
 doc/src/sgml/func.sgml                      |  5 +-
 src/backend/access/transam/xlogfuncs.c      | 17 ++++--
 src/backend/catalog/system_functions.sql    |  2 +-
 src/include/catalog/pg_proc.dat             |  4 +-
 src/test/recovery/t/042_low_level_backup.pl | 67 ++++++++++++++++++++-
 6 files changed, 98 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index e4e4c56cf1..2fcf181121 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1021,9 +1021,12 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
      values. The second of these fields should be written to a file named
      <filename>backup_label</filename> in the root directory of the backup. The
      third field should be written to a file named
-     <filename>tablespace_map</filename> unless the field is empty. These files are
+     <filename>tablespace_map</filename> unless the field is empty. The fourth
+     field should be written into a file named
+     <filename>global/pg_control</filename> (overwriting the existing file when
+     present). These files are
      vital to the backup working and must be written byte for byte without
-     modification, which may require opening the file in binary mode.
+     modification, which will require opening the file in binary mode.
     </para>
    </listitem>
    <listitem>
@@ -1095,7 +1098,16 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
    </para>
 
    <para>
-    You should, however, omit from the backup the files within the
+    You should exclude <filename>global/pg_control</filename> from your backup
+    and put the contents of the <parameter>controlfile</parameter> column
+    returned from <function>pg_backup_stop</function> in your backup at
+    <filename>global/pg_control</filename>. This version of pg_control contains
+    safeguards against recovery without backup_label present and is guaranteed
+    not to be torn.
+   </para>
+
+   <para>
+    You should also omit from the backup the files within the
     cluster's <filename>pg_wal/</filename> subdirectory.  This
     slight adjustment is worthwhile because it reduces the risk
     of mistakes when restoring.  This is easy to arrange if
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 786269b920..09204db7cd 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28581,8 +28581,9 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
         The result of the function is a single record.
         The <parameter>lsn</parameter> column holds the backup's ending
         write-ahead log location (which again can be ignored).  The second
-        column returns the contents of the backup label file, and the third
-        column returns the contents of the tablespace map file.  These must be
+        column returns the contents of the backup label file, the third column
+        returns the contents of the tablespace map file, and the fourth column
+        returns the contents of pg_control.  These must be
         stored as part of the backup and are required as part of the restore
         process.
        </para>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 3e3d2bb618..a7d7c1bcbe 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -115,9 +115,11 @@ pg_backup_start(PG_FUNCTION_ARGS)
  *
  * The backup_label contains the user-supplied label string (typically this
  * would be used to tell where the backup dump will be stored), the starting
- * time, starting WAL location for the dump and so on.  It is the caller's
- * responsibility to write the backup_label and tablespace_map files in the
- * data folder that will be restored from this backup.
+ * time, starting WAL location for the dump and so on.  The pg_control file
+ * contains a consistent copy of pg_control that also has a safeguard against
+ * being used without backup_label.  It is the caller's responsibility to write
+ * the backup_label, pg_control, and tablespace_map files in the data folder
+ * that will be restored from this backup.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -125,12 +127,13 @@ pg_backup_start(PG_FUNCTION_ARGS)
 Datum
 pg_backup_stop(PG_FUNCTION_ARGS)
 {
-#define PG_BACKUP_STOP_V2_COLS 3
+#define PG_BACKUP_STOP_V2_COLS 4
 	TupleDesc	tupdesc;
 	Datum		values[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		nulls[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		waitforarchive = PG_GETARG_BOOL(0);
 	char	   *backup_label;
+	bytea	   *pg_control_bytea;
 	SessionBackupState status = get_backup_status();
 
 	/* Initialize attributes information in the tuple descriptor */
@@ -152,9 +155,15 @@ pg_backup_stop(PG_FUNCTION_ARGS)
 	/* Build the contents of backup_label */
 	backup_label = build_backup_content(backup_state, false);
 
+	/* Build the contents of pg_control */
+	pg_control_bytea = (bytea *) palloc(PG_CONTROL_FILE_SIZE + VARHDRSZ);
+	SET_VARSIZE(pg_control_bytea, PG_CONTROL_FILE_SIZE + VARHDRSZ);
+	backup_control_file((uint8_t *)VARDATA(pg_control_bytea));
+
 	values[0] = LSNGetDatum(backup_state->stoppoint);
 	values[1] = CStringGetTextDatum(backup_label);
 	values[2] = CStringGetTextDatum(tablespace_map->data);
+	values[3] = PointerGetDatum(pg_control_bytea);
 
 	/* Deallocate backup-related variables */
 	pfree(backup_label);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index b0d0de051e..758f90b2c3 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -390,7 +390,7 @@ CREATE OR REPLACE FUNCTION
 
 CREATE OR REPLACE FUNCTION pg_backup_stop (
         wait_for_archive boolean DEFAULT true, OUT lsn pg_lsn,
-        OUT labelfile text, OUT spcmapfile text)
+        OUT labelfile text, OUT spcmapfile text, OUT controlfile bytea)
   RETURNS record STRICT VOLATILE LANGUAGE internal as 'pg_backup_stop'
   PARALLEL RESTRICTED;
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7cc5421ea1..ea93241eed 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6569,8 +6569,8 @@
 { oid => '2739', descr => 'finish taking an online backup',
   proname => 'pg_backup_stop', provolatile => 'v', proparallel => 'r',
   prorettype => 'record', proargtypes => 'bool',
-  proallargtypes => '{bool,pg_lsn,text,text}', proargmodes => '{i,o,o,o}',
-  proargnames => '{wait_for_archive,lsn,labelfile,spcmapfile}',
+  proallargtypes => '{bool,pg_lsn,text,text,bytea}', proargmodes => '{i,o,o,o,o}',
+  proargnames => '{wait_for_archive,lsn,labelfile,spcmapfile,controlfile}',
   prosrc => 'pg_backup_stop' },
 { oid => '3436', descr => 'promote standby server',
   proname => 'pg_promote', provolatile => 'v', prorettype => 'bool',
diff --git a/src/test/recovery/t/042_low_level_backup.pl b/src/test/recovery/t/042_low_level_backup.pl
index 61d23187e0..bd3a99960f 100644
--- a/src/test/recovery/t/042_low_level_backup.pl
+++ b/src/test/recovery/t/042_low_level_backup.pl
@@ -13,6 +13,42 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Decode hex to binary
+sub decode_hex
+{
+	my ($encoded) = @_;
+	my $decoded;
+
+	$encoded =~ s/^\s+|\s+$//g;
+
+	for (my $idx = 0; $idx < length($encoded); $idx += 2)
+	{
+		$decoded .= pack('C', hex(substr($encoded, $idx, 2)));
+	}
+
+	return $decoded;
+}
+
+# Get backup_label/pg_control from pg_stop_backup()
+sub stop_backup_result
+{
+	my ($psql) = @_;
+
+	my $encoded = $psql->query_safe(
+		"select encode(labelfile::bytea, 'hex') || ',' || " .
+		"       encode(controlfile, 'hex')" .
+		"  from pg_backup_stop()");
+
+	my @result;
+
+    foreach my $column (split(',', $encoded))
+	{
+		push(@result, decode_hex($column));
+	}
+
+	return @result;
+}
+
 # Start primary node with archiving.
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(has_archiving => 1, allows_streaming => 1);
@@ -80,8 +116,7 @@ my $stop_segment_name = $node_primary->safe_psql('postgres',
 	'SELECT pg_walfile_name(pg_current_wal_lsn())');
 
 # Stop backup and get backup_label, the last segment is archived.
-my $backup_label =
-  $psql->query_safe("select labelfile from pg_backup_stop()");
+(my $backup_label, my $pg_control) = stop_backup_result($psql);
 
 $psql->quit;
 
@@ -118,10 +153,36 @@ ok( $node_replica->log_contains(
 $node_replica->teardown_node;
 $node_replica->clean_node;
 
+# Save only pg_control into the backup to demonstrate that pg_control returned
+# from pg_stop_backup() will only perform recovery when backup_label is present.
+open(my $fh, ">", "$backup_dir/global/pg_control")
+  or die "could not open pg_control";
+binmode($fh);
+syswrite($fh, $pg_control);
+close($fh);
+
+$node_replica = PostgreSQL::Test::Cluster->new('replica_fail2');
+$node_replica->init_from_backup($node_primary, $backup_name,
+	has_restoring => 1);
+
+my $res = run_log(
+	[
+		'pg_ctl', '-D', $node_replica->data_dir, '-l',
+		$node_replica->logfile, 'start'
+	]);
+ok(!$res, 'invalid recovery startup fails');
+
+my $logfile = slurp_file($node_replica->logfile());
+ok($logfile =~ qr/could not find backup_label required for recovery/,
+	'could not find backup_label required for recovery');
+
+$node_replica->teardown_node;
+$node_replica->clean_node;
+
 # Save backup_label into the backup directory and recover using the primary's
 # archive.  This time recovery will succeed and the canary table will be
 # present.
-open my $fh, ">>", "$backup_dir/backup_label"
+open $fh, ">>", "$backup_dir/backup_label"
   or die "could not open backup_label";
 # Binary mode is required for Windows, as the backup_label parsing is not
 # able to cope with CRLFs.
-- 
2.34.1

#6David Steele
david@pgbackrest.org
In reply to: David Steele (#5)
2 attachment(s)
Re: Return pg_control from pg_backup_stop().

On 10/3/24 05:11, David Steele wrote:

On 10/3/24 07:45, Michael Paquier wrote:

1) is something that has more value than 2), IMO, because there is no
need for a manual step when a backup is taken by the replication
protocol.  Well, custom backup solutions that rely on the replication
protocol to copy the data would need to make sure that they have a
backup_label, but that's something they should do anyway and what this
patch wants to protect users from.  The SQL part is optional IMO.  It
can be done, but it has less impact overall and makes backups more
complicated by requiring the manual copy of the control file.

I don't think having incremental backup in pg_basebackup means alternate
backup solutions are going away or that we should deprecate the SQL
interface. If nothing else, third-party solutions need a way to get an
untorn copy of pg_control and in general I think the new flag will be
universally useful.

I updated this patch to fix an issue with -fsanitize=alignment. I'm not
entirely happy with copying twice but not sure of another way to do it.
As far as I can see VARDATA() will not return aligned data on 64-bit
architectures.

Regards,
-David

Attachments:

pgcontrol-flag-v4-02-sql.patchtext/plain; charset=UTF-8; name=pgcontrol-flag-v4-02-sql.patchDownload
From 659a1d9b6b1528e139741dc69146848ca15a07b9 Mon Sep 17 00:00:00 2001
From: David Steele <david@pgmasters.net>
Date: Wed, 20 Nov 2024 22:33:36 +0000
Subject: Return pg_control from pg_backup_stop().

Harden recovery by returning a copy of pg_control from pg_backup_stop() that has
a flag set to prevent recovery if the backup_label file is missing. Instead of
backup software copying pg_control from PGDATA, it stores an updated version
that is returned from pg_backup_stop(). This is better for the following
reasons:

* The user can no longer remove backup_label and get what looks like a
successful recovery (while almost certainly causing corruption). If backup_label
is removed the cluster will not start. The user may try pg_resetwal, but that
tool makes it pretty clear that corruption will result from its use.

* We don't need to worry about backup software seeing a torn copy of pg_control,
since Postgres can safely read it out of memory and provide a valid copy via
pg_backup_stop(). This solves torn reads without needing to write pg_control via
a temp file, which may affect performance on a standby.

* For backup from standby, we no longer need to instruct the backup software to
copy pg_control last. In fact the backup software should not copy pg_control from
PGDATA at all.

These changes have no impact on current backup software and they are free to use
the pg_control available from pg_stop_backup() or continue to use pg_control from
PGDATA. Of course they will miss the benefits of getting a consistent copy of
pg_control and the backup_label checking, but will be no worse off than before.

Catalog version bump is required.
---
 doc/src/sgml/backup.sgml                    | 18 +++++-
 doc/src/sgml/func.sgml                      |  5 +-
 src/backend/access/transam/xlogfuncs.c      | 20 ++++--
 src/backend/catalog/system_functions.sql    |  2 +-
 src/include/catalog/pg_proc.dat             |  4 +-
 src/test/recovery/t/042_low_level_backup.pl | 67 ++++++++++++++++++++-
 6 files changed, 101 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index e4e4c56cf14..2fcf1811213 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1021,9 +1021,12 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
      values. The second of these fields should be written to a file named
      <filename>backup_label</filename> in the root directory of the backup. The
      third field should be written to a file named
-     <filename>tablespace_map</filename> unless the field is empty. These files are
+     <filename>tablespace_map</filename> unless the field is empty. The fourth
+     field should be written into a file named
+     <filename>global/pg_control</filename> (overwriting the existing file when
+     present). These files are
      vital to the backup working and must be written byte for byte without
-     modification, which may require opening the file in binary mode.
+     modification, which will require opening the file in binary mode.
     </para>
    </listitem>
    <listitem>
@@ -1095,7 +1098,16 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
    </para>
 
    <para>
-    You should, however, omit from the backup the files within the
+    You should exclude <filename>global/pg_control</filename> from your backup
+    and put the contents of the <parameter>controlfile</parameter> column
+    returned from <function>pg_backup_stop</function> in your backup at
+    <filename>global/pg_control</filename>. This version of pg_control contains
+    safeguards against recovery without backup_label present and is guaranteed
+    not to be torn.
+   </para>
+
+   <para>
+    You should also omit from the backup the files within the
     cluster's <filename>pg_wal/</filename> subdirectory.  This
     slight adjustment is worthwhile because it reduces the risk
     of mistakes when restoring.  This is easy to arrange if
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index fa37a0469ae..b3d8c73295b 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28600,8 +28600,9 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
         The result of the function is a single record.
         The <parameter>lsn</parameter> column holds the backup's ending
         write-ahead log location (which again can be ignored).  The second
-        column returns the contents of the backup label file, and the third
-        column returns the contents of the tablespace map file.  These must be
+        column returns the contents of the backup label file, the third column
+        returns the contents of the tablespace map file, and the fourth column
+        returns the contents of pg_control.  These must be
         stored as part of the backup and are required as part of the restore
         process.
        </para>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index b0c6d7c6875..7547bb6e28e 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -113,9 +113,11 @@ pg_backup_start(PG_FUNCTION_ARGS)
  *
  * The backup_label contains the user-supplied label string (typically this
  * would be used to tell where the backup dump will be stored), the starting
- * time, starting WAL location for the dump and so on.  It is the caller's
- * responsibility to write the backup_label and tablespace_map files in the
- * data folder that will be restored from this backup.
+ * time, starting WAL location for the dump and so on.  The pg_control file
+ * contains a consistent copy of pg_control that also has a safeguard against
+ * being used without backup_label.  It is the caller's responsibility to write
+ * the backup_label, pg_control, and tablespace_map files in the data folder
+ * that will be restored from this backup.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -123,12 +125,14 @@ pg_backup_start(PG_FUNCTION_ARGS)
 Datum
 pg_backup_stop(PG_FUNCTION_ARGS)
 {
-#define PG_BACKUP_STOP_V2_COLS 3
+#define PG_BACKUP_STOP_V2_COLS 4
 	TupleDesc	tupdesc;
 	Datum		values[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		nulls[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		waitforarchive = PG_GETARG_BOOL(0);
 	char	   *backup_label;
+	bytea	   *pg_control_bytea;
+	uint8_t		pg_control[PG_CONTROL_FILE_SIZE];
 	SessionBackupState status = get_backup_status();
 
 	/* Initialize attributes information in the tuple descriptor */
@@ -150,9 +154,17 @@ pg_backup_stop(PG_FUNCTION_ARGS)
 	/* Build the contents of backup_label */
 	backup_label = build_backup_content(backup_state, false);
 
+	/* Build the contents of pg_control */
+	backup_control_file(pg_control);
+
+	pg_control_bytea = (bytea *) palloc(PG_CONTROL_FILE_SIZE + VARHDRSZ);
+	SET_VARSIZE(pg_control_bytea, PG_CONTROL_FILE_SIZE + VARHDRSZ);
+	memcpy(VARDATA(pg_control_bytea), pg_control, PG_CONTROL_FILE_SIZE);
+
 	values[0] = LSNGetDatum(backup_state->stoppoint);
 	values[1] = CStringGetTextDatum(backup_label);
 	values[2] = CStringGetTextDatum(tablespace_map->data);
+	values[3] = PointerGetDatum(pg_control_bytea);
 
 	/* Deallocate backup-related variables */
 	pfree(backup_label);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index c51dfca802f..6a29b9b3cf3 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -390,7 +390,7 @@ CREATE OR REPLACE FUNCTION
 
 CREATE OR REPLACE FUNCTION pg_backup_stop (
         wait_for_archive boolean DEFAULT true, OUT lsn pg_lsn,
-        OUT labelfile text, OUT spcmapfile text)
+        OUT labelfile text, OUT spcmapfile text, OUT controlfile bytea)
   RETURNS record STRICT VOLATILE LANGUAGE internal as 'pg_backup_stop'
   PARALLEL RESTRICTED;
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 8d9d584735e..6f65cc7c777 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6589,8 +6589,8 @@
 { oid => '2739', descr => 'finish taking an online backup',
   proname => 'pg_backup_stop', provolatile => 'v', proparallel => 'r',
   prorettype => 'record', proargtypes => 'bool',
-  proallargtypes => '{bool,pg_lsn,text,text}', proargmodes => '{i,o,o,o}',
-  proargnames => '{wait_for_archive,lsn,labelfile,spcmapfile}',
+  proallargtypes => '{bool,pg_lsn,text,text,bytea}', proargmodes => '{i,o,o,o,o}',
+  proargnames => '{wait_for_archive,lsn,labelfile,spcmapfile,controlfile}',
   prosrc => 'pg_backup_stop' },
 { oid => '3436', descr => 'promote standby server',
   proname => 'pg_promote', provolatile => 'v', prorettype => 'bool',
diff --git a/src/test/recovery/t/042_low_level_backup.pl b/src/test/recovery/t/042_low_level_backup.pl
index 61d23187e0f..bd3a99960f5 100644
--- a/src/test/recovery/t/042_low_level_backup.pl
+++ b/src/test/recovery/t/042_low_level_backup.pl
@@ -13,6 +13,42 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Decode hex to binary
+sub decode_hex
+{
+	my ($encoded) = @_;
+	my $decoded;
+
+	$encoded =~ s/^\s+|\s+$//g;
+
+	for (my $idx = 0; $idx < length($encoded); $idx += 2)
+	{
+		$decoded .= pack('C', hex(substr($encoded, $idx, 2)));
+	}
+
+	return $decoded;
+}
+
+# Get backup_label/pg_control from pg_stop_backup()
+sub stop_backup_result
+{
+	my ($psql) = @_;
+
+	my $encoded = $psql->query_safe(
+		"select encode(labelfile::bytea, 'hex') || ',' || " .
+		"       encode(controlfile, 'hex')" .
+		"  from pg_backup_stop()");
+
+	my @result;
+
+    foreach my $column (split(',', $encoded))
+	{
+		push(@result, decode_hex($column));
+	}
+
+	return @result;
+}
+
 # Start primary node with archiving.
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(has_archiving => 1, allows_streaming => 1);
@@ -80,8 +116,7 @@ my $stop_segment_name = $node_primary->safe_psql('postgres',
 	'SELECT pg_walfile_name(pg_current_wal_lsn())');
 
 # Stop backup and get backup_label, the last segment is archived.
-my $backup_label =
-  $psql->query_safe("select labelfile from pg_backup_stop()");
+(my $backup_label, my $pg_control) = stop_backup_result($psql);
 
 $psql->quit;
 
@@ -118,10 +153,36 @@ ok( $node_replica->log_contains(
 $node_replica->teardown_node;
 $node_replica->clean_node;
 
+# Save only pg_control into the backup to demonstrate that pg_control returned
+# from pg_stop_backup() will only perform recovery when backup_label is present.
+open(my $fh, ">", "$backup_dir/global/pg_control")
+  or die "could not open pg_control";
+binmode($fh);
+syswrite($fh, $pg_control);
+close($fh);
+
+$node_replica = PostgreSQL::Test::Cluster->new('replica_fail2');
+$node_replica->init_from_backup($node_primary, $backup_name,
+	has_restoring => 1);
+
+my $res = run_log(
+	[
+		'pg_ctl', '-D', $node_replica->data_dir, '-l',
+		$node_replica->logfile, 'start'
+	]);
+ok(!$res, 'invalid recovery startup fails');
+
+my $logfile = slurp_file($node_replica->logfile());
+ok($logfile =~ qr/could not find backup_label required for recovery/,
+	'could not find backup_label required for recovery');
+
+$node_replica->teardown_node;
+$node_replica->clean_node;
+
 # Save backup_label into the backup directory and recover using the primary's
 # archive.  This time recovery will succeed and the canary table will be
 # present.
-open my $fh, ">>", "$backup_dir/backup_label"
+open $fh, ">>", "$backup_dir/backup_label"
   or die "could not open backup_label";
 # Binary mode is required for Windows, as the backup_label parsing is not
 # able to cope with CRLFs.
-- 
2.34.1

pgcontrol-flag-v4-01-basebackup.patchtext/plain; charset=UTF-8; name=pgcontrol-flag-v4-01-basebackup.patchDownload
From 005365fd1f87a03b4e4de8485be0ed6af9127acc Mon Sep 17 00:00:00 2001
From: David Steele <david@pgmasters.net>
Date: Wed, 20 Nov 2024 22:33:36 +0000
Subject: Add pg_control flag to prevent recovery without backup_label.

Harden recovery by adding a flag to pg_control to indicate that backup_label is
required. This prevents the user from deleting backup_label resulting in an
inconsistent recovery.

Another advantage is that the copy of pg_control used by pg_basebackup is
guaranteed not to be torn.

This functionality is limited to pg_basebackup (or any software comfortable
with modifying pg_control).

Control and catalog version bumps are required.
---
 doc/src/sgml/func.sgml                    |  5 +++++
 src/backend/access/transam/xlog.c         | 23 +++++++++++++++++++++++
 src/backend/access/transam/xlogrecovery.c | 10 +++++++++-
 src/backend/backup/basebackup.c           | 15 ++++++---------
 src/backend/utils/misc/pg_controldata.c   |  7 +++++--
 src/bin/pg_controldata/pg_controldata.c   |  2 ++
 src/bin/pg_resetwal/pg_resetwal.c         |  1 +
 src/bin/pg_rewind/pg_rewind.c             |  1 +
 src/include/access/xlog.h                 |  1 +
 src/include/catalog/pg_control.h          |  4 ++++
 src/include/catalog/pg_proc.dat           |  6 +++---
 src/test/recovery/t/002_archiving.pl      | 20 ++++++++++++++++++++
 12 files changed, 80 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 72f223a0414..fa37a0469ae 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27918,6 +27918,11 @@ acl      | {postgres=arwdDxtm/postgres,foo=r/postgres}
        <entry><type>boolean</type></entry>
       </row>
 
+      <row>
+       <entry><structfield>backup_label_required</structfield></entry>
+       <entry><type>boolean</type></entry>
+      </row>
+
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6f58412bcab..2a2c80a6f74 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9413,6 +9413,29 @@ do_pg_abort_backup(int code, Datum arg)
 	}
 }
 
+/*
+ * Create a consistent copy of control data to be used for backup and update it
+ * to require a backup label for recovery. Also recalculate the CRC.
+ */
+void
+backup_control_file(uint8_t *controlFile)
+{
+	ControlFileData *controlData = ((ControlFileData *)controlFile);
+
+	memset(controlFile + sizeof(ControlFileData), 0,
+		   PG_CONTROL_FILE_SIZE - sizeof(ControlFileData));
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	memcpy(controlFile, ControlFile, sizeof(ControlFileData));
+	LWLockRelease(ControlFileLock);
+
+	controlData->backupLabelRequired = true;
+
+	INIT_CRC32C(controlData->crc);
+	COMP_CRC32C(controlData->crc, controlFile, offsetof(ControlFileData, crc));
+	FIN_CRC32C(controlData->crc);
+}
+
 /*
  * Register a handler that will warn about unterminated backups at end of
  * session, unless this has already been done.
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 05c738d6614..f641cb1a8ad 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -703,7 +703,14 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	}
 	else
 	{
-		/* No backup_label file has been found if we are here. */
+		/*
+		 * No backup_label file has been found if we are here. Error if the
+		 * control file requires backup_label.
+		 */
+		if (ControlFile->backupLabelRequired)
+			ereport(FATAL,
+					(errmsg("could not find backup_label required for recovery"),
+					 errhint("backup_label must be present for recovery to succeed")));
 
 		/*
 		 * If tablespace_map file is present without backup_label file, there
@@ -976,6 +983,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		{
 			ControlFile->backupStartPoint = checkPoint.redo;
 			ControlFile->backupEndRequired = backupEndRequired;
+			ControlFile->backupLabelRequired = false;
 
 			if (backupFromStandby)
 			{
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index e2ed9081d1c..b673bc10bc3 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -23,6 +23,7 @@
 #include "backup/basebackup_incremental.h"
 #include "backup/basebackup_sink.h"
 #include "backup/basebackup_target.h"
+#include "catalog/pg_control.h"
 #include "catalog/pg_tablespace_d.h"
 #include "commands/defrem.h"
 #include "common/compression.h"
@@ -326,9 +327,9 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 
 			if (ti->path == NULL)
 			{
-				struct stat statbuf;
 				bool		sendtblspclinks = true;
 				char	   *backup_label;
+				uint8_t controlFile[PG_CONTROL_FILE_SIZE];
 
 				bbsink_begin_archive(sink, "base.tar");
 
@@ -351,14 +352,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 						sendtblspclinks, &manifest, InvalidOid, ib);
 
 				/* ... and pg_control after everything else. */
-				if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not stat file \"%s\": %m",
-									XLOG_CONTROL_FILE)));
-				sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
-						 false, InvalidOid, InvalidOid,
-						 InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
+				backup_control_file(controlFile);
+				sendFileWithContent(sink, XLOG_CONTROL_FILE,
+									(char *)controlFile, PG_CONTROL_FILE_SIZE,
+									&manifest);
 			}
 			else
 			{
diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index 98c932dc7bd..9eaf3f8b9f6 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -162,8 +162,8 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 Datum
 pg_control_recovery(PG_FUNCTION_ARGS)
 {
-	Datum		values[5];
-	bool		nulls[5];
+	Datum		values[6];
+	bool		nulls[6];
 	TupleDesc	tupdesc;
 	HeapTuple	htup;
 	ControlFileData *ControlFile;
@@ -195,6 +195,9 @@ pg_control_recovery(PG_FUNCTION_ARGS)
 	values[4] = BoolGetDatum(ControlFile->backupEndRequired);
 	nulls[4] = false;
 
+	values[5] = BoolGetDatum(ControlFile->backupLabelRequired);
+	nulls[5] = false;
+
 	htup = heap_form_tuple(tupdesc, values, nulls);
 
 	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca7..8cc29904c6b 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -283,6 +283,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("Backup label required:                %s\n"),
+		   ControlFile->backupLabelRequired ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e9dcb5a6d89..7056752cff0 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -875,6 +875,7 @@ RewriteControlFile(void)
 	ControlFile.backupStartPoint = 0;
 	ControlFile.backupEndPoint = 0;
 	ControlFile.backupEndRequired = false;
+	ControlFile.backupLabelRequired = false;
 
 	/*
 	 * Force the defaults for max_* settings. The values don't really matter
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index c4fe4e37040..b9faa51661e 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -725,6 +725,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	ControlFile_new.minRecoveryPoint = endrec;
 	ControlFile_new.minRecoveryPointTLI = endtli;
 	ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
+	ControlFile_new.backupLabelRequired = true;
 	if (!dry_run)
 		update_controlfile(datadir_target, &ControlFile_new, do_sync);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 34ad46c067b..9d5d8ed43c2 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -295,6 +295,7 @@ extern void do_pg_backup_start(const char *backupidstr, bool fast,
 							   StringInfo tblspcmapfile);
 extern void do_pg_backup_stop(BackupState *state, bool waitforarchive);
 extern void do_pg_abort_backup(int code, Datum arg);
+extern void backup_control_file(uint8_t *controlFile);
 extern void register_persistent_abort_backup_handler(void);
 extern SessionBackupState get_backup_status(void);
 
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e80ff8e4140..b471a9b02e7 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -164,12 +164,16 @@ typedef struct ControlFileData
 	 * If backupEndRequired is true, we know for sure that we're restoring
 	 * from a backup, and must see a backup-end record before we can safely
 	 * start up.
+	 *
+	 * If backupLabelRequired is true, then a backup_label file must be
+	 * present in order for recovery to succeed.
 	 */
 	XLogRecPtr	minRecoveryPoint;
 	TimeLineID	minRecoveryPointTLI;
 	XLogRecPtr	backupStartPoint;
 	XLogRecPtr	backupEndPoint;
 	bool		backupEndRequired;
+	bool		backupLabelRequired;
 
 	/*
 	 * Parameter settings that determine if the WAL can be used for archival
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index cbbe8acd382..8d9d584735e 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12139,9 +12139,9 @@
 { oid => '3443',
   descr => 'pg_controldata recovery state information as a function',
   proname => 'pg_control_recovery', provolatile => 'v', prorettype => 'record',
-  proargtypes => '', proallargtypes => '{pg_lsn,int4,pg_lsn,pg_lsn,bool}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{min_recovery_end_lsn,min_recovery_end_timeline,backup_start_lsn,backup_end_lsn,end_of_backup_record_required}',
+  proargtypes => '', proallargtypes => '{pg_lsn,int4,pg_lsn,pg_lsn,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o}',
+  proargnames => '{min_recovery_end_lsn,min_recovery_end_timeline,backup_start_lsn,backup_end_lsn,end_of_backup_record_required,backup_label_required}',
   prosrc => 'pg_control_recovery' },
 
 { oid => '3444',
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index bc447330e15..462f1dcf0d6 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -41,6 +41,26 @@ $node_standby->append_conf(
 archive_cleanup_command = 'echo archive_cleanup_done > $archive_cleanup_command_file'
 recovery_end_command = 'echo recovery_ended_done > $recovery_end_command_file'
 ));
+
+# Rename backup_label to verify that recovery will not start without it
+rename("$data_dir/backup_label", "$data_dir/backup_label.tmp")
+  or BAIL_OUT "could not move $data_dir/backup_label";
+
+my $res = run_log(
+	[
+		'pg_ctl', '-D', $node_standby->data_dir, '-l',
+		$node_standby->logfile, 'start'
+	]);
+ok(!$res, 'invalid recovery startup fails');
+
+my $logfile = slurp_file($node_standby->logfile());
+ok($logfile =~ qr/could not find backup_label required for recovery/,
+	'could not find backup_label required for recovery');
+
+# Restore backup_label so recovery proceeds normally
+rename("$data_dir/backup_label.tmp", "$data_dir/backup_label")
+  or BAIL_OUT "could not move $data_dir/backup_label";
+
 $node_standby->start;
 
 # Create some content on primary
-- 
2.34.1

#7David Steele
david@pgbackrest.org
In reply to: David Steele (#6)
2 attachment(s)
Re: Return pg_control from pg_backup_stop().

On 11/20/24 17:44, David Steele wrote:

On 10/3/24 05:11, David Steele wrote:

On 10/3/24 07:45, Michael Paquier wrote:

1) is something that has more value than 2), IMO, because there is no
need for a manual step when a backup is taken by the replication
protocol.  Well, custom backup solutions that rely on the replication
protocol to copy the data would need to make sure that they have a
backup_label, but that's something they should do anyway and what this
patch wants to protect users from.  The SQL part is optional IMO.  It
can be done, but it has less impact overall and makes backups more
complicated by requiring the manual copy of the control file.

I don't think having incremental backup in pg_basebackup means
alternate backup solutions are going away or that we should deprecate
the SQL interface. If nothing else, third-party solutions need a way
to get an untorn copy of pg_control and in general I think the new
flag will be universally useful.

I updated this patch to fix an issue with -fsanitize=alignment. I'm not
entirely happy with copying twice but not sure of another way to do it.
As far as I can see VARDATA() will not return aligned data on 64-bit
architectures.

Rebased and improved a comment and an error.

Regards,
-David

Attachments:

pgcontrol-flag-v5-02-sql.patchtext/plain; charset=UTF-8; name=pgcontrol-flag-v5-02-sql.patchDownload
From b229466927f4c17c2c058dbaca92b8ec0429bd80 Mon Sep 17 00:00:00 2001
From: David Steele <david@pgmasters.net>
Date: Fri, 24 Jan 2025 18:38:09 +0000
Subject: Return pg_control from pg_backup_stop().

Harden recovery by returning a copy of pg_control from pg_backup_stop() that has
a flag set to prevent recovery if the backup_label file is missing. Instead of
backup software copying pg_control from PGDATA, it stores an updated version
that is returned from pg_backup_stop(). This is better for the following
reasons:

* The user can no longer remove backup_label and get what looks like a
successful recovery (while almost certainly causing corruption). If backup_label
is removed the cluster will not start. The user may try pg_resetwal, but that
tool makes it pretty clear that corruption will result from its use.

* We don't need to worry about backup software seeing a torn copy of pg_control,
since Postgres can safely read it out of memory and provide a valid copy via
pg_backup_stop(). This solves torn reads without needing to write pg_control via
a temp file, which may affect performance on a standby.

* For backup from standby, we no longer need to instruct the backup software to
copy pg_control last. In fact the backup software should not copy pg_control from
PGDATA at all.

These changes have no impact on current backup software and they are free to use
the pg_control available from pg_stop_backup() or continue to use pg_control from
PGDATA. Of course they will miss the benefits of getting a consistent copy of
pg_control and the backup_label checking, but will be no worse off than before.

Catalog version bump is required.
---
 doc/src/sgml/backup.sgml                    | 18 +++++-
 doc/src/sgml/func.sgml                      |  5 +-
 src/backend/access/transam/xlogfuncs.c      | 20 ++++--
 src/backend/catalog/system_functions.sql    |  2 +-
 src/include/catalog/pg_proc.dat             |  4 +-
 src/test/recovery/t/042_low_level_backup.pl | 67 ++++++++++++++++++++-
 6 files changed, 101 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index e4e4c56cf14..2fcf1811213 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1021,9 +1021,12 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
      values. The second of these fields should be written to a file named
      <filename>backup_label</filename> in the root directory of the backup. The
      third field should be written to a file named
-     <filename>tablespace_map</filename> unless the field is empty. These files are
+     <filename>tablespace_map</filename> unless the field is empty. The fourth
+     field should be written into a file named
+     <filename>global/pg_control</filename> (overwriting the existing file when
+     present). These files are
      vital to the backup working and must be written byte for byte without
-     modification, which may require opening the file in binary mode.
+     modification, which will require opening the file in binary mode.
     </para>
    </listitem>
    <listitem>
@@ -1095,7 +1098,16 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
    </para>
 
    <para>
-    You should, however, omit from the backup the files within the
+    You should exclude <filename>global/pg_control</filename> from your backup
+    and put the contents of the <parameter>controlfile</parameter> column
+    returned from <function>pg_backup_stop</function> in your backup at
+    <filename>global/pg_control</filename>. This version of pg_control contains
+    safeguards against recovery without backup_label present and is guaranteed
+    not to be torn.
+   </para>
+
+   <para>
+    You should also omit from the backup the files within the
     cluster's <filename>pg_wal/</filename> subdirectory.  This
     slight adjustment is worthwhile because it reduces the risk
     of mistakes when restoring.  This is easy to arrange if
diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 86c946905aa..61907e088d7 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -28671,8 +28671,9 @@ LOG:  Grand total: 1651920 bytes in 201 blocks; 622360 free (88 chunks); 1029560
         The result of the function is a single record.
         The <parameter>lsn</parameter> column holds the backup's ending
         write-ahead log location (which again can be ignored).  The second
-        column returns the contents of the backup label file, and the third
-        column returns the contents of the tablespace map file.  These must be
+        column returns the contents of the backup label file, the third column
+        returns the contents of the tablespace map file, and the fourth column
+        returns the contents of pg_control.  These must be
         stored as part of the backup and are required as part of the restore
         process.
        </para>
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 8c3090165f0..e805f231c69 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -113,9 +113,11 @@ pg_backup_start(PG_FUNCTION_ARGS)
  *
  * The backup_label contains the user-supplied label string (typically this
  * would be used to tell where the backup dump will be stored), the starting
- * time, starting WAL location for the dump and so on.  It is the caller's
- * responsibility to write the backup_label and tablespace_map files in the
- * data folder that will be restored from this backup.
+ * time, starting WAL location for the dump and so on.  The pg_control file
+ * contains a consistent copy of pg_control that also has a safeguard against
+ * being used without backup_label.  It is the caller's responsibility to write
+ * the backup_label, pg_control, and tablespace_map files in the data folder
+ * that will be restored from this backup.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -123,12 +125,14 @@ pg_backup_start(PG_FUNCTION_ARGS)
 Datum
 pg_backup_stop(PG_FUNCTION_ARGS)
 {
-#define PG_BACKUP_STOP_V2_COLS 3
+#define PG_BACKUP_STOP_V2_COLS 4
 	TupleDesc	tupdesc;
 	Datum		values[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		nulls[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		waitforarchive = PG_GETARG_BOOL(0);
 	char	   *backup_label;
+	bytea	   *pg_control_bytea;
+	uint8_t		pg_control[PG_CONTROL_FILE_SIZE];
 	SessionBackupState status = get_backup_status();
 
 	/* Initialize attributes information in the tuple descriptor */
@@ -150,9 +154,17 @@ pg_backup_stop(PG_FUNCTION_ARGS)
 	/* Build the contents of backup_label */
 	backup_label = build_backup_content(backup_state, false);
 
+	/* Build the contents of pg_control */
+	backup_control_file(pg_control);
+
+	pg_control_bytea = (bytea *) palloc(PG_CONTROL_FILE_SIZE + VARHDRSZ);
+	SET_VARSIZE(pg_control_bytea, PG_CONTROL_FILE_SIZE + VARHDRSZ);
+	memcpy(VARDATA(pg_control_bytea), pg_control, PG_CONTROL_FILE_SIZE);
+
 	values[0] = LSNGetDatum(backup_state->stoppoint);
 	values[1] = CStringGetTextDatum(backup_label);
 	values[2] = CStringGetTextDatum(tablespace_map->data);
+	values[3] = PointerGetDatum(pg_control_bytea);
 
 	/* Deallocate backup-related variables */
 	pfree(backup_label);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 591157b1d1b..fa635a698a3 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -390,7 +390,7 @@ CREATE OR REPLACE FUNCTION
 
 CREATE OR REPLACE FUNCTION pg_backup_stop (
         wait_for_archive boolean DEFAULT true, OUT lsn pg_lsn,
-        OUT labelfile text, OUT spcmapfile text)
+        OUT labelfile text, OUT spcmapfile text, OUT controlfile bytea)
   RETURNS record STRICT VOLATILE LANGUAGE internal as 'pg_backup_stop'
   PARALLEL RESTRICTED;
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 0416569b823..dbd8a89d9d9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6602,8 +6602,8 @@
 { oid => '2739', descr => 'finish taking an online backup',
   proname => 'pg_backup_stop', provolatile => 'v', proparallel => 'r',
   prorettype => 'record', proargtypes => 'bool',
-  proallargtypes => '{bool,pg_lsn,text,text}', proargmodes => '{i,o,o,o}',
-  proargnames => '{wait_for_archive,lsn,labelfile,spcmapfile}',
+  proallargtypes => '{bool,pg_lsn,text,text,bytea}', proargmodes => '{i,o,o,o,o}',
+  proargnames => '{wait_for_archive,lsn,labelfile,spcmapfile,controlfile}',
   prosrc => 'pg_backup_stop' },
 { oid => '3436', descr => 'promote standby server',
   proname => 'pg_promote', provolatile => 'v', prorettype => 'bool',
diff --git a/src/test/recovery/t/042_low_level_backup.pl b/src/test/recovery/t/042_low_level_backup.pl
index 5749a1df533..a08a9a9d9df 100644
--- a/src/test/recovery/t/042_low_level_backup.pl
+++ b/src/test/recovery/t/042_low_level_backup.pl
@@ -13,6 +13,42 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Decode hex to binary
+sub decode_hex
+{
+	my ($encoded) = @_;
+	my $decoded;
+
+	$encoded =~ s/^\s+|\s+$//g;
+
+	for (my $idx = 0; $idx < length($encoded); $idx += 2)
+	{
+		$decoded .= pack('C', hex(substr($encoded, $idx, 2)));
+	}
+
+	return $decoded;
+}
+
+# Get backup_label/pg_control from pg_stop_backup()
+sub stop_backup_result
+{
+	my ($psql) = @_;
+
+	my $encoded = $psql->query_safe(
+		"select encode(labelfile::bytea, 'hex') || ',' || " .
+		"       encode(controlfile, 'hex')" .
+		"  from pg_backup_stop()");
+
+	my @result;
+
+    foreach my $column (split(',', $encoded))
+	{
+		push(@result, decode_hex($column));
+	}
+
+	return @result;
+}
+
 # Start primary node with archiving.
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(has_archiving => 1, allows_streaming => 1);
@@ -80,8 +116,7 @@ my $stop_segment_name = $node_primary->safe_psql('postgres',
 	'SELECT pg_walfile_name(pg_current_wal_lsn())');
 
 # Stop backup and get backup_label, the last segment is archived.
-my $backup_label =
-  $psql->query_safe("select labelfile from pg_backup_stop()");
+(my $backup_label, my $pg_control) = stop_backup_result($psql);
 
 $psql->quit;
 
@@ -118,10 +153,36 @@ ok( $node_replica->log_contains(
 $node_replica->teardown_node;
 $node_replica->clean_node;
 
+# Save only pg_control into the backup to demonstrate that pg_control returned
+# from pg_stop_backup() will only perform recovery when backup_label is present.
+open(my $fh, ">", "$backup_dir/global/pg_control")
+  or die "could not open pg_control";
+binmode($fh);
+syswrite($fh, $pg_control);
+close($fh);
+
+$node_replica = PostgreSQL::Test::Cluster->new('replica_fail2');
+$node_replica->init_from_backup($node_primary, $backup_name,
+	has_restoring => 1);
+
+my $res = run_log(
+	[
+		'pg_ctl', '-D', $node_replica->data_dir, '-l',
+		$node_replica->logfile, 'start'
+	]);
+ok(!$res, 'invalid recovery startup fails');
+
+my $logfile = slurp_file($node_replica->logfile());
+ok($logfile =~ qr/could not find backup_label required for recovery/,
+	'could not find backup_label required for recovery');
+
+$node_replica->teardown_node;
+$node_replica->clean_node;
+
 # Save backup_label into the backup directory and recover using the primary's
 # archive.  This time recovery will succeed and the canary table will be
 # present.
-open my $fh, ">>", "$backup_dir/backup_label"
+open $fh, ">>", "$backup_dir/backup_label"
   or die "could not open backup_label";
 # Binary mode is required for Windows, as the backup_label parsing is not
 # able to cope with CRLFs.
-- 
2.34.1

pgcontrol-flag-v5-01-basebackup.patchtext/plain; charset=UTF-8; name=pgcontrol-flag-v5-01-basebackup.patchDownload
From 8d9c268780c2128bb96d1bbdb2762aeca5b2393c Mon Sep 17 00:00:00 2001
From: David Steele <david@pgmasters.net>
Date: Fri, 24 Jan 2025 18:38:08 +0000
Subject: Add pg_control flag to prevent recovery without backup_label.

Harden recovery by adding a flag to pg_control to indicate that backup_label is
required. This prevents the user from deleting backup_label resulting in an
inconsistent recovery.

Another advantage is that the copy of pg_control used by pg_basebackup is
guaranteed not to be torn.

This functionality is limited to pg_basebackup (or any software comfortable
with modifying pg_control).

Control and catalog version bumps are required.
---
 doc/src/sgml/func.sgml                    |  5 +++++
 src/backend/access/transam/xlog.c         | 23 +++++++++++++++++++++++
 src/backend/access/transam/xlogrecovery.c | 10 +++++++++-
 src/backend/backup/basebackup.c           | 15 ++++++---------
 src/backend/utils/misc/pg_controldata.c   |  7 +++++--
 src/bin/pg_controldata/pg_controldata.c   |  2 ++
 src/bin/pg_resetwal/pg_resetwal.c         |  1 +
 src/bin/pg_rewind/pg_rewind.c             |  1 +
 src/include/access/xlog.h                 |  1 +
 src/include/catalog/pg_control.h          |  4 ++++
 src/include/catalog/pg_proc.dat           |  6 +++---
 src/test/recovery/t/002_archiving.pl      | 20 ++++++++++++++++++++
 12 files changed, 80 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/func.sgml b/doc/src/sgml/func.sgml
index 5678e7621a5..86c946905aa 100644
--- a/doc/src/sgml/func.sgml
+++ b/doc/src/sgml/func.sgml
@@ -27989,6 +27989,11 @@ acl      | {postgres=arwdDxtm/postgres,foo=r/postgres}
        <entry><type>boolean</type></entry>
       </row>
 
+      <row>
+       <entry><structfield>backup_label_required</structfield></entry>
+       <entry><type>boolean</type></entry>
+      </row>
+
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901d..9285a4a5c72 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9413,6 +9413,29 @@ do_pg_abort_backup(int code, Datum arg)
 	}
 }
 
+/*
+ * Create a consistent copy of control data to be used for backup and update it
+ * to require a backup label for recovery. Also recalculate the CRC.
+ */
+void
+backup_control_file(uint8_t *controlFile)
+{
+	ControlFileData *controlData = ((ControlFileData *)controlFile);
+
+	memset(controlFile + sizeof(ControlFileData), 0,
+		   PG_CONTROL_FILE_SIZE - sizeof(ControlFileData));
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	memcpy(controlFile, ControlFile, sizeof(ControlFileData));
+	LWLockRelease(ControlFileLock);
+
+	controlData->backupLabelRequired = true;
+
+	INIT_CRC32C(controlData->crc);
+	COMP_CRC32C(controlData->crc, controlFile, offsetof(ControlFileData, crc));
+	FIN_CRC32C(controlData->crc);
+}
+
 /*
  * Register a handler that will warn about unterminated backups at end of
  * session, unless this has already been done.
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cf2b007806f..72ced660f88 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -703,7 +703,14 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	}
 	else
 	{
-		/* No backup_label file has been found if we are here. */
+		/*
+		 * No backup_label file has been found if we are here. Error if the
+		 * control file requires backup_label.
+		 */
+		if (ControlFile->backupLabelRequired)
+			ereport(FATAL,
+					(errmsg("could not find backup_label required for recovery"),
+					 errhint("backup_label must be present for recovery to proceed")));
 
 		/*
 		 * If tablespace_map file is present without backup_label file, there
@@ -976,6 +983,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		{
 			ControlFile->backupStartPoint = checkPoint.redo;
 			ControlFile->backupEndRequired = backupEndRequired;
+			ControlFile->backupLabelRequired = false;
 
 			if (backupFromStandby)
 			{
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index 3f8a3c55725..a457f5714d1 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -23,6 +23,7 @@
 #include "backup/basebackup_incremental.h"
 #include "backup/basebackup_sink.h"
 #include "backup/basebackup_target.h"
+#include "catalog/pg_control.h"
 #include "catalog/pg_tablespace_d.h"
 #include "commands/defrem.h"
 #include "common/compression.h"
@@ -326,9 +327,9 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 
 			if (ti->path == NULL)
 			{
-				struct stat statbuf;
 				bool		sendtblspclinks = true;
 				char	   *backup_label;
+				uint8_t controlFile[PG_CONTROL_FILE_SIZE];
 
 				bbsink_begin_archive(sink, "base.tar");
 
@@ -351,14 +352,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 						sendtblspclinks, &manifest, InvalidOid, ib);
 
 				/* ... and pg_control after everything else. */
-				if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not stat file \"%s\": %m",
-									XLOG_CONTROL_FILE)));
-				sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
-						 false, InvalidOid, InvalidOid,
-						 InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
+				backup_control_file(controlFile);
+				sendFileWithContent(sink, XLOG_CONTROL_FILE,
+									(char *)controlFile, PG_CONTROL_FILE_SIZE,
+									&manifest);
 			}
 			else
 			{
diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index 9dfba499c13..983be8496a2 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -162,8 +162,8 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 Datum
 pg_control_recovery(PG_FUNCTION_ARGS)
 {
-	Datum		values[5];
-	bool		nulls[5];
+	Datum		values[6];
+	bool		nulls[6];
 	TupleDesc	tupdesc;
 	HeapTuple	htup;
 	ControlFileData *ControlFile;
@@ -195,6 +195,9 @@ pg_control_recovery(PG_FUNCTION_ARGS)
 	values[4] = BoolGetDatum(ControlFile->backupEndRequired);
 	nulls[4] = false;
 
+	values[5] = BoolGetDatum(ControlFile->backupLabelRequired);
+	nulls[5] = false;
+
 	htup = heap_form_tuple(tupdesc, values, nulls);
 
 	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 93a05d80ca7..8cc29904c6b 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -283,6 +283,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("Backup label required:                %s\n"),
+		   ControlFile->backupLabelRequired ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index ed73607a46f..9fb3bbebc42 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -875,6 +875,7 @@ RewriteControlFile(void)
 	ControlFile.backupStartPoint = 0;
 	ControlFile.backupEndPoint = 0;
 	ControlFile.backupEndRequired = false;
+	ControlFile.backupLabelRequired = false;
 
 	/*
 	 * Force the defaults for max_* settings. The values don't really matter
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index cae81cd6cb1..26fcdc1cd71 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -725,6 +725,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	ControlFile_new.minRecoveryPoint = endrec;
 	ControlFile_new.minRecoveryPointTLI = endtli;
 	ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
+	ControlFile_new.backupLabelRequired = true;
 	if (!dry_run)
 		update_controlfile(datadir_target, &ControlFile_new, do_sync);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4411c1468ac..59c80b65165 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -295,6 +295,7 @@ extern void do_pg_backup_start(const char *backupidstr, bool fast,
 							   StringInfo tblspcmapfile);
 extern void do_pg_backup_stop(BackupState *state, bool waitforarchive);
 extern void do_pg_abort_backup(int code, Datum arg);
+extern void backup_control_file(uint8_t *controlFile);
 extern void register_persistent_abort_backup_handler(void);
 extern SessionBackupState get_backup_status(void);
 
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 3797f25b306..b7d864e103a 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -164,12 +164,16 @@ typedef struct ControlFileData
 	 * If backupEndRequired is true, we know for sure that we're restoring
 	 * from a backup, and must see a backup-end record before we can safely
 	 * start up.
+	 *
+	 * If backupLabelRequired is true, then a backup_label file must be
+	 * present in order for recovery to proceed.
 	 */
 	XLogRecPtr	minRecoveryPoint;
 	TimeLineID	minRecoveryPointTLI;
 	XLogRecPtr	backupStartPoint;
 	XLogRecPtr	backupEndPoint;
 	bool		backupEndRequired;
+	bool		backupLabelRequired;
 
 	/*
 	 * Parameter settings that determine if the WAL can be used for archival
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 18560755d26..0416569b823 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12172,9 +12172,9 @@
 { oid => '3443',
   descr => 'pg_controldata recovery state information as a function',
   proname => 'pg_control_recovery', provolatile => 'v', prorettype => 'record',
-  proargtypes => '', proallargtypes => '{pg_lsn,int4,pg_lsn,pg_lsn,bool}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{min_recovery_end_lsn,min_recovery_end_timeline,backup_start_lsn,backup_end_lsn,end_of_backup_record_required}',
+  proargtypes => '', proallargtypes => '{pg_lsn,int4,pg_lsn,pg_lsn,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o}',
+  proargnames => '{min_recovery_end_lsn,min_recovery_end_timeline,backup_start_lsn,backup_end_lsn,end_of_backup_record_required,backup_label_required}',
   prosrc => 'pg_control_recovery' },
 
 { oid => '3444',
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index 3acdb9ff1eb..153c658c123 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -41,6 +41,26 @@ $node_standby->append_conf(
 archive_cleanup_command = 'echo archive_cleanup_done > $archive_cleanup_command_file'
 recovery_end_command = 'echo recovery_ended_done > $recovery_end_command_file'
 ));
+
+# Rename backup_label to verify that recovery will not start without it
+rename("$data_dir/backup_label", "$data_dir/backup_label.tmp")
+  or BAIL_OUT "could not move $data_dir/backup_label";
+
+my $res = run_log(
+	[
+		'pg_ctl', '-D', $node_standby->data_dir, '-l',
+		$node_standby->logfile, 'start'
+	]);
+ok(!$res, 'invalid recovery startup fails');
+
+my $logfile = slurp_file($node_standby->logfile());
+ok($logfile =~ qr/could not find backup_label required for recovery/,
+	'could not find backup_label required for recovery');
+
+# Restore backup_label so recovery proceeds normally
+rename("$data_dir/backup_label.tmp", "$data_dir/backup_label")
+  or BAIL_OUT "could not move $data_dir/backup_label";
+
 $node_standby->start;
 
 # Create some content on primary
-- 
2.34.1

#8David Steele
david@pgbackrest.org
In reply to: David Steele (#7)
2 attachment(s)
Re: Return pg_control from pg_backup_stop().

On 1/24/25 13:43, David Steele wrote:

Rebased and improved a comment and an error.

Rebased to fix breakage caused by the split of func.sgml in 4e23c9e.

Regards,
-David

Attachments:

pgcontrol-flag-v6-01-basebackup.patchtext/plain; charset=UTF-8; name=pgcontrol-flag-v6-01-basebackup.patchDownload
From c44525f64f6a2eed4f2636db70e255237244d5b4 Mon Sep 17 00:00:00 2001
From: David Steele <david@pgmasters.net>
Date: Wed, 6 Aug 2025 22:25:41 +0000
Subject: Add pg_control flag to prevent recovery without backup_label.

Harden recovery by adding a flag to pg_control to indicate that backup_label is
required. This prevents the user from deleting backup_label resulting in an
inconsistent recovery.

Another advantage is that the copy of pg_control used by pg_basebackup is
guaranteed not to be torn.

This functionality is limited to pg_basebackup (or any software comfortable
with modifying pg_control).

Control and catalog version bumps are required.
---
 doc/src/sgml/func/func-info.sgml          |  5 +++++
 src/backend/access/transam/xlog.c         | 23 +++++++++++++++++++++++
 src/backend/access/transam/xlogrecovery.c | 10 +++++++++-
 src/backend/backup/basebackup.c           | 15 ++++++---------
 src/backend/utils/misc/pg_controldata.c   |  7 +++++--
 src/bin/pg_controldata/pg_controldata.c   |  2 ++
 src/bin/pg_resetwal/pg_resetwal.c         |  1 +
 src/bin/pg_rewind/pg_rewind.c             |  1 +
 src/include/access/xlog.h                 |  1 +
 src/include/catalog/pg_control.h          |  4 ++++
 src/include/catalog/pg_proc.dat           |  6 +++---
 src/test/recovery/t/002_archiving.pl      | 20 ++++++++++++++++++++
 12 files changed, 80 insertions(+), 15 deletions(-)

diff --git a/doc/src/sgml/func/func-info.sgml b/doc/src/sgml/func/func-info.sgml
index b507bfaf64b..f185c80cec2 100644
--- a/doc/src/sgml/func/func-info.sgml
+++ b/doc/src/sgml/func/func-info.sgml
@@ -3588,6 +3588,11 @@ acl      | {postgres=arwdDxtm/postgres,foo=r/postgres}
        <entry><type>boolean</type></entry>
       </row>
 
+      <row>
+       <entry><structfield>backup_label_required</structfield></entry>
+       <entry><type>boolean</type></entry>
+      </row>
+
      </tbody>
     </tgroup>
    </table>
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9a4de1616bc..04922836cbc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9579,6 +9579,29 @@ do_pg_abort_backup(int code, Datum arg)
 	}
 }
 
+/*
+ * Create a consistent copy of control data to be used for backup and update it
+ * to require a backup label for recovery. Also recalculate the CRC.
+ */
+void
+backup_control_file(uint8_t *controlFile)
+{
+	ControlFileData *controlData = ((ControlFileData *)controlFile);
+
+	memset(controlFile + sizeof(ControlFileData), 0,
+		   PG_CONTROL_FILE_SIZE - sizeof(ControlFileData));
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	memcpy(controlFile, ControlFile, sizeof(ControlFileData));
+	LWLockRelease(ControlFileLock);
+
+	controlData->backupLabelRequired = true;
+
+	INIT_CRC32C(controlData->crc);
+	COMP_CRC32C(controlData->crc, controlFile, offsetof(ControlFileData, crc));
+	FIN_CRC32C(controlData->crc);
+}
+
 /*
  * Register a handler that will warn about unterminated backups at end of
  * session, unless this has already been done.
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index f23ec8969c2..9efbafce335 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -709,7 +709,14 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	}
 	else
 	{
-		/* No backup_label file has been found if we are here. */
+		/*
+		 * No backup_label file has been found if we are here. Error if the
+		 * control file requires backup_label.
+		 */
+		if (ControlFile->backupLabelRequired)
+			ereport(FATAL,
+					(errmsg("could not find backup_label required for recovery"),
+					 errhint("backup_label must be present for recovery to proceed")));
 
 		/*
 		 * If tablespace_map file is present without backup_label file, there
@@ -984,6 +991,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		{
 			ControlFile->backupStartPoint = checkPoint.redo;
 			ControlFile->backupEndRequired = backupEndRequired;
+			ControlFile->backupLabelRequired = false;
 
 			if (backupFromStandby)
 			{
diff --git a/src/backend/backup/basebackup.c b/src/backend/backup/basebackup.c
index bb7d90aa5d9..b2b6b259b69 100644
--- a/src/backend/backup/basebackup.c
+++ b/src/backend/backup/basebackup.c
@@ -23,6 +23,7 @@
 #include "backup/basebackup_incremental.h"
 #include "backup/basebackup_sink.h"
 #include "backup/basebackup_target.h"
+#include "catalog/pg_control.h"
 #include "catalog/pg_tablespace_d.h"
 #include "commands/defrem.h"
 #include "common/compression.h"
@@ -326,9 +327,9 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 
 			if (ti->path == NULL)
 			{
-				struct stat statbuf;
 				bool		sendtblspclinks = true;
 				char	   *backup_label;
+				uint8_t controlFile[PG_CONTROL_FILE_SIZE];
 
 				bbsink_begin_archive(sink, "base.tar");
 
@@ -351,14 +352,10 @@ perform_base_backup(basebackup_options *opt, bbsink *sink,
 						sendtblspclinks, &manifest, InvalidOid, ib);
 
 				/* ... and pg_control after everything else. */
-				if (lstat(XLOG_CONTROL_FILE, &statbuf) != 0)
-					ereport(ERROR,
-							(errcode_for_file_access(),
-							 errmsg("could not stat file \"%s\": %m",
-									XLOG_CONTROL_FILE)));
-				sendFile(sink, XLOG_CONTROL_FILE, XLOG_CONTROL_FILE, &statbuf,
-						 false, InvalidOid, InvalidOid,
-						 InvalidRelFileNumber, 0, &manifest, 0, NULL, 0);
+				backup_control_file(controlFile);
+				sendFileWithContent(sink, XLOG_CONTROL_FILE,
+									(char *)controlFile, PG_CONTROL_FILE_SIZE,
+									&manifest);
 			}
 			else
 			{
diff --git a/src/backend/utils/misc/pg_controldata.c b/src/backend/utils/misc/pg_controldata.c
index 6d036e3bf32..d0ebd8e98a7 100644
--- a/src/backend/utils/misc/pg_controldata.c
+++ b/src/backend/utils/misc/pg_controldata.c
@@ -162,8 +162,8 @@ pg_control_checkpoint(PG_FUNCTION_ARGS)
 Datum
 pg_control_recovery(PG_FUNCTION_ARGS)
 {
-	Datum		values[5];
-	bool		nulls[5];
+	Datum		values[6];
+	bool		nulls[6];
 	TupleDesc	tupdesc;
 	HeapTuple	htup;
 	ControlFileData *ControlFile;
@@ -195,6 +195,9 @@ pg_control_recovery(PG_FUNCTION_ARGS)
 	values[4] = BoolGetDatum(ControlFile->backupEndRequired);
 	nulls[4] = false;
 
+	values[5] = BoolGetDatum(ControlFile->backupLabelRequired);
+	nulls[5] = false;
+
 	htup = heap_form_tuple(tupdesc, values, nulls);
 
 	PG_RETURN_DATUM(HeapTupleGetDatum(htup));
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 10de058ce91..16046b3f094 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -294,6 +294,8 @@ main(int argc, char *argv[])
 		   LSN_FORMAT_ARGS(ControlFile->backupEndPoint));
 	printf(_("End-of-backup record required:        %s\n"),
 		   ControlFile->backupEndRequired ? _("yes") : _("no"));
+	printf(_("Backup label required:                %s\n"),
+		   ControlFile->backupLabelRequired ? _("yes") : _("no"));
 	printf(_("wal_level setting:                    %s\n"),
 		   wal_level_str(ControlFile->wal_level));
 	printf(_("wal_log_hints setting:                %s\n"),
diff --git a/src/bin/pg_resetwal/pg_resetwal.c b/src/bin/pg_resetwal/pg_resetwal.c
index e876f35f38e..be23e50d1c1 100644
--- a/src/bin/pg_resetwal/pg_resetwal.c
+++ b/src/bin/pg_resetwal/pg_resetwal.c
@@ -899,6 +899,7 @@ RewriteControlFile(void)
 	ControlFile.backupStartPoint = 0;
 	ControlFile.backupEndPoint = 0;
 	ControlFile.backupEndRequired = false;
+	ControlFile.backupLabelRequired = false;
 
 	/*
 	 * Force the defaults for max_* settings. The values don't really matter
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 0c68dd4235e..d365132eff6 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -727,6 +727,7 @@ perform_rewind(filemap_t *filemap, rewind_source *source,
 	ControlFile_new.minRecoveryPoint = endrec;
 	ControlFile_new.minRecoveryPointTLI = endtli;
 	ControlFile_new.state = DB_IN_ARCHIVE_RECOVERY;
+	ControlFile_new.backupLabelRequired = true;
 	if (!dry_run)
 		update_controlfile(datadir_target, &ControlFile_new, do_sync);
 }
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d12798be3d8..4d0137096d4 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -295,6 +295,7 @@ extern void do_pg_backup_start(const char *backupidstr, bool fast,
 							   StringInfo tblspcmapfile);
 extern void do_pg_backup_stop(BackupState *state, bool waitforarchive);
 extern void do_pg_abort_backup(int code, Datum arg);
+extern void backup_control_file(uint8_t *controlFile);
 extern void register_persistent_abort_backup_handler(void);
 extern SessionBackupState get_backup_status(void);
 
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 63e834a6ce4..84df7ee6f50 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -164,12 +164,16 @@ typedef struct ControlFileData
 	 * If backupEndRequired is true, we know for sure that we're restoring
 	 * from a backup, and must see a backup-end record before we can safely
 	 * start up.
+	 *
+	 * If backupLabelRequired is true, then a backup_label file must be
+	 * present in order for recovery to proceed.
 	 */
 	XLogRecPtr	minRecoveryPoint;
 	TimeLineID	minRecoveryPointTLI;
 	XLogRecPtr	backupStartPoint;
 	XLogRecPtr	backupEndPoint;
 	bool		backupEndRequired;
+	bool		backupLabelRequired;
 
 	/*
 	 * Parameter settings that determine if the WAL can be used for archival
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 118d6da1ace..7d05cf89eec 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -12326,9 +12326,9 @@
 { oid => '3443',
   descr => 'pg_controldata recovery state information as a function',
   proname => 'pg_control_recovery', provolatile => 'v', prorettype => 'record',
-  proargtypes => '', proallargtypes => '{pg_lsn,int4,pg_lsn,pg_lsn,bool}',
-  proargmodes => '{o,o,o,o,o}',
-  proargnames => '{min_recovery_end_lsn,min_recovery_end_timeline,backup_start_lsn,backup_end_lsn,end_of_backup_record_required}',
+  proargtypes => '', proallargtypes => '{pg_lsn,int4,pg_lsn,pg_lsn,bool,bool}',
+  proargmodes => '{o,o,o,o,o,o}',
+  proargnames => '{min_recovery_end_lsn,min_recovery_end_timeline,backup_start_lsn,backup_end_lsn,end_of_backup_record_required,backup_label_required}',
   prosrc => 'pg_control_recovery' },
 
 { oid => '3444',
diff --git a/src/test/recovery/t/002_archiving.pl b/src/test/recovery/t/002_archiving.pl
index 3acdb9ff1eb..153c658c123 100644
--- a/src/test/recovery/t/002_archiving.pl
+++ b/src/test/recovery/t/002_archiving.pl
@@ -41,6 +41,26 @@ $node_standby->append_conf(
 archive_cleanup_command = 'echo archive_cleanup_done > $archive_cleanup_command_file'
 recovery_end_command = 'echo recovery_ended_done > $recovery_end_command_file'
 ));
+
+# Rename backup_label to verify that recovery will not start without it
+rename("$data_dir/backup_label", "$data_dir/backup_label.tmp")
+  or BAIL_OUT "could not move $data_dir/backup_label";
+
+my $res = run_log(
+	[
+		'pg_ctl', '-D', $node_standby->data_dir, '-l',
+		$node_standby->logfile, 'start'
+	]);
+ok(!$res, 'invalid recovery startup fails');
+
+my $logfile = slurp_file($node_standby->logfile());
+ok($logfile =~ qr/could not find backup_label required for recovery/,
+	'could not find backup_label required for recovery');
+
+# Restore backup_label so recovery proceeds normally
+rename("$data_dir/backup_label.tmp", "$data_dir/backup_label")
+  or BAIL_OUT "could not move $data_dir/backup_label";
+
 $node_standby->start;
 
 # Create some content on primary
-- 
2.34.1

pgcontrol-flag-v6-02-sql.patchtext/plain; charset=UTF-8; name=pgcontrol-flag-v6-02-sql.patchDownload
From 8e5f815ccfa313d7c59649e9baa473b63c21862b Mon Sep 17 00:00:00 2001
From: David Steele <david@pgmasters.net>
Date: Wed, 6 Aug 2025 22:25:42 +0000
Subject: Return pg_control from pg_backup_stop().

Harden recovery by returning a copy of pg_control from pg_backup_stop() that has
a flag set to prevent recovery if the backup_label file is missing. Instead of
backup software copying pg_control from PGDATA, it stores an updated version
that is returned from pg_backup_stop(). This is better for the following
reasons:

* The user can no longer remove backup_label and get what looks like a
successful recovery (while almost certainly causing corruption). If backup_label
is removed the cluster will not start. The user may try pg_resetwal, but that
tool makes it pretty clear that corruption will result from its use.

* We don't need to worry about backup software seeing a torn copy of pg_control,
since Postgres can safely read it out of memory and provide a valid copy via
pg_backup_stop(). This solves torn reads without needing to write pg_control via
a temp file, which may affect performance on a standby.

* For backup from standby, we no longer need to instruct the backup software to
copy pg_control last. In fact the backup software should not copy pg_control from
PGDATA at all.

These changes have no impact on current backup software and they are free to use
the pg_control available from pg_stop_backup() or continue to use pg_control from
PGDATA. Of course they will miss the benefits of getting a consistent copy of
pg_control and the backup_label checking, but will be no worse off than before.

Catalog version bump is required.
---
 doc/src/sgml/backup.sgml                    | 18 +++++-
 src/backend/access/transam/xlogfuncs.c      | 20 ++++--
 src/backend/catalog/system_functions.sql    |  2 +-
 src/include/catalog/pg_proc.dat             |  4 +-
 src/test/recovery/t/042_low_level_backup.pl | 67 ++++++++++++++++++++-
 5 files changed, 98 insertions(+), 13 deletions(-)

diff --git a/doc/src/sgml/backup.sgml b/doc/src/sgml/backup.sgml
index 5f7489afbd1..7ea85172d51 100644
--- a/doc/src/sgml/backup.sgml
+++ b/doc/src/sgml/backup.sgml
@@ -1027,9 +1027,12 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
      values. The second of these fields should be written to a file named
      <filename>backup_label</filename> in the root directory of the backup. The
      third field should be written to a file named
-     <filename>tablespace_map</filename> unless the field is empty. These files are
+     <filename>tablespace_map</filename> unless the field is empty. The fourth
+     field should be written into a file named
+     <filename>global/pg_control</filename> (overwriting the existing file when
+     present). These files are
      vital to the backup working and must be written byte for byte without
-     modification, which may require opening the file in binary mode.
+     modification, which will require opening the file in binary mode.
     </para>
    </listitem>
    <listitem>
@@ -1101,7 +1104,16 @@ SELECT * FROM pg_backup_stop(wait_for_archive => true);
    </para>
 
    <para>
-    You should, however, omit from the backup the files within the
+    You should exclude <filename>global/pg_control</filename> from your backup
+    and put the contents of the <parameter>controlfile</parameter> column
+    returned from <function>pg_backup_stop</function> in your backup at
+    <filename>global/pg_control</filename>. This version of pg_control contains
+    safeguards against recovery without backup_label present and is guaranteed
+    not to be torn.
+   </para>
+
+   <para>
+    You should also omit from the backup the files within the
     cluster's <filename>pg_wal/</filename> subdirectory.  This
     slight adjustment is worthwhile because it reduces the risk
     of mistakes when restoring.  This is easy to arrange if
diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c
index 8c3090165f0..e805f231c69 100644
--- a/src/backend/access/transam/xlogfuncs.c
+++ b/src/backend/access/transam/xlogfuncs.c
@@ -113,9 +113,11 @@ pg_backup_start(PG_FUNCTION_ARGS)
  *
  * The backup_label contains the user-supplied label string (typically this
  * would be used to tell where the backup dump will be stored), the starting
- * time, starting WAL location for the dump and so on.  It is the caller's
- * responsibility to write the backup_label and tablespace_map files in the
- * data folder that will be restored from this backup.
+ * time, starting WAL location for the dump and so on.  The pg_control file
+ * contains a consistent copy of pg_control that also has a safeguard against
+ * being used without backup_label.  It is the caller's responsibility to write
+ * the backup_label, pg_control, and tablespace_map files in the data folder
+ * that will be restored from this backup.
  *
  * Permission checking for this function is managed through the normal
  * GRANT system.
@@ -123,12 +125,14 @@ pg_backup_start(PG_FUNCTION_ARGS)
 Datum
 pg_backup_stop(PG_FUNCTION_ARGS)
 {
-#define PG_BACKUP_STOP_V2_COLS 3
+#define PG_BACKUP_STOP_V2_COLS 4
 	TupleDesc	tupdesc;
 	Datum		values[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		nulls[PG_BACKUP_STOP_V2_COLS] = {0};
 	bool		waitforarchive = PG_GETARG_BOOL(0);
 	char	   *backup_label;
+	bytea	   *pg_control_bytea;
+	uint8_t		pg_control[PG_CONTROL_FILE_SIZE];
 	SessionBackupState status = get_backup_status();
 
 	/* Initialize attributes information in the tuple descriptor */
@@ -150,9 +154,17 @@ pg_backup_stop(PG_FUNCTION_ARGS)
 	/* Build the contents of backup_label */
 	backup_label = build_backup_content(backup_state, false);
 
+	/* Build the contents of pg_control */
+	backup_control_file(pg_control);
+
+	pg_control_bytea = (bytea *) palloc(PG_CONTROL_FILE_SIZE + VARHDRSZ);
+	SET_VARSIZE(pg_control_bytea, PG_CONTROL_FILE_SIZE + VARHDRSZ);
+	memcpy(VARDATA(pg_control_bytea), pg_control, PG_CONTROL_FILE_SIZE);
+
 	values[0] = LSNGetDatum(backup_state->stoppoint);
 	values[1] = CStringGetTextDatum(backup_label);
 	values[2] = CStringGetTextDatum(tablespace_map->data);
+	values[3] = PointerGetDatum(pg_control_bytea);
 
 	/* Deallocate backup-related variables */
 	pfree(backup_label);
diff --git a/src/backend/catalog/system_functions.sql b/src/backend/catalog/system_functions.sql
index 566f308e443..f8833393833 100644
--- a/src/backend/catalog/system_functions.sql
+++ b/src/backend/catalog/system_functions.sql
@@ -390,7 +390,7 @@ CREATE OR REPLACE FUNCTION
 
 CREATE OR REPLACE FUNCTION pg_backup_stop (
         wait_for_archive boolean DEFAULT true, OUT lsn pg_lsn,
-        OUT labelfile text, OUT spcmapfile text)
+        OUT labelfile text, OUT spcmapfile text, OUT controlfile bytea)
   RETURNS record STRICT VOLATILE LANGUAGE internal as 'pg_backup_stop'
   PARALLEL RESTRICTED;
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7d05cf89eec..f15bc5260ee 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -6700,8 +6700,8 @@
 { oid => '2739', descr => 'finish taking an online backup',
   proname => 'pg_backup_stop', provolatile => 'v', proparallel => 'r',
   prorettype => 'record', proargtypes => 'bool',
-  proallargtypes => '{bool,pg_lsn,text,text}', proargmodes => '{i,o,o,o}',
-  proargnames => '{wait_for_archive,lsn,labelfile,spcmapfile}',
+  proallargtypes => '{bool,pg_lsn,text,text,bytea}', proargmodes => '{i,o,o,o,o}',
+  proargnames => '{wait_for_archive,lsn,labelfile,spcmapfile,controlfile}',
   prosrc => 'pg_backup_stop' },
 { oid => '3436', descr => 'promote standby server',
   proname => 'pg_promote', provolatile => 'v', prorettype => 'bool',
diff --git a/src/test/recovery/t/042_low_level_backup.pl b/src/test/recovery/t/042_low_level_backup.pl
index 5749a1df533..a08a9a9d9df 100644
--- a/src/test/recovery/t/042_low_level_backup.pl
+++ b/src/test/recovery/t/042_low_level_backup.pl
@@ -13,6 +13,42 @@ use PostgreSQL::Test::Cluster;
 use PostgreSQL::Test::Utils;
 use Test::More;
 
+# Decode hex to binary
+sub decode_hex
+{
+	my ($encoded) = @_;
+	my $decoded;
+
+	$encoded =~ s/^\s+|\s+$//g;
+
+	for (my $idx = 0; $idx < length($encoded); $idx += 2)
+	{
+		$decoded .= pack('C', hex(substr($encoded, $idx, 2)));
+	}
+
+	return $decoded;
+}
+
+# Get backup_label/pg_control from pg_stop_backup()
+sub stop_backup_result
+{
+	my ($psql) = @_;
+
+	my $encoded = $psql->query_safe(
+		"select encode(labelfile::bytea, 'hex') || ',' || " .
+		"       encode(controlfile, 'hex')" .
+		"  from pg_backup_stop()");
+
+	my @result;
+
+    foreach my $column (split(',', $encoded))
+	{
+		push(@result, decode_hex($column));
+	}
+
+	return @result;
+}
+
 # Start primary node with archiving.
 my $node_primary = PostgreSQL::Test::Cluster->new('primary');
 $node_primary->init(has_archiving => 1, allows_streaming => 1);
@@ -80,8 +116,7 @@ my $stop_segment_name = $node_primary->safe_psql('postgres',
 	'SELECT pg_walfile_name(pg_current_wal_lsn())');
 
 # Stop backup and get backup_label, the last segment is archived.
-my $backup_label =
-  $psql->query_safe("select labelfile from pg_backup_stop()");
+(my $backup_label, my $pg_control) = stop_backup_result($psql);
 
 $psql->quit;
 
@@ -118,10 +153,36 @@ ok( $node_replica->log_contains(
 $node_replica->teardown_node;
 $node_replica->clean_node;
 
+# Save only pg_control into the backup to demonstrate that pg_control returned
+# from pg_stop_backup() will only perform recovery when backup_label is present.
+open(my $fh, ">", "$backup_dir/global/pg_control")
+  or die "could not open pg_control";
+binmode($fh);
+syswrite($fh, $pg_control);
+close($fh);
+
+$node_replica = PostgreSQL::Test::Cluster->new('replica_fail2');
+$node_replica->init_from_backup($node_primary, $backup_name,
+	has_restoring => 1);
+
+my $res = run_log(
+	[
+		'pg_ctl', '-D', $node_replica->data_dir, '-l',
+		$node_replica->logfile, 'start'
+	]);
+ok(!$res, 'invalid recovery startup fails');
+
+my $logfile = slurp_file($node_replica->logfile());
+ok($logfile =~ qr/could not find backup_label required for recovery/,
+	'could not find backup_label required for recovery');
+
+$node_replica->teardown_node;
+$node_replica->clean_node;
+
 # Save backup_label into the backup directory and recover using the primary's
 # archive.  This time recovery will succeed and the canary table will be
 # present.
-open my $fh, ">>", "$backup_dir/backup_label"
+open $fh, ">>", "$backup_dir/backup_label"
   or die "could not open backup_label";
 # Binary mode is required for Windows, as the backup_label parsing is not
 # able to cope with CRLFs.
-- 
2.34.1