optimize file transfer in pg_upgrade

Started by Nathan Bossartabout 1 year ago45 messages
#1Nathan Bossart
nathandbossart@gmail.com
8 attachment(s)

For clusters with many relations, the file transfer step of pg_upgrade can
take the longest. This step clones, copies, or links the user relation
files from the older cluster to the new cluster, so the amount of time it
takes is closely related to the number of relations. However, since v15,
we've preserved the relfilenodes during pg_upgrade, which means that all of
these user relation files will have the same name. Therefore, it can be
much faster to instead move the entire data directory from the old cluster
to the new cluster and to then swap the catalog relation files.

The attached proof-of-concept patches implement this "catalog-swap" mode
for demonstration purposes. I tested this mode on a cluster with 200
databases, each with 10,000 tables with 1,000 rows and 2 unique constraints
apiece. Each database also had 10,000 sequences. The test used 96 jobs.

pg_upgrade --link --sync-method syncfs --> 10m 23s (~5m linking)
pg_upgrade --catalog-swap --> 5m 32s (~30s linking)

While these results are encouraging, there are a couple of interesting
problems to manage. First, in order to move the data directory from the
old cluster to the new cluster, we will have first moved the new cluster's
data directory (full of files created by pg_restore) aside. After the file
transfer stage, this directory will be filled with useless empty files that
should eventually be deleted. Furthermore, none of these files will have
been synchronized to disk (outside of whatever the kernel has done in the
background), so pg_upgrade's data synchronization step can take a very long
time, even when syncfs() is used (so long that pg_upgrade can take even
longer than before). After much testing, the best way I've found to deal
with this problem is to introduce a special mode for "initdb --sync-only"
that calls fsync() for everything _except_ the actual data files. If we
fsync() the new catalog files as we move them into place, and if we assume
that the old catalog files will have been properly synchronized before
upgrading, there's no reason to synchronize them again at the end.

Another interesting problem is that pg_upgrade currently doesn't transfer
the sequence data files. Since v10, we've restored these via pg_restore.
I believe this was originally done for the introduction of the pg_sequence
catalog, which changed the format of sequence tuples. In the new
catalog-swap mode I am proposing, this means we need to transfer all the
pg_restore-generated sequence data files. If there are many sequences, it
can be difficult to determine which transfer mode and synchronization
method will be faster. Since sequence tuple modifications are very rare, I
think the new catalog-swap mode should just use the sequence data files
from the old cluster whenever possible.

There are a couple of other smaller trade-offs with this approach, too.
First, this new mode complicates rollback if, say, the machine loses power
during file transfer. IME the vast majority of failures happen before this
step, and it should be relatively simple to generate a script that will
safely perform the required rollback steps, so I don't think this is a
deal-breaker. Second, this mode leaves around a bunch of files that users
would likely want to clean up at some point. I think the easiest way to
handle this is to just put all these files in the old cluster's data
directory so that the cleanup script generated by pg_upgrade also takes
care of them.

Thoughts?

--
nathan

Attachments:

v1-0001-Export-walkdir.patchtext/plain; charset=us-asciiDownload
From f800010296b1749b57e0fe3dcde010cc2ba41973 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 5 Nov 2024 15:59:51 -0600
Subject: [PATCH v1 1/8] Export walkdir().

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

A follow-up commit will use this function to swap catalog files
between database directories during pg_upgrade.
---
 src/common/file_utils.c         | 5 +----
 src/include/common/file_utils.h | 3 +++
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 398fe1c334..3f488bf5ec 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -48,9 +48,6 @@
 #ifdef PG_FLUSH_DATA_WORKS
 static int	pre_sync_fname(const char *fname, bool isdir);
 #endif
-static void walkdir(const char *path,
-					int (*action) (const char *fname, bool isdir),
-					bool process_symlinks);
 
 #ifdef HAVE_SYNCFS
 
@@ -268,7 +265,7 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
  *
  * See also walkdir in fd.c, which is a backend version of this logic.
  */
-static void
+void
 walkdir(const char *path,
 		int (*action) (const char *fname, bool isdir),
 		bool process_symlinks)
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index e4339fb7b6..5a9519acfe 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -39,6 +39,9 @@ extern void sync_pgdata(const char *pg_data, int serverVersion,
 extern void sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method);
 extern int	durable_rename(const char *oldfile, const char *newfile);
 extern int	fsync_parent_path(const char *fname);
+extern void walkdir(const char *path,
+					int (*action) (const char *fname, bool isdir),
+					bool process_symlinks);
 #endif
 
 extern PGFileType get_dirent_type(const char *path,
-- 
2.39.5 (Apple Git-154)

v1-0002-Add-void-arg-parameter-to-walkdir-that-is-passed-.patchtext/plain; charset=us-asciiDownload
From 2d6b0d5708f07203ad2ffbd889404094d0c5969c Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 6 Nov 2024 13:59:39 -0600
Subject: [PATCH v1 2/8] Add "void *arg" parameter to walkdir() that is passed
 to function.

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

This will be used in follow up commits to pass private state to the
functions called by walkdir().
---
 src/bin/pg_basebackup/walmethods.c |  8 +++----
 src/bin/pg_dump/pg_backup_custom.c |  2 +-
 src/bin/pg_dump/pg_backup_tar.c    |  2 +-
 src/bin/pg_dump/pg_dumpall.c       |  2 +-
 src/common/file_utils.c            | 38 +++++++++++++++---------------
 src/include/common/file_utils.h    |  6 ++---
 6 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c
index 215b24597f..51640cb493 100644
--- a/src/bin/pg_basebackup/walmethods.c
+++ b/src/bin/pg_basebackup/walmethods.c
@@ -251,7 +251,7 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname,
 	 */
 	if (wwmethod->sync)
 	{
-		if (fsync_fname(tmppath, false) != 0 ||
+		if (fsync_fname(tmppath, false, NULL) != 0 ||
 			fsync_parent_path(tmppath) != 0)
 		{
 			wwmethod->lasterrno = errno;
@@ -486,7 +486,7 @@ dir_close(Walfile *f, WalCloseMethod method)
 			 */
 			if (f->wwmethod->sync)
 			{
-				r = fsync_fname(df->fullpath, false);
+				r = fsync_fname(df->fullpath, false, NULL);
 				if (r == 0)
 					r = fsync_parent_path(df->fullpath);
 			}
@@ -617,7 +617,7 @@ dir_finish(WalWriteMethod *wwmethod)
 		 * Files are fsynced when they are closed, but we need to fsync the
 		 * directory entry here as well.
 		 */
-		if (fsync_fname(dir_data->basedir, true) != 0)
+		if (fsync_fname(dir_data->basedir, true, NULL) != 0)
 		{
 			wwmethod->lasterrno = errno;
 			return false;
@@ -1321,7 +1321,7 @@ tar_finish(WalWriteMethod *wwmethod)
 
 	if (wwmethod->sync)
 	{
-		if (fsync_fname(tar_data->tarfilename, false) != 0 ||
+		if (fsync_fname(tar_data->tarfilename, false, NULL) != 0 ||
 			fsync_parent_path(tar_data->tarfilename) != 0)
 		{
 			wwmethod->lasterrno = errno;
diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index ecaad7321a..6f750c916c 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -767,7 +767,7 @@ _CloseArchive(ArchiveHandle *AH)
 
 	/* Sync the output file if one is defined */
 	if (AH->dosync && AH->mode == archModeWrite && AH->fSpec)
-		(void) fsync_fname(AH->fSpec, false);
+		(void) fsync_fname(AH->fSpec, false, NULL);
 
 	AH->FH = NULL;
 }
diff --git a/src/bin/pg_dump/pg_backup_tar.c b/src/bin/pg_dump/pg_backup_tar.c
index 41ee52b1d6..ecba27b623 100644
--- a/src/bin/pg_dump/pg_backup_tar.c
+++ b/src/bin/pg_dump/pg_backup_tar.c
@@ -847,7 +847,7 @@ _CloseArchive(ArchiveHandle *AH)
 
 		/* Sync the output file if one is defined */
 		if (AH->dosync && AH->fSpec)
-			(void) fsync_fname(AH->fSpec, false);
+			(void) fsync_fname(AH->fSpec, false, NULL);
 	}
 
 	AH->FH = NULL;
diff --git a/src/bin/pg_dump/pg_dumpall.c b/src/bin/pg_dump/pg_dumpall.c
index e3ad8fb295..cbb1e3f9e4 100644
--- a/src/bin/pg_dump/pg_dumpall.c
+++ b/src/bin/pg_dump/pg_dumpall.c
@@ -621,7 +621,7 @@ main(int argc, char *argv[])
 
 		/* sync the resulting file, errors are not fatal */
 		if (dosync)
-			(void) fsync_fname(filename, false);
+			(void) fsync_fname(filename, false, NULL);
 	}
 
 	exit_nicely(0);
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 3f488bf5ec..dc90f35ae1 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -46,7 +46,7 @@
 #define MINIMUM_VERSION_FOR_PG_WAL	100000
 
 #ifdef PG_FLUSH_DATA_WORKS
-static int	pre_sync_fname(const char *fname, bool isdir);
+static int	pre_sync_fname(const char *fname, bool isdir, void *arg);
 #endif
 
 #ifdef HAVE_SYNCFS
@@ -184,10 +184,10 @@ sync_pgdata(const char *pg_data,
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(pg_data, pre_sync_fname, false);
+				walkdir(pg_data, pre_sync_fname, false, NULL);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, pre_sync_fname, false);
-				walkdir(pg_tblspc, pre_sync_fname, true);
+					walkdir(pg_wal, pre_sync_fname, false, NULL);
+				walkdir(pg_tblspc, pre_sync_fname, true, NULL);
 #endif
 
 				/*
@@ -200,10 +200,10 @@ sync_pgdata(const char *pg_data,
 				 * get fsync'd twice. That's not an expected case so we don't
 				 * worry about optimizing it.
 				 */
-				walkdir(pg_data, fsync_fname, false);
+				walkdir(pg_data, fsync_fname, false, NULL);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, fsync_fname, false);
-				walkdir(pg_tblspc, fsync_fname, true);
+					walkdir(pg_wal, fsync_fname, false, NULL);
+				walkdir(pg_tblspc, fsync_fname, true, NULL);
 			}
 			break;
 	}
@@ -242,10 +242,10 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(dir, pre_sync_fname, false);
+				walkdir(dir, pre_sync_fname, false, NULL);
 #endif
 
-				walkdir(dir, fsync_fname, false);
+				walkdir(dir, fsync_fname, false, NULL);
 			}
 			break;
 	}
@@ -267,8 +267,8 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
  */
 void
 walkdir(const char *path,
-		int (*action) (const char *fname, bool isdir),
-		bool process_symlinks)
+		int (*action) (const char *fname, bool isdir, void *arg),
+		bool process_symlinks, void *arg)
 {
 	DIR		   *dir;
 	struct dirent *de;
@@ -293,10 +293,10 @@ walkdir(const char *path,
 		switch (get_dirent_type(subpath, de, process_symlinks, PG_LOG_ERROR))
 		{
 			case PGFILETYPE_REG:
-				(*action) (subpath, false);
+				(*action) (subpath, false, arg);
 				break;
 			case PGFILETYPE_DIR:
-				walkdir(subpath, action, false);
+				walkdir(subpath, action, false, arg);
 				break;
 			default:
 
@@ -320,7 +320,7 @@ walkdir(const char *path,
 	 * synced.  Recent versions of ext4 have made the window much wider but
 	 * it's been an issue for ext3 and other filesystems in the past.
 	 */
-	(*action) (path, true);
+	(*action) (path, true, arg);
 }
 
 /*
@@ -332,7 +332,7 @@ walkdir(const char *path,
 #ifdef PG_FLUSH_DATA_WORKS
 
 static int
-pre_sync_fname(const char *fname, bool isdir)
+pre_sync_fname(const char *fname, bool isdir, void *arg)
 {
 	int			fd;
 
@@ -373,7 +373,7 @@ pre_sync_fname(const char *fname, bool isdir)
  * are fatal.
  */
 int
-fsync_fname(const char *fname, bool isdir)
+fsync_fname(const char *fname, bool isdir, void *arg)
 {
 	int			fd;
 	int			flags;
@@ -444,7 +444,7 @@ fsync_parent_path(const char *fname)
 	if (strlen(parentpath) == 0)
 		strlcpy(parentpath, ".", MAXPGPATH);
 
-	if (fsync_fname(parentpath, true) != 0)
+	if (fsync_fname(parentpath, true, NULL) != 0)
 		return -1;
 
 	return 0;
@@ -467,7 +467,7 @@ durable_rename(const char *oldfile, const char *newfile)
 	 * because it's then guaranteed that either source or target file exists
 	 * after a crash.
 	 */
-	if (fsync_fname(oldfile, false) != 0)
+	if (fsync_fname(oldfile, false, NULL) != 0)
 		return -1;
 
 	fd = open(newfile, PG_BINARY | O_RDWR, 0);
@@ -502,7 +502,7 @@ durable_rename(const char *oldfile, const char *newfile)
 	 * To guarantee renaming the file is persistent, fsync the file with its
 	 * new name, and its containing directory.
 	 */
-	if (fsync_fname(newfile, false) != 0)
+	if (fsync_fname(newfile, false, NULL) != 0)
 		return -1;
 
 	if (fsync_parent_path(newfile) != 0)
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 5a9519acfe..c328f56a85 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -33,15 +33,15 @@ typedef enum DataDirSyncMethod
 struct iovec;					/* avoid including port/pg_iovec.h here */
 
 #ifdef FRONTEND
-extern int	fsync_fname(const char *fname, bool isdir);
+extern int	fsync_fname(const char *fname, bool isdir, void *arg);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
 						DataDirSyncMethod sync_method);
 extern void sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method);
 extern int	durable_rename(const char *oldfile, const char *newfile);
 extern int	fsync_parent_path(const char *fname);
 extern void walkdir(const char *path,
-					int (*action) (const char *fname, bool isdir),
-					bool process_symlinks);
+					int (*action) (const char *fname, bool isdir, void *arg),
+					bool process_symlinks, void *arg);
 #endif
 
 extern PGFileType get_dirent_type(const char *path,
-- 
2.39.5 (Apple Git-154)

v1-0003-Introduce-catalog-swap-mode-for-pg_upgrade.patchtext/plain; charset=us-asciiDownload
From c248059f7276578f5e8abd00ea9efb007423f94d Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 5 Nov 2024 16:38:19 -0600
Subject: [PATCH v1 3/8] Introduce catalog-swap mode for pg_upgrade.

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

This new mode moves the database directories from the old cluster
to the new cluster and then swaps the pg_restore-generated catalog
files in place.  This can significantly increase the length of the
following data synchronization step (due to the large number of
unsynchronized pg_restore-generated files), but this problem will
be handled in follow-up commits.
---
 src/bin/pg_upgrade/check.c         |   2 +
 src/bin/pg_upgrade/option.c        |   5 +
 src/bin/pg_upgrade/pg_upgrade.h    |   1 +
 src/bin/pg_upgrade/relfilenumber.c | 167 +++++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list   |   1 +
 5 files changed, 176 insertions(+)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 94164f0472..a4bb365718 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -711,6 +711,8 @@ check_new_cluster(void)
 		case TRANSFER_MODE_LINK:
 			check_hard_link();
 			break;
+		case TRANSFER_MODE_CATALOG_SWAP:
+			break;
 	}
 
 	check_is_install_user(&new_cluster);
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 6f41d63eed..64091a54c4 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -60,6 +60,7 @@ parseCommandLine(int argc, char *argv[])
 		{"copy", no_argument, NULL, 2},
 		{"copy-file-range", no_argument, NULL, 3},
 		{"sync-method", required_argument, NULL, 4},
+		{"catalog-swap", no_argument, NULL, 5},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -212,6 +213,10 @@ parseCommandLine(int argc, char *argv[])
 				user_opts.sync_method = pg_strdup(optarg);
 				break;
 
+			case 5:
+				user_opts.transfer_mode = TRANSFER_MODE_CATALOG_SWAP;
+				break;
+
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
 						os_info.progname);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 53f693c2d4..19cb5a011e 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -256,6 +256,7 @@ typedef enum
 	TRANSFER_MODE_COPY,
 	TRANSFER_MODE_COPY_FILE_RANGE,
 	TRANSFER_MODE_LINK,
+	TRANSFER_MODE_CATALOG_SWAP,
 } transferMode;
 
 /*
diff --git a/src/bin/pg_upgrade/relfilenumber.c b/src/bin/pg_upgrade/relfilenumber.c
index 07baa49a02..9d8fce3c4a 100644
--- a/src/bin/pg_upgrade/relfilenumber.c
+++ b/src/bin/pg_upgrade/relfilenumber.c
@@ -11,11 +11,21 @@
 
 #include <sys/stat.h>
 
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/int.h"
+#include "fe_utils/option_utils.h"
 #include "pg_upgrade.h"
 
 static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
 static void transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_frozenbit);
 
+typedef struct move_catalog_file_context
+{
+	FileNameMap *maps;
+	int			size;
+	char	   *target;
+} move_catalog_file_context;
 
 /*
  * transfer_all_new_tablespaces()
@@ -41,6 +51,9 @@ transfer_all_new_tablespaces(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		case TRANSFER_MODE_LINK:
 			prep_status_progress("Linking user relation files");
 			break;
+		case TRANSFER_MODE_CATALOG_SWAP:
+			prep_status_progress("Swapping catalog files");
+			break;
 	}
 
 	/*
@@ -127,6 +140,144 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 	}
 }
 
+static int
+FileNameMapCmp(const void *a, const void *b)
+{
+	return pg_cmp_u32(((const FileNameMap *) a)->relfilenumber,
+					  ((const FileNameMap *) b)->relfilenumber);
+}
+
+static RelFileNumber
+parse_relfilenumber(const char *filename)
+{
+	char	   *endp;
+	unsigned long n;
+
+	if (filename[0] < '1' || filename[0] > '9')
+		return InvalidRelFileNumber;
+
+	errno = 0;
+	n = strtoul(filename, &endp, 10);
+	if (errno || filename == endp || n <= 0 || n > PG_UINT32_MAX)
+		return InvalidRelFileNumber;
+
+	return (RelFileNumber) n;
+}
+
+static int
+move_catalog_file(const char *fname, bool isdir, void *arg)
+{
+	char		dst[MAXPGPATH];
+	const char *filename = last_dir_separator(fname) + 1;
+	RelFileNumber rfn = parse_relfilenumber(filename);
+	move_catalog_file_context *context = (move_catalog_file_context *) arg;
+
+	/*
+	 * XXX: Is this right?  AFAICT we don't really expect there to be
+	 * directories within database directories, so perhaps it would be better
+	 * to either unconditionally rename or to fail.  Further investigation is
+	 * required.
+	 */
+	if (isdir)
+		return 0;
+
+	if (RelFileNumberIsValid(rfn))
+	{
+		FileNameMap key;
+
+		key.relfilenumber = (RelFileNumber) rfn;
+		if (bsearch(&key, context->maps, context->size,
+					sizeof(FileNameMap), FileNameMapCmp))
+			return 0;
+	}
+
+	snprintf(dst, sizeof(dst), "%s/%s", context->target, filename);
+	if (rename(fname, dst) != 0)
+		pg_fatal("could not rename \"%s\" to \"%s\": %m", fname, dst);
+
+	return 0;
+}
+
+/*
+ * XXX: This proof-of-concept patch doesn't yet handle non-default tablespaces.
+ */
+static void
+do_catalog_transfer(FileNameMap *maps, int size, char *old_tablespace)
+{
+	char		old_tblspc[MAXPGPATH];
+	char		new_tblspc[MAXPGPATH];
+	char		old_dat[MAXPGPATH];
+	char		new_dat[MAXPGPATH];
+	char		moved_tblspc[MAXPGPATH];
+	char		moved_dat[MAXPGPATH];
+	char		old_cat[MAXPGPATH];
+	move_catalog_file_context context;
+	DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+
+	parse_sync_method(user_opts.sync_method, &sync_method);
+
+	snprintf(old_tblspc, sizeof(old_tblspc), "%s%s",
+			 maps[0].old_tablespace, maps[0].old_tablespace_suffix);
+	snprintf(new_tblspc, sizeof(new_tblspc), "%s%s",
+			 maps[0].new_tablespace, maps[0].new_tablespace_suffix);
+	snprintf(old_dat, sizeof(old_dat), "%s/%u", old_tblspc, maps[0].db_oid);
+	snprintf(new_dat, sizeof(new_dat), "%s/%u", new_tblspc, maps[0].db_oid);
+	snprintf(moved_tblspc, sizeof(moved_tblspc), "%s_moved", old_tblspc);
+	snprintf(moved_dat, sizeof(moved_dat), "%s/%u",
+			 moved_tblspc, maps[0].db_oid);
+	snprintf(old_cat, sizeof(old_cat), "%s/%u_old_cat",
+			 moved_tblspc, maps[0].db_oid);
+
+	qsort(maps, size, sizeof(FileNameMap), FileNameMapCmp);
+
+	/* create dir for stuff that is moved aside */
+	if (pg_mkdir_p(moved_tblspc, pg_dir_create_mode) && errno != EEXIST)
+		pg_fatal("could not create directory \"%s\": %m", moved_tblspc);
+
+	/* move new cluster data dir aside */
+	if (rename(new_dat, moved_dat))
+		pg_fatal("could not rename \"%s\" to \"%s\": %m", new_dat, moved_dat);
+
+	/* move old cluster data dir in place */
+	if (rename(old_dat, new_dat))
+		pg_fatal("could not rename \"%s\" to \"%s\": %m", old_dat, new_dat);
+
+	/* create dir for old catalogs */
+	if (pg_mkdir_p(old_cat, pg_dir_create_mode))
+		pg_fatal("could not create directory \"%s\": %m", old_cat);
+
+	/* move catalogs in new data dir aside */
+	context.maps = maps;
+	context.size = size;
+	context.target = old_cat;
+	walkdir(new_dat, move_catalog_file, false, &context);
+
+	/* move catalogs in moved-aside data dir in place */
+	context.target = new_dat;
+	walkdir(moved_dat, move_catalog_file, false, &context);
+
+	/* no need to sync things individually if we are going to syncfs() later */
+	if (sync_method == DATA_DIR_SYNC_METHOD_SYNCFS)
+		return;
+
+	/* fsync directory entries */
+	if (fsync_fname(moved_dat, true, NULL) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", moved_dat);
+	if (fsync_fname(old_cat, true, NULL) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", old_cat);
+
+	/*
+	 * XXX: We could instead fsync() these directories once at the end instead
+	 * of once per-database, but it doesn't affect performance meaningfully,
+	 * and this is just a proof-of-concept patch, so I haven't bothered doing
+	 * the required refactoring yet.
+	 */
+	if (fsync_fname(old_tblspc, true, NULL) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", old_tblspc);
+	if (fsync_fname(moved_tblspc, true, NULL) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", moved_tblspc);
+}
+
 /*
  * transfer_single_new_db()
  *
@@ -145,6 +296,18 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
 		new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
 		vm_must_add_frozenbit = true;
 
+	/*
+	 * XXX: In catalog-swap mode, vm_must_add_frozenbit isn't handled yet.  We
+	 * could either disallow using catalog-swap mode if the upgrade involves
+	 * versions older than v9.6, or we could add code to handle rewriting the
+	 * visibility maps in this mode (like the other modes do).
+	 */
+	if (user_opts.transfer_mode == TRANSFER_MODE_CATALOG_SWAP)
+	{
+		do_catalog_transfer(maps, size, old_tablespace);
+		return;
+	}
+
 	for (mapnum = 0; mapnum < size; mapnum++)
 	{
 		if (old_tablespace == NULL ||
@@ -259,6 +422,10 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 					pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"",
 						   old_file, new_file);
 					linkFile(old_file, new_file, map->nspname, map->relname);
+					break;
+				case TRANSFER_MODE_CATALOG_SWAP:
+					pg_fatal("should never happen");
+					break;
 			}
 	}
 }
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 1847bbfa95..58c339af85 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3649,6 +3649,7 @@ mix_data_t
 mixedStruct
 mode_t
 movedb_failure_params
+move_catalog_file_context
 multirange_bsearch_comparison
 multirange_unnest_fctx
 mxact
-- 
2.39.5 (Apple Git-154)

v1-0004-Add-no-sync-data-files-flag-to-initdb.patchtext/plain; charset=us-asciiDownload
From 1fd29fa00c777b2c04683394f54261b446e46a61 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 5 Nov 2024 16:47:42 -0600
Subject: [PATCH v1 4/8] Add --no-sync-data-files flag to initdb.

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

This new mode caused 'initdb --sync-only' to synchronize everything
except for the database directories.  It will be used in a
follow-up commit that aims to reduce the duration of the data
synchronization step in pg_upgrade's catalog-swap mode.
---
 src/bin/initdb/initdb.c                     |  9 ++++--
 src/bin/pg_basebackup/pg_basebackup.c       |  2 +-
 src/bin/pg_checksums/pg_checksums.c         |  2 +-
 src/bin/pg_combinebackup/pg_combinebackup.c |  2 +-
 src/bin/pg_rewind/file_ops.c                |  2 +-
 src/common/file_utils.c                     | 35 ++++++++++++++++-----
 src/include/common/file_utils.h             |  3 +-
 7 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 9a91830783..53c6e86a80 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -168,6 +168,7 @@ static bool data_checksums = true;
 static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+static bool sync_data_files = true;
 
 
 /* internal vars */
@@ -3183,6 +3184,7 @@ main(int argc, char *argv[])
 		{"icu-rules", required_argument, NULL, 18},
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
+		{"no-sync-data-files", no_argument, NULL, 21},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3377,6 +3379,9 @@ main(int argc, char *argv[])
 			case 20:
 				data_checksums = false;
 				break;
+			case 21:
+				sync_data_files = false;
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -3428,7 +3433,7 @@ main(int argc, char *argv[])
 
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 		return 0;
 	}
@@ -3491,7 +3496,7 @@ main(int argc, char *argv[])
 	{
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 	}
 	else
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index e41a6cfbda..43526e3246 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -2310,7 +2310,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
 		}
 		else
 		{
-			(void) sync_pgdata(basedir, serverVersion, sync_method);
+			(void) sync_pgdata(basedir, serverVersion, sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index b86bc417c9..06ccaacfda 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -633,7 +633,7 @@ main(int argc, char *argv[])
 		if (do_sync)
 		{
 			pg_log_info("syncing data directory");
-			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method);
+			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method, true);
 		}
 
 		pg_log_info("updating control file");
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index 5f1f62f1db..80a137be4e 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -420,7 +420,7 @@ main(int argc, char *argv[])
 		else
 		{
 			pg_log_debug("recursively fsyncing \"%s\"", opt.output);
-			sync_pgdata(opt.output, version * 10000, opt.sync_method);
+			sync_pgdata(opt.output, version * 10000, opt.sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 67a86bb4c5..ceb1c3ac6d 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -296,7 +296,7 @@ sync_target_dir(void)
 	if (!do_sync || dry_run)
 		return;
 
-	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method);
+	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method, true);
 }
 
 
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index dc90f35ae1..65cdf07ae7 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -94,7 +94,8 @@ do_syncfs(const char *path)
 void
 sync_pgdata(const char *pg_data,
 			int serverVersion,
-			DataDirSyncMethod sync_method)
+			DataDirSyncMethod sync_method,
+			bool sync_data_files)
 {
 	bool		xlog_is_symlink;
 	char		pg_wal[MAXPGPATH];
@@ -184,10 +185,11 @@ sync_pgdata(const char *pg_data,
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(pg_data, pre_sync_fname, false, NULL);
+				walkdir(pg_data, pre_sync_fname, false, &sync_data_files);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, pre_sync_fname, false, NULL);
-				walkdir(pg_tblspc, pre_sync_fname, true, NULL);
+					walkdir(pg_wal, pre_sync_fname, false, &sync_data_files);
+				if (sync_data_files)
+					walkdir(pg_tblspc, pre_sync_fname, true, NULL);
 #endif
 
 				/*
@@ -200,10 +202,11 @@ sync_pgdata(const char *pg_data,
 				 * get fsync'd twice. That's not an expected case so we don't
 				 * worry about optimizing it.
 				 */
-				walkdir(pg_data, fsync_fname, false, NULL);
+				walkdir(pg_data, fsync_fname, false, &sync_data_files);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, fsync_fname, false, NULL);
-				walkdir(pg_tblspc, fsync_fname, true, NULL);
+					walkdir(pg_wal, fsync_fname, false, &sync_data_files);
+				if (sync_data_files)
+					walkdir(pg_tblspc, fsync_fname, true, NULL);
 			}
 			break;
 	}
@@ -296,7 +299,23 @@ walkdir(const char *path,
 				(*action) (subpath, false, arg);
 				break;
 			case PGFILETYPE_DIR:
-				walkdir(subpath, action, false, arg);
+
+				/*
+				 * XXX: Checking here for the "sync_data_files" case is quite
+				 * hacky, but it's not clear how to do better.  Another option
+				 * would be to send "de" down to the function, but that would
+				 * introduce a huge number of function pointer calls and
+				 * directory reads that we are trying to avoid.
+				 */
+#ifdef PG_FLUSH_DATA_WORKS
+				if ((action != pre_sync_fname && action != fsync_fname) ||
+#else
+				if (action != fsync_fname ||
+#endif
+					!arg || *((bool *) arg) ||
+					strcmp(de->d_name, "base") != 0)
+					walkdir(subpath, action, false, arg);
+
 				break;
 			default:
 
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index c328f56a85..3743caa63e 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -35,7 +35,8 @@ struct iovec;					/* avoid including port/pg_iovec.h here */
 #ifdef FRONTEND
 extern int	fsync_fname(const char *fname, bool isdir, void *arg);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
-						DataDirSyncMethod sync_method);
+						DataDirSyncMethod sync_method,
+						bool sync_data_files);
 extern void sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method);
 extern int	durable_rename(const char *oldfile, const char *newfile);
 extern int	fsync_parent_path(const char *fname);
-- 
2.39.5 (Apple Git-154)

v1-0005-Export-pre_sync_fname.patchtext/plain; charset=us-asciiDownload
From e70eea50e81d8f40fb8db15c06b23305d4b8698f Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 6 Nov 2024 09:52:19 -0600
Subject: [PATCH v1 5/8] Export pre_sync_fname().

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

A follow-up commit will use this function to alert the file system
that we want a file's data on disk so that subsequent calls to
fsync() are faster.
---
 src/common/file_utils.c         | 18 +++++-------------
 src/include/common/file_utils.h |  1 +
 2 files changed, 6 insertions(+), 13 deletions(-)

diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 65cdf07ae7..5c201ec6e8 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -45,10 +45,6 @@
  */
 #define MINIMUM_VERSION_FOR_PG_WAL	100000
 
-#ifdef PG_FLUSH_DATA_WORKS
-static int	pre_sync_fname(const char *fname, bool isdir, void *arg);
-#endif
-
 #ifdef HAVE_SYNCFS
 
 /*
@@ -307,11 +303,7 @@ walkdir(const char *path,
 				 * introduce a huge number of function pointer calls and
 				 * directory reads that we are trying to avoid.
 				 */
-#ifdef PG_FLUSH_DATA_WORKS
 				if ((action != pre_sync_fname && action != fsync_fname) ||
-#else
-				if (action != fsync_fname ||
-#endif
 					!arg || *((bool *) arg) ||
 					strcmp(de->d_name, "base") != 0)
 					walkdir(subpath, action, false, arg);
@@ -348,11 +340,12 @@ walkdir(const char *path,
  * Ignores errors trying to open unreadable files, and reports other errors
  * non-fatally.
  */
-#ifdef PG_FLUSH_DATA_WORKS
-
-static int
+int
 pre_sync_fname(const char *fname, bool isdir, void *arg)
 {
+#ifndef PG_FLUSH_DATA_WORKS
+	return 0;
+#else
 	int			fd;
 
 	fd = open(fname, O_RDONLY | PG_BINARY, 0);
@@ -380,9 +373,8 @@ pre_sync_fname(const char *fname, bool isdir, void *arg)
 
 	(void) close(fd);
 	return 0;
-}
-
 #endif							/* PG_FLUSH_DATA_WORKS */
+}
 
 /*
  * fsync_fname -- Try to fsync a file or directory
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 3743caa63e..e7a34d4c4e 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -43,6 +43,7 @@ extern int	fsync_parent_path(const char *fname);
 extern void walkdir(const char *path,
 					int (*action) (const char *fname, bool isdir, void *arg),
 					bool process_symlinks, void *arg);
+extern int	pre_sync_fname(const char *fname, bool isdir, void *arg);
 #endif
 
 extern PGFileType get_dirent_type(const char *path,
-- 
2.39.5 (Apple Git-154)

v1-0006-In-pg_upgrade-s-catalog-swap-mode-only-sync-files.patchtext/plain; charset=us-asciiDownload
From 5d17fd66e08612574ffe0c39ff7624259319059c Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 6 Nov 2024 10:40:43 -0600
Subject: [PATCH v1 6/8] In pg_upgrade's catalog-swap mode, only sync files as
 necessary.

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

In this mode, it can be much faster to use "--sync-method fsync",
which now skips synchronizing data files moved from the old cluster
(which we assumed were synchronized before pg_upgrade).
---
 src/bin/pg_upgrade/pg_upgrade.c    |  6 ++--
 src/bin/pg_upgrade/relfilenumber.c | 52 ++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 663235816f..f5946ac89a 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -210,10 +210,12 @@ main(int argc, char **argv)
 	{
 		prep_status("Sync data directory to disk");
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/initdb\" --sync-only \"%s\" --sync-method %s",
+				  "\"%s/initdb\" --sync-only \"%s\" --sync-method %s %s",
 				  new_cluster.bindir,
 				  new_cluster.pgdata,
-				  user_opts.sync_method);
+				  user_opts.sync_method,
+				  (user_opts.transfer_mode == TRANSFER_MODE_CATALOG_SWAP) ?
+				  "--no-sync-data-files" : "");
 		check_ok();
 	}
 
diff --git a/src/bin/pg_upgrade/relfilenumber.c b/src/bin/pg_upgrade/relfilenumber.c
index 9d8fce3c4a..dcca4bb2e7 100644
--- a/src/bin/pg_upgrade/relfilenumber.c
+++ b/src/bin/pg_upgrade/relfilenumber.c
@@ -25,8 +25,49 @@ typedef struct move_catalog_file_context
 	FileNameMap *maps;
 	int			size;
 	char	   *target;
+	bool		sync_moved;
 } move_catalog_file_context;
 
+#define SYNC_QUEUE_MAX_LEN (1024)
+
+static char *sync_queue[SYNC_QUEUE_MAX_LEN];
+static bool sync_queue_inited;
+static int	sync_queue_len;
+
+static inline void
+sync_queue_init(void)
+{
+	if (sync_queue_inited)
+		return;
+
+	sync_queue_inited = true;
+	for (int i = 0; i < SYNC_QUEUE_MAX_LEN; i++)
+		sync_queue[i] = palloc(MAXPGPATH);
+}
+
+static inline void
+sync_queue_sync_all(void)
+{
+	if (!sync_queue_inited)
+		return;
+
+	for (int i = 0; i < sync_queue_len; i++)
+		fsync_fname(sync_queue[i], false, NULL);
+	sync_queue_len = 0;
+}
+
+static inline void
+sync_queue_push(const char *fname)
+{
+	sync_queue_init();
+
+	pre_sync_fname(fname, false, NULL);
+
+	strncpy(sync_queue[sync_queue_len++], fname, MAXPGPATH);
+	if (sync_queue_len >= SYNC_QUEUE_MAX_LEN)
+		sync_queue_sync_all();
+}
+
 /*
  * transfer_all_new_tablespaces()
  *
@@ -138,6 +179,8 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		/* We allocate something even for n_maps == 0 */
 		pg_free(mappings);
 	}
+
+	sync_queue_sync_all();
 }
 
 static int
@@ -195,6 +238,9 @@ move_catalog_file(const char *fname, bool isdir, void *arg)
 	if (rename(fname, dst) != 0)
 		pg_fatal("could not rename \"%s\" to \"%s\": %m", fname, dst);
 
+	if (context->sync_moved)
+		sync_queue_push(dst);
+
 	return 0;
 }
 
@@ -250,10 +296,12 @@ do_catalog_transfer(FileNameMap *maps, int size, char *old_tablespace)
 	context.maps = maps;
 	context.size = size;
 	context.target = old_cat;
+	context.sync_moved = false;
 	walkdir(new_dat, move_catalog_file, false, &context);
 
 	/* move catalogs in moved-aside data dir in place */
 	context.target = new_dat;
+	context.sync_moved = (sync_method != DATA_DIR_SYNC_METHOD_SYNCFS);
 	walkdir(moved_dat, move_catalog_file, false, &context);
 
 	/* no need to sync things individually if we are going to syncfs() later */
@@ -265,6 +313,8 @@ do_catalog_transfer(FileNameMap *maps, int size, char *old_tablespace)
 		pg_fatal("could not synchronize directory \"%s\": %m", moved_dat);
 	if (fsync_fname(old_cat, true, NULL) != 0)
 		pg_fatal("could not synchronize directory \"%s\": %m", old_cat);
+	if (fsync_fname(new_dat, true, NULL) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", new_dat);
 
 	/*
 	 * XXX: We could instead fsync() these directories once at the end instead
@@ -276,6 +326,8 @@ do_catalog_transfer(FileNameMap *maps, int size, char *old_tablespace)
 		pg_fatal("could not synchronize directory \"%s\": %m", old_tblspc);
 	if (fsync_fname(moved_tblspc, true, NULL) != 0)
 		pg_fatal("could not synchronize directory \"%s\": %m", moved_tblspc);
+	if (fsync_fname(new_tblspc, true, NULL) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", new_tblspc);
 }
 
 /*
-- 
2.39.5 (Apple Git-154)

v1-0007-Add-sequence-data-flag-to-pg_dump.patchtext/plain; charset=us-asciiDownload
From 234d40f4e45d56f40465037dcf83c8c980edf095 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 6 Nov 2024 10:46:11 -0600
Subject: [PATCH v1 7/8] Add --sequence-data flag to pg_dump.

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

This flag can be used to optionally dump the sequence data even
when --schema-only is used.  It is primarily intended for use in a
follow-up commit that will cause sequence data files to be carried
over from the old cluster in pg_upgrade's new catalog-swap mode.
---
 src/bin/pg_dump/pg_dump.c                   | 9 +--------
 src/bin/pg_upgrade/dump.c                   | 2 +-
 src/test/modules/test_pg_dump/t/001_base.pl | 2 +-
 3 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index b2f4eb2c6d..2b57abd305 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -501,6 +501,7 @@ main(int argc, char **argv)
 		{"sync-method", required_argument, NULL, 15},
 		{"filter", required_argument, NULL, 16},
 		{"exclude-extension", required_argument, NULL, 17},
+		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -768,14 +769,6 @@ main(int argc, char **argv)
 	if (dopt.column_inserts && dopt.dump_inserts == 0)
 		dopt.dump_inserts = DUMP_DEFAULT_ROWS_PER_INSERT;
 
-	/*
-	 * Binary upgrade mode implies dumping sequence data even in schema-only
-	 * mode.  This is not exposed as a separate option, but kept separate
-	 * internally for clarity.
-	 */
-	if (dopt.binary_upgrade)
-		dopt.sequence_data = 1;
-
 	if (dopt.dataOnly && dopt.schemaOnly)
 		pg_fatal("options -s/--schema-only and -a/--data-only cannot be used together");
 
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index 8345f55be8..8453722833 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -53,7 +53,7 @@ generate_old_dump(void)
 
 		parallel_exec_prog(log_file_name, NULL,
 						   "\"%s/pg_dump\" %s --schema-only --quote-all-identifiers "
-						   "--binary-upgrade --format=custom %s --no-sync --file=\"%s/%s\" %s",
+						   "--binary-upgrade --sequence-data --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
 						   log_opts.verbose ? "--verbose" : "",
 						   log_opts.dumpdir,
diff --git a/src/test/modules/test_pg_dump/t/001_base.pl b/src/test/modules/test_pg_dump/t/001_base.pl
index e2579e29cd..46231c93f1 100644
--- a/src/test/modules/test_pg_dump/t/001_base.pl
+++ b/src/test/modules/test_pg_dump/t/001_base.pl
@@ -48,7 +48,7 @@ my %pgdump_runs = (
 		dump_cmd => [
 			'pg_dump', '--no-sync',
 			"--file=$tempdir/binary_upgrade.sql", '--schema-only',
-			'--binary-upgrade', '--dbname=postgres',
+			'--binary-upgrade', '--sequence-data', '--dbname=postgres',
 		],
 	},
 	clean => {
-- 
2.39.5 (Apple Git-154)

v1-0008-Avoid-copying-sequence-files-in-pg_upgrade-s-cata.patchtext/plain; charset=us-asciiDownload
From 51be6c09256272e1ce0360b5376b4a14cd1d9a61 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 6 Nov 2024 10:53:40 -0600
Subject: [PATCH v1 8/8] Avoid copying sequence files in pg_upgrade's
 catalog-swap mode.

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

On clusters with many sequences, this can further reduce the amount
of time required to wire up the data files in the new cluster.  If
the sequence data file format changes, this optimization cannot be
used, but that seems rare enough.
---
 src/bin/pg_upgrade/dump.c | 8 +++++++-
 src/bin/pg_upgrade/info.c | 6 +++++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index 8453722833..d5a81cc29c 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -51,10 +51,16 @@ generate_old_dump(void)
 		snprintf(sql_file_name, sizeof(sql_file_name), DB_DUMP_FILE_MASK, old_db->db_oid);
 		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
 
+		/*
+		 * XXX: We need to be sure that the sequence data format hasn't
+		 * changed.
+		 */
 		parallel_exec_prog(log_file_name, NULL,
 						   "\"%s/pg_dump\" %s --schema-only --quote-all-identifiers "
-						   "--binary-upgrade --sequence-data --format=custom %s --no-sync --file=\"%s/%s\" %s",
+						   "--binary-upgrade %s --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
+						   (user_opts.transfer_mode == TRANSFER_MODE_CATALOG_SWAP) ?
+						   "" : "--sequence-data",
 						   log_opts.verbose ? "--verbose" : "",
 						   log_opts.dumpdir,
 						   sql_file_name, escaped_connstr.data);
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index f83ded89cb..786d17e32f 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -483,6 +483,8 @@ get_rel_infos_query(void)
 	 * pg_largeobject contains user data that does not appear in pg_dump
 	 * output, so we have to copy that system table.  It's easiest to do that
 	 * by treating it as a user table.
+	 *
+	 * XXX: We need to be sure that the sequence data format hasn't changed.
 	 */
 	appendPQExpBuffer(&query,
 					  "WITH regular_heap (reloid, indtable, toastheap) AS ( "
@@ -490,7 +492,7 @@ get_rel_infos_query(void)
 					  "  FROM pg_catalog.pg_class c JOIN pg_catalog.pg_namespace n "
 					  "         ON c.relnamespace = n.oid "
 					  "  WHERE relkind IN (" CppAsString2(RELKIND_RELATION) ", "
-					  CppAsString2(RELKIND_MATVIEW) ") AND "
+					  CppAsString2(RELKIND_MATVIEW) "%s) AND "
 	/* exclude possible orphaned temp tables */
 					  "    ((n.nspname !~ '^pg_temp_' AND "
 					  "      n.nspname !~ '^pg_toast_temp_' AND "
@@ -499,6 +501,8 @@ get_rel_infos_query(void)
 					  "      c.oid >= %u::pg_catalog.oid) OR "
 					  "     (n.nspname = 'pg_catalog' AND "
 					  "      relname IN ('pg_largeobject') ))), ",
+					  (user_opts.transfer_mode == TRANSFER_MODE_CATALOG_SWAP) ?
+					  ", " CppAsString2(RELKIND_SEQUENCE) : "",
 					  FirstNormalObjectId);
 
 	/*
-- 
2.39.5 (Apple Git-154)

#2Greg Sabino Mullane
htamfids@gmail.com
In reply to: Nathan Bossart (#1)
Re: optimize file transfer in pg_upgrade

On Wed, Nov 6, 2024 at 5:07 PM Nathan Bossart <nathandbossart@gmail.com>
wrote:

Therefore, it can be much faster to instead move the entire data directory
from the old cluster
to the new cluster and to then swap the catalog relation files.

Thank you for breaking this up so clearly into separate commits. I think it
is a very interesting idea, and anything to speed up pg_upgrade is always
welcome. Some minor thoughts:

[PATCH v1 3/8] Introduce catalog-swap mode for pg_upgrade.
.. we don't really expect there to be directories within database

directories,

so perhaps it would be better to either unconditionally rename or to fail.

Failure seems the best option here, so we can cleanly handle any future
cases in which we decide to put dirs in this directory.

if (RelFileNumberIsValid(rfn))
{
FileNameMap key;

key.relfilenumber = (RelFileNumber) rfn;
if (bsearch(&key, context->maps, context->size,
sizeof(FileNameMap), FileNameMapCmp))
return 0;
}

snprintf(dst, sizeof(dst), "%s/%s", context->target, filename);
if (rename(fname, dst) != 0)

I'm not quite clear what we are doing here with falling through
for InvalidOid entries, could you explain?

.. vm_must_add_frozenbit isn't handled yet. We could either disallow
using catalog-swap mode if the upgrade involves versions older than v9.6

Yes, this. No need for more code to handle super old versions when other
options exist.

with this problem is to introduce a special mode for "initdb --sync-only"

that calls fsync() for everything _except_ the actual data files. If we
fsync() the new catalog files as we move them into place, and if we assume
that the old catalog files will have been properly synchronized before
upgrading, there's no reason to synchronize them again at the end.

Very cool approach!

Cheers,
Greg

#3Bruce Momjian
bruce@momjian.us
In reply to: Nathan Bossart (#1)
Re: optimize file transfer in pg_upgrade

On Wed, Nov 6, 2024 at 04:07:35PM -0600, Nathan Bossart wrote:

For clusters with many relations, the file transfer step of pg_upgrade can
take the longest. This step clones, copies, or links the user relation
files from the older cluster to the new cluster, so the amount of time it
takes is closely related to the number of relations. However, since v15,
we've preserved the relfilenodes during pg_upgrade, which means that all of
these user relation files will have the same name. Therefore, it can be
much faster to instead move the entire data directory from the old cluster
to the new cluster and to then swap the catalog relation files.

That is certainly a creative idea. I am surprised the links take so
long. Obviously rollback would be hard, as you mentioned, while now you
can rollback --link until you start. I think it clearly should be
considered. The patch is smaller than I expected.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

When a patient asks the doctor, "Am I going to die?", he means
"Am I going to die soon?"

#4Nathan Bossart
nathandbossart@gmail.com
In reply to: Greg Sabino Mullane (#2)
Re: optimize file transfer in pg_upgrade

On Sun, Nov 17, 2024 at 01:50:53PM -0500, Greg Sabino Mullane wrote:

On Wed, Nov 6, 2024 at 5:07 PM Nathan Bossart <nathandbossart@gmail.com>
wrote:

Therefore, it can be much faster to instead move the entire data directory
from the old cluster
to the new cluster and to then swap the catalog relation files.

Thank you for breaking this up so clearly into separate commits. I think it
is a very interesting idea, and anything to speed up pg_upgrade is always
welcome. Some minor thoughts:

Thank you for reviewing!

.. we don't really expect there to be directories within database

directories,

so perhaps it would be better to either unconditionally rename or to fail.

Failure seems the best option here, so we can cleanly handle any future
cases in which we decide to put dirs in this directory.

Good point.

if (RelFileNumberIsValid(rfn))
{
FileNameMap key;

key.relfilenumber = (RelFileNumber) rfn;
if (bsearch(&key, context->maps, context->size,
sizeof(FileNameMap), FileNameMapCmp))
return 0;
}

snprintf(dst, sizeof(dst), "%s/%s", context->target, filename);
if (rename(fname, dst) != 0)

I'm not quite clear what we are doing here with falling through
for InvalidOid entries, could you explain?

The idea is that if it looks like a data file that we might want to
transfer (i.e., it starts with a RelFileNumber), we should consult our map
to determine whether to move it. Otherwise, we want to unconditionally
transfer it so that we always use the files generated during pg_restore in
the new cluster (e.g., PG_VERSION and pg_filenode.map). In theory, this
should result in the same end state as what --link mode does today (for the
new cluster, at least).

.. vm_must_add_frozenbit isn't handled yet. We could either disallow
using catalog-swap mode if the upgrade involves versions older than v9.6

Yes, this. No need for more code to handle super old versions when other
options exist.

I'm inclined to agree.

with this problem is to introduce a special mode for "initdb --sync-only"
that calls fsync() for everything _except_ the actual data files. If we
fsync() the new catalog files as we move them into place, and if we assume
that the old catalog files will have been properly synchronized before
upgrading, there's no reason to synchronize them again at the end.

Very cool approach!

:)

--
nathan

#5Nathan Bossart
nathandbossart@gmail.com
In reply to: Bruce Momjian (#3)
Re: optimize file transfer in pg_upgrade

On Mon, Nov 18, 2024 at 10:34:00PM -0500, Bruce Momjian wrote:

On Wed, Nov 6, 2024 at 04:07:35PM -0600, Nathan Bossart wrote:

For clusters with many relations, the file transfer step of pg_upgrade can
take the longest. This step clones, copies, or links the user relation
files from the older cluster to the new cluster, so the amount of time it
takes is closely related to the number of relations. However, since v15,
we've preserved the relfilenodes during pg_upgrade, which means that all of
these user relation files will have the same name. Therefore, it can be
much faster to instead move the entire data directory from the old cluster
to the new cluster and to then swap the catalog relation files.

That is certainly a creative idea. I am surprised the links take so
long. Obviously rollback would be hard, as you mentioned, while now you
can rollback --link until you start. I think it clearly should be
considered.

I've yet to try, but I'm cautiously optimistic that it will be possible to
generate simple scripts that can unwind things by just looking at the
directory entries, even if pg_upgrade crashed halfway through the linking
stage.

The patch is smaller than I expected.

I was surprised by this, too. Obviously, this one is a bit smaller than
the "real" patches will be because it's just a proof-of-concept, but it
should still be pretty manageable.

--
nathan

#6Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#4)
8 attachment(s)
Re: optimize file transfer in pg_upgrade

Here is a rebased patch set for cfbot. I'm planning to spend some time
getting these patches into a more reviewable state in the near future.

--
nathan

Attachments:

v2-0001-Export-walkdir.patchtext/plain; charset=us-asciiDownload
From 81fe66e0f0aa4f958a8707df669f60756c89bb85 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 5 Nov 2024 15:59:51 -0600
Subject: [PATCH v2 1/8] Export walkdir().

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

A follow-up commit will use this function to swap catalog files
between database directories during pg_upgrade.
---
 src/common/file_utils.c         | 5 +----
 src/include/common/file_utils.h | 3 +++
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 398fe1c334..3f488bf5ec 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -48,9 +48,6 @@
 #ifdef PG_FLUSH_DATA_WORKS
 static int	pre_sync_fname(const char *fname, bool isdir);
 #endif
-static void walkdir(const char *path,
-					int (*action) (const char *fname, bool isdir),
-					bool process_symlinks);
 
 #ifdef HAVE_SYNCFS
 
@@ -268,7 +265,7 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
  *
  * See also walkdir in fd.c, which is a backend version of this logic.
  */
-static void
+void
 walkdir(const char *path,
 		int (*action) (const char *fname, bool isdir),
 		bool process_symlinks)
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index e4339fb7b6..5a9519acfe 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -39,6 +39,9 @@ extern void sync_pgdata(const char *pg_data, int serverVersion,
 extern void sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method);
 extern int	durable_rename(const char *oldfile, const char *newfile);
 extern int	fsync_parent_path(const char *fname);
+extern void walkdir(const char *path,
+					int (*action) (const char *fname, bool isdir),
+					bool process_symlinks);
 #endif
 
 extern PGFileType get_dirent_type(const char *path,
-- 
2.39.5 (Apple Git-154)

v2-0002-Add-void-arg-parameter-to-walkdir-that-is-passed-.patchtext/plain; charset=us-asciiDownload
From 36d6a1aad5cbfeb05954886bb336cfa9ec01c5c3 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 6 Nov 2024 13:59:39 -0600
Subject: [PATCH v2 2/8] Add "void *arg" parameter to walkdir() that is passed
 to function.

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

This will be used in follow up commits to pass private state to the
functions called by walkdir().
---
 src/bin/pg_basebackup/walmethods.c |  8 +++----
 src/bin/pg_dump/pg_backup_custom.c |  2 +-
 src/bin/pg_dump/pg_backup_tar.c    |  2 +-
 src/bin/pg_dump/pg_dumpall.c       |  2 +-
 src/common/file_utils.c            | 38 +++++++++++++++---------------
 src/include/common/file_utils.h    |  6 ++---
 6 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/src/bin/pg_basebackup/walmethods.c b/src/bin/pg_basebackup/walmethods.c
index 215b24597f..51640cb493 100644
--- a/src/bin/pg_basebackup/walmethods.c
+++ b/src/bin/pg_basebackup/walmethods.c
@@ -251,7 +251,7 @@ dir_open_for_write(WalWriteMethod *wwmethod, const char *pathname,
 	 */
 	if (wwmethod->sync)
 	{
-		if (fsync_fname(tmppath, false) != 0 ||
+		if (fsync_fname(tmppath, false, NULL) != 0 ||
 			fsync_parent_path(tmppath) != 0)
 		{
 			wwmethod->lasterrno = errno;
@@ -486,7 +486,7 @@ dir_close(Walfile *f, WalCloseMethod method)
 			 */
 			if (f->wwmethod->sync)
 			{
-				r = fsync_fname(df->fullpath, false);
+				r = fsync_fname(df->fullpath, false, NULL);
 				if (r == 0)
 					r = fsync_parent_path(df->fullpath);
 			}
@@ -617,7 +617,7 @@ dir_finish(WalWriteMethod *wwmethod)
 		 * Files are fsynced when they are closed, but we need to fsync the
 		 * directory entry here as well.
 		 */
-		if (fsync_fname(dir_data->basedir, true) != 0)
+		if (fsync_fname(dir_data->basedir, true, NULL) != 0)
 		{
 			wwmethod->lasterrno = errno;
 			return false;
@@ -1321,7 +1321,7 @@ tar_finish(WalWriteMethod *wwmethod)
 
 	if (wwmethod->sync)
 	{
-		if (fsync_fname(tar_data->tarfilename, false) != 0 ||
+		if (fsync_fname(tar_data->tarfilename, false, NULL) != 0 ||
 			fsync_parent_path(tar_data->tarfilename) != 0)
 		{
 			wwmethod->lasterrno = errno;
diff --git a/src/bin/pg_dump/pg_backup_custom.c b/src/bin/pg_dump/pg_backup_custom.c
index e44b887eb2..51edf147d6 100644
--- a/src/bin/pg_dump/pg_backup_custom.c
+++ b/src/bin/pg_dump/pg_backup_custom.c
@@ -767,7 +767,7 @@ _CloseArchive(ArchiveHandle *AH)
 
 	/* Sync the output file if one is defined */
 	if (AH->dosync && AH->mode == archModeWrite && AH->fSpec)
-		(void) fsync_fname(AH->fSpec, false);
+		(void) fsync_fname(AH->fSpec, false, NULL);
 
 	AH->FH = NULL;
 }
diff --git a/src/bin/pg_dump/pg_backup_tar.c b/src/bin/pg_dump/pg_backup_tar.c
index b5ba3b46dd..5ea6a472d4 100644
--- a/src/bin/pg_dump/pg_backup_tar.c
+++ b/src/bin/pg_dump/pg_backup_tar.c
@@ -847,7 +847,7 @@ _CloseArchive(ArchiveHandle *AH)
 
 		/* Sync the output file if one is defined */
 		if (AH->dosync && AH->fSpec)
-			(void) fsync_fname(AH->fSpec, false);
+			(void) fsync_fname(AH->fSpec, false, NULL);
 	}
 
 	AH->FH = NULL;
diff --git a/src/bin/pg_dump/pg_dumpall.c b/src/bin/pg_dump/pg_dumpall.c
index 9a04e51c81..58a9f6e748 100644
--- a/src/bin/pg_dump/pg_dumpall.c
+++ b/src/bin/pg_dump/pg_dumpall.c
@@ -621,7 +621,7 @@ main(int argc, char *argv[])
 
 		/* sync the resulting file, errors are not fatal */
 		if (dosync)
-			(void) fsync_fname(filename, false);
+			(void) fsync_fname(filename, false, NULL);
 	}
 
 	exit_nicely(0);
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 3f488bf5ec..dc90f35ae1 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -46,7 +46,7 @@
 #define MINIMUM_VERSION_FOR_PG_WAL	100000
 
 #ifdef PG_FLUSH_DATA_WORKS
-static int	pre_sync_fname(const char *fname, bool isdir);
+static int	pre_sync_fname(const char *fname, bool isdir, void *arg);
 #endif
 
 #ifdef HAVE_SYNCFS
@@ -184,10 +184,10 @@ sync_pgdata(const char *pg_data,
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(pg_data, pre_sync_fname, false);
+				walkdir(pg_data, pre_sync_fname, false, NULL);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, pre_sync_fname, false);
-				walkdir(pg_tblspc, pre_sync_fname, true);
+					walkdir(pg_wal, pre_sync_fname, false, NULL);
+				walkdir(pg_tblspc, pre_sync_fname, true, NULL);
 #endif
 
 				/*
@@ -200,10 +200,10 @@ sync_pgdata(const char *pg_data,
 				 * get fsync'd twice. That's not an expected case so we don't
 				 * worry about optimizing it.
 				 */
-				walkdir(pg_data, fsync_fname, false);
+				walkdir(pg_data, fsync_fname, false, NULL);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, fsync_fname, false);
-				walkdir(pg_tblspc, fsync_fname, true);
+					walkdir(pg_wal, fsync_fname, false, NULL);
+				walkdir(pg_tblspc, fsync_fname, true, NULL);
 			}
 			break;
 	}
@@ -242,10 +242,10 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(dir, pre_sync_fname, false);
+				walkdir(dir, pre_sync_fname, false, NULL);
 #endif
 
-				walkdir(dir, fsync_fname, false);
+				walkdir(dir, fsync_fname, false, NULL);
 			}
 			break;
 	}
@@ -267,8 +267,8 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
  */
 void
 walkdir(const char *path,
-		int (*action) (const char *fname, bool isdir),
-		bool process_symlinks)
+		int (*action) (const char *fname, bool isdir, void *arg),
+		bool process_symlinks, void *arg)
 {
 	DIR		   *dir;
 	struct dirent *de;
@@ -293,10 +293,10 @@ walkdir(const char *path,
 		switch (get_dirent_type(subpath, de, process_symlinks, PG_LOG_ERROR))
 		{
 			case PGFILETYPE_REG:
-				(*action) (subpath, false);
+				(*action) (subpath, false, arg);
 				break;
 			case PGFILETYPE_DIR:
-				walkdir(subpath, action, false);
+				walkdir(subpath, action, false, arg);
 				break;
 			default:
 
@@ -320,7 +320,7 @@ walkdir(const char *path,
 	 * synced.  Recent versions of ext4 have made the window much wider but
 	 * it's been an issue for ext3 and other filesystems in the past.
 	 */
-	(*action) (path, true);
+	(*action) (path, true, arg);
 }
 
 /*
@@ -332,7 +332,7 @@ walkdir(const char *path,
 #ifdef PG_FLUSH_DATA_WORKS
 
 static int
-pre_sync_fname(const char *fname, bool isdir)
+pre_sync_fname(const char *fname, bool isdir, void *arg)
 {
 	int			fd;
 
@@ -373,7 +373,7 @@ pre_sync_fname(const char *fname, bool isdir)
  * are fatal.
  */
 int
-fsync_fname(const char *fname, bool isdir)
+fsync_fname(const char *fname, bool isdir, void *arg)
 {
 	int			fd;
 	int			flags;
@@ -444,7 +444,7 @@ fsync_parent_path(const char *fname)
 	if (strlen(parentpath) == 0)
 		strlcpy(parentpath, ".", MAXPGPATH);
 
-	if (fsync_fname(parentpath, true) != 0)
+	if (fsync_fname(parentpath, true, NULL) != 0)
 		return -1;
 
 	return 0;
@@ -467,7 +467,7 @@ durable_rename(const char *oldfile, const char *newfile)
 	 * because it's then guaranteed that either source or target file exists
 	 * after a crash.
 	 */
-	if (fsync_fname(oldfile, false) != 0)
+	if (fsync_fname(oldfile, false, NULL) != 0)
 		return -1;
 
 	fd = open(newfile, PG_BINARY | O_RDWR, 0);
@@ -502,7 +502,7 @@ durable_rename(const char *oldfile, const char *newfile)
 	 * To guarantee renaming the file is persistent, fsync the file with its
 	 * new name, and its containing directory.
 	 */
-	if (fsync_fname(newfile, false) != 0)
+	if (fsync_fname(newfile, false, NULL) != 0)
 		return -1;
 
 	if (fsync_parent_path(newfile) != 0)
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 5a9519acfe..c328f56a85 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -33,15 +33,15 @@ typedef enum DataDirSyncMethod
 struct iovec;					/* avoid including port/pg_iovec.h here */
 
 #ifdef FRONTEND
-extern int	fsync_fname(const char *fname, bool isdir);
+extern int	fsync_fname(const char *fname, bool isdir, void *arg);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
 						DataDirSyncMethod sync_method);
 extern void sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method);
 extern int	durable_rename(const char *oldfile, const char *newfile);
 extern int	fsync_parent_path(const char *fname);
 extern void walkdir(const char *path,
-					int (*action) (const char *fname, bool isdir),
-					bool process_symlinks);
+					int (*action) (const char *fname, bool isdir, void *arg),
+					bool process_symlinks, void *arg);
 #endif
 
 extern PGFileType get_dirent_type(const char *path,
-- 
2.39.5 (Apple Git-154)

v2-0003-Introduce-catalog-swap-mode-for-pg_upgrade.patchtext/plain; charset=us-asciiDownload
From f5eca13b8b04760977ab41ef9cd023a47e5cbbbd Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 5 Nov 2024 16:38:19 -0600
Subject: [PATCH v2 3/8] Introduce catalog-swap mode for pg_upgrade.

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

This new mode moves the database directories from the old cluster
to the new cluster and then swaps the pg_restore-generated catalog
files in place.  This can significantly increase the length of the
following data synchronization step (due to the large number of
unsynchronized pg_restore-generated files), but this problem will
be handled in follow-up commits.
---
 src/bin/pg_upgrade/check.c         |   2 +
 src/bin/pg_upgrade/option.c        |   5 +
 src/bin/pg_upgrade/pg_upgrade.h    |   1 +
 src/bin/pg_upgrade/relfilenumber.c | 167 +++++++++++++++++++++++++++++
 src/tools/pgindent/typedefs.list   |   1 +
 5 files changed, 176 insertions(+)

diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 94164f0472..a4bb365718 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -711,6 +711,8 @@ check_new_cluster(void)
 		case TRANSFER_MODE_LINK:
 			check_hard_link();
 			break;
+		case TRANSFER_MODE_CATALOG_SWAP:
+			break;
 	}
 
 	check_is_install_user(&new_cluster);
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 6f41d63eed..64091a54c4 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -60,6 +60,7 @@ parseCommandLine(int argc, char *argv[])
 		{"copy", no_argument, NULL, 2},
 		{"copy-file-range", no_argument, NULL, 3},
 		{"sync-method", required_argument, NULL, 4},
+		{"catalog-swap", no_argument, NULL, 5},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -212,6 +213,10 @@ parseCommandLine(int argc, char *argv[])
 				user_opts.sync_method = pg_strdup(optarg);
 				break;
 
+			case 5:
+				user_opts.transfer_mode = TRANSFER_MODE_CATALOG_SWAP;
+				break;
+
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
 						os_info.progname);
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 53f693c2d4..19cb5a011e 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -256,6 +256,7 @@ typedef enum
 	TRANSFER_MODE_COPY,
 	TRANSFER_MODE_COPY_FILE_RANGE,
 	TRANSFER_MODE_LINK,
+	TRANSFER_MODE_CATALOG_SWAP,
 } transferMode;
 
 /*
diff --git a/src/bin/pg_upgrade/relfilenumber.c b/src/bin/pg_upgrade/relfilenumber.c
index 07baa49a02..9d8fce3c4a 100644
--- a/src/bin/pg_upgrade/relfilenumber.c
+++ b/src/bin/pg_upgrade/relfilenumber.c
@@ -11,11 +11,21 @@
 
 #include <sys/stat.h>
 
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/int.h"
+#include "fe_utils/option_utils.h"
 #include "pg_upgrade.h"
 
 static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
 static void transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_frozenbit);
 
+typedef struct move_catalog_file_context
+{
+	FileNameMap *maps;
+	int			size;
+	char	   *target;
+} move_catalog_file_context;
 
 /*
  * transfer_all_new_tablespaces()
@@ -41,6 +51,9 @@ transfer_all_new_tablespaces(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		case TRANSFER_MODE_LINK:
 			prep_status_progress("Linking user relation files");
 			break;
+		case TRANSFER_MODE_CATALOG_SWAP:
+			prep_status_progress("Swapping catalog files");
+			break;
 	}
 
 	/*
@@ -127,6 +140,144 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 	}
 }
 
+static int
+FileNameMapCmp(const void *a, const void *b)
+{
+	return pg_cmp_u32(((const FileNameMap *) a)->relfilenumber,
+					  ((const FileNameMap *) b)->relfilenumber);
+}
+
+static RelFileNumber
+parse_relfilenumber(const char *filename)
+{
+	char	   *endp;
+	unsigned long n;
+
+	if (filename[0] < '1' || filename[0] > '9')
+		return InvalidRelFileNumber;
+
+	errno = 0;
+	n = strtoul(filename, &endp, 10);
+	if (errno || filename == endp || n <= 0 || n > PG_UINT32_MAX)
+		return InvalidRelFileNumber;
+
+	return (RelFileNumber) n;
+}
+
+static int
+move_catalog_file(const char *fname, bool isdir, void *arg)
+{
+	char		dst[MAXPGPATH];
+	const char *filename = last_dir_separator(fname) + 1;
+	RelFileNumber rfn = parse_relfilenumber(filename);
+	move_catalog_file_context *context = (move_catalog_file_context *) arg;
+
+	/*
+	 * XXX: Is this right?  AFAICT we don't really expect there to be
+	 * directories within database directories, so perhaps it would be better
+	 * to either unconditionally rename or to fail.  Further investigation is
+	 * required.
+	 */
+	if (isdir)
+		return 0;
+
+	if (RelFileNumberIsValid(rfn))
+	{
+		FileNameMap key;
+
+		key.relfilenumber = (RelFileNumber) rfn;
+		if (bsearch(&key, context->maps, context->size,
+					sizeof(FileNameMap), FileNameMapCmp))
+			return 0;
+	}
+
+	snprintf(dst, sizeof(dst), "%s/%s", context->target, filename);
+	if (rename(fname, dst) != 0)
+		pg_fatal("could not rename \"%s\" to \"%s\": %m", fname, dst);
+
+	return 0;
+}
+
+/*
+ * XXX: This proof-of-concept patch doesn't yet handle non-default tablespaces.
+ */
+static void
+do_catalog_transfer(FileNameMap *maps, int size, char *old_tablespace)
+{
+	char		old_tblspc[MAXPGPATH];
+	char		new_tblspc[MAXPGPATH];
+	char		old_dat[MAXPGPATH];
+	char		new_dat[MAXPGPATH];
+	char		moved_tblspc[MAXPGPATH];
+	char		moved_dat[MAXPGPATH];
+	char		old_cat[MAXPGPATH];
+	move_catalog_file_context context;
+	DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+
+	parse_sync_method(user_opts.sync_method, &sync_method);
+
+	snprintf(old_tblspc, sizeof(old_tblspc), "%s%s",
+			 maps[0].old_tablespace, maps[0].old_tablespace_suffix);
+	snprintf(new_tblspc, sizeof(new_tblspc), "%s%s",
+			 maps[0].new_tablespace, maps[0].new_tablespace_suffix);
+	snprintf(old_dat, sizeof(old_dat), "%s/%u", old_tblspc, maps[0].db_oid);
+	snprintf(new_dat, sizeof(new_dat), "%s/%u", new_tblspc, maps[0].db_oid);
+	snprintf(moved_tblspc, sizeof(moved_tblspc), "%s_moved", old_tblspc);
+	snprintf(moved_dat, sizeof(moved_dat), "%s/%u",
+			 moved_tblspc, maps[0].db_oid);
+	snprintf(old_cat, sizeof(old_cat), "%s/%u_old_cat",
+			 moved_tblspc, maps[0].db_oid);
+
+	qsort(maps, size, sizeof(FileNameMap), FileNameMapCmp);
+
+	/* create dir for stuff that is moved aside */
+	if (pg_mkdir_p(moved_tblspc, pg_dir_create_mode) && errno != EEXIST)
+		pg_fatal("could not create directory \"%s\": %m", moved_tblspc);
+
+	/* move new cluster data dir aside */
+	if (rename(new_dat, moved_dat))
+		pg_fatal("could not rename \"%s\" to \"%s\": %m", new_dat, moved_dat);
+
+	/* move old cluster data dir in place */
+	if (rename(old_dat, new_dat))
+		pg_fatal("could not rename \"%s\" to \"%s\": %m", old_dat, new_dat);
+
+	/* create dir for old catalogs */
+	if (pg_mkdir_p(old_cat, pg_dir_create_mode))
+		pg_fatal("could not create directory \"%s\": %m", old_cat);
+
+	/* move catalogs in new data dir aside */
+	context.maps = maps;
+	context.size = size;
+	context.target = old_cat;
+	walkdir(new_dat, move_catalog_file, false, &context);
+
+	/* move catalogs in moved-aside data dir in place */
+	context.target = new_dat;
+	walkdir(moved_dat, move_catalog_file, false, &context);
+
+	/* no need to sync things individually if we are going to syncfs() later */
+	if (sync_method == DATA_DIR_SYNC_METHOD_SYNCFS)
+		return;
+
+	/* fsync directory entries */
+	if (fsync_fname(moved_dat, true, NULL) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", moved_dat);
+	if (fsync_fname(old_cat, true, NULL) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", old_cat);
+
+	/*
+	 * XXX: We could instead fsync() these directories once at the end instead
+	 * of once per-database, but it doesn't affect performance meaningfully,
+	 * and this is just a proof-of-concept patch, so I haven't bothered doing
+	 * the required refactoring yet.
+	 */
+	if (fsync_fname(old_tblspc, true, NULL) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", old_tblspc);
+	if (fsync_fname(moved_tblspc, true, NULL) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", moved_tblspc);
+}
+
 /*
  * transfer_single_new_db()
  *
@@ -145,6 +296,18 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
 		new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
 		vm_must_add_frozenbit = true;
 
+	/*
+	 * XXX: In catalog-swap mode, vm_must_add_frozenbit isn't handled yet.  We
+	 * could either disallow using catalog-swap mode if the upgrade involves
+	 * versions older than v9.6, or we could add code to handle rewriting the
+	 * visibility maps in this mode (like the other modes do).
+	 */
+	if (user_opts.transfer_mode == TRANSFER_MODE_CATALOG_SWAP)
+	{
+		do_catalog_transfer(maps, size, old_tablespace);
+		return;
+	}
+
 	for (mapnum = 0; mapnum < size; mapnum++)
 	{
 		if (old_tablespace == NULL ||
@@ -259,6 +422,10 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 					pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"",
 						   old_file, new_file);
 					linkFile(old_file, new_file, map->nspname, map->relname);
+					break;
+				case TRANSFER_MODE_CATALOG_SWAP:
+					pg_fatal("should never happen");
+					break;
 			}
 	}
 }
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 2d4c870423..f721f934c0 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -3655,6 +3655,7 @@ mix_data_t
 mixedStruct
 mode_t
 movedb_failure_params
+move_catalog_file_context
 multirange_bsearch_comparison
 multirange_unnest_fctx
 mxact
-- 
2.39.5 (Apple Git-154)

v2-0004-Add-no-sync-data-files-flag-to-initdb.patchtext/plain; charset=us-asciiDownload
From 44d566652775b2a88789e181d95a15b92ac913c6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 5 Nov 2024 16:47:42 -0600
Subject: [PATCH v2 4/8] Add --no-sync-data-files flag to initdb.

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

This new mode caused 'initdb --sync-only' to synchronize everything
except for the database directories.  It will be used in a
follow-up commit that aims to reduce the duration of the data
synchronization step in pg_upgrade's catalog-swap mode.
---
 src/bin/initdb/initdb.c                     |  9 ++++--
 src/bin/pg_basebackup/pg_basebackup.c       |  2 +-
 src/bin/pg_checksums/pg_checksums.c         |  2 +-
 src/bin/pg_combinebackup/pg_combinebackup.c |  2 +-
 src/bin/pg_rewind/file_ops.c                |  2 +-
 src/common/file_utils.c                     | 35 ++++++++++++++++-----
 src/include/common/file_utils.h             |  3 +-
 7 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 9a91830783..53c6e86a80 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -168,6 +168,7 @@ static bool data_checksums = true;
 static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+static bool sync_data_files = true;
 
 
 /* internal vars */
@@ -3183,6 +3184,7 @@ main(int argc, char *argv[])
 		{"icu-rules", required_argument, NULL, 18},
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
+		{"no-sync-data-files", no_argument, NULL, 21},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3377,6 +3379,9 @@ main(int argc, char *argv[])
 			case 20:
 				data_checksums = false;
 				break;
+			case 21:
+				sync_data_files = false;
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -3428,7 +3433,7 @@ main(int argc, char *argv[])
 
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 		return 0;
 	}
@@ -3491,7 +3496,7 @@ main(int argc, char *argv[])
 	{
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 	}
 	else
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index e41a6cfbda..43526e3246 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -2310,7 +2310,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
 		}
 		else
 		{
-			(void) sync_pgdata(basedir, serverVersion, sync_method);
+			(void) sync_pgdata(basedir, serverVersion, sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index b86bc417c9..06ccaacfda 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -633,7 +633,7 @@ main(int argc, char *argv[])
 		if (do_sync)
 		{
 			pg_log_info("syncing data directory");
-			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method);
+			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method, true);
 		}
 
 		pg_log_info("updating control file");
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index 5f1f62f1db..80a137be4e 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -420,7 +420,7 @@ main(int argc, char *argv[])
 		else
 		{
 			pg_log_debug("recursively fsyncing \"%s\"", opt.output);
-			sync_pgdata(opt.output, version * 10000, opt.sync_method);
+			sync_pgdata(opt.output, version * 10000, opt.sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 67a86bb4c5..ceb1c3ac6d 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -296,7 +296,7 @@ sync_target_dir(void)
 	if (!do_sync || dry_run)
 		return;
 
-	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method);
+	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method, true);
 }
 
 
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index dc90f35ae1..65cdf07ae7 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -94,7 +94,8 @@ do_syncfs(const char *path)
 void
 sync_pgdata(const char *pg_data,
 			int serverVersion,
-			DataDirSyncMethod sync_method)
+			DataDirSyncMethod sync_method,
+			bool sync_data_files)
 {
 	bool		xlog_is_symlink;
 	char		pg_wal[MAXPGPATH];
@@ -184,10 +185,11 @@ sync_pgdata(const char *pg_data,
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(pg_data, pre_sync_fname, false, NULL);
+				walkdir(pg_data, pre_sync_fname, false, &sync_data_files);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, pre_sync_fname, false, NULL);
-				walkdir(pg_tblspc, pre_sync_fname, true, NULL);
+					walkdir(pg_wal, pre_sync_fname, false, &sync_data_files);
+				if (sync_data_files)
+					walkdir(pg_tblspc, pre_sync_fname, true, NULL);
 #endif
 
 				/*
@@ -200,10 +202,11 @@ sync_pgdata(const char *pg_data,
 				 * get fsync'd twice. That's not an expected case so we don't
 				 * worry about optimizing it.
 				 */
-				walkdir(pg_data, fsync_fname, false, NULL);
+				walkdir(pg_data, fsync_fname, false, &sync_data_files);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, fsync_fname, false, NULL);
-				walkdir(pg_tblspc, fsync_fname, true, NULL);
+					walkdir(pg_wal, fsync_fname, false, &sync_data_files);
+				if (sync_data_files)
+					walkdir(pg_tblspc, fsync_fname, true, NULL);
 			}
 			break;
 	}
@@ -296,7 +299,23 @@ walkdir(const char *path,
 				(*action) (subpath, false, arg);
 				break;
 			case PGFILETYPE_DIR:
-				walkdir(subpath, action, false, arg);
+
+				/*
+				 * XXX: Checking here for the "sync_data_files" case is quite
+				 * hacky, but it's not clear how to do better.  Another option
+				 * would be to send "de" down to the function, but that would
+				 * introduce a huge number of function pointer calls and
+				 * directory reads that we are trying to avoid.
+				 */
+#ifdef PG_FLUSH_DATA_WORKS
+				if ((action != pre_sync_fname && action != fsync_fname) ||
+#else
+				if (action != fsync_fname ||
+#endif
+					!arg || *((bool *) arg) ||
+					strcmp(de->d_name, "base") != 0)
+					walkdir(subpath, action, false, arg);
+
 				break;
 			default:
 
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index c328f56a85..3743caa63e 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -35,7 +35,8 @@ struct iovec;					/* avoid including port/pg_iovec.h here */
 #ifdef FRONTEND
 extern int	fsync_fname(const char *fname, bool isdir, void *arg);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
-						DataDirSyncMethod sync_method);
+						DataDirSyncMethod sync_method,
+						bool sync_data_files);
 extern void sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method);
 extern int	durable_rename(const char *oldfile, const char *newfile);
 extern int	fsync_parent_path(const char *fname);
-- 
2.39.5 (Apple Git-154)

v2-0005-Export-pre_sync_fname.patchtext/plain; charset=us-asciiDownload
From c911cee97239542e14a489ed84ab3b46a047659e Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 6 Nov 2024 09:52:19 -0600
Subject: [PATCH v2 5/8] Export pre_sync_fname().

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

A follow-up commit will use this function to alert the file system
that we want a file's data on disk so that subsequent calls to
fsync() are faster.
---
 src/common/file_utils.c         | 18 +++++-------------
 src/include/common/file_utils.h |  1 +
 2 files changed, 6 insertions(+), 13 deletions(-)

diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 65cdf07ae7..5c201ec6e8 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -45,10 +45,6 @@
  */
 #define MINIMUM_VERSION_FOR_PG_WAL	100000
 
-#ifdef PG_FLUSH_DATA_WORKS
-static int	pre_sync_fname(const char *fname, bool isdir, void *arg);
-#endif
-
 #ifdef HAVE_SYNCFS
 
 /*
@@ -307,11 +303,7 @@ walkdir(const char *path,
 				 * introduce a huge number of function pointer calls and
 				 * directory reads that we are trying to avoid.
 				 */
-#ifdef PG_FLUSH_DATA_WORKS
 				if ((action != pre_sync_fname && action != fsync_fname) ||
-#else
-				if (action != fsync_fname ||
-#endif
 					!arg || *((bool *) arg) ||
 					strcmp(de->d_name, "base") != 0)
 					walkdir(subpath, action, false, arg);
@@ -348,11 +340,12 @@ walkdir(const char *path,
  * Ignores errors trying to open unreadable files, and reports other errors
  * non-fatally.
  */
-#ifdef PG_FLUSH_DATA_WORKS
-
-static int
+int
 pre_sync_fname(const char *fname, bool isdir, void *arg)
 {
+#ifndef PG_FLUSH_DATA_WORKS
+	return 0;
+#else
 	int			fd;
 
 	fd = open(fname, O_RDONLY | PG_BINARY, 0);
@@ -380,9 +373,8 @@ pre_sync_fname(const char *fname, bool isdir, void *arg)
 
 	(void) close(fd);
 	return 0;
-}
-
 #endif							/* PG_FLUSH_DATA_WORKS */
+}
 
 /*
  * fsync_fname -- Try to fsync a file or directory
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 3743caa63e..e7a34d4c4e 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -43,6 +43,7 @@ extern int	fsync_parent_path(const char *fname);
 extern void walkdir(const char *path,
 					int (*action) (const char *fname, bool isdir, void *arg),
 					bool process_symlinks, void *arg);
+extern int	pre_sync_fname(const char *fname, bool isdir, void *arg);
 #endif
 
 extern PGFileType get_dirent_type(const char *path,
-- 
2.39.5 (Apple Git-154)

v2-0006-In-pg_upgrade-s-catalog-swap-mode-only-sync-files.patchtext/plain; charset=us-asciiDownload
From 6565fea925c2bb51a03428fa0a40728588220eed Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 6 Nov 2024 10:40:43 -0600
Subject: [PATCH v2 6/8] In pg_upgrade's catalog-swap mode, only sync files as
 necessary.

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

In this mode, it can be much faster to use "--sync-method fsync",
which now skips synchronizing data files moved from the old cluster
(which we assumed were synchronized before pg_upgrade).
---
 src/bin/pg_upgrade/pg_upgrade.c    |  6 ++--
 src/bin/pg_upgrade/relfilenumber.c | 52 ++++++++++++++++++++++++++++++
 2 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 663235816f..f5946ac89a 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -210,10 +210,12 @@ main(int argc, char **argv)
 	{
 		prep_status("Sync data directory to disk");
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/initdb\" --sync-only \"%s\" --sync-method %s",
+				  "\"%s/initdb\" --sync-only \"%s\" --sync-method %s %s",
 				  new_cluster.bindir,
 				  new_cluster.pgdata,
-				  user_opts.sync_method);
+				  user_opts.sync_method,
+				  (user_opts.transfer_mode == TRANSFER_MODE_CATALOG_SWAP) ?
+				  "--no-sync-data-files" : "");
 		check_ok();
 	}
 
diff --git a/src/bin/pg_upgrade/relfilenumber.c b/src/bin/pg_upgrade/relfilenumber.c
index 9d8fce3c4a..dcca4bb2e7 100644
--- a/src/bin/pg_upgrade/relfilenumber.c
+++ b/src/bin/pg_upgrade/relfilenumber.c
@@ -25,8 +25,49 @@ typedef struct move_catalog_file_context
 	FileNameMap *maps;
 	int			size;
 	char	   *target;
+	bool		sync_moved;
 } move_catalog_file_context;
 
+#define SYNC_QUEUE_MAX_LEN (1024)
+
+static char *sync_queue[SYNC_QUEUE_MAX_LEN];
+static bool sync_queue_inited;
+static int	sync_queue_len;
+
+static inline void
+sync_queue_init(void)
+{
+	if (sync_queue_inited)
+		return;
+
+	sync_queue_inited = true;
+	for (int i = 0; i < SYNC_QUEUE_MAX_LEN; i++)
+		sync_queue[i] = palloc(MAXPGPATH);
+}
+
+static inline void
+sync_queue_sync_all(void)
+{
+	if (!sync_queue_inited)
+		return;
+
+	for (int i = 0; i < sync_queue_len; i++)
+		fsync_fname(sync_queue[i], false, NULL);
+	sync_queue_len = 0;
+}
+
+static inline void
+sync_queue_push(const char *fname)
+{
+	sync_queue_init();
+
+	pre_sync_fname(fname, false, NULL);
+
+	strncpy(sync_queue[sync_queue_len++], fname, MAXPGPATH);
+	if (sync_queue_len >= SYNC_QUEUE_MAX_LEN)
+		sync_queue_sync_all();
+}
+
 /*
  * transfer_all_new_tablespaces()
  *
@@ -138,6 +179,8 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		/* We allocate something even for n_maps == 0 */
 		pg_free(mappings);
 	}
+
+	sync_queue_sync_all();
 }
 
 static int
@@ -195,6 +238,9 @@ move_catalog_file(const char *fname, bool isdir, void *arg)
 	if (rename(fname, dst) != 0)
 		pg_fatal("could not rename \"%s\" to \"%s\": %m", fname, dst);
 
+	if (context->sync_moved)
+		sync_queue_push(dst);
+
 	return 0;
 }
 
@@ -250,10 +296,12 @@ do_catalog_transfer(FileNameMap *maps, int size, char *old_tablespace)
 	context.maps = maps;
 	context.size = size;
 	context.target = old_cat;
+	context.sync_moved = false;
 	walkdir(new_dat, move_catalog_file, false, &context);
 
 	/* move catalogs in moved-aside data dir in place */
 	context.target = new_dat;
+	context.sync_moved = (sync_method != DATA_DIR_SYNC_METHOD_SYNCFS);
 	walkdir(moved_dat, move_catalog_file, false, &context);
 
 	/* no need to sync things individually if we are going to syncfs() later */
@@ -265,6 +313,8 @@ do_catalog_transfer(FileNameMap *maps, int size, char *old_tablespace)
 		pg_fatal("could not synchronize directory \"%s\": %m", moved_dat);
 	if (fsync_fname(old_cat, true, NULL) != 0)
 		pg_fatal("could not synchronize directory \"%s\": %m", old_cat);
+	if (fsync_fname(new_dat, true, NULL) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", new_dat);
 
 	/*
 	 * XXX: We could instead fsync() these directories once at the end instead
@@ -276,6 +326,8 @@ do_catalog_transfer(FileNameMap *maps, int size, char *old_tablespace)
 		pg_fatal("could not synchronize directory \"%s\": %m", old_tblspc);
 	if (fsync_fname(moved_tblspc, true, NULL) != 0)
 		pg_fatal("could not synchronize directory \"%s\": %m", moved_tblspc);
+	if (fsync_fname(new_tblspc, true, NULL) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", new_tblspc);
 }
 
 /*
-- 
2.39.5 (Apple Git-154)

v2-0007-Add-sequence-data-flag-to-pg_dump.patchtext/plain; charset=us-asciiDownload
From d60d05908793a1850c6e34979924d1502449bbfc Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 6 Nov 2024 10:46:11 -0600
Subject: [PATCH v2 7/8] Add --sequence-data flag to pg_dump.

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

This flag can be used to optionally dump the sequence data even
when --schema-only is used.  It is primarily intended for use in a
follow-up commit that will cause sequence data files to be carried
over from the old cluster in pg_upgrade's new catalog-swap mode.
---
 src/bin/pg_dump/pg_dump.c                   | 9 +--------
 src/bin/pg_upgrade/dump.c                   | 2 +-
 src/test/modules/test_pg_dump/t/001_base.pl | 2 +-
 3 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index add7f16c90..a0810aaefd 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -507,6 +507,7 @@ main(int argc, char **argv)
 		{"sync-method", required_argument, NULL, 15},
 		{"filter", required_argument, NULL, 16},
 		{"exclude-extension", required_argument, NULL, 17},
+		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -774,14 +775,6 @@ main(int argc, char **argv)
 	if (dopt.column_inserts && dopt.dump_inserts == 0)
 		dopt.dump_inserts = DUMP_DEFAULT_ROWS_PER_INSERT;
 
-	/*
-	 * Binary upgrade mode implies dumping sequence data even in schema-only
-	 * mode.  This is not exposed as a separate option, but kept separate
-	 * internally for clarity.
-	 */
-	if (dopt.binary_upgrade)
-		dopt.sequence_data = 1;
-
 	if (data_only && schema_only)
 		pg_fatal("options -s/--schema-only and -a/--data-only cannot be used together");
 
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index 8345f55be8..8453722833 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -53,7 +53,7 @@ generate_old_dump(void)
 
 		parallel_exec_prog(log_file_name, NULL,
 						   "\"%s/pg_dump\" %s --schema-only --quote-all-identifiers "
-						   "--binary-upgrade --format=custom %s --no-sync --file=\"%s/%s\" %s",
+						   "--binary-upgrade --sequence-data --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
 						   log_opts.verbose ? "--verbose" : "",
 						   log_opts.dumpdir,
diff --git a/src/test/modules/test_pg_dump/t/001_base.pl b/src/test/modules/test_pg_dump/t/001_base.pl
index e2579e29cd..46231c93f1 100644
--- a/src/test/modules/test_pg_dump/t/001_base.pl
+++ b/src/test/modules/test_pg_dump/t/001_base.pl
@@ -48,7 +48,7 @@ my %pgdump_runs = (
 		dump_cmd => [
 			'pg_dump', '--no-sync',
 			"--file=$tempdir/binary_upgrade.sql", '--schema-only',
-			'--binary-upgrade', '--dbname=postgres',
+			'--binary-upgrade', '--sequence-data', '--dbname=postgres',
 		],
 	},
 	clean => {
-- 
2.39.5 (Apple Git-154)

v2-0008-Avoid-copying-sequence-files-in-pg_upgrade-s-cata.patchtext/plain; charset=us-asciiDownload
From 0d7af5d0333b71d3f583077d3c2bd6cdb06fbf79 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 6 Nov 2024 10:53:40 -0600
Subject: [PATCH v2 8/8] Avoid copying sequence files in pg_upgrade's
 catalog-swap mode.

THIS IS A PROOF OF CONCEPT AND IS NOT READY FOR SERIOUS REVIEW.

On clusters with many sequences, this can further reduce the amount
of time required to wire up the data files in the new cluster.  If
the sequence data file format changes, this optimization cannot be
used, but that seems rare enough.
---
 src/bin/pg_upgrade/dump.c | 8 +++++++-
 src/bin/pg_upgrade/info.c | 6 +++++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index 8453722833..d5a81cc29c 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -51,10 +51,16 @@ generate_old_dump(void)
 		snprintf(sql_file_name, sizeof(sql_file_name), DB_DUMP_FILE_MASK, old_db->db_oid);
 		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
 
+		/*
+		 * XXX: We need to be sure that the sequence data format hasn't
+		 * changed.
+		 */
 		parallel_exec_prog(log_file_name, NULL,
 						   "\"%s/pg_dump\" %s --schema-only --quote-all-identifiers "
-						   "--binary-upgrade --sequence-data --format=custom %s --no-sync --file=\"%s/%s\" %s",
+						   "--binary-upgrade %s --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
+						   (user_opts.transfer_mode == TRANSFER_MODE_CATALOG_SWAP) ?
+						   "" : "--sequence-data",
 						   log_opts.verbose ? "--verbose" : "",
 						   log_opts.dumpdir,
 						   sql_file_name, escaped_connstr.data);
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index f83ded89cb..786d17e32f 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -483,6 +483,8 @@ get_rel_infos_query(void)
 	 * pg_largeobject contains user data that does not appear in pg_dump
 	 * output, so we have to copy that system table.  It's easiest to do that
 	 * by treating it as a user table.
+	 *
+	 * XXX: We need to be sure that the sequence data format hasn't changed.
 	 */
 	appendPQExpBuffer(&query,
 					  "WITH regular_heap (reloid, indtable, toastheap) AS ( "
@@ -490,7 +492,7 @@ get_rel_infos_query(void)
 					  "  FROM pg_catalog.pg_class c JOIN pg_catalog.pg_namespace n "
 					  "         ON c.relnamespace = n.oid "
 					  "  WHERE relkind IN (" CppAsString2(RELKIND_RELATION) ", "
-					  CppAsString2(RELKIND_MATVIEW) ") AND "
+					  CppAsString2(RELKIND_MATVIEW) "%s) AND "
 	/* exclude possible orphaned temp tables */
 					  "    ((n.nspname !~ '^pg_temp_' AND "
 					  "      n.nspname !~ '^pg_toast_temp_' AND "
@@ -499,6 +501,8 @@ get_rel_infos_query(void)
 					  "      c.oid >= %u::pg_catalog.oid) OR "
 					  "     (n.nspname = 'pg_catalog' AND "
 					  "      relname IN ('pg_largeobject') ))), ",
+					  (user_opts.transfer_mode == TRANSFER_MODE_CATALOG_SWAP) ?
+					  ", " CppAsString2(RELKIND_SEQUENCE) : "",
 					  FirstNormalObjectId);
 
 	/*
-- 
2.39.5 (Apple Git-154)

#7Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#6)
4 attachment(s)
Re: optimize file transfer in pg_upgrade

I've spent quite a bit of time recently trying to get this patch set into a
reasonable state. It's still a little rough around the edges, and the code
for the generated scripts is incomplete, but I figured I'd at least get
some CI testing going.

--
nathan

Attachments:

v3-0001-initdb-Add-no-sync-data-files.patchtext/plain; charset=us-asciiDownload
From 0af23114cfe5d00ab0b69ff804bb92d58d485adb Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 19 Feb 2025 09:14:51 -0600
Subject: [PATCH v3 1/4] initdb: Add --no-sync-data-files.

This new option instructs initdb to skip synchronizing any files
in database directories and the database directories themselves,
i.e., everything in the base/ subdirectory and any other
tablespace directories.  Other files, such as those in pg_wal/ and
pg_xact/, will still be synchronized unless --no-sync is also
specified.  --no-sync-data-files is primarily intended for internal
use by tools that separately ensure the skipped files are
synchronized to disk.  A follow-up commit will use this to help
optimize pg_upgrade's file transfer step.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/initdb.sgml                | 20 +++++
 src/bin/initdb/initdb.c                     | 10 ++-
 src/bin/initdb/t/001_initdb.pl              |  1 +
 src/bin/pg_basebackup/pg_basebackup.c       |  2 +-
 src/bin/pg_checksums/pg_checksums.c         |  2 +-
 src/bin/pg_combinebackup/pg_combinebackup.c |  2 +-
 src/bin/pg_rewind/file_ops.c                |  2 +-
 src/common/file_utils.c                     | 85 +++++++++++++--------
 src/include/common/file_utils.h             |  2 +-
 9 files changed, 89 insertions(+), 37 deletions(-)

diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 0026318485a..14c401b9a99 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -527,6 +527,26 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry id="app-initdb-option-no-sync-data-files">
+      <term><option>--no-sync-data-files</option></term>
+      <listitem>
+       <para>
+        By default, <command>initdb</command> safely writes all database files
+        to disk.  This option instructs <command>initdb</command> to skip
+        synchronizing all files in the individual database directories and the
+        database directories themselves, i.e., everything in the
+        <filename>base</filename> subdirectory and any other tablespace
+        directories.  Other files, such as those in <literal>pg_wal</literal>
+        and <literal>pg_xact</literal>, will still be synchronized unless the
+        <option>--no-sync</option> option is also specified.
+       </para>
+       <para>
+        This option is primarily intended for internal use by tools that
+        separately ensure the skipped files are synchronized to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="app-initdb-option-no-instructions">
       <term><option>--no-instructions</option></term>
       <listitem>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 21a0fe3ecd9..22b7d31b165 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -168,6 +168,7 @@ static bool data_checksums = true;
 static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+static bool sync_data_files = true;
 
 
 /* internal vars */
@@ -2566,6 +2567,7 @@ usage(const char *progname)
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
 	printf(_("  -n, --no-clean            do not clean up after errors\n"));
 	printf(_("  -N, --no-sync             do not wait for changes to be written safely to disk\n"));
+	printf(_("      --no-sync-data-files  do not sync files within database directories\n"));
 	printf(_("      --no-instructions     do not print instructions for next steps\n"));
 	printf(_("  -s, --show                show internal settings, then exit\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
@@ -3208,6 +3210,7 @@ main(int argc, char *argv[])
 		{"icu-rules", required_argument, NULL, 18},
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
+		{"no-sync-data-files", no_argument, NULL, 21},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3402,6 +3405,9 @@ main(int argc, char *argv[])
 			case 20:
 				data_checksums = false;
 				break;
+			case 21:
+				sync_data_files = false;
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -3453,7 +3459,7 @@ main(int argc, char *argv[])
 
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 		return 0;
 	}
@@ -3516,7 +3522,7 @@ main(int argc, char *argv[])
 	{
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 	}
 	else
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 01cc4a1602b..15dd10ce40a 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -76,6 +76,7 @@ command_like(
 	'checksums are enabled in control file');
 
 command_ok([ 'initdb', '--sync-only', $datadir ], 'sync only');
+command_ok([ 'initdb', '--sync-only', '--no-sync-data-files', $datadir ], '--no-sync-data-files');
 command_fails([ 'initdb', $datadir ], 'existing data directory');
 
 if ($supports_syncfs)
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index dc0c805137a..bc94c114d27 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -2310,7 +2310,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
 		}
 		else
 		{
-			(void) sync_pgdata(basedir, serverVersion, sync_method);
+			(void) sync_pgdata(basedir, serverVersion, sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index e1acb6e933d..3bbd8f616cf 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -633,7 +633,7 @@ main(int argc, char *argv[])
 		if (do_sync)
 		{
 			pg_log_info("syncing data directory");
-			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method);
+			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method, true);
 		}
 
 		pg_log_info("updating control file");
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index 5864ec574fb..c0ec09485c3 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -420,7 +420,7 @@ main(int argc, char *argv[])
 		else
 		{
 			pg_log_debug("recursively fsyncing \"%s\"", opt.output);
-			sync_pgdata(opt.output, version * 10000, opt.sync_method);
+			sync_pgdata(opt.output, version * 10000, opt.sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 467845419ed..55659ce201f 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -296,7 +296,7 @@ sync_target_dir(void)
 	if (!do_sync || dry_run)
 		return;
 
-	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method);
+	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method, true);
 }
 
 
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 0e3cfede935..78e272916f5 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -50,7 +50,8 @@ static int	pre_sync_fname(const char *fname, bool isdir);
 #endif
 static void walkdir(const char *path,
 					int (*action) (const char *fname, bool isdir),
-					bool process_symlinks);
+					bool process_symlinks,
+					const char *exclude_dir);
 
 #ifdef HAVE_SYNCFS
 
@@ -93,11 +94,15 @@ do_syncfs(const char *path)
  * syncing, and might not have privileges to write at all.
  *
  * serverVersion indicates the version of the server to be sync'd.
+ *
+ * If sync_data_files is false, this function skips syncing "base/" and any
+ * other tablespace directories.
  */
 void
 sync_pgdata(const char *pg_data,
 			int serverVersion,
-			DataDirSyncMethod sync_method)
+			DataDirSyncMethod sync_method,
+			bool sync_data_files)
 {
 	bool		xlog_is_symlink;
 	char		pg_wal[MAXPGPATH];
@@ -147,30 +152,33 @@ sync_pgdata(const char *pg_data,
 				do_syncfs(pg_data);
 
 				/* If any tablespaces are configured, sync each of those. */
-				dir = opendir(pg_tblspc);
-				if (dir == NULL)
-					pg_log_error("could not open directory \"%s\": %m",
-								 pg_tblspc);
-				else
+				if (sync_data_files)
 				{
-					while (errno = 0, (de = readdir(dir)) != NULL)
+					dir = opendir(pg_tblspc);
+					if (dir == NULL)
+						pg_log_error("could not open directory \"%s\": %m",
+									 pg_tblspc);
+					else
 					{
-						char		subpath[MAXPGPATH * 2];
+						while (errno = 0, (de = readdir(dir)) != NULL)
+						{
+							char		subpath[MAXPGPATH * 2];
 
-						if (strcmp(de->d_name, ".") == 0 ||
-							strcmp(de->d_name, "..") == 0)
-							continue;
+							if (strcmp(de->d_name, ".") == 0 ||
+								strcmp(de->d_name, "..") == 0)
+								continue;
 
-						snprintf(subpath, sizeof(subpath), "%s/%s",
-								 pg_tblspc, de->d_name);
-						do_syncfs(subpath);
-					}
+							snprintf(subpath, sizeof(subpath), "%s/%s",
+									 pg_tblspc, de->d_name);
+							do_syncfs(subpath);
+						}
 
-					if (errno)
-						pg_log_error("could not read directory \"%s\": %m",
-									 pg_tblspc);
+						if (errno)
+							pg_log_error("could not read directory \"%s\": %m",
+										 pg_tblspc);
 
-					(void) closedir(dir);
+						(void) closedir(dir);
+					}
 				}
 
 				/* If pg_wal is a symlink, process that too. */
@@ -182,15 +190,21 @@ sync_pgdata(const char *pg_data,
 
 		case DATA_DIR_SYNC_METHOD_FSYNC:
 			{
+				char	   *exclude_dir = NULL;
+
+				if (!sync_data_files)
+					exclude_dir = psprintf("%s/base", pg_data);
+
 				/*
 				 * If possible, hint to the kernel that we're soon going to
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(pg_data, pre_sync_fname, false);
+				walkdir(pg_data, pre_sync_fname, false, exclude_dir);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, pre_sync_fname, false);
-				walkdir(pg_tblspc, pre_sync_fname, true);
+					walkdir(pg_wal, pre_sync_fname, false, NULL);
+				if (sync_data_files)
+					walkdir(pg_tblspc, pre_sync_fname, true, NULL);
 #endif
 
 				/*
@@ -203,10 +217,14 @@ sync_pgdata(const char *pg_data,
 				 * get fsync'd twice. That's not an expected case so we don't
 				 * worry about optimizing it.
 				 */
-				walkdir(pg_data, fsync_fname, false);
+				walkdir(pg_data, fsync_fname, false, exclude_dir);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, fsync_fname, false);
-				walkdir(pg_tblspc, fsync_fname, true);
+					walkdir(pg_wal, fsync_fname, false, NULL);
+				if (sync_data_files)
+					walkdir(pg_tblspc, fsync_fname, true, NULL);
+
+				if (exclude_dir)
+					pfree(exclude_dir);
 			}
 			break;
 	}
@@ -245,10 +263,10 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(dir, pre_sync_fname, false);
+				walkdir(dir, pre_sync_fname, false, NULL);
 #endif
 
-				walkdir(dir, fsync_fname, false);
+				walkdir(dir, fsync_fname, false, NULL);
 			}
 			break;
 	}
@@ -264,6 +282,9 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
  * ignored in subdirectories, ie we intentionally don't pass down the
  * process_symlinks flag to recursive calls.
  *
+ * If exclude_dir is not NULL, it specifies a directory path to skip
+ * processing.
+ *
  * Errors are reported but not considered fatal.
  *
  * See also walkdir in fd.c, which is a backend version of this logic.
@@ -271,11 +292,15 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
 static void
 walkdir(const char *path,
 		int (*action) (const char *fname, bool isdir),
-		bool process_symlinks)
+		bool process_symlinks,
+		const char *exclude_dir)
 {
 	DIR		   *dir;
 	struct dirent *de;
 
+	if (exclude_dir && strcmp(exclude_dir, path) == 0)
+		return;
+
 	dir = opendir(path);
 	if (dir == NULL)
 	{
@@ -299,7 +324,7 @@ walkdir(const char *path,
 				(*action) (subpath, false);
 				break;
 			case PGFILETYPE_DIR:
-				walkdir(subpath, action, false);
+				walkdir(subpath, action, false, exclude_dir);
 				break;
 			default:
 
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index a832210adc1..8274bc877ab 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -35,7 +35,7 @@ struct iovec;					/* avoid including port/pg_iovec.h here */
 #ifdef FRONTEND
 extern int	fsync_fname(const char *fname, bool isdir);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
-						DataDirSyncMethod sync_method);
+						DataDirSyncMethod sync_method, bool sync_data_files);
 extern void sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method);
 extern int	durable_rename(const char *oldfile, const char *newfile);
 extern int	fsync_parent_path(const char *fname);
-- 
2.39.5 (Apple Git-154)

v3-0002-pg_dump-Add-sequence-data.patchtext/plain; charset=us-asciiDownload
From d344dfcc9b96253702025e551ee3e8dd720bb0d6 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 19 Feb 2025 11:25:28 -0600
Subject: [PATCH v3 2/4] pg_dump: Add --sequence-data.

This new option instructs pg_dump to dump sequence data when the
--no-data, --schema-only, or --statistics-only option is specified.
This was originally considered for commit a7e5457db8, but it was
left out at that time because there was no known use-case.  A
follow-up commit will use this to optimize pg_upgrade's file
transfer step.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/pg_dump.sgml               | 11 +++++++++++
 src/bin/pg_dump/pg_dump.c                   | 10 ++--------
 src/bin/pg_dump/t/002_pg_dump.pl            |  1 +
 src/bin/pg_upgrade/dump.c                   |  2 +-
 src/test/modules/test_pg_dump/t/001_base.pl |  2 +-
 5 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 1975054d7bf..b05f16995c3 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1289,6 +1289,17 @@ PostgreSQL documentation
        </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--sequence-data</option></term>
+      <listitem>
+       <para>
+        Include sequence data in the dump.  This is the default behavior except
+        when <option>--no-data</option>, <option>--schema-only</option>, or
+        <option>--statistics-only</option> is specified.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--serializable-deferrable</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 4f4ad2ee150..f63215eb3f9 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -517,6 +517,7 @@ main(int argc, char **argv)
 		{"sync-method", required_argument, NULL, 15},
 		{"filter", required_argument, NULL, 16},
 		{"exclude-extension", required_argument, NULL, 17},
+		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -803,14 +804,6 @@ main(int argc, char **argv)
 	if (dopt.column_inserts && dopt.dump_inserts == 0)
 		dopt.dump_inserts = DUMP_DEFAULT_ROWS_PER_INSERT;
 
-	/*
-	 * Binary upgrade mode implies dumping sequence data even in schema-only
-	 * mode.  This is not exposed as a separate option, but kept separate
-	 * internally for clarity.
-	 */
-	if (dopt.binary_upgrade)
-		dopt.sequence_data = 1;
-
 	if (data_only && schema_only)
 		pg_fatal("options -s/--schema-only and -a/--data-only cannot be used together");
 	if (schema_only && statistics_only)
@@ -1275,6 +1268,7 @@ help(const char *progname)
 	printf(_("  --quote-all-identifiers      quote all identifiers, even if not key words\n"));
 	printf(_("  --rows-per-insert=NROWS      number of rows per INSERT; implies --inserts\n"));
 	printf(_("  --section=SECTION            dump named section (pre-data, data, or post-data)\n"));
+	printf(_("  --sequence-data              include sequence data in dump\n"));
 	printf(_("  --serializable-deferrable    wait until the dump can run without anomalies\n"));
 	printf(_("  --snapshot=SNAPSHOT          use given snapshot for the dump\n"));
 	printf(_("  --statistics-only            dump only the statistics, not schema or data\n"));
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index c7bffc1b045..8ae6c5374fc 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -66,6 +66,7 @@ my %pgdump_runs = (
 			'--file' => "$tempdir/binary_upgrade.dump",
 			'--no-password',
 			'--no-data',
+			'--sequence-data',
 			'--binary-upgrade',
 			'--dbname' => 'postgres',    # alternative way to specify database
 		],
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index 23fe7280a16..b8fd0d0acee 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -52,7 +52,7 @@ generate_old_dump(void)
 		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
 
 		parallel_exec_prog(log_file_name, NULL,
-						   "\"%s/pg_dump\" %s --no-data %s --quote-all-identifiers "
+						   "\"%s/pg_dump\" %s --no-data %s --sequence-data --quote-all-identifiers "
 						   "--binary-upgrade --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
 						   log_opts.verbose ? "--verbose" : "",
diff --git a/src/test/modules/test_pg_dump/t/001_base.pl b/src/test/modules/test_pg_dump/t/001_base.pl
index 9b2a90b0469..27c6c2ab0f3 100644
--- a/src/test/modules/test_pg_dump/t/001_base.pl
+++ b/src/test/modules/test_pg_dump/t/001_base.pl
@@ -48,7 +48,7 @@ my %pgdump_runs = (
 		dump_cmd => [
 			'pg_dump', '--no-sync',
 			"--file=$tempdir/binary_upgrade.sql", '--schema-only',
-			'--binary-upgrade', '--dbname=postgres',
+			'--sequence-data', '--binary-upgrade', '--dbname=postgres',
 		],
 	},
 	clean => {
-- 
2.39.5 (Apple Git-154)

v3-0003-Add-new-frontend-functions-for-durable-file-opera.patchtext/plain; charset=us-asciiDownload
From 04063a995759c9f32bd87b0155c68a2c5fb346ed Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 26 Feb 2025 11:44:36 -0600
Subject: [PATCH v3 3/4] Add new frontend functions for durable file
 operations.

This commit exports the existing pre_sync_fname() function and adds
durable_mkdir_p() and durable_rename_dir() for use in frontend
programs.  A follow-up commit will use this to help optimize
pg_upgrade's file transfer step.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 src/common/file_utils.c         | 55 +++++++++++++++++++++++++++------
 src/include/common/file_utils.h |  3 ++
 2 files changed, 49 insertions(+), 9 deletions(-)

diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 78e272916f5..a5a03abd7ca 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -26,6 +26,7 @@
 
 #include "common/file_utils.h"
 #ifdef FRONTEND
+#include "common/file_perm.h"
 #include "common/logging.h"
 #endif
 #include "common/relpath.h"
@@ -45,9 +46,6 @@
  */
 #define MINIMUM_VERSION_FOR_PG_WAL	100000
 
-#ifdef PG_FLUSH_DATA_WORKS
-static int	pre_sync_fname(const char *fname, bool isdir);
-#endif
 static void walkdir(const char *path,
 					int (*action) (const char *fname, bool isdir),
 					bool process_symlinks,
@@ -352,16 +350,16 @@ walkdir(const char *path,
 }
 
 /*
- * Hint to the OS that it should get ready to fsync() this file.
+ * Hint to the OS that it should get ready to fsync() this file, if supported
+ * by the platform.
  *
  * Ignores errors trying to open unreadable files, and reports other errors
  * non-fatally.
  */
-#ifdef PG_FLUSH_DATA_WORKS
-
-static int
+int
 pre_sync_fname(const char *fname, bool isdir)
 {
+#ifdef PG_FLUSH_DATA_WORKS
 	int			fd;
 
 	fd = open(fname, O_RDONLY | PG_BINARY, 0);
@@ -388,11 +386,10 @@ pre_sync_fname(const char *fname, bool isdir)
 #endif
 
 	(void) close(fd);
+#endif							/* PG_FLUSH_DATA_WORKS */
 	return 0;
 }
 
-#endif							/* PG_FLUSH_DATA_WORKS */
-
 /*
  * fsync_fname -- Try to fsync a file or directory
  *
@@ -539,6 +536,46 @@ durable_rename(const char *oldfile, const char *newfile)
 	return 0;
 }
 
+/*
+ * durable_rename_dir: rename(2) wrapper for directories, issuing fsyncs
+ * required for durability.
+ */
+int
+durable_rename_dir(const char *olddir, const char *newdir)
+{
+	if (fsync_fname(olddir, true) != 0 ||
+		fsync_parent_path(olddir) != 0 ||
+		fsync_parent_path(newdir) != 0)
+		return -1;
+
+	if (rename(olddir, newdir) != 0)
+		return -1;
+
+	if (fsync_fname(newdir, true) != 0 ||
+		fsync_parent_path(olddir) != 0 ||
+		fsync_parent_path(newdir) != 0)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * durable_mkdir_p: pg_mkdir_p() wrapper, issuing fsyncs required for
+ * durability.
+ */
+int
+durable_mkdir_p(char *newdir)
+{
+	if (pg_mkdir_p(newdir, pg_dir_create_mode) && errno != EEXIST)
+		return -1;
+
+	if (fsync_fname(newdir, true) != 0 ||
+		fsync_parent_path(newdir) != 0)
+		return -1;
+
+	return 0;
+}
+
 #endif							/* FRONTEND */
 
 /*
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 8274bc877ab..7d253a4cb51 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -33,11 +33,14 @@ typedef enum DataDirSyncMethod
 struct iovec;					/* avoid including port/pg_iovec.h here */
 
 #ifdef FRONTEND
+extern int	pre_sync_fname(const char *fname, bool isdir);
 extern int	fsync_fname(const char *fname, bool isdir);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
 						DataDirSyncMethod sync_method, bool sync_data_files);
 extern void sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method);
 extern int	durable_rename(const char *oldfile, const char *newfile);
+extern int	durable_rename_dir(const char *olddir, const char *newdir);
+extern int	durable_mkdir_p(char *newdir);
 extern int	fsync_parent_path(const char *fname);
 #endif
 
-- 
2.39.5 (Apple Git-154)

v3-0004-pg_upgrade-Add-swap-for-faster-file-transfer.patchtext/plain; charset=us-asciiDownload
From c8540d235c0dc6cac817a3b9f3336c3336af5886 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Fri, 28 Feb 2025 13:00:50 -0600
Subject: [PATCH v3 4/4] pg_upgrade: Add --swap for faster file transfer.

This new option instructs pg_upgrade to move the data directories
from the old cluster to the new cluster and then to replace the
catalog files with those generated for the new cluster.  This mode
can outperform --link, --clone, --copy, and --copy-file-range,
especially on clusters with many relations.

However, this mode creates many garbage files in the old cluster,
which can prolong the file synchronization step.  To handle that,
we use "initdb --sync-only --no-sync-data-files" for file
synchronization, and we synchronize the catalog files as they are
transferred.  We assume that the database files transferred from
the old cluster were synchronized prior to upgrade.  This mode also
complicates reverting to the old cluster.  For this reason,
pg_upgrade generates a script to perform the necessary steps.

The new mode is limited to clusters located in the same file system
and to upgrades from version 10 and newer.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/pgupgrade.sgml    |  69 +++++-
 src/bin/pg_upgrade/.gitignore      |   2 +
 src/bin/pg_upgrade/Makefile        |   2 +-
 src/bin/pg_upgrade/check.c         |  82 ++++++-
 src/bin/pg_upgrade/dump.c          |   4 +-
 src/bin/pg_upgrade/file.c          |  16 +-
 src/bin/pg_upgrade/info.c          |   4 +-
 src/bin/pg_upgrade/option.c        |   7 +
 src/bin/pg_upgrade/pg_upgrade.c    |   4 +-
 src/bin/pg_upgrade/pg_upgrade.h    |   4 +-
 src/bin/pg_upgrade/relfilenumber.c | 374 +++++++++++++++++++++++++++++
 11 files changed, 558 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 7bdd85c5cff..6ca20f19ec2 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -244,7 +244,8 @@ PostgreSQL documentation
       <listitem>
        <para>
         Copy files to the new cluster.  This is the default.  (See also
-        <option>--link</option> and <option>--clone</option>.)
+        <option>--link</option>, <option>--clone</option>,
+        <option>--copy-file-range</option>, and <option>--swap</option>.)
        </para>
       </listitem>
      </varlistentry>
@@ -262,6 +263,33 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--swap</option></term>
+      <listitem>
+       <para>
+        Move the data directories from the old cluster to the new cluster.
+        Then, replace the catalog files with those generated for the new
+        cluster.  This mode can outperform <option>--link</option>,
+        <option>--clone</option>, <option>--copy</option>, and
+        <option>--copy-file-range</option>, especially on clusters with many
+        relations.
+       </para>
+       <para>
+        However, this mode creates many garbage files in the old cluster, which
+        can prolong the file synchronization step if
+        <option>--sync-method=syncfs</option> is used.  Therefore, it is
+        recommended to use <option>--sync-method=fsync</option> with
+        <option>--swap</option>.
+       </para>
+       <para>
+        Additionally, this mode complicates reverting to the old cluster.  For
+        this reason, <application>pg_upgrade</application> generates a script
+        to perform the necessary steps.  See
+        <xref linkend="pgupgrade-step-revert"/> for details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--sync-method=</option><replaceable>method</replaceable></term>
       <listitem>
@@ -530,6 +558,10 @@ NET STOP postgresql-&majorversion;
      is started.  Clone mode also requires that the old and new data
      directories be in the same file system.  This mode is only available
      on certain operating systems and file systems.
+     Swap mode may be the fastest if there are many relations, but like link
+     mode, you will not be able to access your old cluster once you start the
+     new cluster after the upgrade.  Swap mode also requires that the old and
+     new cluster data directories be in the same file system.
     </para>
 
     <para>
@@ -889,6 +921,41 @@ psql --username=postgres --file=script.sql postgres
 
         </itemizedlist></para>
       </listitem>
+
+      <listitem>
+       <para>
+        If the <option>--swap</option> option was used, the data directories
+        and their files might be moved between the old and new clusters:
+
+        <itemizedlist>
+         <listitem>
+          <para>
+           If <command>pg_upgrade</command> aborted before moving any data
+           directories or their files, the old cluster was unmodified; it can
+           be restarted.
+          </para>
+         </listitem>
+
+         <listitem>
+          <para>
+           If you did <emphasis>not</emphasis> start the new cluster, the
+           content of the database files was unmodified, but the data
+           directories and their files were moved between the old and new
+           clusters.  To reuse the old cluster, run the script mentioned before
+           <command>pg_upgrade</command> started the file transfer step.
+          </para>
+         </listitem>
+
+         <listitem>
+          <para>
+           If you did start the new cluster, it has written to the files, and
+           it is unsafe to use the old cluster.  The old cluster will need to be
+           restored from backup in this case.
+          </para>
+         </listitem>
+        </itemizedlist>
+       </para>
+      </listitem>
      </itemizedlist></para>
    </step>
   </procedure>
diff --git a/src/bin/pg_upgrade/.gitignore b/src/bin/pg_upgrade/.gitignore
index a66166ea0fa..ea3a0046e51 100644
--- a/src/bin/pg_upgrade/.gitignore
+++ b/src/bin/pg_upgrade/.gitignore
@@ -3,6 +3,8 @@
 /delete_old_cluster.sh
 /delete_old_cluster.bat
 /reindex_hash.sql
+/revert_to_old_cluster.sh
+/revert_to_old_cluster.bat
 # Generated by test suite
 /log/
 /tmp_check/
diff --git a/src/bin/pg_upgrade/Makefile b/src/bin/pg_upgrade/Makefile
index f83d2b5d309..67ac34443af 100644
--- a/src/bin/pg_upgrade/Makefile
+++ b/src/bin/pg_upgrade/Makefile
@@ -53,7 +53,7 @@ uninstall:
 clean distclean:
 	rm -f pg_upgrade$(X) $(OBJS)
 	rm -rf delete_old_cluster.sh log/ tmp_check/ \
-	       reindex_hash.sql
+	       reindex_hash.sql revert_to_old_cluster.sh
 
 export with_icu
 
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 88db8869b6e..9d27097ad94 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -709,7 +709,34 @@ check_new_cluster(void)
 			check_copy_file_range();
 			break;
 		case TRANSFER_MODE_LINK:
-			check_hard_link();
+			check_hard_link(TRANSFER_MODE_LINK);
+			break;
+		case TRANSFER_MODE_SWAP:
+
+			/*
+			 * We do the hard link check for --swap, too, since it's an easy
+			 * way to verify the clusters are in the same file system.  This
+			 * allows us to take some shortcuts in the file synchronization
+			 * step.  With some more effort, we could probably support the
+			 * separate-file-system use case, but this mode is unlikely to
+			 * offer much benefit if we have to copy the files across file
+			 * system boundaries.
+			 */
+			check_hard_link(TRANSFER_MODE_SWAP);
+
+			/*
+			 * There are a few known issues with using --swap to upgrade from
+			 * versions older than 10.  For example, the sequence tuple format
+			 * changed in v10, and the visibility map format changed in 9.6.
+			 * While such problems are not insurmountable (and we may have to
+			 * deal with similar problems in the future, anyway), it doesn't
+			 * seem worth the effort to support swap mode for upgrades from
+			 * long-unsupported versions.
+			 */
+			if (GET_MAJOR_VERSION(old_cluster.major_version) < 1000)
+				pg_fatal("Swap mode can only upgrade clusters from PostgreSQL version %s and later.",
+						 "10");
+
 			break;
 	}
 
@@ -928,6 +955,8 @@ check_for_new_tablespace_dir(void)
  * create_script_for_old_cluster_deletion()
  *
  *	This is particularly useful for tablespace deletion.
+ *
+ * XXX: DO WE NEED TO MODIFY THIS FOR SWAP MODE?
  */
 void
 create_script_for_old_cluster_deletion(char **deletion_script_file_name)
@@ -1046,6 +1075,57 @@ create_script_for_old_cluster_deletion(char **deletion_script_file_name)
 }
 
 
+/*
+ * create_script_for_swap_revert()
+ *
+ * Reverting to the old cluster when --swap is used is complicated, so we
+ * generate a script to make it easy.
+ */
+void
+create_script_for_swap_revert(void)
+{
+	char	   *script;
+	FILE	   *fd;
+
+	script = psprintf("%srevert_to_old_cluster.%s", SCRIPT_PREFIX, SCRIPT_EXT);
+
+	prep_status("Creating script to revert to old cluster");
+
+	if ((fd = fopen_priv(script, "w")) == NULL)
+		pg_fatal("could not open file \"%s\": %m", script);
+
+#ifndef WIN32
+	/* add shebang header */
+	fprintf(fd, "#!/bin/sh\n\n");
+#endif
+
+	/* handle default tablespace */
+	/* TODO */
+
+	/* handle alternate tablespaces */
+	for (int tblnum = 0; tblnum < os_info.num_old_tablespaces; tblnum++)
+	{
+		/* TODO */
+	}
+
+	fclose(fd);
+
+#ifndef WIN32
+	if (chmod(script, S_IRWXU) != 0)
+		pg_fatal("could not add execute permission to file \"%s\": %m", script);
+#endif
+
+	check_ok();
+
+	/* report location of script to user */
+	pg_log(PG_REPORT, "\n"
+		   "    To revert to the old cluster, run this script before\n"
+		   "    starting the new cluster:\n"
+		   "        %s",
+		   script);
+}
+
+
 /*
  *	check_is_install_user()
  *
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index b8fd0d0acee..23cb08e8347 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -52,9 +52,11 @@ generate_old_dump(void)
 		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
 
 		parallel_exec_prog(log_file_name, NULL,
-						   "\"%s/pg_dump\" %s --no-data %s --sequence-data --quote-all-identifiers "
+						   "\"%s/pg_dump\" %s --no-data %s %s --quote-all-identifiers "
 						   "--binary-upgrade --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
+						   (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+						   "" : "--sequence-data",
 						   log_opts.verbose ? "--verbose" : "",
 						   user_opts.do_statistics ? "" : "--no-statistics",
 						   log_opts.dumpdir,
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 7fd1991204a..4fe784e8b94 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -434,18 +434,28 @@ check_copy_file_range(void)
 }
 
 void
-check_hard_link(void)
+check_hard_link(transferMode transfer_mode)
 {
 	char		existing_file[MAXPGPATH];
 	char		new_link_file[MAXPGPATH];
 
+	/* only used for --link and --swap */
+	Assert(transfer_mode == TRANSFER_MODE_LINK ||
+		   transfer_mode == TRANSFER_MODE_SWAP);
+
 	snprintf(existing_file, sizeof(existing_file), "%s/PG_VERSION", old_cluster.pgdata);
 	snprintf(new_link_file, sizeof(new_link_file), "%s/PG_VERSION.linktest", new_cluster.pgdata);
 	unlink(new_link_file);		/* might fail */
 
 	if (link(existing_file, new_link_file) < 0)
-		pg_fatal("could not create hard link between old and new data directories: %m\n"
-				 "In link mode the old and new data directories must be on the same file system.");
+	{
+		if (transfer_mode == TRANSFER_MODE_LINK)
+			pg_fatal("could not create hard link between old and new data directories: %m\n"
+					 "In link mode the old and new data directories must be on the same file system.");
+		else
+			pg_fatal("could not create hard link between old and new data directories: %m\n"
+					 "In swap mode the old and new data directories must be on the same file system.");
+	}
 
 	unlink(new_link_file);
 }
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index ad52de8b607..4b7a56f5b3b 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -490,7 +490,7 @@ get_rel_infos_query(void)
 					  "  FROM pg_catalog.pg_class c JOIN pg_catalog.pg_namespace n "
 					  "         ON c.relnamespace = n.oid "
 					  "  WHERE relkind IN (" CppAsString2(RELKIND_RELATION) ", "
-					  CppAsString2(RELKIND_MATVIEW) ") AND "
+					  CppAsString2(RELKIND_MATVIEW) "%s) AND "
 	/* exclude possible orphaned temp tables */
 					  "    ((n.nspname !~ '^pg_temp_' AND "
 					  "      n.nspname !~ '^pg_toast_temp_' AND "
@@ -499,6 +499,8 @@ get_rel_infos_query(void)
 					  "      c.oid >= %u::pg_catalog.oid) OR "
 					  "     (n.nspname = 'pg_catalog' AND "
 					  "      relname IN ('pg_largeobject') ))), ",
+					  (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+					  ", " CppAsString2(RELKIND_SEQUENCE) : "",
 					  FirstNormalObjectId);
 
 	/*
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 188dd8d8a8b..7fd7f1d33fc 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -62,6 +62,7 @@ parseCommandLine(int argc, char *argv[])
 		{"sync-method", required_argument, NULL, 4},
 		{"no-statistics", no_argument, NULL, 5},
 		{"set-char-signedness", required_argument, NULL, 6},
+		{"swap", no_argument, NULL, 7},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -228,6 +229,11 @@ parseCommandLine(int argc, char *argv[])
 				else
 					pg_fatal("invalid argument for option %s", "--set-char-signedness");
 				break;
+
+			case 7:
+				user_opts.transfer_mode = TRANSFER_MODE_SWAP;
+				break;
+
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
 						os_info.progname);
@@ -325,6 +331,7 @@ usage(void)
 	printf(_("  --no-statistics               do not import statistics from old cluster\n"));
 	printf(_("  --set-char-signedness=OPTION  set new cluster char signedness to \"signed\" or\n"
 			 "                                \"unsigned\"\n"));
+	printf(_("  --swap                        move data directories to new cluster\n"));
 	printf(_("  --sync-method=METHOD          set method for syncing files to disk\n"));
 	printf(_("  -?, --help                    show this help, then exit\n"));
 	printf(_("\n"
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 174cd920840..a538d407f74 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -212,8 +212,10 @@ main(int argc, char **argv)
 	{
 		prep_status("Sync data directory to disk");
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/initdb\" --sync-only \"%s\" --sync-method %s",
+				  "\"%s/initdb\" --sync-only %s \"%s\" --sync-method %s",
 				  new_cluster.bindir,
+				  (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+				  "--no-sync-data-files" : "",
 				  new_cluster.pgdata,
 				  user_opts.sync_method);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index f4e375d27c7..9403c0ac78f 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -262,6 +262,7 @@ typedef enum
 	TRANSFER_MODE_COPY,
 	TRANSFER_MODE_COPY_FILE_RANGE,
 	TRANSFER_MODE_LINK,
+	TRANSFER_MODE_SWAP,
 } transferMode;
 
 /*
@@ -385,6 +386,7 @@ void		output_completion_banner(char *deletion_script_file_name);
 void		check_cluster_versions(void);
 void		check_cluster_compatibility(void);
 void		create_script_for_old_cluster_deletion(char **deletion_script_file_name);
+void		create_script_for_swap_revert(void);
 
 
 /* controldata.c */
@@ -423,7 +425,7 @@ void		rewriteVisibilityMap(const char *fromfile, const char *tofile,
 								 const char *schemaName, const char *relName);
 void		check_file_clone(void);
 void		check_copy_file_range(void);
-void		check_hard_link(void);
+void		check_hard_link(transferMode transfer_mode);
 
 /* fopen_priv() is no longer different from fopen() */
 #define fopen_priv(path, mode)	fopen(path, mode)
diff --git a/src/bin/pg_upgrade/relfilenumber.c b/src/bin/pg_upgrade/relfilenumber.c
index 8c23c583172..059ef98350f 100644
--- a/src/bin/pg_upgrade/relfilenumber.c
+++ b/src/bin/pg_upgrade/relfilenumber.c
@@ -11,11 +11,91 @@
 
 #include <sys/stat.h>
 
+#include "common/file_utils.h"
+#include "common/int.h"
+#include "common/logging.h"
 #include "pg_upgrade.h"
 
 static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
 static void transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_frozenbit);
 
+/*
+ * The following set of sync_queue_* functions are used for --swap to reduce
+ * the amount of time spent synchronizing the swapped catalog files.  When a
+ * file is added to the queue, we also alert the file system that we'd like it
+ * to be persisted to disk in the near future (if that operation is supported
+ * by the current platform).  Once the queue is full, all of the files are
+ * synchronized to disk.  This strategy should generally be much faster than
+ * simply calling fsync() on the files right away.
+ *
+ * The general usage pattern should be something like:
+ *
+ *     for (int i = 0; i < num_files; i++)
+ *         sync_queue_push(files[i]);
+ *
+ *     // be sure to sync any remaining files in the queue
+ *     sync_queue_sync_all();
+ *     synq_queue_destroy();
+ */
+
+#define SYNC_QUEUE_MAX_LEN	(1024)
+
+static char *sync_queue[SYNC_QUEUE_MAX_LEN];
+static bool sync_queue_inited;
+static int	sync_queue_len;
+
+static inline void
+sync_queue_init(void)
+{
+	if (sync_queue_inited)
+		return;
+
+	sync_queue_inited = true;
+	for (int i = 0; i < SYNC_QUEUE_MAX_LEN; i++)
+		sync_queue[i] = palloc(MAXPGPATH);
+}
+
+static inline void
+sync_queue_sync_all(void)
+{
+	if (!sync_queue_inited)
+		return;
+
+	for (int i = 0; i < sync_queue_len; i++)
+	{
+		if (fsync_fname(sync_queue[i], false) != 0)
+			pg_fatal("could not synchronize file \"%s\": %m", sync_queue[i]);
+	}
+
+	sync_queue_len = 0;
+}
+
+static inline void
+sync_queue_push(const char *fname)
+{
+	sync_queue_init();
+
+	pre_sync_fname(fname, false);
+
+	strncpy(sync_queue[sync_queue_len++], fname, MAXPGPATH);
+	if (sync_queue_len >= SYNC_QUEUE_MAX_LEN)
+		sync_queue_sync_all();
+}
+
+static inline void
+sync_queue_destroy(void)
+{
+	if (!sync_queue_inited)
+		return;
+
+	sync_queue_inited = false;
+	sync_queue_len = 0;
+	for (int i = 0; i < SYNC_QUEUE_MAX_LEN; i++)
+	{
+		pfree(sync_queue[i]);
+		sync_queue[i] = NULL;
+	}
+}
 
 /*
  * transfer_all_new_tablespaces()
@@ -41,6 +121,17 @@ transfer_all_new_tablespaces(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		case TRANSFER_MODE_LINK:
 			prep_status_progress("Linking user relation files");
 			break;
+		case TRANSFER_MODE_SWAP:
+
+			/*
+			 * We generate the revert script for this mode before starting
+			 * file transfer so that it can be used in the case of a crash
+			 * halfway through.
+			 */
+			create_script_for_swap_revert();
+
+			prep_status_progress("Swapping data directories");
+			break;
 	}
 
 	/*
@@ -125,6 +216,271 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		/* We allocate something even for n_maps == 0 */
 		pg_free(mappings);
 	}
+
+	/*
+	 * Make sure anything pending synchronization in swap mode is fully
+	 * persisted to disk.  This is a no-op for other transfer modes.
+	 */
+	sync_queue_sync_all();
+	sync_queue_destroy();
+}
+
+/*
+ * prepare_for_swap()
+ *
+ * This function durably moves the database directories from the old cluster to
+ * the new cluster in preparation for moving the pg_restore-generated catalog
+ * files into place.  Returns false if the database with the given OID does not
+ * have a directory in the given tablespace, otherwise returns true.
+ */
+static bool
+prepare_for_swap(const char *old_tablespace, Oid db_oid,
+				 char *old_cat, char *new_dat, char *moved_dat)
+{
+	const char *new_tablespace;
+	const char *old_tblspc_suffix;
+	const char *new_tblspc_suffix;
+	char		old_tblspc[MAXPGPATH];
+	char		new_tblspc[MAXPGPATH];
+	char		moved_tblspc[MAXPGPATH];
+	char		old_dat[MAXPGPATH];
+	struct stat st;
+
+	if (strcmp(old_tablespace, old_cluster.pgdata) == 0)
+	{
+		new_tablespace = new_cluster.pgdata;
+		new_tblspc_suffix = "/base";
+		old_tblspc_suffix = "/base";
+	}
+	else
+	{
+		new_tablespace = old_tablespace;
+		new_tblspc_suffix = new_cluster.tablespace_suffix;
+		old_tblspc_suffix = old_cluster.tablespace_suffix;
+	}
+
+	snprintf(old_tblspc, sizeof(old_tblspc), "%s%s", old_tablespace, old_tblspc_suffix);
+	snprintf(moved_tblspc, sizeof(moved_tblspc), "%s_moved", old_tblspc);
+	snprintf(old_cat, MAXPGPATH, "%s/%u_old_cat", moved_tblspc, db_oid);
+	snprintf(new_tblspc, sizeof(new_tblspc), "%s%s", new_tablespace, new_tblspc_suffix);
+	snprintf(new_dat, MAXPGPATH, "%s/%u", new_tblspc, db_oid);
+	snprintf(moved_dat, MAXPGPATH, "%s/%u", moved_tblspc, db_oid);
+	snprintf(old_dat, sizeof(old_dat), "%s/%u", old_tblspc, db_oid);
+
+	/* Check that the database directory exists in the given tablespace. */
+	if (stat(old_dat, &st) != 0)
+	{
+		if (errno != ENOENT)
+			pg_fatal("could not stat file \"%s\": %m", old_dat);
+		return false;
+	}
+
+	/* Create directory for stuff that is moved aside. */
+	if (durable_mkdir_p(moved_tblspc) != 0)
+		pg_fatal("could not create directory \"%s\"", moved_tblspc);
+
+	/* Create directory for old catalog files. */
+	if (durable_mkdir_p(old_cat) != 0)
+		pg_fatal("could not create directory \"%s\"", old_cat);
+
+	/* Move the new cluster's database directory aside. */
+	if (durable_rename_dir(new_dat, moved_dat) != 0)
+		pg_fatal("could not rename \"%s\" to \"%s\"", new_dat, moved_dat);
+
+	/* Move the old cluster's database directory into place. */
+	if (durable_rename_dir(old_dat, new_dat) != 0)
+		pg_fatal("could not rename \"%s\" to \"%s\"", old_dat, new_dat);
+
+	return true;
+}
+
+/*
+ * FileNameMapCmp()
+ *
+ * qsort() comparator for FileNameMap that sorts by RelFileNumber.
+ */
+static int
+FileNameMapCmp(const void *a, const void *b)
+{
+	const FileNameMap *map1 = (const FileNameMap *) a;
+	const FileNameMap *map2 = (const FileNameMap *) b;
+
+	return pg_cmp_u32(map1->relfilenumber, map2->relfilenumber);
+}
+
+/*
+ * parse_relfilenumber()
+ *
+ * Attempt to parse the RelFileNumber of the given file name.  If we can't,
+ * return InvalidRelFileNumber.
+ */
+static RelFileNumber
+parse_relfilenumber(const char *filename)
+{
+	char	   *endp;
+	unsigned long n;
+
+	if (filename[0] < '1' || filename[0] > '9')
+		return InvalidRelFileNumber;
+
+	errno = 0;
+	n = strtoul(filename, &endp, 10);
+	if (errno || filename == endp || n <= 0 || n > PG_UINT32_MAX)
+		return InvalidRelFileNumber;
+
+	return (RelFileNumber) n;
+}
+
+/*
+ * swap_catalog_files()
+ *
+ * Moves the old catalog files aside, and moves the new catalog files into
+ * place.
+ */
+static void
+swap_catalog_files(FileNameMap *maps, int size, const char *old_cat,
+				   const char *new_dat, const char *moved_dat)
+{
+	DIR		   *dir;
+	struct dirent *de;
+	char		path[MAXPGPATH];
+	char		dest[MAXPGPATH];
+	RelFileNumber rfn;
+
+	/*
+	 * Move the old catalog files aside.
+	 */
+	dir = opendir(new_dat);
+	if (dir == NULL)
+		pg_fatal("could not open directory \"%s\": %m", new_dat);
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+			continue;
+
+		snprintf(path, sizeof(path), "%s/%s", new_dat, de->d_name);
+		if (get_dirent_type(path, de, false, PG_LOG_ERROR) != PGFILETYPE_REG)
+			continue;
+
+		rfn = parse_relfilenumber(de->d_name);
+		if (RelFileNumberIsValid(rfn))
+		{
+			FileNameMap key;
+
+			key.relfilenumber = (RelFileNumber) rfn;
+			if (bsearch(&key, maps, size, sizeof(FileNameMap), FileNameMapCmp))
+				continue;
+		}
+
+		snprintf(dest, sizeof(dest), "%s/%s", old_cat, de->d_name);
+		if (rename(path, dest) != 0)
+			pg_fatal("could not rename \"%s\" to \"%s\": %m", path, dest);
+	}
+
+	if (errno)
+		pg_fatal("could not read directory \"%s\": %m", new_dat);
+	(void) closedir(dir);
+
+	/*
+	 * Move the new catalog files into place.
+	 */
+	dir = opendir(moved_dat);
+	if (dir == NULL)
+		pg_fatal("could not open directory \"%s\": %m", moved_dat);
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+			continue;
+
+		snprintf(path, sizeof(path), "%s/%s", moved_dat, de->d_name);
+		if (get_dirent_type(path, de, false, PG_LOG_ERROR) != PGFILETYPE_REG)
+			continue;
+
+		rfn = parse_relfilenumber(de->d_name);
+		if (RelFileNumberIsValid(rfn))
+		{
+			FileNameMap key;
+
+			key.relfilenumber = (RelFileNumber) rfn;
+			if (bsearch(&key, maps, size, sizeof(FileNameMap), FileNameMapCmp))
+				continue;
+		}
+
+		snprintf(dest, sizeof(dest), "%s/%s", new_dat, de->d_name);
+		if (rename(path, dest) != 0)
+			pg_fatal("could not rename \"%s\" to \"%s\": %m", path, dest);
+
+		/*
+		 * We don't fsync() the database files in the file synchronization
+		 * stage of pg_upgrade in swap mode, so we need to synchronize them
+		 * ourselves.  We only do this for the catalog files because they were
+		 * created during pg_restore with fsync=off.  We assume that the user
+		 * data files files were properly persisted to disk when the user last
+		 * shut it down.
+		 */
+		sync_queue_push(dest);
+	}
+
+	if (errno)
+		pg_fatal("could not read directory \"%s\": %m", moved_dat);
+	(void) closedir(dir);
+
+	/*
+	 * Ensure the directory entries are persisted to disk.
+	 */
+	if (fsync_fname(old_cat, true) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", old_cat);
+	if (fsync_fname(new_dat, true) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", new_dat);
+	if (fsync_fname(moved_dat, true) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", moved_dat);
+}
+
+/*
+ * do_swap()
+ *
+ * Perform the required steps for --swap for a single database.  In short this
+ * moves the old cluster's database directory into the new cluster and then
+ * replaces any files for system catalogs with the ones that were generated
+ * during pg_restore.
+ */
+static void
+do_swap(FileNameMap *maps, int size, char *old_tablespace)
+{
+	char		old_cat[MAXPGPATH];
+	char		new_dat[MAXPGPATH];
+	char		moved_dat[MAXPGPATH];
+
+	/*
+	 * We perform many lookups on maps by relfilenumber in swap mode, so make
+	 * sure it's sorted.
+	 */
+	qsort(maps, size, sizeof(FileNameMap), FileNameMapCmp);
+
+	/*
+	 * If an old tablespace is given, we only need to process that one.  If no
+	 * old tablespace is specified, we need to process all the tablespaces on
+	 * the system.
+	 */
+	if (old_tablespace)
+	{
+		if (prepare_for_swap(old_tablespace, maps[0].db_oid,
+							 old_cat, new_dat, moved_dat))
+			swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+	}
+	else
+	{
+		if (prepare_for_swap(old_cluster.pgdata, maps[0].db_oid,
+							 old_cat, new_dat, moved_dat))
+			swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+
+		for (int tblnum = 0; tblnum < os_info.num_old_tablespaces; tblnum++)
+		{
+			if (prepare_for_swap(os_info.old_tablespaces[tblnum], maps[0].db_oid,
+								 old_cat, new_dat, moved_dat))
+				swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+		}
+	}
 }
 
 /*
@@ -145,6 +501,20 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
 		new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
 		vm_must_add_frozenbit = true;
 
+	/* --swap has its own subroutine */
+	if (user_opts.transfer_mode == TRANSFER_MODE_SWAP)
+	{
+		/*
+		 * We don't support --swap to upgrade from versions that require
+		 * rewriting the visibility map.  We should've failed already if
+		 * someone tries to do that.
+		 */
+		Assert(!vm_must_add_frozenbit);
+
+		do_swap(maps, size, old_tablespace);
+		return;
+	}
+
 	for (mapnum = 0; mapnum < size; mapnum++)
 	{
 		if (old_tablespace == NULL ||
@@ -259,6 +629,10 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 					pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"",
 						   old_file, new_file);
 					linkFile(old_file, new_file, map->nspname, map->relname);
+				case TRANSFER_MODE_SWAP:
+					/* swap mode is handled in its own code path */
+					pg_fatal("should never happen");
+					break;
 			}
 	}
 }
-- 
2.39.5 (Apple Git-154)

#8Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#1)
Re: optimize file transfer in pg_upgrade

On Wed, Nov 6, 2024 at 5:07 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

these user relation files will have the same name. Therefore, it can be
much faster to instead move the entire data directory from the old cluster
to the new cluster and to then swap the catalog relation files.

This is a cool idea.

Another interesting problem is that pg_upgrade currently doesn't transfer
the sequence data files. Since v10, we've restored these via pg_restore.
I believe this was originally done for the introduction of the pg_sequence
catalog, which changed the format of sequence tuples. In the new
catalog-swap mode I am proposing, this means we need to transfer all the
pg_restore-generated sequence data files. If there are many sequences, it
can be difficult to determine which transfer mode and synchronization
method will be faster. Since sequence tuple modifications are very rare, I
think the new catalog-swap mode should just use the sequence data files
from the old cluster whenever possible.

Maybe we should rethink the decision not to transfer relfilenodes for
sequences. Or have more than one way to do it. pg_upgrade
--binary-upgrade --binary-upgrade-even-for-sequences, or whatever.

--
Robert Haas
EDB: http://www.enterprisedb.com

#9Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#8)
Re: optimize file transfer in pg_upgrade

On Fri, Feb 28, 2025 at 2:40 PM Robert Haas <robertmhaas@gmail.com> wrote:

Maybe we should rethink the decision not to transfer relfilenodes for
sequences. Or have more than one way to do it. pg_upgrade
--binary-upgrade --binary-upgrade-even-for-sequences, or whatever.

Sorry, I meant: pg_dump --binary-upgrade --binary-upgrade-even-for-sequences

i.e. pg_upgrade could decide which way to ask pg_dump to do it,
depending on versions and flags.

--
Robert Haas
EDB: http://www.enterprisedb.com

#10Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#9)
Re: optimize file transfer in pg_upgrade

On Fri, Feb 28, 2025 at 02:41:22PM -0500, Robert Haas wrote:

On Fri, Feb 28, 2025 at 2:40 PM Robert Haas <robertmhaas@gmail.com> wrote:

Maybe we should rethink the decision not to transfer relfilenodes for
sequences. Or have more than one way to do it. pg_upgrade
--binary-upgrade --binary-upgrade-even-for-sequences, or whatever.

Sorry, I meant: pg_dump --binary-upgrade --binary-upgrade-even-for-sequences

i.e. pg_upgrade could decide which way to ask pg_dump to do it,
depending on versions and flags.

That's exactly where I landed (see v3-0002). I haven't measured whether
transferring relfilenodes or dumping the sequence data is faster for the
existing modes, but for now I've left those alone, i.e., they still dump
sequence data. The new "swap" mode just uses the old cluster's sequence
files, and I've disallowed using swap mode for upgrades from <v10 to avoid
the sequence tuple format change (along with other incompatible changes).

I'll admit I'm a bit concerned that this will cause problems if and when
someone wants to change the sequence tuple format again. But that hasn't
happened for a while, AFAIK nobody's planning to change it, and even if it
does happen, we just need to have my proposed new mode transfer the
sequence files like it transfers the catalog files. That will make this
mode slower, especially if you have a ton of sequences, but maybe it'll
still be a win in most cases. Of course, we probably will need to have
pg_upgrade handle other kinds of format changes, too, but IMHO it's still
worth trying to speed up pg_upgrade despite the potential future
complexities.

--
nathan

#11Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#10)
Re: optimize file transfer in pg_upgrade

On Fri, Feb 28, 2025 at 3:01 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

That's exactly where I landed (see v3-0002). I haven't measured whether
transferring relfilenodes or dumping the sequence data is faster for the
existing modes, but for now I've left those alone, i.e., they still dump
sequence data. The new "swap" mode just uses the old cluster's sequence
files, and I've disallowed using swap mode for upgrades from <v10 to avoid
the sequence tuple format change (along with other incompatible changes).

Ah. Perhaps I should have read the thread more carefully before
commenting. Sounds good, at any rate.

I'll admit I'm a bit concerned that this will cause problems if and when
someone wants to change the sequence tuple format again. But that hasn't
happened for a while, AFAIK nobody's planning to change it, and even if it
does happen, we just need to have my proposed new mode transfer the
sequence files like it transfers the catalog files. That will make this
mode slower, especially if you have a ton of sequences, but maybe it'll
still be a win in most cases. Of course, we probably will need to have
pg_upgrade handle other kinds of format changes, too, but IMHO it's still
worth trying to speed up pg_upgrade despite the potential future
complexities.

I think it's fine. If somebody comes along and says "hey, when v23
came out Nathan's feature only sped up pg_upgrade by 2x instead of 3x
like it did for v22, so Nathan is a bad person," I think we can fairly
reply "thanks for sharing your opinion, feel free not to use the
feature and run at 1x speed". There's no rule saying that every
optimization must always produce the maximum possible benefit in every
scenario. We're just concerned about regressions, and "only delivers
some of the speedup if the sequence format has changed on disk" is not
a regression.

--
Robert Haas
EDB: http://www.enterprisedb.com

#12Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#11)
Re: optimize file transfer in pg_upgrade

On Fri, Feb 28, 2025 at 03:37:49PM -0500, Robert Haas wrote:

On Fri, Feb 28, 2025 at 3:01 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

That's exactly where I landed (see v3-0002). I haven't measured whether
transferring relfilenodes or dumping the sequence data is faster for the
existing modes, but for now I've left those alone, i.e., they still dump
sequence data. The new "swap" mode just uses the old cluster's sequence
files, and I've disallowed using swap mode for upgrades from <v10 to avoid
the sequence tuple format change (along with other incompatible changes).

Ah. Perhaps I should have read the thread more carefully before
commenting. Sounds good, at any rate.

On the contrary, I'm glad you independently came to the same conclusion.

I'll admit I'm a bit concerned that this will cause problems if and when
someone wants to change the sequence tuple format again. But that hasn't
happened for a while, AFAIK nobody's planning to change it, and even if it
does happen, we just need to have my proposed new mode transfer the
sequence files like it transfers the catalog files. That will make this
mode slower, especially if you have a ton of sequences, but maybe it'll
still be a win in most cases. Of course, we probably will need to have
pg_upgrade handle other kinds of format changes, too, but IMHO it's still
worth trying to speed up pg_upgrade despite the potential future
complexities.

I think it's fine. If somebody comes along and says "hey, when v23
came out Nathan's feature only sped up pg_upgrade by 2x instead of 3x
like it did for v22, so Nathan is a bad person," I think we can fairly
reply "thanks for sharing your opinion, feel free not to use the
feature and run at 1x speed". There's no rule saying that every
optimization must always produce the maximum possible benefit in every
scenario. We're just concerned about regressions, and "only delivers
some of the speedup if the sequence format has changed on disk" is not
a regression.

Cool. I appreciate the design feedback.

--
nathan

#13Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#12)
Re: optimize file transfer in pg_upgrade

On Fri, Feb 28, 2025 at 02:51:27PM -0600, Nathan Bossart wrote:

Cool. I appreciate the design feedback.

One other design point I wanted to bring up is whether we should bother
generating a rollback script for the new "swap" mode. In short, I'm
wondering if it would be unreasonable to say that, just for this mode, once
pg_upgrade enters the file transfer step, reverting to the old cluster
requires restoring a backup. I believe that's worth considering for the
following reasons:

* Anecdotally, I'm not sure I've ever actually seen pg_upgrade fail during
or after file transfer, and I'm hoping to get some real data about that
in the near future. Has anyone else dealt with such a failure? I
suspect that failures during file transfer are typically due to OS
crashes, power losses, etc., and hopefully those are rare.

* I've spent quite some time trying to generate a portable script, but it's
quite complicated and difficult to reason about its correctness. And I
haven't even started on the Windows version. Leaving this part out would
simplify the patch set quite a bit.

* If we give up the idea of reverting to the old cluster, we also can avoid
a bunch of intermediate fsync() calls which I only included to help
reason about the state of the files in case you failed halfway through.
This might not add up to much, but it's at least another area of
simplification.

Of course, rollback would still be possible, but you'd really need to
understand what "swap" mode does behind the scenes to do so safely. In any
case, I'm growing skeptical that a probably-not-super-well-tested script
that extremely few people will need and fewer will use is worth the
complexity.

--
nathan

#14Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#13)
Re: optimize file transfer in pg_upgrade

On Wed, Mar 5, 2025 at 2:42 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

Of course, rollback would still be possible, but you'd really need to
understand what "swap" mode does behind the scenes to do so safely. In any
case, I'm growing skeptical that a probably-not-super-well-tested script
that extremely few people will need and fewer will use is worth the
complexity.

I don't have a super-strong view on what the right thing to do is
here, but I'm definitely in favor of not doing something half-baked.
If you think the revert script is going to suck, then let's not have
it at all and just be clear about what the requirements for using this
mode are.

One serious danger that you didn't mention here is that, if pg_upgrade
does fail, you may well try several times. And if you forget the
revert script at some point, or run it more than once, or run the
wrong version, you will be in trouble. I feel like this is something
someone could very easily mess up even if in theory it works
perfectly. Upgrades are often stressful times.

--
Robert Haas
EDB: http://www.enterprisedb.com

#15Greg Sabino Mullane
htamfids@gmail.com
In reply to: Nathan Bossart (#13)
Re: optimize file transfer in pg_upgrade

On Wed, Mar 5, 2025 at 2:43 PM Nathan Bossart <nathandbossart@gmail.com>
wrote:

One other design point I wanted to bring up is whether we should bother
generating a rollback script for the new "swap" mode. In short, I'm
wondering if it would be unreasonable to say that, just for this mode, once
pg_upgrade enters the file transfer step, reverting to the old cluster
requires restoring a backup.

I think that's a fair requirement. And like Robert, revert scripts make me
nervous.

* Anecdotally, I'm not sure I've ever actually seen pg_upgrade fail

during or after file transfer, and I'm hoping to get some real data about
that in the near future. Has anyone else dealt with such a failure?

I've seen various failures, but they always get caught quite early.
Certainly early enough to easily abort, fix perms/mounts/etc., then retry.
I think your instinct is correct that this reversion is more trouble than
its worth. I don't think the pg_upgrade docs mention taking a backup, but
that's always step 0 in my playbook, and that's the rollback plan in the
unlikely event of failure.

Cheers,
Greg

--
Crunchy Data - https://www.crunchydata.com
Enterprise Postgres Software Products & Tech Support

#16Nathan Bossart
nathandbossart@gmail.com
In reply to: Greg Sabino Mullane (#15)
3 attachment(s)
Re: optimize file transfer in pg_upgrade

On Wed, Mar 05, 2025 at 03:40:52PM -0500, Greg Sabino Mullane wrote:

I've seen various failures, but they always get caught quite early.
Certainly early enough to easily abort, fix perms/mounts/etc., then retry.
I think your instinct is correct that this reversion is more trouble than
its worth. I don't think the pg_upgrade docs mention taking a backup, but
that's always step 0 in my playbook, and that's the rollback plan in the
unlikely event of failure.

Thank you, Greg and Robert, for sharing your thoughts. With that, here's
what I'm considering to be a reasonably complete patch set for this
feature. This leaves about a month for rigorous testing and editing, so
I'm hopeful it'll be ready v18.

--
nathan

Attachments:

v4-0001-initdb-Add-no-sync-data-files.patchtext/plain; charset=us-asciiDownload
From 6716e7b16a795911f55432dfd6d3c246aa8fd9fe Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 19 Feb 2025 09:14:51 -0600
Subject: [PATCH v4 1/3] initdb: Add --no-sync-data-files.

This new option instructs initdb to skip synchronizing any files
in database directories and the database directories themselves,
i.e., everything in the base/ subdirectory and any other
tablespace directories.  Other files, such as those in pg_wal/ and
pg_xact/, will still be synchronized unless --no-sync is also
specified.  --no-sync-data-files is primarily intended for internal
use by tools that separately ensure the skipped files are
synchronized to disk.  A follow-up commit will use this to help
optimize pg_upgrade's file transfer step.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/initdb.sgml                | 20 +++++
 src/bin/initdb/initdb.c                     | 10 ++-
 src/bin/initdb/t/001_initdb.pl              |  1 +
 src/bin/pg_basebackup/pg_basebackup.c       |  2 +-
 src/bin/pg_checksums/pg_checksums.c         |  2 +-
 src/bin/pg_combinebackup/pg_combinebackup.c |  2 +-
 src/bin/pg_rewind/file_ops.c                |  2 +-
 src/common/file_utils.c                     | 85 +++++++++++++--------
 src/include/common/file_utils.h             |  2 +-
 9 files changed, 89 insertions(+), 37 deletions(-)

diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 0026318485a..14c401b9a99 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -527,6 +527,26 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry id="app-initdb-option-no-sync-data-files">
+      <term><option>--no-sync-data-files</option></term>
+      <listitem>
+       <para>
+        By default, <command>initdb</command> safely writes all database files
+        to disk.  This option instructs <command>initdb</command> to skip
+        synchronizing all files in the individual database directories and the
+        database directories themselves, i.e., everything in the
+        <filename>base</filename> subdirectory and any other tablespace
+        directories.  Other files, such as those in <literal>pg_wal</literal>
+        and <literal>pg_xact</literal>, will still be synchronized unless the
+        <option>--no-sync</option> option is also specified.
+       </para>
+       <para>
+        This option is primarily intended for internal use by tools that
+        separately ensure the skipped files are synchronized to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="app-initdb-option-no-instructions">
       <term><option>--no-instructions</option></term>
       <listitem>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 21a0fe3ecd9..22b7d31b165 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -168,6 +168,7 @@ static bool data_checksums = true;
 static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+static bool sync_data_files = true;
 
 
 /* internal vars */
@@ -2566,6 +2567,7 @@ usage(const char *progname)
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
 	printf(_("  -n, --no-clean            do not clean up after errors\n"));
 	printf(_("  -N, --no-sync             do not wait for changes to be written safely to disk\n"));
+	printf(_("      --no-sync-data-files  do not sync files within database directories\n"));
 	printf(_("      --no-instructions     do not print instructions for next steps\n"));
 	printf(_("  -s, --show                show internal settings, then exit\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
@@ -3208,6 +3210,7 @@ main(int argc, char *argv[])
 		{"icu-rules", required_argument, NULL, 18},
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
+		{"no-sync-data-files", no_argument, NULL, 21},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3402,6 +3405,9 @@ main(int argc, char *argv[])
 			case 20:
 				data_checksums = false;
 				break;
+			case 21:
+				sync_data_files = false;
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -3453,7 +3459,7 @@ main(int argc, char *argv[])
 
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 		return 0;
 	}
@@ -3516,7 +3522,7 @@ main(int argc, char *argv[])
 	{
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 	}
 	else
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 01cc4a1602b..15dd10ce40a 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -76,6 +76,7 @@ command_like(
 	'checksums are enabled in control file');
 
 command_ok([ 'initdb', '--sync-only', $datadir ], 'sync only');
+command_ok([ 'initdb', '--sync-only', '--no-sync-data-files', $datadir ], '--no-sync-data-files');
 command_fails([ 'initdb', $datadir ], 'existing data directory');
 
 if ($supports_syncfs)
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index dc0c805137a..bc94c114d27 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -2310,7 +2310,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
 		}
 		else
 		{
-			(void) sync_pgdata(basedir, serverVersion, sync_method);
+			(void) sync_pgdata(basedir, serverVersion, sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index 867aeddc601..f20be82862a 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -633,7 +633,7 @@ main(int argc, char *argv[])
 		if (do_sync)
 		{
 			pg_log_info("syncing data directory");
-			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method);
+			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method, true);
 		}
 
 		pg_log_info("updating control file");
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index 5864ec574fb..c0ec09485c3 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -420,7 +420,7 @@ main(int argc, char *argv[])
 		else
 		{
 			pg_log_debug("recursively fsyncing \"%s\"", opt.output);
-			sync_pgdata(opt.output, version * 10000, opt.sync_method);
+			sync_pgdata(opt.output, version * 10000, opt.sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 467845419ed..55659ce201f 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -296,7 +296,7 @@ sync_target_dir(void)
 	if (!do_sync || dry_run)
 		return;
 
-	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method);
+	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method, true);
 }
 
 
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 0e3cfede935..78e272916f5 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -50,7 +50,8 @@ static int	pre_sync_fname(const char *fname, bool isdir);
 #endif
 static void walkdir(const char *path,
 					int (*action) (const char *fname, bool isdir),
-					bool process_symlinks);
+					bool process_symlinks,
+					const char *exclude_dir);
 
 #ifdef HAVE_SYNCFS
 
@@ -93,11 +94,15 @@ do_syncfs(const char *path)
  * syncing, and might not have privileges to write at all.
  *
  * serverVersion indicates the version of the server to be sync'd.
+ *
+ * If sync_data_files is false, this function skips syncing "base/" and any
+ * other tablespace directories.
  */
 void
 sync_pgdata(const char *pg_data,
 			int serverVersion,
-			DataDirSyncMethod sync_method)
+			DataDirSyncMethod sync_method,
+			bool sync_data_files)
 {
 	bool		xlog_is_symlink;
 	char		pg_wal[MAXPGPATH];
@@ -147,30 +152,33 @@ sync_pgdata(const char *pg_data,
 				do_syncfs(pg_data);
 
 				/* If any tablespaces are configured, sync each of those. */
-				dir = opendir(pg_tblspc);
-				if (dir == NULL)
-					pg_log_error("could not open directory \"%s\": %m",
-								 pg_tblspc);
-				else
+				if (sync_data_files)
 				{
-					while (errno = 0, (de = readdir(dir)) != NULL)
+					dir = opendir(pg_tblspc);
+					if (dir == NULL)
+						pg_log_error("could not open directory \"%s\": %m",
+									 pg_tblspc);
+					else
 					{
-						char		subpath[MAXPGPATH * 2];
+						while (errno = 0, (de = readdir(dir)) != NULL)
+						{
+							char		subpath[MAXPGPATH * 2];
 
-						if (strcmp(de->d_name, ".") == 0 ||
-							strcmp(de->d_name, "..") == 0)
-							continue;
+							if (strcmp(de->d_name, ".") == 0 ||
+								strcmp(de->d_name, "..") == 0)
+								continue;
 
-						snprintf(subpath, sizeof(subpath), "%s/%s",
-								 pg_tblspc, de->d_name);
-						do_syncfs(subpath);
-					}
+							snprintf(subpath, sizeof(subpath), "%s/%s",
+									 pg_tblspc, de->d_name);
+							do_syncfs(subpath);
+						}
 
-					if (errno)
-						pg_log_error("could not read directory \"%s\": %m",
-									 pg_tblspc);
+						if (errno)
+							pg_log_error("could not read directory \"%s\": %m",
+										 pg_tblspc);
 
-					(void) closedir(dir);
+						(void) closedir(dir);
+					}
 				}
 
 				/* If pg_wal is a symlink, process that too. */
@@ -182,15 +190,21 @@ sync_pgdata(const char *pg_data,
 
 		case DATA_DIR_SYNC_METHOD_FSYNC:
 			{
+				char	   *exclude_dir = NULL;
+
+				if (!sync_data_files)
+					exclude_dir = psprintf("%s/base", pg_data);
+
 				/*
 				 * If possible, hint to the kernel that we're soon going to
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(pg_data, pre_sync_fname, false);
+				walkdir(pg_data, pre_sync_fname, false, exclude_dir);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, pre_sync_fname, false);
-				walkdir(pg_tblspc, pre_sync_fname, true);
+					walkdir(pg_wal, pre_sync_fname, false, NULL);
+				if (sync_data_files)
+					walkdir(pg_tblspc, pre_sync_fname, true, NULL);
 #endif
 
 				/*
@@ -203,10 +217,14 @@ sync_pgdata(const char *pg_data,
 				 * get fsync'd twice. That's not an expected case so we don't
 				 * worry about optimizing it.
 				 */
-				walkdir(pg_data, fsync_fname, false);
+				walkdir(pg_data, fsync_fname, false, exclude_dir);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, fsync_fname, false);
-				walkdir(pg_tblspc, fsync_fname, true);
+					walkdir(pg_wal, fsync_fname, false, NULL);
+				if (sync_data_files)
+					walkdir(pg_tblspc, fsync_fname, true, NULL);
+
+				if (exclude_dir)
+					pfree(exclude_dir);
 			}
 			break;
 	}
@@ -245,10 +263,10 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(dir, pre_sync_fname, false);
+				walkdir(dir, pre_sync_fname, false, NULL);
 #endif
 
-				walkdir(dir, fsync_fname, false);
+				walkdir(dir, fsync_fname, false, NULL);
 			}
 			break;
 	}
@@ -264,6 +282,9 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
  * ignored in subdirectories, ie we intentionally don't pass down the
  * process_symlinks flag to recursive calls.
  *
+ * If exclude_dir is not NULL, it specifies a directory path to skip
+ * processing.
+ *
  * Errors are reported but not considered fatal.
  *
  * See also walkdir in fd.c, which is a backend version of this logic.
@@ -271,11 +292,15 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
 static void
 walkdir(const char *path,
 		int (*action) (const char *fname, bool isdir),
-		bool process_symlinks)
+		bool process_symlinks,
+		const char *exclude_dir)
 {
 	DIR		   *dir;
 	struct dirent *de;
 
+	if (exclude_dir && strcmp(exclude_dir, path) == 0)
+		return;
+
 	dir = opendir(path);
 	if (dir == NULL)
 	{
@@ -299,7 +324,7 @@ walkdir(const char *path,
 				(*action) (subpath, false);
 				break;
 			case PGFILETYPE_DIR:
-				walkdir(subpath, action, false);
+				walkdir(subpath, action, false, exclude_dir);
 				break;
 			default:
 
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index a832210adc1..8274bc877ab 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -35,7 +35,7 @@ struct iovec;					/* avoid including port/pg_iovec.h here */
 #ifdef FRONTEND
 extern int	fsync_fname(const char *fname, bool isdir);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
-						DataDirSyncMethod sync_method);
+						DataDirSyncMethod sync_method, bool sync_data_files);
 extern void sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method);
 extern int	durable_rename(const char *oldfile, const char *newfile);
 extern int	fsync_parent_path(const char *fname);
-- 
2.39.5 (Apple Git-154)

v4-0002-pg_dump-Add-sequence-data.patchtext/plain; charset=us-asciiDownload
From e6dc183b8a80a32f6ca52b0d21a173d1b291deea Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 19 Feb 2025 11:25:28 -0600
Subject: [PATCH v4 2/3] pg_dump: Add --sequence-data.

This new option instructs pg_dump to dump sequence data when the
--no-data, --schema-only, or --statistics-only option is specified.
This was originally considered for commit a7e5457db8, but it was
left out at that time because there was no known use-case.  A
follow-up commit will use this to optimize pg_upgrade's file
transfer step.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/pg_dump.sgml               | 11 +++++++++++
 src/bin/pg_dump/pg_dump.c                   | 10 ++--------
 src/bin/pg_dump/t/002_pg_dump.pl            |  1 +
 src/bin/pg_upgrade/dump.c                   |  2 +-
 src/test/modules/test_pg_dump/t/001_base.pl |  2 +-
 5 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 1975054d7bf..b05f16995c3 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1289,6 +1289,17 @@ PostgreSQL documentation
        </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--sequence-data</option></term>
+      <listitem>
+       <para>
+        Include sequence data in the dump.  This is the default behavior except
+        when <option>--no-data</option>, <option>--schema-only</option>, or
+        <option>--statistics-only</option> is specified.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--serializable-deferrable</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 4f4ad2ee150..f63215eb3f9 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -517,6 +517,7 @@ main(int argc, char **argv)
 		{"sync-method", required_argument, NULL, 15},
 		{"filter", required_argument, NULL, 16},
 		{"exclude-extension", required_argument, NULL, 17},
+		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -803,14 +804,6 @@ main(int argc, char **argv)
 	if (dopt.column_inserts && dopt.dump_inserts == 0)
 		dopt.dump_inserts = DUMP_DEFAULT_ROWS_PER_INSERT;
 
-	/*
-	 * Binary upgrade mode implies dumping sequence data even in schema-only
-	 * mode.  This is not exposed as a separate option, but kept separate
-	 * internally for clarity.
-	 */
-	if (dopt.binary_upgrade)
-		dopt.sequence_data = 1;
-
 	if (data_only && schema_only)
 		pg_fatal("options -s/--schema-only and -a/--data-only cannot be used together");
 	if (schema_only && statistics_only)
@@ -1275,6 +1268,7 @@ help(const char *progname)
 	printf(_("  --quote-all-identifiers      quote all identifiers, even if not key words\n"));
 	printf(_("  --rows-per-insert=NROWS      number of rows per INSERT; implies --inserts\n"));
 	printf(_("  --section=SECTION            dump named section (pre-data, data, or post-data)\n"));
+	printf(_("  --sequence-data              include sequence data in dump\n"));
 	printf(_("  --serializable-deferrable    wait until the dump can run without anomalies\n"));
 	printf(_("  --snapshot=SNAPSHOT          use given snapshot for the dump\n"));
 	printf(_("  --statistics-only            dump only the statistics, not schema or data\n"));
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index c7bffc1b045..8ae6c5374fc 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -66,6 +66,7 @@ my %pgdump_runs = (
 			'--file' => "$tempdir/binary_upgrade.dump",
 			'--no-password',
 			'--no-data',
+			'--sequence-data',
 			'--binary-upgrade',
 			'--dbname' => 'postgres',    # alternative way to specify database
 		],
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index 23fe7280a16..b8fd0d0acee 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -52,7 +52,7 @@ generate_old_dump(void)
 		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
 
 		parallel_exec_prog(log_file_name, NULL,
-						   "\"%s/pg_dump\" %s --no-data %s --quote-all-identifiers "
+						   "\"%s/pg_dump\" %s --no-data %s --sequence-data --quote-all-identifiers "
 						   "--binary-upgrade --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
 						   log_opts.verbose ? "--verbose" : "",
diff --git a/src/test/modules/test_pg_dump/t/001_base.pl b/src/test/modules/test_pg_dump/t/001_base.pl
index 9b2a90b0469..27c6c2ab0f3 100644
--- a/src/test/modules/test_pg_dump/t/001_base.pl
+++ b/src/test/modules/test_pg_dump/t/001_base.pl
@@ -48,7 +48,7 @@ my %pgdump_runs = (
 		dump_cmd => [
 			'pg_dump', '--no-sync',
 			"--file=$tempdir/binary_upgrade.sql", '--schema-only',
-			'--binary-upgrade', '--dbname=postgres',
+			'--sequence-data', '--binary-upgrade', '--dbname=postgres',
 		],
 	},
 	clean => {
-- 
2.39.5 (Apple Git-154)

v4-0003-pg_upgrade-Add-swap-for-faster-file-transfer.patchtext/plain; charset=us-asciiDownload
From ecd5b53daefd3187195e5a8fbf47e0a8a278bf30 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 5 Mar 2025 17:36:54 -0600
Subject: [PATCH v4 3/3] pg_upgrade: Add --swap for faster file transfer.

This new option instructs pg_upgrade to move the data directories
from the old cluster to the new cluster and then to replace the
catalog files with those generated for the new cluster.  This mode
can outperform --link, --clone, --copy, and --copy-file-range,
especially on clusters with many relations.

However, this mode creates many garbage files in the old cluster,
which can prolong the file synchronization step.  To handle that,
we use "initdb --sync-only --no-sync-data-files" for file
synchronization, and we synchronize the catalog files as they are
transferred.  We assume that the database files transferred from
the old cluster were synchronized prior to upgrade.  This mode also
complicates reverting to the old cluster, so we recommend restoring
from backup upon failure during or after file transfer.

The new mode is limited to clusters located in the same file system
and to upgrades from version 10 and newer.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/pgupgrade.sgml    |  59 ++++-
 src/bin/pg_upgrade/check.c         |  29 ++-
 src/bin/pg_upgrade/controldata.c   |  23 +-
 src/bin/pg_upgrade/dump.c          |   4 +-
 src/bin/pg_upgrade/file.c          |  16 +-
 src/bin/pg_upgrade/info.c          |   4 +-
 src/bin/pg_upgrade/option.c        |   7 +
 src/bin/pg_upgrade/pg_upgrade.c    |  16 +-
 src/bin/pg_upgrade/pg_upgrade.h    |   5 +-
 src/bin/pg_upgrade/relfilenumber.c | 358 +++++++++++++++++++++++++++++
 src/common/file_utils.c            |  14 +-
 src/include/common/file_utils.h    |   1 +
 12 files changed, 505 insertions(+), 31 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 7bdd85c5cff..08278232e71 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -244,7 +244,8 @@ PostgreSQL documentation
       <listitem>
        <para>
         Copy files to the new cluster.  This is the default.  (See also
-        <option>--link</option> and <option>--clone</option>.)
+        <option>--link</option>, <option>--clone</option>,
+        <option>--copy-file-range</option>, and <option>--swap</option>.)
        </para>
       </listitem>
      </varlistentry>
@@ -262,6 +263,32 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--swap</option></term>
+      <listitem>
+       <para>
+        Move the data directories from the old cluster to the new cluster.
+        Then, replace the catalog files with those generated for the new
+        cluster.  This mode can outperform <option>--link</option>,
+        <option>--clone</option>, <option>--copy</option>, and
+        <option>--copy-file-range</option>, especially on clusters with many
+        relations.
+       </para>
+       <para>
+        However, this mode creates many garbage files in the old cluster, which
+        can prolong the file synchronization step if
+        <option>--sync-method=syncfs</option> is used.  Therefore, it is
+        recommended to use <option>--sync-method=fsync</option> with
+        <option>--swap</option>.
+       </para>
+       <para>
+        Additionally, once the file transfer step begins, the old cluster will
+        be destructively modified and therefore will no longer be safe to
+        start.  See <xref linkend="pgupgrade-step-revert"/> for details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--sync-method=</option><replaceable>method</replaceable></term>
       <listitem>
@@ -530,6 +557,10 @@ NET STOP postgresql-&majorversion;
      is started.  Clone mode also requires that the old and new data
      directories be in the same file system.  This mode is only available
      on certain operating systems and file systems.
+     Swap mode may be the fastest if there are many relations, but you will not
+     be able to access your old cluster once the file transfer step begins.
+     Swap mode also requires that the old and new cluster data directories be
+     in the same file system.
     </para>
 
     <para>
@@ -889,6 +920,32 @@ psql --username=postgres --file=script.sql postgres
 
         </itemizedlist></para>
       </listitem>
+
+      <listitem>
+       <para>
+        If the <option>--swap</option> option was used, the old cluster might
+        be destructively modified:
+
+        <itemizedlist>
+         <listitem>
+          <para>
+           If <command>pg_upgrade</command> aborts before reporting that the
+           old cluster is no longer safe to start, the old cluster was
+           unmodified; it can be restarted.
+          </para>
+         </listitem>
+
+         <listitem>
+          <para>
+           If <command>pg_upgrade</command> has reported that the old cluster
+           is no longer safe to start, the old cluster was destructively
+           modified.  The old cluster will need to be restored from backup in
+           this case.
+          </para>
+         </listitem>
+        </itemizedlist>
+       </para>
+      </listitem>
      </itemizedlist></para>
    </step>
   </procedure>
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 88db8869b6e..5405ed7bc8f 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -709,7 +709,34 @@ check_new_cluster(void)
 			check_copy_file_range();
 			break;
 		case TRANSFER_MODE_LINK:
-			check_hard_link();
+			check_hard_link(TRANSFER_MODE_LINK);
+			break;
+		case TRANSFER_MODE_SWAP:
+
+			/*
+			 * We do the hard link check for --swap, too, since it's an easy
+			 * way to verify the clusters are in the same file system.  This
+			 * allows us to take some shortcuts in the file synchronization
+			 * step.  With some more effort, we could probably support the
+			 * separate-file-system use case, but this mode is unlikely to
+			 * offer much benefit if we have to copy the files across file
+			 * system boundaries.
+			 */
+			check_hard_link(TRANSFER_MODE_SWAP);
+
+			/*
+			 * There are a few known issues with using --swap to upgrade from
+			 * versions older than 10.  For example, the sequence tuple format
+			 * changed in v10, and the visibility map format changed in 9.6.
+			 * While such problems are not insurmountable (and we may have to
+			 * deal with similar problems in the future, anyway), it doesn't
+			 * seem worth the effort to support swap mode for upgrades from
+			 * long-unsupported versions.
+			 */
+			if (GET_MAJOR_VERSION(old_cluster.major_version) < 1000)
+				pg_fatal("Swap mode can only upgrade clusters from PostgreSQL version %s and later.",
+						 "10");
+
 			break;
 	}
 
diff --git a/src/bin/pg_upgrade/controldata.c b/src/bin/pg_upgrade/controldata.c
index bd49ea867bf..391ed3e1085 100644
--- a/src/bin/pg_upgrade/controldata.c
+++ b/src/bin/pg_upgrade/controldata.c
@@ -751,11 +751,15 @@ check_control_data(ControlData *oldctrl,
 
 
 void
-disable_old_cluster(void)
+disable_old_cluster(transferMode transfer_mode)
 {
 	char		old_path[MAXPGPATH],
 				new_path[MAXPGPATH];
 
+	/* only used for --link and --swap */
+	Assert(transfer_mode == TRANSFER_MODE_LINK ||
+		   transfer_mode == TRANSFER_MODE_SWAP);
+
 	/* rename pg_control so old server cannot be accidentally started */
 	prep_status("Adding \".old\" suffix to old global/pg_control");
 
@@ -766,10 +770,15 @@ disable_old_cluster(void)
 				 old_path, new_path);
 	check_ok();
 
-	pg_log(PG_REPORT, "\n"
-		   "If you want to start the old cluster, you will need to remove\n"
-		   "the \".old\" suffix from %s/global/pg_control.old.\n"
-		   "Because \"link\" mode was used, the old cluster cannot be safely\n"
-		   "started once the new cluster has been started.",
-		   old_cluster.pgdata);
+	if (transfer_mode == TRANSFER_MODE_LINK)
+		pg_log(PG_REPORT, "\n"
+			   "If you want to start the old cluster, you will need to remove\n"
+			   "the \".old\" suffix from %s/global/pg_control.old.\n"
+			   "Because \"link\" mode was used, the old cluster cannot be safely\n"
+			   "started once the new cluster has been started.",
+			   old_cluster.pgdata);
+	else
+		pg_log(PG_REPORT, "\n"
+			   "Because \"swap\" mode was used, the old cluster can no longer be\n"
+			   "safely started.");
 }
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index b8fd0d0acee..23cb08e8347 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -52,9 +52,11 @@ generate_old_dump(void)
 		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
 
 		parallel_exec_prog(log_file_name, NULL,
-						   "\"%s/pg_dump\" %s --no-data %s --sequence-data --quote-all-identifiers "
+						   "\"%s/pg_dump\" %s --no-data %s %s --quote-all-identifiers "
 						   "--binary-upgrade --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
+						   (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+						   "" : "--sequence-data",
 						   log_opts.verbose ? "--verbose" : "",
 						   user_opts.do_statistics ? "" : "--no-statistics",
 						   log_opts.dumpdir,
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 7fd1991204a..4fe784e8b94 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -434,18 +434,28 @@ check_copy_file_range(void)
 }
 
 void
-check_hard_link(void)
+check_hard_link(transferMode transfer_mode)
 {
 	char		existing_file[MAXPGPATH];
 	char		new_link_file[MAXPGPATH];
 
+	/* only used for --link and --swap */
+	Assert(transfer_mode == TRANSFER_MODE_LINK ||
+		   transfer_mode == TRANSFER_MODE_SWAP);
+
 	snprintf(existing_file, sizeof(existing_file), "%s/PG_VERSION", old_cluster.pgdata);
 	snprintf(new_link_file, sizeof(new_link_file), "%s/PG_VERSION.linktest", new_cluster.pgdata);
 	unlink(new_link_file);		/* might fail */
 
 	if (link(existing_file, new_link_file) < 0)
-		pg_fatal("could not create hard link between old and new data directories: %m\n"
-				 "In link mode the old and new data directories must be on the same file system.");
+	{
+		if (transfer_mode == TRANSFER_MODE_LINK)
+			pg_fatal("could not create hard link between old and new data directories: %m\n"
+					 "In link mode the old and new data directories must be on the same file system.");
+		else
+			pg_fatal("could not create hard link between old and new data directories: %m\n"
+					 "In swap mode the old and new data directories must be on the same file system.");
+	}
 
 	unlink(new_link_file);
 }
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index ad52de8b607..4b7a56f5b3b 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -490,7 +490,7 @@ get_rel_infos_query(void)
 					  "  FROM pg_catalog.pg_class c JOIN pg_catalog.pg_namespace n "
 					  "         ON c.relnamespace = n.oid "
 					  "  WHERE relkind IN (" CppAsString2(RELKIND_RELATION) ", "
-					  CppAsString2(RELKIND_MATVIEW) ") AND "
+					  CppAsString2(RELKIND_MATVIEW) "%s) AND "
 	/* exclude possible orphaned temp tables */
 					  "    ((n.nspname !~ '^pg_temp_' AND "
 					  "      n.nspname !~ '^pg_toast_temp_' AND "
@@ -499,6 +499,8 @@ get_rel_infos_query(void)
 					  "      c.oid >= %u::pg_catalog.oid) OR "
 					  "     (n.nspname = 'pg_catalog' AND "
 					  "      relname IN ('pg_largeobject') ))), ",
+					  (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+					  ", " CppAsString2(RELKIND_SEQUENCE) : "",
 					  FirstNormalObjectId);
 
 	/*
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 188dd8d8a8b..7fd7f1d33fc 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -62,6 +62,7 @@ parseCommandLine(int argc, char *argv[])
 		{"sync-method", required_argument, NULL, 4},
 		{"no-statistics", no_argument, NULL, 5},
 		{"set-char-signedness", required_argument, NULL, 6},
+		{"swap", no_argument, NULL, 7},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -228,6 +229,11 @@ parseCommandLine(int argc, char *argv[])
 				else
 					pg_fatal("invalid argument for option %s", "--set-char-signedness");
 				break;
+
+			case 7:
+				user_opts.transfer_mode = TRANSFER_MODE_SWAP;
+				break;
+
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
 						os_info.progname);
@@ -325,6 +331,7 @@ usage(void)
 	printf(_("  --no-statistics               do not import statistics from old cluster\n"));
 	printf(_("  --set-char-signedness=OPTION  set new cluster char signedness to \"signed\" or\n"
 			 "                                \"unsigned\"\n"));
+	printf(_("  --swap                        move data directories to new cluster\n"));
 	printf(_("  --sync-method=METHOD          set method for syncing files to disk\n"));
 	printf(_("  -?, --help                    show this help, then exit\n"));
 	printf(_("\n"
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 174cd920840..9295e46aed3 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -170,12 +170,14 @@ main(int argc, char **argv)
 
 	/*
 	 * Most failures happen in create_new_objects(), which has completed at
-	 * this point.  We do this here because it is just before linking, which
-	 * will link the old and new cluster data files, preventing the old
-	 * cluster from being safely started once the new cluster is started.
+	 * this point.  We do this here because it is just before file transfer,
+	 * which for --link will make it unsafe to start the old cluster once the
+	 * new cluster is started, and for --swap will make it unsafe to start the
+	 * old cluster at all.
 	 */
-	if (user_opts.transfer_mode == TRANSFER_MODE_LINK)
-		disable_old_cluster();
+	if (user_opts.transfer_mode == TRANSFER_MODE_LINK ||
+		user_opts.transfer_mode == TRANSFER_MODE_SWAP)
+		disable_old_cluster(user_opts.transfer_mode);
 
 	transfer_all_new_tablespaces(&old_cluster.dbarr, &new_cluster.dbarr,
 								 old_cluster.pgdata, new_cluster.pgdata);
@@ -212,8 +214,10 @@ main(int argc, char **argv)
 	{
 		prep_status("Sync data directory to disk");
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/initdb\" --sync-only \"%s\" --sync-method %s",
+				  "\"%s/initdb\" --sync-only %s \"%s\" --sync-method %s",
 				  new_cluster.bindir,
+				  (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+				  "--no-sync-data-files" : "",
 				  new_cluster.pgdata,
 				  user_opts.sync_method);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index f4e375d27c7..120c38929d4 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -262,6 +262,7 @@ typedef enum
 	TRANSFER_MODE_COPY,
 	TRANSFER_MODE_COPY_FILE_RANGE,
 	TRANSFER_MODE_LINK,
+	TRANSFER_MODE_SWAP,
 } transferMode;
 
 /*
@@ -391,7 +392,7 @@ void		create_script_for_old_cluster_deletion(char **deletion_script_file_name);
 
 void		get_control_data(ClusterInfo *cluster);
 void		check_control_data(ControlData *oldctrl, ControlData *newctrl);
-void		disable_old_cluster(void);
+void		disable_old_cluster(transferMode transfer_mode);
 
 
 /* dump.c */
@@ -423,7 +424,7 @@ void		rewriteVisibilityMap(const char *fromfile, const char *tofile,
 								 const char *schemaName, const char *relName);
 void		check_file_clone(void);
 void		check_copy_file_range(void);
-void		check_hard_link(void);
+void		check_hard_link(transferMode transfer_mode);
 
 /* fopen_priv() is no longer different from fopen() */
 #define fopen_priv(path, mode)	fopen(path, mode)
diff --git a/src/bin/pg_upgrade/relfilenumber.c b/src/bin/pg_upgrade/relfilenumber.c
index 8c23c583172..2abe90dd239 100644
--- a/src/bin/pg_upgrade/relfilenumber.c
+++ b/src/bin/pg_upgrade/relfilenumber.c
@@ -11,11 +11,92 @@
 
 #include <sys/stat.h>
 
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/int.h"
+#include "common/logging.h"
 #include "pg_upgrade.h"
 
 static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
 static void transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_frozenbit);
 
+/*
+ * The following set of sync_queue_* functions are used for --swap to reduce
+ * the amount of time spent synchronizing the swapped catalog files.  When a
+ * file is added to the queue, we also alert the file system that we'd like it
+ * to be persisted to disk in the near future (if that operation is supported
+ * by the current platform).  Once the queue is full, all of the files are
+ * synchronized to disk.  This strategy should generally be much faster than
+ * simply calling fsync() on the files right away.
+ *
+ * The general usage pattern should be something like:
+ *
+ *     for (int i = 0; i < num_files; i++)
+ *         sync_queue_push(files[i]);
+ *
+ *     // be sure to sync any remaining files in the queue
+ *     sync_queue_sync_all();
+ *     synq_queue_destroy();
+ */
+
+#define SYNC_QUEUE_MAX_LEN	(1024)
+
+static char *sync_queue[SYNC_QUEUE_MAX_LEN];
+static bool sync_queue_inited;
+static int	sync_queue_len;
+
+static inline void
+sync_queue_init(void)
+{
+	if (sync_queue_inited)
+		return;
+
+	sync_queue_inited = true;
+	for (int i = 0; i < SYNC_QUEUE_MAX_LEN; i++)
+		sync_queue[i] = palloc(MAXPGPATH);
+}
+
+static inline void
+sync_queue_sync_all(void)
+{
+	if (!sync_queue_inited)
+		return;
+
+	for (int i = 0; i < sync_queue_len; i++)
+	{
+		if (fsync_fname(sync_queue[i], false) != 0)
+			pg_fatal("could not synchronize file \"%s\": %m", sync_queue[i]);
+	}
+
+	sync_queue_len = 0;
+}
+
+static inline void
+sync_queue_push(const char *fname)
+{
+	sync_queue_init();
+
+	pre_sync_fname(fname, false);
+
+	strncpy(sync_queue[sync_queue_len++], fname, MAXPGPATH);
+	if (sync_queue_len >= SYNC_QUEUE_MAX_LEN)
+		sync_queue_sync_all();
+}
+
+static inline void
+sync_queue_destroy(void)
+{
+	if (!sync_queue_inited)
+		return;
+
+	sync_queue_inited = false;
+	sync_queue_len = 0;
+	for (int i = 0; i < SYNC_QUEUE_MAX_LEN; i++)
+	{
+		pfree(sync_queue[i]);
+		sync_queue[i] = NULL;
+	}
+}
 
 /*
  * transfer_all_new_tablespaces()
@@ -41,6 +122,9 @@ transfer_all_new_tablespaces(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		case TRANSFER_MODE_LINK:
 			prep_status_progress("Linking user relation files");
 			break;
+		case TRANSFER_MODE_SWAP:
+			prep_status_progress("Swapping data directories");
+			break;
 	}
 
 	/*
@@ -125,6 +209,261 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		/* We allocate something even for n_maps == 0 */
 		pg_free(mappings);
 	}
+
+	/*
+	 * Make sure anything pending synchronization in swap mode is fully
+	 * persisted to disk.  This is a no-op for other transfer modes.
+	 */
+	sync_queue_sync_all();
+	sync_queue_destroy();
+}
+
+/*
+ * prepare_for_swap()
+ *
+ * This function durably moves the database directories from the old cluster to
+ * the new cluster in preparation for moving the pg_restore-generated catalog
+ * files into place.  Returns false if the database with the given OID does not
+ * have a directory in the given tablespace, otherwise returns true.
+ */
+static bool
+prepare_for_swap(const char *old_tablespace, Oid db_oid,
+				 char *old_cat, char *new_dat, char *moved_dat)
+{
+	const char *new_tablespace;
+	const char *old_tblspc_suffix;
+	const char *new_tblspc_suffix;
+	char		old_tblspc[MAXPGPATH];
+	char		new_tblspc[MAXPGPATH];
+	char		moved_tblspc[MAXPGPATH];
+	char		old_dat[MAXPGPATH];
+	struct stat st;
+
+	if (strcmp(old_tablespace, old_cluster.pgdata) == 0)
+	{
+		new_tablespace = new_cluster.pgdata;
+		new_tblspc_suffix = "/base";
+		old_tblspc_suffix = "/base";
+	}
+	else
+	{
+		new_tablespace = old_tablespace;
+		new_tblspc_suffix = new_cluster.tablespace_suffix;
+		old_tblspc_suffix = old_cluster.tablespace_suffix;
+	}
+
+	snprintf(old_tblspc, sizeof(old_tblspc), "%s%s", old_tablespace, old_tblspc_suffix);
+	snprintf(moved_tblspc, sizeof(moved_tblspc), "%s_moved", old_tblspc);
+	snprintf(old_cat, MAXPGPATH, "%s/%u_old_cat", moved_tblspc, db_oid);
+	snprintf(new_tblspc, sizeof(new_tblspc), "%s%s", new_tablespace, new_tblspc_suffix);
+	snprintf(new_dat, MAXPGPATH, "%s/%u", new_tblspc, db_oid);
+	snprintf(moved_dat, MAXPGPATH, "%s/%u", moved_tblspc, db_oid);
+	snprintf(old_dat, sizeof(old_dat), "%s/%u", old_tblspc, db_oid);
+
+	/* Check that the database directory exists in the given tablespace. */
+	if (stat(old_dat, &st) != 0)
+	{
+		if (errno != ENOENT)
+			pg_fatal("could not stat file \"%s\": %m", old_dat);
+		return false;
+	}
+
+	/* Create directory for stuff that is moved aside. */
+	if (pg_mkdir_p(moved_tblspc, pg_dir_create_mode) != 0 && errno != EEXIST)
+		pg_fatal("could not create directory \"%s\"", moved_tblspc);
+
+	/* Create directory for old catalog files. */
+	if (pg_mkdir_p(old_cat, pg_dir_create_mode) != 0)
+		pg_fatal("could not create directory \"%s\"", old_cat);
+
+	/* Move the new cluster's database directory aside. */
+	if (rename(new_dat, moved_dat) != 0)
+		pg_fatal("could not rename \"%s\" to \"%s\"", new_dat, moved_dat);
+
+	/* Move the old cluster's database directory into place. */
+	if (rename(old_dat, new_dat) != 0)
+		pg_fatal("could not rename \"%s\" to \"%s\"", old_dat, new_dat);
+
+	return true;
+}
+
+/*
+ * FileNameMapCmp()
+ *
+ * qsort() comparator for FileNameMap that sorts by RelFileNumber.
+ */
+static int
+FileNameMapCmp(const void *a, const void *b)
+{
+	const FileNameMap *map1 = (const FileNameMap *) a;
+	const FileNameMap *map2 = (const FileNameMap *) b;
+
+	return pg_cmp_u32(map1->relfilenumber, map2->relfilenumber);
+}
+
+/*
+ * parse_relfilenumber()
+ *
+ * Attempt to parse the RelFileNumber of the given file name.  If we can't,
+ * return InvalidRelFileNumber.
+ */
+static RelFileNumber
+parse_relfilenumber(const char *filename)
+{
+	char	   *endp;
+	unsigned long n;
+
+	if (filename[0] < '1' || filename[0] > '9')
+		return InvalidRelFileNumber;
+
+	errno = 0;
+	n = strtoul(filename, &endp, 10);
+	if (errno || filename == endp || n <= 0 || n > PG_UINT32_MAX)
+		return InvalidRelFileNumber;
+
+	return (RelFileNumber) n;
+}
+
+/*
+ * swap_catalog_files()
+ *
+ * Moves the old catalog files aside, and moves the new catalog files into
+ * place.
+ */
+static void
+swap_catalog_files(FileNameMap *maps, int size, const char *old_cat,
+				   const char *new_dat, const char *moved_dat)
+{
+	DIR		   *dir;
+	struct dirent *de;
+	char		path[MAXPGPATH];
+	char		dest[MAXPGPATH];
+	RelFileNumber rfn;
+
+	/* Move the old catalog files aside. */
+	dir = opendir(new_dat);
+	if (dir == NULL)
+		pg_fatal("could not open directory \"%s\": %m", new_dat);
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+			continue;
+
+		snprintf(path, sizeof(path), "%s/%s", new_dat, de->d_name);
+		if (get_dirent_type(path, de, false, PG_LOG_ERROR) != PGFILETYPE_REG)
+			continue;
+
+		rfn = parse_relfilenumber(de->d_name);
+		if (RelFileNumberIsValid(rfn))
+		{
+			FileNameMap key = {.relfilenumber = rfn};
+
+			if (bsearch(&key, maps, size, sizeof(FileNameMap), FileNameMapCmp))
+				continue;
+		}
+
+		snprintf(dest, sizeof(dest), "%s/%s", old_cat, de->d_name);
+		if (rename(path, dest) != 0)
+			pg_fatal("could not rename \"%s\" to \"%s\": %m", path, dest);
+	}
+
+	if (errno)
+		pg_fatal("could not read directory \"%s\": %m", new_dat);
+	(void) closedir(dir);
+
+	/* Move the new catalog files into place. */
+	dir = opendir(moved_dat);
+	if (dir == NULL)
+		pg_fatal("could not open directory \"%s\": %m", moved_dat);
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		if (strcmp(de->d_name, ".") == 0 || strcmp(de->d_name, "..") == 0)
+			continue;
+
+		snprintf(path, sizeof(path), "%s/%s", moved_dat, de->d_name);
+		if (get_dirent_type(path, de, false, PG_LOG_ERROR) != PGFILETYPE_REG)
+			continue;
+
+		rfn = parse_relfilenumber(de->d_name);
+		if (RelFileNumberIsValid(rfn))
+		{
+			FileNameMap key = {.relfilenumber = rfn};
+
+			if (bsearch(&key, maps, size, sizeof(FileNameMap), FileNameMapCmp))
+				continue;
+		}
+
+		snprintf(dest, sizeof(dest), "%s/%s", new_dat, de->d_name);
+		if (rename(path, dest) != 0)
+			pg_fatal("could not rename \"%s\" to \"%s\": %m", path, dest);
+
+		/*
+		 * We don't fsync() the database files in the file synchronization
+		 * stage of pg_upgrade in swap mode, so we need to synchronize them
+		 * ourselves.  We only do this for the catalog files because they were
+		 * created during pg_restore with fsync=off.  We assume that the user
+		 * data files files were properly persisted to disk when the user last
+		 * shut it down.
+		 */
+		sync_queue_push(dest);
+	}
+
+	if (errno)
+		pg_fatal("could not read directory \"%s\": %m", moved_dat);
+	(void) closedir(dir);
+
+	/* Ensure the directory entries are persisted to disk. */
+	if (fsync_fname(new_dat, true) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", new_dat);
+	if (fsync_parent_path(new_dat) != 0)
+		pg_fatal("could not synchronize parent directory of \"%s\": %m", new_dat);
+}
+
+/*
+ * do_swap()
+ *
+ * Perform the required steps for --swap for a single database.  In short this
+ * moves the old cluster's database directory into the new cluster and then
+ * replaces any files for system catalogs with the ones that were generated
+ * during pg_restore.
+ */
+static void
+do_swap(FileNameMap *maps, int size, char *old_tablespace)
+{
+	char		old_cat[MAXPGPATH];
+	char		new_dat[MAXPGPATH];
+	char		moved_dat[MAXPGPATH];
+
+	/*
+	 * We perform many lookups on maps by relfilenumber in swap mode, so make
+	 * sure it's sorted.
+	 */
+	qsort(maps, size, sizeof(FileNameMap), FileNameMapCmp);
+
+	/*
+	 * If an old tablespace is given, we only need to process that one.  If no
+	 * old tablespace is specified, we need to process all the tablespaces on
+	 * the system.
+	 */
+	if (old_tablespace)
+	{
+		if (prepare_for_swap(old_tablespace, maps[0].db_oid,
+							 old_cat, new_dat, moved_dat))
+			swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+	}
+	else
+	{
+		if (prepare_for_swap(old_cluster.pgdata, maps[0].db_oid,
+							 old_cat, new_dat, moved_dat))
+			swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+
+		for (int tblnum = 0; tblnum < os_info.num_old_tablespaces; tblnum++)
+		{
+			if (prepare_for_swap(os_info.old_tablespaces[tblnum], maps[0].db_oid,
+								 old_cat, new_dat, moved_dat))
+				swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+		}
+	}
 }
 
 /*
@@ -145,6 +484,20 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
 		new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
 		vm_must_add_frozenbit = true;
 
+	/* --swap has its own subroutine */
+	if (user_opts.transfer_mode == TRANSFER_MODE_SWAP)
+	{
+		/*
+		 * We don't support --swap to upgrade from versions that require
+		 * rewriting the visibility map.  We should've failed already if
+		 * someone tries to do that.
+		 */
+		Assert(!vm_must_add_frozenbit);
+
+		do_swap(maps, size, old_tablespace);
+		return;
+	}
+
 	for (mapnum = 0; mapnum < size; mapnum++)
 	{
 		if (old_tablespace == NULL ||
@@ -259,6 +612,11 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 					pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"",
 						   old_file, new_file);
 					linkFile(old_file, new_file, map->nspname, map->relname);
+					break;
+				case TRANSFER_MODE_SWAP:
+					/* swap mode is handled in its own code path */
+					pg_fatal("should never happen");
+					break;
 			}
 	}
 }
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 78e272916f5..4405ef8b425 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -45,9 +45,6 @@
  */
 #define MINIMUM_VERSION_FOR_PG_WAL	100000
 
-#ifdef PG_FLUSH_DATA_WORKS
-static int	pre_sync_fname(const char *fname, bool isdir);
-#endif
 static void walkdir(const char *path,
 					int (*action) (const char *fname, bool isdir),
 					bool process_symlinks,
@@ -352,16 +349,16 @@ walkdir(const char *path,
 }
 
 /*
- * Hint to the OS that it should get ready to fsync() this file.
+ * Hint to the OS that it should get ready to fsync() this file, if supported
+ * by the platform.
  *
  * Ignores errors trying to open unreadable files, and reports other errors
  * non-fatally.
  */
-#ifdef PG_FLUSH_DATA_WORKS
-
-static int
+int
 pre_sync_fname(const char *fname, bool isdir)
 {
+#ifdef PG_FLUSH_DATA_WORKS
 	int			fd;
 
 	fd = open(fname, O_RDONLY | PG_BINARY, 0);
@@ -388,11 +385,10 @@ pre_sync_fname(const char *fname, bool isdir)
 #endif
 
 	(void) close(fd);
+#endif							/* PG_FLUSH_DATA_WORKS */
 	return 0;
 }
 
-#endif							/* PG_FLUSH_DATA_WORKS */
-
 /*
  * fsync_fname -- Try to fsync a file or directory
  *
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 8274bc877ab..9fd88953e43 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -33,6 +33,7 @@ typedef enum DataDirSyncMethod
 struct iovec;					/* avoid including port/pg_iovec.h here */
 
 #ifdef FRONTEND
+extern int	pre_sync_fname(const char *fname, bool isdir);
 extern int	fsync_fname(const char *fname, bool isdir);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
 						DataDirSyncMethod sync_method, bool sync_data_files);
-- 
2.39.5 (Apple Git-154)

#17Bruce Momjian
bruce@momjian.us
In reply to: Greg Sabino Mullane (#15)
Re: optimize file transfer in pg_upgrade

On Wed, Mar 5, 2025 at 03:40:52PM -0500, Greg Sabino Mullane wrote:

On Wed, Mar 5, 2025 at 2:43 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

One other design point I wanted to bring up is whether we should bother
generating a rollback script for the new "swap" mode.  In short, I'm
wondering if it would be unreasonable to say that, just for this mode, once
pg_upgrade enters the file transfer step, reverting to the old cluster
requires restoring a backup.

I think that's a fair requirement. And like Robert, revert scripts make me
nervous.

* Anecdotally, I'm not sure I've ever actually seen pg_upgrade fail
during or after file transfer, and I'm hoping to get some real data about
that in the near future.  Has anyone else dealt with such a failure?

I've seen various failures, but they always get caught quite early. Certainly
early enough to easily abort, fix perms/mounts/etc., then retry. I think your
instinct is correct that this reversion is more trouble than its worth. I don't
think the pg_upgrade docs mention taking a backup, but that's always step 0 in
my playbook, and that's the rollback plan in the unlikely event of failure.

I avoided many optimizations in pg_upgrade in the fear they would lead
to hard-to-detect bugs, or breakage from major release changes.
pg_upgrade is probably old enough now (15 years) that we can risk these
optimizations.

--
Bruce Momjian <bruce@momjian.us> https://momjian.us
EDB https://enterprisedb.com

Do not let urgent matters crowd out time for investment in the future.

#18Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#16)
3 attachment(s)
Re: optimize file transfer in pg_upgrade

On Wed, Mar 05, 2025 at 08:34:37PM -0600, Nathan Bossart wrote:

Thank you, Greg and Robert, for sharing your thoughts. With that, here's
what I'm considering to be a reasonably complete patch set for this
feature. This leaves about a month for rigorous testing and editing, so
I'm hopeful it'll be ready v18.

Here are my notes after a round of self-review.

0001:
* The documentation does not adequately describe the interaction between
--no-sync-data-files and --sync-method=syncfs.
* I really don't like the exclude_dir hack for skipping the main tablespace
directory, but I haven't thought of anything that seems better.
* I should verify that there's no path separator issues on Windows for the
exclude_dir hack. From some quick code analysis, I think it should work
fine, but I probably ought to test it out to be sure.
* The documentation needs to mention that the tablespace directories
themselves are not synchronized.

0002:
* The documentation changes are subject to update based on ongoing stats
import/export work.
* Does --statistics-only --sequence-data make any sense? It seems like it
ought to function as expected, but it's hard to see a use-case.

0003:
* Once committed, I should update one of my buildfarm animals to use
PG_TEST_PG_UPGRADE_MODE=--swap.
* For check_hard_link() and disable_old_cluster(), move the Assert() to an
"else" block with a pg_fatal() call for sturdiness.
* I need to do a thorough pass-through on all comments. Many are not
sufficiently detailed.
* The "." and ".." checks in the catalog swap code are redundant and can be
removed.
* The directory for "moved-aside" stuff should be placed within the old
cluster's corresponding tablespace directory so that no changes need to
be made to delete_old_cluster.{sh,bat}.
* Manual testing with non-default tablespaces!

Updated patches based on these notes are attached.

--
nathan

Attachments:

v5-0001-initdb-Add-no-sync-data-files.patchtext/plain; charset=us-asciiDownload
From c0180c868c0e08c088ba40dbba071ce100a67b44 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 19 Feb 2025 09:14:51 -0600
Subject: [PATCH v5 1/3] initdb: Add --no-sync-data-files.

This new option instructs initdb to skip synchronizing any files
in database directories and the database directories themselves,
i.e., everything in the base/ subdirectory and any other
tablespace directories.  Other files, such as those in pg_wal/ and
pg_xact/, will still be synchronized unless --no-sync is also
specified.  --no-sync-data-files is primarily intended for internal
use by tools that separately ensure the skipped files are
synchronized to disk.  A follow-up commit will use this to help
optimize pg_upgrade's file transfer step.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/initdb.sgml                | 27 +++++++
 src/bin/initdb/initdb.c                     | 10 ++-
 src/bin/initdb/t/001_initdb.pl              |  1 +
 src/bin/pg_basebackup/pg_basebackup.c       |  2 +-
 src/bin/pg_checksums/pg_checksums.c         |  2 +-
 src/bin/pg_combinebackup/pg_combinebackup.c |  2 +-
 src/bin/pg_rewind/file_ops.c                |  2 +-
 src/common/file_utils.c                     | 85 +++++++++++++--------
 src/include/common/file_utils.h             |  2 +-
 9 files changed, 96 insertions(+), 37 deletions(-)

diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 0026318485a..2f1f9a42f90 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -527,6 +527,33 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry id="app-initdb-option-no-sync-data-files">
+      <term><option>--no-sync-data-files</option></term>
+      <listitem>
+       <para>
+        By default, <command>initdb</command> safely writes all database files
+        to disk.  This option instructs <command>initdb</command> to skip
+        synchronizing all files in the individual database directories, the
+        database directories themselves, and the tablespace directories, i.e.,
+        everything in the <filename>base</filename> subdirectory and any other
+        tablespace directories.  Other files, such as those in
+        <literal>pg_wal</literal> and <literal>pg_xact</literal>, will still be
+        synchronized unless the <option>--no-sync</option> option is also
+        specified.
+       </para>
+       <para>
+        Note that if <option>--no-sync-data-files</option> is used in
+        conjuction with <option>--sync-method=syncfs</option>, some or all of
+        the aforementioned files and directories will be synchronized because
+        <literal>syncfs</literal> processes entire file systems.
+       </para>
+       <para>
+        This option is primarily intended for internal use by tools that
+        separately ensure the skipped files are synchronized to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="app-initdb-option-no-instructions">
       <term><option>--no-instructions</option></term>
       <listitem>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 21a0fe3ecd9..22b7d31b165 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -168,6 +168,7 @@ static bool data_checksums = true;
 static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+static bool sync_data_files = true;
 
 
 /* internal vars */
@@ -2566,6 +2567,7 @@ usage(const char *progname)
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
 	printf(_("  -n, --no-clean            do not clean up after errors\n"));
 	printf(_("  -N, --no-sync             do not wait for changes to be written safely to disk\n"));
+	printf(_("      --no-sync-data-files  do not sync files within database directories\n"));
 	printf(_("      --no-instructions     do not print instructions for next steps\n"));
 	printf(_("  -s, --show                show internal settings, then exit\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
@@ -3208,6 +3210,7 @@ main(int argc, char *argv[])
 		{"icu-rules", required_argument, NULL, 18},
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
+		{"no-sync-data-files", no_argument, NULL, 21},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3402,6 +3405,9 @@ main(int argc, char *argv[])
 			case 20:
 				data_checksums = false;
 				break;
+			case 21:
+				sync_data_files = false;
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -3453,7 +3459,7 @@ main(int argc, char *argv[])
 
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 		return 0;
 	}
@@ -3516,7 +3522,7 @@ main(int argc, char *argv[])
 	{
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 	}
 	else
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 01cc4a1602b..15dd10ce40a 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -76,6 +76,7 @@ command_like(
 	'checksums are enabled in control file');
 
 command_ok([ 'initdb', '--sync-only', $datadir ], 'sync only');
+command_ok([ 'initdb', '--sync-only', '--no-sync-data-files', $datadir ], '--no-sync-data-files');
 command_fails([ 'initdb', $datadir ], 'existing data directory');
 
 if ($supports_syncfs)
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index d4b4e334014..1da4bfc2351 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -2310,7 +2310,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
 		}
 		else
 		{
-			(void) sync_pgdata(basedir, serverVersion, sync_method);
+			(void) sync_pgdata(basedir, serverVersion, sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index 867aeddc601..f20be82862a 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -633,7 +633,7 @@ main(int argc, char *argv[])
 		if (do_sync)
 		{
 			pg_log_info("syncing data directory");
-			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method);
+			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method, true);
 		}
 
 		pg_log_info("updating control file");
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index d480dc74436..050260ee832 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -424,7 +424,7 @@ main(int argc, char *argv[])
 		else
 		{
 			pg_log_debug("recursively fsyncing \"%s\"", opt.output);
-			sync_pgdata(opt.output, version * 10000, opt.sync_method);
+			sync_pgdata(opt.output, version * 10000, opt.sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 467845419ed..55659ce201f 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -296,7 +296,7 @@ sync_target_dir(void)
 	if (!do_sync || dry_run)
 		return;
 
-	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method);
+	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method, true);
 }
 
 
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 0e3cfede935..78e272916f5 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -50,7 +50,8 @@ static int	pre_sync_fname(const char *fname, bool isdir);
 #endif
 static void walkdir(const char *path,
 					int (*action) (const char *fname, bool isdir),
-					bool process_symlinks);
+					bool process_symlinks,
+					const char *exclude_dir);
 
 #ifdef HAVE_SYNCFS
 
@@ -93,11 +94,15 @@ do_syncfs(const char *path)
  * syncing, and might not have privileges to write at all.
  *
  * serverVersion indicates the version of the server to be sync'd.
+ *
+ * If sync_data_files is false, this function skips syncing "base/" and any
+ * other tablespace directories.
  */
 void
 sync_pgdata(const char *pg_data,
 			int serverVersion,
-			DataDirSyncMethod sync_method)
+			DataDirSyncMethod sync_method,
+			bool sync_data_files)
 {
 	bool		xlog_is_symlink;
 	char		pg_wal[MAXPGPATH];
@@ -147,30 +152,33 @@ sync_pgdata(const char *pg_data,
 				do_syncfs(pg_data);
 
 				/* If any tablespaces are configured, sync each of those. */
-				dir = opendir(pg_tblspc);
-				if (dir == NULL)
-					pg_log_error("could not open directory \"%s\": %m",
-								 pg_tblspc);
-				else
+				if (sync_data_files)
 				{
-					while (errno = 0, (de = readdir(dir)) != NULL)
+					dir = opendir(pg_tblspc);
+					if (dir == NULL)
+						pg_log_error("could not open directory \"%s\": %m",
+									 pg_tblspc);
+					else
 					{
-						char		subpath[MAXPGPATH * 2];
+						while (errno = 0, (de = readdir(dir)) != NULL)
+						{
+							char		subpath[MAXPGPATH * 2];
 
-						if (strcmp(de->d_name, ".") == 0 ||
-							strcmp(de->d_name, "..") == 0)
-							continue;
+							if (strcmp(de->d_name, ".") == 0 ||
+								strcmp(de->d_name, "..") == 0)
+								continue;
 
-						snprintf(subpath, sizeof(subpath), "%s/%s",
-								 pg_tblspc, de->d_name);
-						do_syncfs(subpath);
-					}
+							snprintf(subpath, sizeof(subpath), "%s/%s",
+									 pg_tblspc, de->d_name);
+							do_syncfs(subpath);
+						}
 
-					if (errno)
-						pg_log_error("could not read directory \"%s\": %m",
-									 pg_tblspc);
+						if (errno)
+							pg_log_error("could not read directory \"%s\": %m",
+										 pg_tblspc);
 
-					(void) closedir(dir);
+						(void) closedir(dir);
+					}
 				}
 
 				/* If pg_wal is a symlink, process that too. */
@@ -182,15 +190,21 @@ sync_pgdata(const char *pg_data,
 
 		case DATA_DIR_SYNC_METHOD_FSYNC:
 			{
+				char	   *exclude_dir = NULL;
+
+				if (!sync_data_files)
+					exclude_dir = psprintf("%s/base", pg_data);
+
 				/*
 				 * If possible, hint to the kernel that we're soon going to
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(pg_data, pre_sync_fname, false);
+				walkdir(pg_data, pre_sync_fname, false, exclude_dir);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, pre_sync_fname, false);
-				walkdir(pg_tblspc, pre_sync_fname, true);
+					walkdir(pg_wal, pre_sync_fname, false, NULL);
+				if (sync_data_files)
+					walkdir(pg_tblspc, pre_sync_fname, true, NULL);
 #endif
 
 				/*
@@ -203,10 +217,14 @@ sync_pgdata(const char *pg_data,
 				 * get fsync'd twice. That's not an expected case so we don't
 				 * worry about optimizing it.
 				 */
-				walkdir(pg_data, fsync_fname, false);
+				walkdir(pg_data, fsync_fname, false, exclude_dir);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, fsync_fname, false);
-				walkdir(pg_tblspc, fsync_fname, true);
+					walkdir(pg_wal, fsync_fname, false, NULL);
+				if (sync_data_files)
+					walkdir(pg_tblspc, fsync_fname, true, NULL);
+
+				if (exclude_dir)
+					pfree(exclude_dir);
 			}
 			break;
 	}
@@ -245,10 +263,10 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(dir, pre_sync_fname, false);
+				walkdir(dir, pre_sync_fname, false, NULL);
 #endif
 
-				walkdir(dir, fsync_fname, false);
+				walkdir(dir, fsync_fname, false, NULL);
 			}
 			break;
 	}
@@ -264,6 +282,9 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
  * ignored in subdirectories, ie we intentionally don't pass down the
  * process_symlinks flag to recursive calls.
  *
+ * If exclude_dir is not NULL, it specifies a directory path to skip
+ * processing.
+ *
  * Errors are reported but not considered fatal.
  *
  * See also walkdir in fd.c, which is a backend version of this logic.
@@ -271,11 +292,15 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
 static void
 walkdir(const char *path,
 		int (*action) (const char *fname, bool isdir),
-		bool process_symlinks)
+		bool process_symlinks,
+		const char *exclude_dir)
 {
 	DIR		   *dir;
 	struct dirent *de;
 
+	if (exclude_dir && strcmp(exclude_dir, path) == 0)
+		return;
+
 	dir = opendir(path);
 	if (dir == NULL)
 	{
@@ -299,7 +324,7 @@ walkdir(const char *path,
 				(*action) (subpath, false);
 				break;
 			case PGFILETYPE_DIR:
-				walkdir(subpath, action, false);
+				walkdir(subpath, action, false, exclude_dir);
 				break;
 			default:
 
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index a832210adc1..8274bc877ab 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -35,7 +35,7 @@ struct iovec;					/* avoid including port/pg_iovec.h here */
 #ifdef FRONTEND
 extern int	fsync_fname(const char *fname, bool isdir);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
-						DataDirSyncMethod sync_method);
+						DataDirSyncMethod sync_method, bool sync_data_files);
 extern void sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method);
 extern int	durable_rename(const char *oldfile, const char *newfile);
 extern int	fsync_parent_path(const char *fname);
-- 
2.39.5 (Apple Git-154)

v5-0002-pg_dump-Add-sequence-data.patchtext/plain; charset=us-asciiDownload
From 7db4f4ff914d454c28ef9439e76fd16a0cd094a4 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 19 Feb 2025 11:25:28 -0600
Subject: [PATCH v5 2/3] pg_dump: Add --sequence-data.

This new option instructs pg_dump to dump sequence data when the
--no-data, --schema-only, or --statistics-only option is specified.
This was originally considered for commit a7e5457db8, but it was
left out at that time because there was no known use-case.  A
follow-up commit will use this to optimize pg_upgrade's file
transfer step.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/pg_dump.sgml               | 11 +++++++++++
 src/bin/pg_dump/pg_dump.c                   | 10 ++--------
 src/bin/pg_dump/t/002_pg_dump.pl            |  1 +
 src/bin/pg_upgrade/dump.c                   |  2 +-
 src/test/modules/test_pg_dump/t/001_base.pl |  2 +-
 5 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 0ae40f9be58..63cca18711a 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1298,6 +1298,17 @@ PostgreSQL documentation
        </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--sequence-data</option></term>
+      <listitem>
+       <para>
+        Include sequence data in the dump.  This is the default behavior except
+        when <option>--no-data</option>, <option>--schema-only</option>, or
+        <option>--statistics-only</option> is specified.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--serializable-deferrable</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 428ed2d60fc..e6253331e27 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -518,6 +518,7 @@ main(int argc, char **argv)
 		{"sync-method", required_argument, NULL, 15},
 		{"filter", required_argument, NULL, 16},
 		{"exclude-extension", required_argument, NULL, 17},
+		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -801,14 +802,6 @@ main(int argc, char **argv)
 	if (dopt.column_inserts && dopt.dump_inserts == 0)
 		dopt.dump_inserts = DUMP_DEFAULT_ROWS_PER_INSERT;
 
-	/*
-	 * Binary upgrade mode implies dumping sequence data even in schema-only
-	 * mode.  This is not exposed as a separate option, but kept separate
-	 * internally for clarity.
-	 */
-	if (dopt.binary_upgrade)
-		dopt.sequence_data = 1;
-
 	if (data_only && schema_only)
 		pg_fatal("options -s/--schema-only and -a/--data-only cannot be used together");
 	if (schema_only && statistics_only)
@@ -1275,6 +1268,7 @@ help(const char *progname)
 	printf(_("  --quote-all-identifiers      quote all identifiers, even if not key words\n"));
 	printf(_("  --rows-per-insert=NROWS      number of rows per INSERT; implies --inserts\n"));
 	printf(_("  --section=SECTION            dump named section (pre-data, data, or post-data)\n"));
+	printf(_("  --sequence-data              include sequence data in dump\n"));
 	printf(_("  --serializable-deferrable    wait until the dump can run without anomalies\n"));
 	printf(_("  --snapshot=SNAPSHOT          use given snapshot for the dump\n"));
 	printf(_("  --statistics-only            dump only the statistics, not schema or data\n"));
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index d281e27aa67..ed379033da7 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -66,6 +66,7 @@ my %pgdump_runs = (
 			'--file' => "$tempdir/binary_upgrade.dump",
 			'--no-password',
 			'--no-data',
+			'--sequence-data',
 			'--binary-upgrade',
 			'--dbname' => 'postgres',    # alternative way to specify database
 		],
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index 23fe7280a16..b8fd0d0acee 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -52,7 +52,7 @@ generate_old_dump(void)
 		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
 
 		parallel_exec_prog(log_file_name, NULL,
-						   "\"%s/pg_dump\" %s --no-data %s --quote-all-identifiers "
+						   "\"%s/pg_dump\" %s --no-data %s --sequence-data --quote-all-identifiers "
 						   "--binary-upgrade --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
 						   log_opts.verbose ? "--verbose" : "",
diff --git a/src/test/modules/test_pg_dump/t/001_base.pl b/src/test/modules/test_pg_dump/t/001_base.pl
index a9bcac4169d..adcaa419616 100644
--- a/src/test/modules/test_pg_dump/t/001_base.pl
+++ b/src/test/modules/test_pg_dump/t/001_base.pl
@@ -48,7 +48,7 @@ my %pgdump_runs = (
 		dump_cmd => [
 			'pg_dump', '--no-sync',
 			'--file' => "$tempdir/binary_upgrade.sql",
-			'--schema-only', '--binary-upgrade',
+			'--schema-only', '--sequence-data', '--binary-upgrade',
 			'--dbname' => 'postgres',
 		],
 	},
-- 
2.39.5 (Apple Git-154)

v5-0003-pg_upgrade-Add-swap-for-faster-file-transfer.patchtext/plain; charset=us-asciiDownload
From 26799878d4d4e0ad12b4e42c6d8e6b11296126a1 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 5 Mar 2025 17:36:54 -0600
Subject: [PATCH v5 3/3] pg_upgrade: Add --swap for faster file transfer.

This new option instructs pg_upgrade to move the data directories
from the old cluster to the new cluster and then to replace the
catalog files with those generated for the new cluster.  This mode
can outperform --link, --clone, --copy, and --copy-file-range,
especially on clusters with many relations.

However, this mode creates many garbage files in the old cluster,
which can prolong the file synchronization step.  To handle that,
we use "initdb --sync-only --no-sync-data-files" for file
synchronization, and we synchronize the catalog files as they are
transferred.  We assume that the database files transferred from
the old cluster were synchronized prior to upgrade.  This mode also
complicates reverting to the old cluster, so we recommend restoring
from backup upon failure during or after file transfer.

The new mode is limited to clusters located in the same file system
and to upgrades from version 10 and newer.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/pgupgrade.sgml    |  59 ++++-
 src/bin/pg_upgrade/TESTING         |   6 +-
 src/bin/pg_upgrade/check.c         |  29 ++-
 src/bin/pg_upgrade/controldata.c   |  21 +-
 src/bin/pg_upgrade/dump.c          |   4 +-
 src/bin/pg_upgrade/file.c          |  14 +-
 src/bin/pg_upgrade/info.c          |   4 +-
 src/bin/pg_upgrade/option.c        |   7 +
 src/bin/pg_upgrade/pg_upgrade.c    |  16 +-
 src/bin/pg_upgrade/pg_upgrade.h    |   5 +-
 src/bin/pg_upgrade/relfilenumber.c | 364 +++++++++++++++++++++++++++++
 src/common/file_utils.c            |  14 +-
 src/include/common/file_utils.h    |   1 +
 13 files changed, 510 insertions(+), 34 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 9ef7a84eed0..6deee1607ec 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -244,7 +244,8 @@ PostgreSQL documentation
       <listitem>
        <para>
         Copy files to the new cluster.  This is the default.  (See also
-        <option>--link</option> and <option>--clone</option>.)
+        <option>--link</option>, <option>--clone</option>,
+        <option>--copy-file-range</option>, and <option>--swap</option>.)
        </para>
       </listitem>
      </varlistentry>
@@ -262,6 +263,32 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--swap</option></term>
+      <listitem>
+       <para>
+        Move the data directories from the old cluster to the new cluster.
+        Then, replace the catalog files with those generated for the new
+        cluster.  This mode can outperform <option>--link</option>,
+        <option>--clone</option>, <option>--copy</option>, and
+        <option>--copy-file-range</option>, especially on clusters with many
+        relations.
+       </para>
+       <para>
+        However, this mode creates many garbage files in the old cluster, which
+        can prolong the file synchronization step if
+        <option>--sync-method=syncfs</option> is used.  Therefore, it is
+        recommended to use <option>--sync-method=fsync</option> with
+        <option>--swap</option>.
+       </para>
+       <para>
+        Additionally, once the file transfer step begins, the old cluster will
+        be destructively modified and therefore will no longer be safe to
+        start.  See <xref linkend="pgupgrade-step-revert"/> for details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--sync-method=</option><replaceable>method</replaceable></term>
       <listitem>
@@ -530,6 +557,10 @@ NET STOP postgresql-&majorversion;
      is started.  Clone mode also requires that the old and new data
      directories be in the same file system.  This mode is only available
      on certain operating systems and file systems.
+     Swap mode may be the fastest if there are many relations, but you will not
+     be able to access your old cluster once the file transfer step begins.
+     Swap mode also requires that the old and new cluster data directories be
+     in the same file system.
     </para>
 
     <para>
@@ -888,6 +919,32 @@ psql --username=postgres --file=script.sql postgres
 
         </itemizedlist></para>
       </listitem>
+
+      <listitem>
+       <para>
+        If the <option>--swap</option> option was used, the old cluster might
+        be destructively modified:
+
+        <itemizedlist>
+         <listitem>
+          <para>
+           If <command>pg_upgrade</command> aborts before reporting that the
+           old cluster is no longer safe to start, the old cluster was
+           unmodified; it can be restarted.
+          </para>
+         </listitem>
+
+         <listitem>
+          <para>
+           If <command>pg_upgrade</command> has reported that the old cluster
+           is no longer safe to start, the old cluster was destructively
+           modified.  The old cluster will need to be restored from backup in
+           this case.
+          </para>
+         </listitem>
+        </itemizedlist>
+       </para>
+      </listitem>
      </itemizedlist></para>
    </step>
   </procedure>
diff --git a/src/bin/pg_upgrade/TESTING b/src/bin/pg_upgrade/TESTING
index 00842ac6ec3..c3d463c9c29 100644
--- a/src/bin/pg_upgrade/TESTING
+++ b/src/bin/pg_upgrade/TESTING
@@ -20,13 +20,13 @@ export oldinstall=...otherversion/	(old version's install base path)
 See DETAILS below for more information about creation of the dump.
 
 You can also test the different transfer modes (--copy, --link,
---clone, --copy-file-range) by setting the environment variable
+--clone, --copy-file-range, --swap) by setting the environment variable
 PG_TEST_PG_UPGRADE_MODE to the respective command-line option, like
 
 	make check PG_TEST_PG_UPGRADE_MODE=--link
 
-The default is --copy.  Note that the other modes are not supported on
-all operating systems.
+The default is --copy.  Note that not all modes are supported on all
+operating systems.
 
 DETAILS
 -------
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index d32fc3d88ec..81c91fc2912 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -709,7 +709,34 @@ check_new_cluster(void)
 			check_copy_file_range();
 			break;
 		case TRANSFER_MODE_LINK:
-			check_hard_link();
+			check_hard_link(TRANSFER_MODE_LINK);
+			break;
+		case TRANSFER_MODE_SWAP:
+
+			/*
+			 * We do the hard link check for --swap, too, since it's an easy
+			 * way to verify the clusters are in the same file system.  This
+			 * allows us to take some shortcuts in the file synchronization
+			 * step.  With some more effort, we could probably support the
+			 * separate-file-system use case, but this mode is unlikely to
+			 * offer much benefit if we have to copy the files across file
+			 * system boundaries.
+			 */
+			check_hard_link(TRANSFER_MODE_SWAP);
+
+			/*
+			 * There are a few known issues with using --swap to upgrade from
+			 * versions older than 10.  For example, the sequence tuple format
+			 * changed in v10, and the visibility map format changed in 9.6.
+			 * While such problems are not insurmountable (and we may have to
+			 * deal with similar problems in the future, anyway), it doesn't
+			 * seem worth the effort to support swap mode for upgrades from
+			 * long-unsupported versions.
+			 */
+			if (GET_MAJOR_VERSION(old_cluster.major_version) < 1000)
+				pg_fatal("Swap mode can only upgrade clusters from PostgreSQL version %s and later.",
+						 "10");
+
 			break;
 	}
 
diff --git a/src/bin/pg_upgrade/controldata.c b/src/bin/pg_upgrade/controldata.c
index bd49ea867bf..47ee27ec835 100644
--- a/src/bin/pg_upgrade/controldata.c
+++ b/src/bin/pg_upgrade/controldata.c
@@ -751,7 +751,7 @@ check_control_data(ControlData *oldctrl,
 
 
 void
-disable_old_cluster(void)
+disable_old_cluster(transferMode transfer_mode)
 {
 	char		old_path[MAXPGPATH],
 				new_path[MAXPGPATH];
@@ -766,10 +766,17 @@ disable_old_cluster(void)
 				 old_path, new_path);
 	check_ok();
 
-	pg_log(PG_REPORT, "\n"
-		   "If you want to start the old cluster, you will need to remove\n"
-		   "the \".old\" suffix from %s/global/pg_control.old.\n"
-		   "Because \"link\" mode was used, the old cluster cannot be safely\n"
-		   "started once the new cluster has been started.",
-		   old_cluster.pgdata);
+	if (transfer_mode == TRANSFER_MODE_LINK)
+		pg_log(PG_REPORT, "\n"
+			   "If you want to start the old cluster, you will need to remove\n"
+			   "the \".old\" suffix from %s/global/pg_control.old.\n"
+			   "Because \"link\" mode was used, the old cluster cannot be safely\n"
+			   "started once the new cluster has been started.",
+			   old_cluster.pgdata);
+	else if (transfer_mode == TRANSFER_MODE_SWAP)
+		pg_log(PG_REPORT, "\n"
+			   "Because \"swap\" mode was used, the old cluster can no longer be\n"
+			   "safely started.");
+	else
+		pg_fatal("unrecognized transfer mode");
 }
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index b8fd0d0acee..23cb08e8347 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -52,9 +52,11 @@ generate_old_dump(void)
 		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
 
 		parallel_exec_prog(log_file_name, NULL,
-						   "\"%s/pg_dump\" %s --no-data %s --sequence-data --quote-all-identifiers "
+						   "\"%s/pg_dump\" %s --no-data %s %s --quote-all-identifiers "
 						   "--binary-upgrade --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
+						   (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+						   "" : "--sequence-data",
 						   log_opts.verbose ? "--verbose" : "",
 						   user_opts.do_statistics ? "" : "--no-statistics",
 						   log_opts.dumpdir,
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 7fd1991204a..91ed16acb08 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -434,7 +434,7 @@ check_copy_file_range(void)
 }
 
 void
-check_hard_link(void)
+check_hard_link(transferMode transfer_mode)
 {
 	char		existing_file[MAXPGPATH];
 	char		new_link_file[MAXPGPATH];
@@ -444,8 +444,16 @@ check_hard_link(void)
 	unlink(new_link_file);		/* might fail */
 
 	if (link(existing_file, new_link_file) < 0)
-		pg_fatal("could not create hard link between old and new data directories: %m\n"
-				 "In link mode the old and new data directories must be on the same file system.");
+	{
+		if (transfer_mode == TRANSFER_MODE_LINK)
+			pg_fatal("could not create hard link between old and new data directories: %m\n"
+					 "In link mode the old and new data directories must be on the same file system.");
+		else if (transfer_mode == TRANSFER_MODE_SWAP)
+			pg_fatal("could not create hard link between old and new data directories: %m\n"
+					 "In swap mode the old and new data directories must be on the same file system.");
+		else
+			pg_fatal("unrecognized transfer mode");
+	}
 
 	unlink(new_link_file);
 }
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index ad52de8b607..4b7a56f5b3b 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -490,7 +490,7 @@ get_rel_infos_query(void)
 					  "  FROM pg_catalog.pg_class c JOIN pg_catalog.pg_namespace n "
 					  "         ON c.relnamespace = n.oid "
 					  "  WHERE relkind IN (" CppAsString2(RELKIND_RELATION) ", "
-					  CppAsString2(RELKIND_MATVIEW) ") AND "
+					  CppAsString2(RELKIND_MATVIEW) "%s) AND "
 	/* exclude possible orphaned temp tables */
 					  "    ((n.nspname !~ '^pg_temp_' AND "
 					  "      n.nspname !~ '^pg_toast_temp_' AND "
@@ -499,6 +499,8 @@ get_rel_infos_query(void)
 					  "      c.oid >= %u::pg_catalog.oid) OR "
 					  "     (n.nspname = 'pg_catalog' AND "
 					  "      relname IN ('pg_largeobject') ))), ",
+					  (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+					  ", " CppAsString2(RELKIND_SEQUENCE) : "",
 					  FirstNormalObjectId);
 
 	/*
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 188dd8d8a8b..7fd7f1d33fc 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -62,6 +62,7 @@ parseCommandLine(int argc, char *argv[])
 		{"sync-method", required_argument, NULL, 4},
 		{"no-statistics", no_argument, NULL, 5},
 		{"set-char-signedness", required_argument, NULL, 6},
+		{"swap", no_argument, NULL, 7},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -228,6 +229,11 @@ parseCommandLine(int argc, char *argv[])
 				else
 					pg_fatal("invalid argument for option %s", "--set-char-signedness");
 				break;
+
+			case 7:
+				user_opts.transfer_mode = TRANSFER_MODE_SWAP;
+				break;
+
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
 						os_info.progname);
@@ -325,6 +331,7 @@ usage(void)
 	printf(_("  --no-statistics               do not import statistics from old cluster\n"));
 	printf(_("  --set-char-signedness=OPTION  set new cluster char signedness to \"signed\" or\n"
 			 "                                \"unsigned\"\n"));
+	printf(_("  --swap                        move data directories to new cluster\n"));
 	printf(_("  --sync-method=METHOD          set method for syncing files to disk\n"));
 	printf(_("  -?, --help                    show this help, then exit\n"));
 	printf(_("\n"
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 174cd920840..9295e46aed3 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -170,12 +170,14 @@ main(int argc, char **argv)
 
 	/*
 	 * Most failures happen in create_new_objects(), which has completed at
-	 * this point.  We do this here because it is just before linking, which
-	 * will link the old and new cluster data files, preventing the old
-	 * cluster from being safely started once the new cluster is started.
+	 * this point.  We do this here because it is just before file transfer,
+	 * which for --link will make it unsafe to start the old cluster once the
+	 * new cluster is started, and for --swap will make it unsafe to start the
+	 * old cluster at all.
 	 */
-	if (user_opts.transfer_mode == TRANSFER_MODE_LINK)
-		disable_old_cluster();
+	if (user_opts.transfer_mode == TRANSFER_MODE_LINK ||
+		user_opts.transfer_mode == TRANSFER_MODE_SWAP)
+		disable_old_cluster(user_opts.transfer_mode);
 
 	transfer_all_new_tablespaces(&old_cluster.dbarr, &new_cluster.dbarr,
 								 old_cluster.pgdata, new_cluster.pgdata);
@@ -212,8 +214,10 @@ main(int argc, char **argv)
 	{
 		prep_status("Sync data directory to disk");
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/initdb\" --sync-only \"%s\" --sync-method %s",
+				  "\"%s/initdb\" --sync-only %s \"%s\" --sync-method %s",
 				  new_cluster.bindir,
+				  (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+				  "--no-sync-data-files" : "",
 				  new_cluster.pgdata,
 				  user_opts.sync_method);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 4c9d0172149..69c965bb7d0 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -262,6 +262,7 @@ typedef enum
 	TRANSFER_MODE_COPY,
 	TRANSFER_MODE_COPY_FILE_RANGE,
 	TRANSFER_MODE_LINK,
+	TRANSFER_MODE_SWAP,
 } transferMode;
 
 /*
@@ -391,7 +392,7 @@ void		create_script_for_old_cluster_deletion(char **deletion_script_file_name);
 
 void		get_control_data(ClusterInfo *cluster);
 void		check_control_data(ControlData *oldctrl, ControlData *newctrl);
-void		disable_old_cluster(void);
+void		disable_old_cluster(transferMode transfer_mode);
 
 
 /* dump.c */
@@ -423,7 +424,7 @@ void		rewriteVisibilityMap(const char *fromfile, const char *tofile,
 								 const char *schemaName, const char *relName);
 void		check_file_clone(void);
 void		check_copy_file_range(void);
-void		check_hard_link(void);
+void		check_hard_link(transferMode transfer_mode);
 
 /* fopen_priv() is no longer different from fopen() */
 #define fopen_priv(path, mode)	fopen(path, mode)
diff --git a/src/bin/pg_upgrade/relfilenumber.c b/src/bin/pg_upgrade/relfilenumber.c
index 8c23c583172..a87e6156911 100644
--- a/src/bin/pg_upgrade/relfilenumber.c
+++ b/src/bin/pg_upgrade/relfilenumber.c
@@ -11,11 +11,92 @@
 
 #include <sys/stat.h>
 
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/int.h"
+#include "common/logging.h"
 #include "pg_upgrade.h"
 
 static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
 static void transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_frozenbit);
 
+/*
+ * The following set of sync_queue_* functions are used for --swap to reduce
+ * the amount of time spent synchronizing the swapped catalog files.  When a
+ * file is added to the queue, we also alert the file system that we'd like it
+ * to be persisted to disk in the near future (if that operation is supported
+ * by the current platform).  Once the queue is full, all of the files are
+ * synchronized to disk.  This strategy should generally be much faster than
+ * simply calling fsync() on the files right away.
+ *
+ * The general usage pattern should be something like:
+ *
+ *     for (int i = 0; i < num_files; i++)
+ *         sync_queue_push(files[i]);
+ *
+ *     // be sure to sync any remaining files in the queue
+ *     sync_queue_sync_all();
+ *     synq_queue_destroy();
+ */
+
+#define SYNC_QUEUE_MAX_LEN	(1024)
+
+static char *sync_queue[SYNC_QUEUE_MAX_LEN];
+static bool sync_queue_inited;
+static int	sync_queue_len;
+
+static inline void
+sync_queue_init(void)
+{
+	if (sync_queue_inited)
+		return;
+
+	sync_queue_inited = true;
+	for (int i = 0; i < SYNC_QUEUE_MAX_LEN; i++)
+		sync_queue[i] = palloc(MAXPGPATH);
+}
+
+static inline void
+sync_queue_sync_all(void)
+{
+	if (!sync_queue_inited)
+		return;
+
+	for (int i = 0; i < sync_queue_len; i++)
+	{
+		if (fsync_fname(sync_queue[i], false) != 0)
+			pg_fatal("could not synchronize file \"%s\": %m", sync_queue[i]);
+	}
+
+	sync_queue_len = 0;
+}
+
+static inline void
+sync_queue_push(const char *fname)
+{
+	sync_queue_init();
+
+	pre_sync_fname(fname, false);
+
+	strncpy(sync_queue[sync_queue_len++], fname, MAXPGPATH);
+	if (sync_queue_len >= SYNC_QUEUE_MAX_LEN)
+		sync_queue_sync_all();
+}
+
+static inline void
+sync_queue_destroy(void)
+{
+	if (!sync_queue_inited)
+		return;
+
+	sync_queue_inited = false;
+	sync_queue_len = 0;
+	for (int i = 0; i < SYNC_QUEUE_MAX_LEN; i++)
+	{
+		pfree(sync_queue[i]);
+		sync_queue[i] = NULL;
+	}
+}
 
 /*
  * transfer_all_new_tablespaces()
@@ -41,6 +122,9 @@ transfer_all_new_tablespaces(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		case TRANSFER_MODE_LINK:
 			prep_status_progress("Linking user relation files");
 			break;
+		case TRANSFER_MODE_SWAP:
+			prep_status_progress("Swapping data directories");
+			break;
 	}
 
 	/*
@@ -125,6 +209,267 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		/* We allocate something even for n_maps == 0 */
 		pg_free(mappings);
 	}
+
+	/*
+	 * Make sure anything pending synchronization in swap mode is fully
+	 * persisted to disk.  This is a no-op for other transfer modes.
+	 */
+	sync_queue_sync_all();
+	sync_queue_destroy();
+}
+
+/*
+ * prepare_for_swap()
+ *
+ * This function durably moves the database directory from the old cluster to
+ * the new cluster in preparation for moving the pg_restore-generated catalog
+ * files into place.  Returns false if the database with the given OID does not
+ * have a directory in the given tablespace, otherwise returns true.
+ *
+ * old_cat (the directory for the old catalog files), new_dat (the database
+ * directory in the new cluster), and moved_dat (the destination for the
+ * pg_restore-generated database directory) should be sized to MAXPGPATH bytes.
+ * This function will return the appropriate paths in those variables.
+ */
+static bool
+prepare_for_swap(const char *old_tablespace, Oid db_oid,
+				 char *old_cat, char *new_dat, char *moved_dat)
+{
+	const char *new_tablespace;
+	const char *old_tblspc_suffix;
+	const char *new_tblspc_suffix;
+	char		old_tblspc[MAXPGPATH];
+	char		new_tblspc[MAXPGPATH];
+	char		moved_tblspc[MAXPGPATH];
+	char		old_dat[MAXPGPATH];
+	struct stat st;
+
+	if (strcmp(old_tablespace, old_cluster.pgdata) == 0)
+	{
+		new_tablespace = new_cluster.pgdata;
+		new_tblspc_suffix = "/base";
+		old_tblspc_suffix = "/base";
+	}
+	else
+	{
+		new_tablespace = old_tablespace;
+		new_tblspc_suffix = new_cluster.tablespace_suffix;
+		old_tblspc_suffix = old_cluster.tablespace_suffix;
+	}
+
+	/* Old and new cluster paths. */
+	snprintf(old_tblspc, sizeof(old_tblspc), "%s%s", old_tablespace, old_tblspc_suffix);
+	snprintf(new_tblspc, sizeof(new_tblspc), "%s%s", new_tablespace, new_tblspc_suffix);
+	snprintf(old_dat, sizeof(old_dat), "%s/%u", old_tblspc, db_oid);
+	snprintf(new_dat, MAXPGPATH, "%s/%u", new_tblspc, db_oid);
+
+	/*
+	 * Paths for "moved aside" stuff.  We intentionally put these in the old
+	 * cluster so that the delete_old_cluster.{sh,bat} script handles them.
+	 */
+	snprintf(moved_tblspc, sizeof(moved_tblspc), "%s/moved_for_upgrade", old_tblspc);
+	snprintf(old_cat, MAXPGPATH, "%s/%u_old_catalogs", moved_tblspc, db_oid);
+	snprintf(moved_dat, MAXPGPATH, "%s/%u", moved_tblspc, db_oid);
+
+	/* Check that the database directory exists in the given tablespace. */
+	if (stat(old_dat, &st) != 0)
+	{
+		if (errno != ENOENT)
+			pg_fatal("could not stat file \"%s\": %m", old_dat);
+		return false;
+	}
+
+	/* Create directory for stuff that is moved aside. */
+	if (pg_mkdir_p(moved_tblspc, pg_dir_create_mode) != 0 && errno != EEXIST)
+		pg_fatal("could not create directory \"%s\"", moved_tblspc);
+
+	/* Create directory for old catalog files. */
+	if (pg_mkdir_p(old_cat, pg_dir_create_mode) != 0)
+		pg_fatal("could not create directory \"%s\"", old_cat);
+
+	/* Move the new cluster's database directory aside. */
+	if (rename(new_dat, moved_dat) != 0)
+		pg_fatal("could not rename \"%s\" to \"%s\"", new_dat, moved_dat);
+
+	/* Move the old cluster's database directory into place. */
+	if (rename(old_dat, new_dat) != 0)
+		pg_fatal("could not rename \"%s\" to \"%s\"", old_dat, new_dat);
+
+	return true;
+}
+
+/*
+ * FileNameMapCmp()
+ *
+ * qsort() comparator for FileNameMap that sorts by RelFileNumber.
+ */
+static int
+FileNameMapCmp(const void *a, const void *b)
+{
+	const FileNameMap *map1 = (const FileNameMap *) a;
+	const FileNameMap *map2 = (const FileNameMap *) b;
+
+	return pg_cmp_u32(map1->relfilenumber, map2->relfilenumber);
+}
+
+/*
+ * parse_relfilenumber()
+ *
+ * Attempt to parse the RelFileNumber of the given file name.  If we can't,
+ * return InvalidRelFileNumber.  Note that this code snippet is lifted from
+ * parse_filename_for_nontemp_relation().
+ */
+static RelFileNumber
+parse_relfilenumber(const char *filename)
+{
+	char	   *endp;
+	unsigned long n;
+
+	if (filename[0] < '1' || filename[0] > '9')
+		return InvalidRelFileNumber;
+
+	errno = 0;
+	n = strtoul(filename, &endp, 10);
+	if (errno || filename == endp || n <= 0 || n > PG_UINT32_MAX)
+		return InvalidRelFileNumber;
+
+	return (RelFileNumber) n;
+}
+
+/*
+ * swap_catalog_files()
+ *
+ * Moves the old catalog files aside, and moves the new catalog files into
+ * place.
+ */
+static void
+swap_catalog_files(FileNameMap *maps, int size, const char *old_cat,
+				   const char *new_dat, const char *moved_dat)
+{
+	DIR		   *dir;
+	struct dirent *de;
+	char		path[MAXPGPATH];
+	char		dest[MAXPGPATH];
+	RelFileNumber rfn;
+
+	/* Move the old catalog files aside. */
+	dir = opendir(new_dat);
+	if (dir == NULL)
+		pg_fatal("could not open directory \"%s\": %m", new_dat);
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		snprintf(path, sizeof(path), "%s/%s", new_dat, de->d_name);
+		if (get_dirent_type(path, de, false, PG_LOG_ERROR) != PGFILETYPE_REG)
+			continue;
+
+		rfn = parse_relfilenumber(de->d_name);
+		if (RelFileNumberIsValid(rfn))
+		{
+			FileNameMap key = {.relfilenumber = rfn};
+
+			if (bsearch(&key, maps, size, sizeof(FileNameMap), FileNameMapCmp))
+				continue;
+		}
+
+		snprintf(dest, sizeof(dest), "%s/%s", old_cat, de->d_name);
+		if (rename(path, dest) != 0)
+			pg_fatal("could not rename \"%s\" to \"%s\": %m", path, dest);
+	}
+	if (errno)
+		pg_fatal("could not read directory \"%s\": %m", new_dat);
+	(void) closedir(dir);
+
+	/* Move the new catalog files into place. */
+	dir = opendir(moved_dat);
+	if (dir == NULL)
+		pg_fatal("could not open directory \"%s\": %m", moved_dat);
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		snprintf(path, sizeof(path), "%s/%s", moved_dat, de->d_name);
+		if (get_dirent_type(path, de, false, PG_LOG_ERROR) != PGFILETYPE_REG)
+			continue;
+
+		rfn = parse_relfilenumber(de->d_name);
+		if (RelFileNumberIsValid(rfn))
+		{
+			FileNameMap key = {.relfilenumber = rfn};
+
+			if (bsearch(&key, maps, size, sizeof(FileNameMap), FileNameMapCmp))
+				continue;
+		}
+
+		snprintf(dest, sizeof(dest), "%s/%s", new_dat, de->d_name);
+		if (rename(path, dest) != 0)
+			pg_fatal("could not rename \"%s\" to \"%s\": %m", path, dest);
+
+		/*
+		 * We don't fsync() the database files in the file synchronization
+		 * stage of pg_upgrade in swap mode, so we need to synchronize them
+		 * ourselves.  We only do this for the catalog files because they were
+		 * created during pg_restore with fsync=off.  We assume that the user
+		 * data files files were properly persisted to disk when the user last
+		 * shut it down.
+		 */
+		if (user_opts.do_sync)
+			sync_queue_push(dest);
+	}
+	if (errno)
+		pg_fatal("could not read directory \"%s\": %m", moved_dat);
+	(void) closedir(dir);
+
+	/* Ensure the directory entries are persisted to disk. */
+	if (fsync_fname(new_dat, true) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", new_dat);
+	if (fsync_parent_path(new_dat) != 0)
+		pg_fatal("could not synchronize parent directory of \"%s\": %m", new_dat);
+}
+
+/*
+ * do_swap()
+ *
+ * Perform the required steps for --swap for a single database.  In short this
+ * moves the old cluster's database directory into the new cluster and then
+ * replaces any files for system catalogs with the ones that were generated
+ * during pg_restore.
+ */
+static void
+do_swap(FileNameMap *maps, int size, char *old_tablespace)
+{
+	char		old_cat[MAXPGPATH];
+	char		new_dat[MAXPGPATH];
+	char		moved_dat[MAXPGPATH];
+
+	/*
+	 * We perform many lookups on maps by relfilenumber in swap mode, so make
+	 * sure it's sorted by relfilenumber.  maps should already be sorted by
+	 * OID, so in general this shouldn't have much work to do.
+	 */
+	qsort(maps, size, sizeof(FileNameMap), FileNameMapCmp);
+
+	/*
+	 * If an old tablespace is given, we only need to process that one.  If no
+	 * old tablespace is specified, we need to process all the tablespaces on
+	 * the system.
+	 */
+	if (old_tablespace)
+	{
+		if (prepare_for_swap(old_tablespace, maps[0].db_oid,
+							 old_cat, new_dat, moved_dat))
+			swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+	}
+	else
+	{
+		if (prepare_for_swap(old_cluster.pgdata, maps[0].db_oid,
+							 old_cat, new_dat, moved_dat))
+			swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+
+		for (int tblnum = 0; tblnum < os_info.num_old_tablespaces; tblnum++)
+		{
+			if (prepare_for_swap(os_info.old_tablespaces[tblnum], maps[0].db_oid,
+								 old_cat, new_dat, moved_dat))
+				swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+		}
+	}
 }
 
 /*
@@ -145,6 +490,20 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
 		new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
 		vm_must_add_frozenbit = true;
 
+	/* --swap has its own subroutine */
+	if (user_opts.transfer_mode == TRANSFER_MODE_SWAP)
+	{
+		/*
+		 * We don't support --swap to upgrade from versions that require
+		 * rewriting the visibility map.  We should've failed already if
+		 * someone tries to do that.
+		 */
+		Assert(!vm_must_add_frozenbit);
+
+		do_swap(maps, size, old_tablespace);
+		return;
+	}
+
 	for (mapnum = 0; mapnum < size; mapnum++)
 	{
 		if (old_tablespace == NULL ||
@@ -259,6 +618,11 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 					pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"",
 						   old_file, new_file);
 					linkFile(old_file, new_file, map->nspname, map->relname);
+					break;
+				case TRANSFER_MODE_SWAP:
+					/* swap mode is handled in its own code path */
+					pg_fatal("should never happen");
+					break;
 			}
 	}
 }
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 78e272916f5..4405ef8b425 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -45,9 +45,6 @@
  */
 #define MINIMUM_VERSION_FOR_PG_WAL	100000
 
-#ifdef PG_FLUSH_DATA_WORKS
-static int	pre_sync_fname(const char *fname, bool isdir);
-#endif
 static void walkdir(const char *path,
 					int (*action) (const char *fname, bool isdir),
 					bool process_symlinks,
@@ -352,16 +349,16 @@ walkdir(const char *path,
 }
 
 /*
- * Hint to the OS that it should get ready to fsync() this file.
+ * Hint to the OS that it should get ready to fsync() this file, if supported
+ * by the platform.
  *
  * Ignores errors trying to open unreadable files, and reports other errors
  * non-fatally.
  */
-#ifdef PG_FLUSH_DATA_WORKS
-
-static int
+int
 pre_sync_fname(const char *fname, bool isdir)
 {
+#ifdef PG_FLUSH_DATA_WORKS
 	int			fd;
 
 	fd = open(fname, O_RDONLY | PG_BINARY, 0);
@@ -388,11 +385,10 @@ pre_sync_fname(const char *fname, bool isdir)
 #endif
 
 	(void) close(fd);
+#endif							/* PG_FLUSH_DATA_WORKS */
 	return 0;
 }
 
-#endif							/* PG_FLUSH_DATA_WORKS */
-
 /*
  * fsync_fname -- Try to fsync a file or directory
  *
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 8274bc877ab..9fd88953e43 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -33,6 +33,7 @@ typedef enum DataDirSyncMethod
 struct iovec;					/* avoid including port/pg_iovec.h here */
 
 #ifdef FRONTEND
+extern int	pre_sync_fname(const char *fname, bool isdir);
 extern int	fsync_fname(const char *fname, bool isdir);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
 						DataDirSyncMethod sync_method, bool sync_data_files);
-- 
2.39.5 (Apple Git-154)

#19Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#18)
Re: optimize file transfer in pg_upgrade

On Mon, Mar 17, 2025 at 3:34 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

* Once committed, I should update one of my buildfarm animals to use
PG_TEST_PG_UPGRADE_MODE=--swap.

It would be better if we didn't need a separate buildfarm animal to
test this, because that means you won't get notified by local testing
OR by CI if you break this. Can we instead have one test that checks
this which is part of the normal test run?

--
Robert Haas
EDB: http://www.enterprisedb.com

#20Nathan Bossart
nathandbossart@gmail.com
In reply to: Robert Haas (#19)
3 attachment(s)
Re: optimize file transfer in pg_upgrade

On Mon, Mar 17, 2025 at 04:04:45PM -0400, Robert Haas wrote:

On Mon, Mar 17, 2025 at 3:34 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

* Once committed, I should update one of my buildfarm animals to use
PG_TEST_PG_UPGRADE_MODE=--swap.

It would be better if we didn't need a separate buildfarm animal to
test this, because that means you won't get notified by local testing
OR by CI if you break this. Can we instead have one test that checks
this which is part of the normal test run?

That's what I set out to do before I discovered PG_TEST_PG_UPGRADE_MODE.
The commit message for b059a24 seemed to indicate that we don't want to
automatically test all supported modes, but I agree that it would be nice
to have some basic coverage for --swap on CI/buildfarm regardless of
PG_TEST_PG_UPGRADE_MODE. How about we add a simple TAP test (attached),
and I still plan on switching a buildfarm animal to --swap for more
in-depth testing?

--
nathan

Attachments:

v6-0001-initdb-Add-no-sync-data-files.patchtext/plain; charset=us-asciiDownload
From 0a228f150d101ef2ffe38c88fe290e313142a2d9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 19 Feb 2025 09:14:51 -0600
Subject: [PATCH v6 1/3] initdb: Add --no-sync-data-files.

This new option instructs initdb to skip synchronizing any files
in database directories and the database directories themselves,
i.e., everything in the base/ subdirectory and any other
tablespace directories.  Other files, such as those in pg_wal/ and
pg_xact/, will still be synchronized unless --no-sync is also
specified.  --no-sync-data-files is primarily intended for internal
use by tools that separately ensure the skipped files are
synchronized to disk.  A follow-up commit will use this to help
optimize pg_upgrade's file transfer step.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/initdb.sgml                | 27 +++++++
 src/bin/initdb/initdb.c                     | 10 ++-
 src/bin/initdb/t/001_initdb.pl              |  1 +
 src/bin/pg_basebackup/pg_basebackup.c       |  2 +-
 src/bin/pg_checksums/pg_checksums.c         |  2 +-
 src/bin/pg_combinebackup/pg_combinebackup.c |  2 +-
 src/bin/pg_rewind/file_ops.c                |  2 +-
 src/common/file_utils.c                     | 85 +++++++++++++--------
 src/include/common/file_utils.h             |  2 +-
 9 files changed, 96 insertions(+), 37 deletions(-)

diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 0026318485a..2f1f9a42f90 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -527,6 +527,33 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry id="app-initdb-option-no-sync-data-files">
+      <term><option>--no-sync-data-files</option></term>
+      <listitem>
+       <para>
+        By default, <command>initdb</command> safely writes all database files
+        to disk.  This option instructs <command>initdb</command> to skip
+        synchronizing all files in the individual database directories, the
+        database directories themselves, and the tablespace directories, i.e.,
+        everything in the <filename>base</filename> subdirectory and any other
+        tablespace directories.  Other files, such as those in
+        <literal>pg_wal</literal> and <literal>pg_xact</literal>, will still be
+        synchronized unless the <option>--no-sync</option> option is also
+        specified.
+       </para>
+       <para>
+        Note that if <option>--no-sync-data-files</option> is used in
+        conjuction with <option>--sync-method=syncfs</option>, some or all of
+        the aforementioned files and directories will be synchronized because
+        <literal>syncfs</literal> processes entire file systems.
+       </para>
+       <para>
+        This option is primarily intended for internal use by tools that
+        separately ensure the skipped files are synchronized to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="app-initdb-option-no-instructions">
       <term><option>--no-instructions</option></term>
       <listitem>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 21a0fe3ecd9..22b7d31b165 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -168,6 +168,7 @@ static bool data_checksums = true;
 static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+static bool sync_data_files = true;
 
 
 /* internal vars */
@@ -2566,6 +2567,7 @@ usage(const char *progname)
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
 	printf(_("  -n, --no-clean            do not clean up after errors\n"));
 	printf(_("  -N, --no-sync             do not wait for changes to be written safely to disk\n"));
+	printf(_("      --no-sync-data-files  do not sync files within database directories\n"));
 	printf(_("      --no-instructions     do not print instructions for next steps\n"));
 	printf(_("  -s, --show                show internal settings, then exit\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
@@ -3208,6 +3210,7 @@ main(int argc, char *argv[])
 		{"icu-rules", required_argument, NULL, 18},
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
+		{"no-sync-data-files", no_argument, NULL, 21},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3402,6 +3405,9 @@ main(int argc, char *argv[])
 			case 20:
 				data_checksums = false;
 				break;
+			case 21:
+				sync_data_files = false;
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -3453,7 +3459,7 @@ main(int argc, char *argv[])
 
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 		return 0;
 	}
@@ -3516,7 +3522,7 @@ main(int argc, char *argv[])
 	{
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 	}
 	else
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 01cc4a1602b..15dd10ce40a 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -76,6 +76,7 @@ command_like(
 	'checksums are enabled in control file');
 
 command_ok([ 'initdb', '--sync-only', $datadir ], 'sync only');
+command_ok([ 'initdb', '--sync-only', '--no-sync-data-files', $datadir ], '--no-sync-data-files');
 command_fails([ 'initdb', $datadir ], 'existing data directory');
 
 if ($supports_syncfs)
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index d4b4e334014..1da4bfc2351 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -2310,7 +2310,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
 		}
 		else
 		{
-			(void) sync_pgdata(basedir, serverVersion, sync_method);
+			(void) sync_pgdata(basedir, serverVersion, sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index 867aeddc601..f20be82862a 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -633,7 +633,7 @@ main(int argc, char *argv[])
 		if (do_sync)
 		{
 			pg_log_info("syncing data directory");
-			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method);
+			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method, true);
 		}
 
 		pg_log_info("updating control file");
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index d480dc74436..050260ee832 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -424,7 +424,7 @@ main(int argc, char *argv[])
 		else
 		{
 			pg_log_debug("recursively fsyncing \"%s\"", opt.output);
-			sync_pgdata(opt.output, version * 10000, opt.sync_method);
+			sync_pgdata(opt.output, version * 10000, opt.sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 467845419ed..55659ce201f 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -296,7 +296,7 @@ sync_target_dir(void)
 	if (!do_sync || dry_run)
 		return;
 
-	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method);
+	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method, true);
 }
 
 
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 0e3cfede935..78e272916f5 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -50,7 +50,8 @@ static int	pre_sync_fname(const char *fname, bool isdir);
 #endif
 static void walkdir(const char *path,
 					int (*action) (const char *fname, bool isdir),
-					bool process_symlinks);
+					bool process_symlinks,
+					const char *exclude_dir);
 
 #ifdef HAVE_SYNCFS
 
@@ -93,11 +94,15 @@ do_syncfs(const char *path)
  * syncing, and might not have privileges to write at all.
  *
  * serverVersion indicates the version of the server to be sync'd.
+ *
+ * If sync_data_files is false, this function skips syncing "base/" and any
+ * other tablespace directories.
  */
 void
 sync_pgdata(const char *pg_data,
 			int serverVersion,
-			DataDirSyncMethod sync_method)
+			DataDirSyncMethod sync_method,
+			bool sync_data_files)
 {
 	bool		xlog_is_symlink;
 	char		pg_wal[MAXPGPATH];
@@ -147,30 +152,33 @@ sync_pgdata(const char *pg_data,
 				do_syncfs(pg_data);
 
 				/* If any tablespaces are configured, sync each of those. */
-				dir = opendir(pg_tblspc);
-				if (dir == NULL)
-					pg_log_error("could not open directory \"%s\": %m",
-								 pg_tblspc);
-				else
+				if (sync_data_files)
 				{
-					while (errno = 0, (de = readdir(dir)) != NULL)
+					dir = opendir(pg_tblspc);
+					if (dir == NULL)
+						pg_log_error("could not open directory \"%s\": %m",
+									 pg_tblspc);
+					else
 					{
-						char		subpath[MAXPGPATH * 2];
+						while (errno = 0, (de = readdir(dir)) != NULL)
+						{
+							char		subpath[MAXPGPATH * 2];
 
-						if (strcmp(de->d_name, ".") == 0 ||
-							strcmp(de->d_name, "..") == 0)
-							continue;
+							if (strcmp(de->d_name, ".") == 0 ||
+								strcmp(de->d_name, "..") == 0)
+								continue;
 
-						snprintf(subpath, sizeof(subpath), "%s/%s",
-								 pg_tblspc, de->d_name);
-						do_syncfs(subpath);
-					}
+							snprintf(subpath, sizeof(subpath), "%s/%s",
+									 pg_tblspc, de->d_name);
+							do_syncfs(subpath);
+						}
 
-					if (errno)
-						pg_log_error("could not read directory \"%s\": %m",
-									 pg_tblspc);
+						if (errno)
+							pg_log_error("could not read directory \"%s\": %m",
+										 pg_tblspc);
 
-					(void) closedir(dir);
+						(void) closedir(dir);
+					}
 				}
 
 				/* If pg_wal is a symlink, process that too. */
@@ -182,15 +190,21 @@ sync_pgdata(const char *pg_data,
 
 		case DATA_DIR_SYNC_METHOD_FSYNC:
 			{
+				char	   *exclude_dir = NULL;
+
+				if (!sync_data_files)
+					exclude_dir = psprintf("%s/base", pg_data);
+
 				/*
 				 * If possible, hint to the kernel that we're soon going to
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(pg_data, pre_sync_fname, false);
+				walkdir(pg_data, pre_sync_fname, false, exclude_dir);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, pre_sync_fname, false);
-				walkdir(pg_tblspc, pre_sync_fname, true);
+					walkdir(pg_wal, pre_sync_fname, false, NULL);
+				if (sync_data_files)
+					walkdir(pg_tblspc, pre_sync_fname, true, NULL);
 #endif
 
 				/*
@@ -203,10 +217,14 @@ sync_pgdata(const char *pg_data,
 				 * get fsync'd twice. That's not an expected case so we don't
 				 * worry about optimizing it.
 				 */
-				walkdir(pg_data, fsync_fname, false);
+				walkdir(pg_data, fsync_fname, false, exclude_dir);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, fsync_fname, false);
-				walkdir(pg_tblspc, fsync_fname, true);
+					walkdir(pg_wal, fsync_fname, false, NULL);
+				if (sync_data_files)
+					walkdir(pg_tblspc, fsync_fname, true, NULL);
+
+				if (exclude_dir)
+					pfree(exclude_dir);
 			}
 			break;
 	}
@@ -245,10 +263,10 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(dir, pre_sync_fname, false);
+				walkdir(dir, pre_sync_fname, false, NULL);
 #endif
 
-				walkdir(dir, fsync_fname, false);
+				walkdir(dir, fsync_fname, false, NULL);
 			}
 			break;
 	}
@@ -264,6 +282,9 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
  * ignored in subdirectories, ie we intentionally don't pass down the
  * process_symlinks flag to recursive calls.
  *
+ * If exclude_dir is not NULL, it specifies a directory path to skip
+ * processing.
+ *
  * Errors are reported but not considered fatal.
  *
  * See also walkdir in fd.c, which is a backend version of this logic.
@@ -271,11 +292,15 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
 static void
 walkdir(const char *path,
 		int (*action) (const char *fname, bool isdir),
-		bool process_symlinks)
+		bool process_symlinks,
+		const char *exclude_dir)
 {
 	DIR		   *dir;
 	struct dirent *de;
 
+	if (exclude_dir && strcmp(exclude_dir, path) == 0)
+		return;
+
 	dir = opendir(path);
 	if (dir == NULL)
 	{
@@ -299,7 +324,7 @@ walkdir(const char *path,
 				(*action) (subpath, false);
 				break;
 			case PGFILETYPE_DIR:
-				walkdir(subpath, action, false);
+				walkdir(subpath, action, false, exclude_dir);
 				break;
 			default:
 
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index a832210adc1..8274bc877ab 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -35,7 +35,7 @@ struct iovec;					/* avoid including port/pg_iovec.h here */
 #ifdef FRONTEND
 extern int	fsync_fname(const char *fname, bool isdir);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
-						DataDirSyncMethod sync_method);
+						DataDirSyncMethod sync_method, bool sync_data_files);
 extern void sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method);
 extern int	durable_rename(const char *oldfile, const char *newfile);
 extern int	fsync_parent_path(const char *fname);
-- 
2.39.5 (Apple Git-154)

v6-0002-pg_dump-Add-sequence-data.patchtext/plain; charset=us-asciiDownload
From e4c414ce5202efcb86f74f8d41c926575656a527 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 19 Feb 2025 11:25:28 -0600
Subject: [PATCH v6 2/3] pg_dump: Add --sequence-data.

This new option instructs pg_dump to dump sequence data when the
--no-data, --schema-only, or --statistics-only option is specified.
This was originally considered for commit a7e5457db8, but it was
left out at that time because there was no known use-case.  A
follow-up commit will use this to optimize pg_upgrade's file
transfer step.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/pg_dump.sgml               | 11 +++++++++++
 src/bin/pg_dump/pg_dump.c                   | 10 ++--------
 src/bin/pg_dump/t/002_pg_dump.pl            |  1 +
 src/bin/pg_upgrade/dump.c                   |  2 +-
 src/test/modules/test_pg_dump/t/001_base.pl |  2 +-
 5 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 0ae40f9be58..63cca18711a 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1298,6 +1298,17 @@ PostgreSQL documentation
        </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--sequence-data</option></term>
+      <listitem>
+       <para>
+        Include sequence data in the dump.  This is the default behavior except
+        when <option>--no-data</option>, <option>--schema-only</option>, or
+        <option>--statistics-only</option> is specified.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--serializable-deferrable</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 428ed2d60fc..e6253331e27 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -518,6 +518,7 @@ main(int argc, char **argv)
 		{"sync-method", required_argument, NULL, 15},
 		{"filter", required_argument, NULL, 16},
 		{"exclude-extension", required_argument, NULL, 17},
+		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -801,14 +802,6 @@ main(int argc, char **argv)
 	if (dopt.column_inserts && dopt.dump_inserts == 0)
 		dopt.dump_inserts = DUMP_DEFAULT_ROWS_PER_INSERT;
 
-	/*
-	 * Binary upgrade mode implies dumping sequence data even in schema-only
-	 * mode.  This is not exposed as a separate option, but kept separate
-	 * internally for clarity.
-	 */
-	if (dopt.binary_upgrade)
-		dopt.sequence_data = 1;
-
 	if (data_only && schema_only)
 		pg_fatal("options -s/--schema-only and -a/--data-only cannot be used together");
 	if (schema_only && statistics_only)
@@ -1275,6 +1268,7 @@ help(const char *progname)
 	printf(_("  --quote-all-identifiers      quote all identifiers, even if not key words\n"));
 	printf(_("  --rows-per-insert=NROWS      number of rows per INSERT; implies --inserts\n"));
 	printf(_("  --section=SECTION            dump named section (pre-data, data, or post-data)\n"));
+	printf(_("  --sequence-data              include sequence data in dump\n"));
 	printf(_("  --serializable-deferrable    wait until the dump can run without anomalies\n"));
 	printf(_("  --snapshot=SNAPSHOT          use given snapshot for the dump\n"));
 	printf(_("  --statistics-only            dump only the statistics, not schema or data\n"));
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index d281e27aa67..ed379033da7 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -66,6 +66,7 @@ my %pgdump_runs = (
 			'--file' => "$tempdir/binary_upgrade.dump",
 			'--no-password',
 			'--no-data',
+			'--sequence-data',
 			'--binary-upgrade',
 			'--dbname' => 'postgres',    # alternative way to specify database
 		],
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index 23fe7280a16..b8fd0d0acee 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -52,7 +52,7 @@ generate_old_dump(void)
 		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
 
 		parallel_exec_prog(log_file_name, NULL,
-						   "\"%s/pg_dump\" %s --no-data %s --quote-all-identifiers "
+						   "\"%s/pg_dump\" %s --no-data %s --sequence-data --quote-all-identifiers "
 						   "--binary-upgrade --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
 						   log_opts.verbose ? "--verbose" : "",
diff --git a/src/test/modules/test_pg_dump/t/001_base.pl b/src/test/modules/test_pg_dump/t/001_base.pl
index a9bcac4169d..adcaa419616 100644
--- a/src/test/modules/test_pg_dump/t/001_base.pl
+++ b/src/test/modules/test_pg_dump/t/001_base.pl
@@ -48,7 +48,7 @@ my %pgdump_runs = (
 		dump_cmd => [
 			'pg_dump', '--no-sync',
 			'--file' => "$tempdir/binary_upgrade.sql",
-			'--schema-only', '--binary-upgrade',
+			'--schema-only', '--sequence-data', '--binary-upgrade',
 			'--dbname' => 'postgres',
 		],
 	},
-- 
2.39.5 (Apple Git-154)

v6-0003-pg_upgrade-Add-swap-for-faster-file-transfer.patchtext/plain; charset=us-asciiDownload
From 267867927687279840742b76d58580ac5efb45ea Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 5 Mar 2025 17:36:54 -0600
Subject: [PATCH v6 3/3] pg_upgrade: Add --swap for faster file transfer.

This new option instructs pg_upgrade to move the data directories
from the old cluster to the new cluster and then to replace the
catalog files with those generated for the new cluster.  This mode
can outperform --link, --clone, --copy, and --copy-file-range,
especially on clusters with many relations.

However, this mode creates many garbage files in the old cluster,
which can prolong the file synchronization step.  To handle that,
we use "initdb --sync-only --no-sync-data-files" for file
synchronization, and we synchronize the catalog files as they are
transferred.  We assume that the database files transferred from
the old cluster were synchronized prior to upgrade.  This mode also
complicates reverting to the old cluster, so we recommend restoring
from backup upon failure during or after file transfer.

The new mode is limited to clusters located in the same file system
and to upgrades from version 10 and newer.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/pgupgrade.sgml    |  59 ++++-
 src/bin/pg_upgrade/TESTING         |   6 +-
 src/bin/pg_upgrade/check.c         |  29 ++-
 src/bin/pg_upgrade/controldata.c   |  21 +-
 src/bin/pg_upgrade/dump.c          |   4 +-
 src/bin/pg_upgrade/file.c          |  14 +-
 src/bin/pg_upgrade/info.c          |   4 +-
 src/bin/pg_upgrade/meson.build     |   1 +
 src/bin/pg_upgrade/option.c        |   7 +
 src/bin/pg_upgrade/pg_upgrade.c    |  16 +-
 src/bin/pg_upgrade/pg_upgrade.h    |   5 +-
 src/bin/pg_upgrade/relfilenumber.c | 364 +++++++++++++++++++++++++++++
 src/bin/pg_upgrade/t/006_swap.pl   |  42 ++++
 src/common/file_utils.c            |  14 +-
 src/include/common/file_utils.h    |   1 +
 15 files changed, 553 insertions(+), 34 deletions(-)
 create mode 100644 src/bin/pg_upgrade/t/006_swap.pl

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 9ef7a84eed0..6deee1607ec 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -244,7 +244,8 @@ PostgreSQL documentation
       <listitem>
        <para>
         Copy files to the new cluster.  This is the default.  (See also
-        <option>--link</option> and <option>--clone</option>.)
+        <option>--link</option>, <option>--clone</option>,
+        <option>--copy-file-range</option>, and <option>--swap</option>.)
        </para>
       </listitem>
      </varlistentry>
@@ -262,6 +263,32 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--swap</option></term>
+      <listitem>
+       <para>
+        Move the data directories from the old cluster to the new cluster.
+        Then, replace the catalog files with those generated for the new
+        cluster.  This mode can outperform <option>--link</option>,
+        <option>--clone</option>, <option>--copy</option>, and
+        <option>--copy-file-range</option>, especially on clusters with many
+        relations.
+       </para>
+       <para>
+        However, this mode creates many garbage files in the old cluster, which
+        can prolong the file synchronization step if
+        <option>--sync-method=syncfs</option> is used.  Therefore, it is
+        recommended to use <option>--sync-method=fsync</option> with
+        <option>--swap</option>.
+       </para>
+       <para>
+        Additionally, once the file transfer step begins, the old cluster will
+        be destructively modified and therefore will no longer be safe to
+        start.  See <xref linkend="pgupgrade-step-revert"/> for details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--sync-method=</option><replaceable>method</replaceable></term>
       <listitem>
@@ -530,6 +557,10 @@ NET STOP postgresql-&majorversion;
      is started.  Clone mode also requires that the old and new data
      directories be in the same file system.  This mode is only available
      on certain operating systems and file systems.
+     Swap mode may be the fastest if there are many relations, but you will not
+     be able to access your old cluster once the file transfer step begins.
+     Swap mode also requires that the old and new cluster data directories be
+     in the same file system.
     </para>
 
     <para>
@@ -888,6 +919,32 @@ psql --username=postgres --file=script.sql postgres
 
         </itemizedlist></para>
       </listitem>
+
+      <listitem>
+       <para>
+        If the <option>--swap</option> option was used, the old cluster might
+        be destructively modified:
+
+        <itemizedlist>
+         <listitem>
+          <para>
+           If <command>pg_upgrade</command> aborts before reporting that the
+           old cluster is no longer safe to start, the old cluster was
+           unmodified; it can be restarted.
+          </para>
+         </listitem>
+
+         <listitem>
+          <para>
+           If <command>pg_upgrade</command> has reported that the old cluster
+           is no longer safe to start, the old cluster was destructively
+           modified.  The old cluster will need to be restored from backup in
+           this case.
+          </para>
+         </listitem>
+        </itemizedlist>
+       </para>
+      </listitem>
      </itemizedlist></para>
    </step>
   </procedure>
diff --git a/src/bin/pg_upgrade/TESTING b/src/bin/pg_upgrade/TESTING
index 00842ac6ec3..c3d463c9c29 100644
--- a/src/bin/pg_upgrade/TESTING
+++ b/src/bin/pg_upgrade/TESTING
@@ -20,13 +20,13 @@ export oldinstall=...otherversion/	(old version's install base path)
 See DETAILS below for more information about creation of the dump.
 
 You can also test the different transfer modes (--copy, --link,
---clone, --copy-file-range) by setting the environment variable
+--clone, --copy-file-range, --swap) by setting the environment variable
 PG_TEST_PG_UPGRADE_MODE to the respective command-line option, like
 
 	make check PG_TEST_PG_UPGRADE_MODE=--link
 
-The default is --copy.  Note that the other modes are not supported on
-all operating systems.
+The default is --copy.  Note that not all modes are supported on all
+operating systems.
 
 DETAILS
 -------
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index d32fc3d88ec..81c91fc2912 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -709,7 +709,34 @@ check_new_cluster(void)
 			check_copy_file_range();
 			break;
 		case TRANSFER_MODE_LINK:
-			check_hard_link();
+			check_hard_link(TRANSFER_MODE_LINK);
+			break;
+		case TRANSFER_MODE_SWAP:
+
+			/*
+			 * We do the hard link check for --swap, too, since it's an easy
+			 * way to verify the clusters are in the same file system.  This
+			 * allows us to take some shortcuts in the file synchronization
+			 * step.  With some more effort, we could probably support the
+			 * separate-file-system use case, but this mode is unlikely to
+			 * offer much benefit if we have to copy the files across file
+			 * system boundaries.
+			 */
+			check_hard_link(TRANSFER_MODE_SWAP);
+
+			/*
+			 * There are a few known issues with using --swap to upgrade from
+			 * versions older than 10.  For example, the sequence tuple format
+			 * changed in v10, and the visibility map format changed in 9.6.
+			 * While such problems are not insurmountable (and we may have to
+			 * deal with similar problems in the future, anyway), it doesn't
+			 * seem worth the effort to support swap mode for upgrades from
+			 * long-unsupported versions.
+			 */
+			if (GET_MAJOR_VERSION(old_cluster.major_version) < 1000)
+				pg_fatal("Swap mode can only upgrade clusters from PostgreSQL version %s and later.",
+						 "10");
+
 			break;
 	}
 
diff --git a/src/bin/pg_upgrade/controldata.c b/src/bin/pg_upgrade/controldata.c
index bd49ea867bf..47ee27ec835 100644
--- a/src/bin/pg_upgrade/controldata.c
+++ b/src/bin/pg_upgrade/controldata.c
@@ -751,7 +751,7 @@ check_control_data(ControlData *oldctrl,
 
 
 void
-disable_old_cluster(void)
+disable_old_cluster(transferMode transfer_mode)
 {
 	char		old_path[MAXPGPATH],
 				new_path[MAXPGPATH];
@@ -766,10 +766,17 @@ disable_old_cluster(void)
 				 old_path, new_path);
 	check_ok();
 
-	pg_log(PG_REPORT, "\n"
-		   "If you want to start the old cluster, you will need to remove\n"
-		   "the \".old\" suffix from %s/global/pg_control.old.\n"
-		   "Because \"link\" mode was used, the old cluster cannot be safely\n"
-		   "started once the new cluster has been started.",
-		   old_cluster.pgdata);
+	if (transfer_mode == TRANSFER_MODE_LINK)
+		pg_log(PG_REPORT, "\n"
+			   "If you want to start the old cluster, you will need to remove\n"
+			   "the \".old\" suffix from %s/global/pg_control.old.\n"
+			   "Because \"link\" mode was used, the old cluster cannot be safely\n"
+			   "started once the new cluster has been started.",
+			   old_cluster.pgdata);
+	else if (transfer_mode == TRANSFER_MODE_SWAP)
+		pg_log(PG_REPORT, "\n"
+			   "Because \"swap\" mode was used, the old cluster can no longer be\n"
+			   "safely started.");
+	else
+		pg_fatal("unrecognized transfer mode");
 }
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index b8fd0d0acee..23cb08e8347 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -52,9 +52,11 @@ generate_old_dump(void)
 		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
 
 		parallel_exec_prog(log_file_name, NULL,
-						   "\"%s/pg_dump\" %s --no-data %s --sequence-data --quote-all-identifiers "
+						   "\"%s/pg_dump\" %s --no-data %s %s --quote-all-identifiers "
 						   "--binary-upgrade --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
+						   (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+						   "" : "--sequence-data",
 						   log_opts.verbose ? "--verbose" : "",
 						   user_opts.do_statistics ? "" : "--no-statistics",
 						   log_opts.dumpdir,
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 7fd1991204a..91ed16acb08 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -434,7 +434,7 @@ check_copy_file_range(void)
 }
 
 void
-check_hard_link(void)
+check_hard_link(transferMode transfer_mode)
 {
 	char		existing_file[MAXPGPATH];
 	char		new_link_file[MAXPGPATH];
@@ -444,8 +444,16 @@ check_hard_link(void)
 	unlink(new_link_file);		/* might fail */
 
 	if (link(existing_file, new_link_file) < 0)
-		pg_fatal("could not create hard link between old and new data directories: %m\n"
-				 "In link mode the old and new data directories must be on the same file system.");
+	{
+		if (transfer_mode == TRANSFER_MODE_LINK)
+			pg_fatal("could not create hard link between old and new data directories: %m\n"
+					 "In link mode the old and new data directories must be on the same file system.");
+		else if (transfer_mode == TRANSFER_MODE_SWAP)
+			pg_fatal("could not create hard link between old and new data directories: %m\n"
+					 "In swap mode the old and new data directories must be on the same file system.");
+		else
+			pg_fatal("unrecognized transfer mode");
+	}
 
 	unlink(new_link_file);
 }
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index ad52de8b607..4b7a56f5b3b 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -490,7 +490,7 @@ get_rel_infos_query(void)
 					  "  FROM pg_catalog.pg_class c JOIN pg_catalog.pg_namespace n "
 					  "         ON c.relnamespace = n.oid "
 					  "  WHERE relkind IN (" CppAsString2(RELKIND_RELATION) ", "
-					  CppAsString2(RELKIND_MATVIEW) ") AND "
+					  CppAsString2(RELKIND_MATVIEW) "%s) AND "
 	/* exclude possible orphaned temp tables */
 					  "    ((n.nspname !~ '^pg_temp_' AND "
 					  "      n.nspname !~ '^pg_toast_temp_' AND "
@@ -499,6 +499,8 @@ get_rel_infos_query(void)
 					  "      c.oid >= %u::pg_catalog.oid) OR "
 					  "     (n.nspname = 'pg_catalog' AND "
 					  "      relname IN ('pg_largeobject') ))), ",
+					  (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+					  ", " CppAsString2(RELKIND_SEQUENCE) : "",
 					  FirstNormalObjectId);
 
 	/*
diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index da84344966a..a4a5eb82690 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -46,6 +46,7 @@ tests += {
       't/003_logical_slots.pl',
       't/004_subscription.pl',
       't/005_char_signedness.pl',
+      't/006_swap.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 188dd8d8a8b..7fd7f1d33fc 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -62,6 +62,7 @@ parseCommandLine(int argc, char *argv[])
 		{"sync-method", required_argument, NULL, 4},
 		{"no-statistics", no_argument, NULL, 5},
 		{"set-char-signedness", required_argument, NULL, 6},
+		{"swap", no_argument, NULL, 7},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -228,6 +229,11 @@ parseCommandLine(int argc, char *argv[])
 				else
 					pg_fatal("invalid argument for option %s", "--set-char-signedness");
 				break;
+
+			case 7:
+				user_opts.transfer_mode = TRANSFER_MODE_SWAP;
+				break;
+
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
 						os_info.progname);
@@ -325,6 +331,7 @@ usage(void)
 	printf(_("  --no-statistics               do not import statistics from old cluster\n"));
 	printf(_("  --set-char-signedness=OPTION  set new cluster char signedness to \"signed\" or\n"
 			 "                                \"unsigned\"\n"));
+	printf(_("  --swap                        move data directories to new cluster\n"));
 	printf(_("  --sync-method=METHOD          set method for syncing files to disk\n"));
 	printf(_("  -?, --help                    show this help, then exit\n"));
 	printf(_("\n"
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 174cd920840..9295e46aed3 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -170,12 +170,14 @@ main(int argc, char **argv)
 
 	/*
 	 * Most failures happen in create_new_objects(), which has completed at
-	 * this point.  We do this here because it is just before linking, which
-	 * will link the old and new cluster data files, preventing the old
-	 * cluster from being safely started once the new cluster is started.
+	 * this point.  We do this here because it is just before file transfer,
+	 * which for --link will make it unsafe to start the old cluster once the
+	 * new cluster is started, and for --swap will make it unsafe to start the
+	 * old cluster at all.
 	 */
-	if (user_opts.transfer_mode == TRANSFER_MODE_LINK)
-		disable_old_cluster();
+	if (user_opts.transfer_mode == TRANSFER_MODE_LINK ||
+		user_opts.transfer_mode == TRANSFER_MODE_SWAP)
+		disable_old_cluster(user_opts.transfer_mode);
 
 	transfer_all_new_tablespaces(&old_cluster.dbarr, &new_cluster.dbarr,
 								 old_cluster.pgdata, new_cluster.pgdata);
@@ -212,8 +214,10 @@ main(int argc, char **argv)
 	{
 		prep_status("Sync data directory to disk");
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/initdb\" --sync-only \"%s\" --sync-method %s",
+				  "\"%s/initdb\" --sync-only %s \"%s\" --sync-method %s",
 				  new_cluster.bindir,
+				  (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+				  "--no-sync-data-files" : "",
 				  new_cluster.pgdata,
 				  user_opts.sync_method);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 4c9d0172149..69c965bb7d0 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -262,6 +262,7 @@ typedef enum
 	TRANSFER_MODE_COPY,
 	TRANSFER_MODE_COPY_FILE_RANGE,
 	TRANSFER_MODE_LINK,
+	TRANSFER_MODE_SWAP,
 } transferMode;
 
 /*
@@ -391,7 +392,7 @@ void		create_script_for_old_cluster_deletion(char **deletion_script_file_name);
 
 void		get_control_data(ClusterInfo *cluster);
 void		check_control_data(ControlData *oldctrl, ControlData *newctrl);
-void		disable_old_cluster(void);
+void		disable_old_cluster(transferMode transfer_mode);
 
 
 /* dump.c */
@@ -423,7 +424,7 @@ void		rewriteVisibilityMap(const char *fromfile, const char *tofile,
 								 const char *schemaName, const char *relName);
 void		check_file_clone(void);
 void		check_copy_file_range(void);
-void		check_hard_link(void);
+void		check_hard_link(transferMode transfer_mode);
 
 /* fopen_priv() is no longer different from fopen() */
 #define fopen_priv(path, mode)	fopen(path, mode)
diff --git a/src/bin/pg_upgrade/relfilenumber.c b/src/bin/pg_upgrade/relfilenumber.c
index 8c23c583172..a87e6156911 100644
--- a/src/bin/pg_upgrade/relfilenumber.c
+++ b/src/bin/pg_upgrade/relfilenumber.c
@@ -11,11 +11,92 @@
 
 #include <sys/stat.h>
 
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/int.h"
+#include "common/logging.h"
 #include "pg_upgrade.h"
 
 static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
 static void transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_frozenbit);
 
+/*
+ * The following set of sync_queue_* functions are used for --swap to reduce
+ * the amount of time spent synchronizing the swapped catalog files.  When a
+ * file is added to the queue, we also alert the file system that we'd like it
+ * to be persisted to disk in the near future (if that operation is supported
+ * by the current platform).  Once the queue is full, all of the files are
+ * synchronized to disk.  This strategy should generally be much faster than
+ * simply calling fsync() on the files right away.
+ *
+ * The general usage pattern should be something like:
+ *
+ *     for (int i = 0; i < num_files; i++)
+ *         sync_queue_push(files[i]);
+ *
+ *     // be sure to sync any remaining files in the queue
+ *     sync_queue_sync_all();
+ *     synq_queue_destroy();
+ */
+
+#define SYNC_QUEUE_MAX_LEN	(1024)
+
+static char *sync_queue[SYNC_QUEUE_MAX_LEN];
+static bool sync_queue_inited;
+static int	sync_queue_len;
+
+static inline void
+sync_queue_init(void)
+{
+	if (sync_queue_inited)
+		return;
+
+	sync_queue_inited = true;
+	for (int i = 0; i < SYNC_QUEUE_MAX_LEN; i++)
+		sync_queue[i] = palloc(MAXPGPATH);
+}
+
+static inline void
+sync_queue_sync_all(void)
+{
+	if (!sync_queue_inited)
+		return;
+
+	for (int i = 0; i < sync_queue_len; i++)
+	{
+		if (fsync_fname(sync_queue[i], false) != 0)
+			pg_fatal("could not synchronize file \"%s\": %m", sync_queue[i]);
+	}
+
+	sync_queue_len = 0;
+}
+
+static inline void
+sync_queue_push(const char *fname)
+{
+	sync_queue_init();
+
+	pre_sync_fname(fname, false);
+
+	strncpy(sync_queue[sync_queue_len++], fname, MAXPGPATH);
+	if (sync_queue_len >= SYNC_QUEUE_MAX_LEN)
+		sync_queue_sync_all();
+}
+
+static inline void
+sync_queue_destroy(void)
+{
+	if (!sync_queue_inited)
+		return;
+
+	sync_queue_inited = false;
+	sync_queue_len = 0;
+	for (int i = 0; i < SYNC_QUEUE_MAX_LEN; i++)
+	{
+		pfree(sync_queue[i]);
+		sync_queue[i] = NULL;
+	}
+}
 
 /*
  * transfer_all_new_tablespaces()
@@ -41,6 +122,9 @@ transfer_all_new_tablespaces(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		case TRANSFER_MODE_LINK:
 			prep_status_progress("Linking user relation files");
 			break;
+		case TRANSFER_MODE_SWAP:
+			prep_status_progress("Swapping data directories");
+			break;
 	}
 
 	/*
@@ -125,6 +209,267 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		/* We allocate something even for n_maps == 0 */
 		pg_free(mappings);
 	}
+
+	/*
+	 * Make sure anything pending synchronization in swap mode is fully
+	 * persisted to disk.  This is a no-op for other transfer modes.
+	 */
+	sync_queue_sync_all();
+	sync_queue_destroy();
+}
+
+/*
+ * prepare_for_swap()
+ *
+ * This function durably moves the database directory from the old cluster to
+ * the new cluster in preparation for moving the pg_restore-generated catalog
+ * files into place.  Returns false if the database with the given OID does not
+ * have a directory in the given tablespace, otherwise returns true.
+ *
+ * old_cat (the directory for the old catalog files), new_dat (the database
+ * directory in the new cluster), and moved_dat (the destination for the
+ * pg_restore-generated database directory) should be sized to MAXPGPATH bytes.
+ * This function will return the appropriate paths in those variables.
+ */
+static bool
+prepare_for_swap(const char *old_tablespace, Oid db_oid,
+				 char *old_cat, char *new_dat, char *moved_dat)
+{
+	const char *new_tablespace;
+	const char *old_tblspc_suffix;
+	const char *new_tblspc_suffix;
+	char		old_tblspc[MAXPGPATH];
+	char		new_tblspc[MAXPGPATH];
+	char		moved_tblspc[MAXPGPATH];
+	char		old_dat[MAXPGPATH];
+	struct stat st;
+
+	if (strcmp(old_tablespace, old_cluster.pgdata) == 0)
+	{
+		new_tablespace = new_cluster.pgdata;
+		new_tblspc_suffix = "/base";
+		old_tblspc_suffix = "/base";
+	}
+	else
+	{
+		new_tablespace = old_tablespace;
+		new_tblspc_suffix = new_cluster.tablespace_suffix;
+		old_tblspc_suffix = old_cluster.tablespace_suffix;
+	}
+
+	/* Old and new cluster paths. */
+	snprintf(old_tblspc, sizeof(old_tblspc), "%s%s", old_tablespace, old_tblspc_suffix);
+	snprintf(new_tblspc, sizeof(new_tblspc), "%s%s", new_tablespace, new_tblspc_suffix);
+	snprintf(old_dat, sizeof(old_dat), "%s/%u", old_tblspc, db_oid);
+	snprintf(new_dat, MAXPGPATH, "%s/%u", new_tblspc, db_oid);
+
+	/*
+	 * Paths for "moved aside" stuff.  We intentionally put these in the old
+	 * cluster so that the delete_old_cluster.{sh,bat} script handles them.
+	 */
+	snprintf(moved_tblspc, sizeof(moved_tblspc), "%s/moved_for_upgrade", old_tblspc);
+	snprintf(old_cat, MAXPGPATH, "%s/%u_old_catalogs", moved_tblspc, db_oid);
+	snprintf(moved_dat, MAXPGPATH, "%s/%u", moved_tblspc, db_oid);
+
+	/* Check that the database directory exists in the given tablespace. */
+	if (stat(old_dat, &st) != 0)
+	{
+		if (errno != ENOENT)
+			pg_fatal("could not stat file \"%s\": %m", old_dat);
+		return false;
+	}
+
+	/* Create directory for stuff that is moved aside. */
+	if (pg_mkdir_p(moved_tblspc, pg_dir_create_mode) != 0 && errno != EEXIST)
+		pg_fatal("could not create directory \"%s\"", moved_tblspc);
+
+	/* Create directory for old catalog files. */
+	if (pg_mkdir_p(old_cat, pg_dir_create_mode) != 0)
+		pg_fatal("could not create directory \"%s\"", old_cat);
+
+	/* Move the new cluster's database directory aside. */
+	if (rename(new_dat, moved_dat) != 0)
+		pg_fatal("could not rename \"%s\" to \"%s\"", new_dat, moved_dat);
+
+	/* Move the old cluster's database directory into place. */
+	if (rename(old_dat, new_dat) != 0)
+		pg_fatal("could not rename \"%s\" to \"%s\"", old_dat, new_dat);
+
+	return true;
+}
+
+/*
+ * FileNameMapCmp()
+ *
+ * qsort() comparator for FileNameMap that sorts by RelFileNumber.
+ */
+static int
+FileNameMapCmp(const void *a, const void *b)
+{
+	const FileNameMap *map1 = (const FileNameMap *) a;
+	const FileNameMap *map2 = (const FileNameMap *) b;
+
+	return pg_cmp_u32(map1->relfilenumber, map2->relfilenumber);
+}
+
+/*
+ * parse_relfilenumber()
+ *
+ * Attempt to parse the RelFileNumber of the given file name.  If we can't,
+ * return InvalidRelFileNumber.  Note that this code snippet is lifted from
+ * parse_filename_for_nontemp_relation().
+ */
+static RelFileNumber
+parse_relfilenumber(const char *filename)
+{
+	char	   *endp;
+	unsigned long n;
+
+	if (filename[0] < '1' || filename[0] > '9')
+		return InvalidRelFileNumber;
+
+	errno = 0;
+	n = strtoul(filename, &endp, 10);
+	if (errno || filename == endp || n <= 0 || n > PG_UINT32_MAX)
+		return InvalidRelFileNumber;
+
+	return (RelFileNumber) n;
+}
+
+/*
+ * swap_catalog_files()
+ *
+ * Moves the old catalog files aside, and moves the new catalog files into
+ * place.
+ */
+static void
+swap_catalog_files(FileNameMap *maps, int size, const char *old_cat,
+				   const char *new_dat, const char *moved_dat)
+{
+	DIR		   *dir;
+	struct dirent *de;
+	char		path[MAXPGPATH];
+	char		dest[MAXPGPATH];
+	RelFileNumber rfn;
+
+	/* Move the old catalog files aside. */
+	dir = opendir(new_dat);
+	if (dir == NULL)
+		pg_fatal("could not open directory \"%s\": %m", new_dat);
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		snprintf(path, sizeof(path), "%s/%s", new_dat, de->d_name);
+		if (get_dirent_type(path, de, false, PG_LOG_ERROR) != PGFILETYPE_REG)
+			continue;
+
+		rfn = parse_relfilenumber(de->d_name);
+		if (RelFileNumberIsValid(rfn))
+		{
+			FileNameMap key = {.relfilenumber = rfn};
+
+			if (bsearch(&key, maps, size, sizeof(FileNameMap), FileNameMapCmp))
+				continue;
+		}
+
+		snprintf(dest, sizeof(dest), "%s/%s", old_cat, de->d_name);
+		if (rename(path, dest) != 0)
+			pg_fatal("could not rename \"%s\" to \"%s\": %m", path, dest);
+	}
+	if (errno)
+		pg_fatal("could not read directory \"%s\": %m", new_dat);
+	(void) closedir(dir);
+
+	/* Move the new catalog files into place. */
+	dir = opendir(moved_dat);
+	if (dir == NULL)
+		pg_fatal("could not open directory \"%s\": %m", moved_dat);
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		snprintf(path, sizeof(path), "%s/%s", moved_dat, de->d_name);
+		if (get_dirent_type(path, de, false, PG_LOG_ERROR) != PGFILETYPE_REG)
+			continue;
+
+		rfn = parse_relfilenumber(de->d_name);
+		if (RelFileNumberIsValid(rfn))
+		{
+			FileNameMap key = {.relfilenumber = rfn};
+
+			if (bsearch(&key, maps, size, sizeof(FileNameMap), FileNameMapCmp))
+				continue;
+		}
+
+		snprintf(dest, sizeof(dest), "%s/%s", new_dat, de->d_name);
+		if (rename(path, dest) != 0)
+			pg_fatal("could not rename \"%s\" to \"%s\": %m", path, dest);
+
+		/*
+		 * We don't fsync() the database files in the file synchronization
+		 * stage of pg_upgrade in swap mode, so we need to synchronize them
+		 * ourselves.  We only do this for the catalog files because they were
+		 * created during pg_restore with fsync=off.  We assume that the user
+		 * data files files were properly persisted to disk when the user last
+		 * shut it down.
+		 */
+		if (user_opts.do_sync)
+			sync_queue_push(dest);
+	}
+	if (errno)
+		pg_fatal("could not read directory \"%s\": %m", moved_dat);
+	(void) closedir(dir);
+
+	/* Ensure the directory entries are persisted to disk. */
+	if (fsync_fname(new_dat, true) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", new_dat);
+	if (fsync_parent_path(new_dat) != 0)
+		pg_fatal("could not synchronize parent directory of \"%s\": %m", new_dat);
+}
+
+/*
+ * do_swap()
+ *
+ * Perform the required steps for --swap for a single database.  In short this
+ * moves the old cluster's database directory into the new cluster and then
+ * replaces any files for system catalogs with the ones that were generated
+ * during pg_restore.
+ */
+static void
+do_swap(FileNameMap *maps, int size, char *old_tablespace)
+{
+	char		old_cat[MAXPGPATH];
+	char		new_dat[MAXPGPATH];
+	char		moved_dat[MAXPGPATH];
+
+	/*
+	 * We perform many lookups on maps by relfilenumber in swap mode, so make
+	 * sure it's sorted by relfilenumber.  maps should already be sorted by
+	 * OID, so in general this shouldn't have much work to do.
+	 */
+	qsort(maps, size, sizeof(FileNameMap), FileNameMapCmp);
+
+	/*
+	 * If an old tablespace is given, we only need to process that one.  If no
+	 * old tablespace is specified, we need to process all the tablespaces on
+	 * the system.
+	 */
+	if (old_tablespace)
+	{
+		if (prepare_for_swap(old_tablespace, maps[0].db_oid,
+							 old_cat, new_dat, moved_dat))
+			swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+	}
+	else
+	{
+		if (prepare_for_swap(old_cluster.pgdata, maps[0].db_oid,
+							 old_cat, new_dat, moved_dat))
+			swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+
+		for (int tblnum = 0; tblnum < os_info.num_old_tablespaces; tblnum++)
+		{
+			if (prepare_for_swap(os_info.old_tablespaces[tblnum], maps[0].db_oid,
+								 old_cat, new_dat, moved_dat))
+				swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+		}
+	}
 }
 
 /*
@@ -145,6 +490,20 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
 		new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
 		vm_must_add_frozenbit = true;
 
+	/* --swap has its own subroutine */
+	if (user_opts.transfer_mode == TRANSFER_MODE_SWAP)
+	{
+		/*
+		 * We don't support --swap to upgrade from versions that require
+		 * rewriting the visibility map.  We should've failed already if
+		 * someone tries to do that.
+		 */
+		Assert(!vm_must_add_frozenbit);
+
+		do_swap(maps, size, old_tablespace);
+		return;
+	}
+
 	for (mapnum = 0; mapnum < size; mapnum++)
 	{
 		if (old_tablespace == NULL ||
@@ -259,6 +618,11 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 					pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"",
 						   old_file, new_file);
 					linkFile(old_file, new_file, map->nspname, map->relname);
+					break;
+				case TRANSFER_MODE_SWAP:
+					/* swap mode is handled in its own code path */
+					pg_fatal("should never happen");
+					break;
 			}
 	}
 }
diff --git a/src/bin/pg_upgrade/t/006_swap.pl b/src/bin/pg_upgrade/t/006_swap.pl
new file mode 100644
index 00000000000..5ab0cc1dc00
--- /dev/null
+++ b/src/bin/pg_upgrade/t/006_swap.pl
@@ -0,0 +1,42 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Tests for --swap
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Initialize old and new clusters
+my $old = PostgreSQL::Test::Cluster->new('old');
+my $new = PostgreSQL::Test::Cluster->new('new');
+$old->init();
+$new->init();
+
+$old->start;
+$old->safe_psql('postgres', "CREATE TABLE test AS SELECT generate_series(1, 5432)");
+$old->stop;
+
+# pg_upgrade should be successful.
+command_ok(
+	[
+		'pg_upgrade', '--no-sync',
+		'--old-datadir' => $old->data_dir,
+		'--new-datadir' => $new->data_dir,
+		'--old-bindir' => $old->config_data('--bindir'),
+		'--new-bindir' => $new->config_data('--bindir'),
+		'--socketdir' => $new->host,
+		'--old-port' => $old->port,
+		'--new-port' => $new->port,
+		'--swap'
+	],
+	'run of pg_upgrade --swap');
+
+$new->start;
+my $result = $new->safe_psql('postgres', "SELECT COUNT(*) FROM test");
+is($result, '5432', 'table data after pg_upgrade --swap');
+$new->stop;
+
+done_testing();
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 78e272916f5..4405ef8b425 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -45,9 +45,6 @@
  */
 #define MINIMUM_VERSION_FOR_PG_WAL	100000
 
-#ifdef PG_FLUSH_DATA_WORKS
-static int	pre_sync_fname(const char *fname, bool isdir);
-#endif
 static void walkdir(const char *path,
 					int (*action) (const char *fname, bool isdir),
 					bool process_symlinks,
@@ -352,16 +349,16 @@ walkdir(const char *path,
 }
 
 /*
- * Hint to the OS that it should get ready to fsync() this file.
+ * Hint to the OS that it should get ready to fsync() this file, if supported
+ * by the platform.
  *
  * Ignores errors trying to open unreadable files, and reports other errors
  * non-fatally.
  */
-#ifdef PG_FLUSH_DATA_WORKS
-
-static int
+int
 pre_sync_fname(const char *fname, bool isdir)
 {
+#ifdef PG_FLUSH_DATA_WORKS
 	int			fd;
 
 	fd = open(fname, O_RDONLY | PG_BINARY, 0);
@@ -388,11 +385,10 @@ pre_sync_fname(const char *fname, bool isdir)
 #endif
 
 	(void) close(fd);
+#endif							/* PG_FLUSH_DATA_WORKS */
 	return 0;
 }
 
-#endif							/* PG_FLUSH_DATA_WORKS */
-
 /*
  * fsync_fname -- Try to fsync a file or directory
  *
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 8274bc877ab..9fd88953e43 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -33,6 +33,7 @@ typedef enum DataDirSyncMethod
 struct iovec;					/* avoid including port/pg_iovec.h here */
 
 #ifdef FRONTEND
+extern int	pre_sync_fname(const char *fname, bool isdir);
 extern int	fsync_fname(const char *fname, bool isdir);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
 						DataDirSyncMethod sync_method, bool sync_data_files);
-- 
2.39.5 (Apple Git-154)

#21Robert Haas
robertmhaas@gmail.com
In reply to: Nathan Bossart (#20)
Re: optimize file transfer in pg_upgrade

On Mon, Mar 17, 2025 at 4:30 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Mon, Mar 17, 2025 at 04:04:45PM -0400, Robert Haas wrote:

On Mon, Mar 17, 2025 at 3:34 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

* Once committed, I should update one of my buildfarm animals to use
PG_TEST_PG_UPGRADE_MODE=--swap.

It would be better if we didn't need a separate buildfarm animal to
test this, because that means you won't get notified by local testing
OR by CI if you break this. Can we instead have one test that checks
this which is part of the normal test run?

That's what I set out to do before I discovered PG_TEST_PG_UPGRADE_MODE.
The commit message for b059a24 seemed to indicate that we don't want to
automatically test all supported modes, but I agree that it would be nice
to have some basic coverage for --swap on CI/buildfarm regardless of
PG_TEST_PG_UPGRADE_MODE. How about we add a simple TAP test (attached),
and I still plan on switching a buildfarm animal to --swap for more
in-depth testing?

The background here is that I'm kind of on the warpath against weird
configurations that we only test on certain buildfarm animals at the
moment, because the result of that is that CI is clean and then the
buildfarm turns red when you commit. That's an unenjoyable experience
for the committer and for everyone who looks at the buildfarm results.
The way to fix it is to stop relying on "rerun all the tests with this
weird mode flag" and rely more on tests that are designed to test that
specific flag and, ideally, that get run by in local testing or at
least by CI.

I'm not quite sure what the best thing is to do is for the pg_upgrade
tests in particular, and it may well be best to do as you propose for
now and figure that out later. But I question whether just rerunning
all of those tests with several different mode flags is the right
thing to do. Why for example does 005_char_signedness.pl need to be
checked under both --link and --clone? I would guess that there are
one or maybe two tests in src/bin/pg_upgrade/t that needs to test
--link and --clone and they should grow internal loops to do that
(when supported by the local platform) and PG_UPGRADE_TEST_MODE should
go in the garbage.

--
Robert Haas
EDB: http://www.enterprisedb.com

#22Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#21)
Re: optimize file transfer in pg_upgrade

Robert Haas <robertmhaas@gmail.com> writes:

I'm not quite sure what the best thing is to do is for the pg_upgrade
tests in particular, and it may well be best to do as you propose for
now and figure that out later. But I question whether just rerunning
all of those tests with several different mode flags is the right
thing to do. Why for example does 005_char_signedness.pl need to be
checked under both --link and --clone? I would guess that there are
one or maybe two tests in src/bin/pg_upgrade/t that needs to test
--link and --clone and they should grow internal loops to do that
(when supported by the local platform) and PG_UPGRADE_TEST_MODE should
go in the garbage.

+1

I'd be particularly allergic to running 002_pg_upgrade.pl multiple
times, as that's one of our most expensive tests, and I flat out
don't believe that expending that many cycles could be justified.
Surely we can test these modes sufficiently in some much cheaper and
more targeted way.

regards, tom lane

#23Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#22)
Re: optimize file transfer in pg_upgrade

Hi,

On 2025-03-18 10:04:41 -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I'm not quite sure what the best thing is to do is for the pg_upgrade
tests in particular, and it may well be best to do as you propose for
now and figure that out later. But I question whether just rerunning
all of those tests with several different mode flags is the right
thing to do. Why for example does 005_char_signedness.pl need to be
checked under both --link and --clone? I would guess that there are
one or maybe two tests in src/bin/pg_upgrade/t that needs to test
--link and --clone and they should grow internal loops to do that
(when supported by the local platform) and PG_UPGRADE_TEST_MODE should
go in the garbage.

+1

I'd be particularly allergic to running 002_pg_upgrade.pl multiple
times, as that's one of our most expensive tests, and I flat out
don't believe that expending that many cycles could be justified.
Surely we can test these modes sufficiently in some much cheaper and
more targeted way.

+1

It's useful to have coverage of as many object types as possible in pg_upgrade
- hence 002_pg_upgrade.pl. It helps us to find problems in new code that
didn't think about pg_upgrade.

But that doesn't mean that it's a good idea to run all other pg_upgrade tests
the same way, to the contrary - the cost is too high.

Even leaving runtime aside, I have a hard time believing that --link, --clone,
--swap benefit from running the same way as 002_pg_upgrade.pl - the
implementation of those flags is on a lower level and works the same across
e.g. different index AMs.

I'd go so far as to say that 002_pg_upgrade.pl style testing actually makes it
*harder* to diagnose problems related to things like --link, because there are
no targeted tests, but just a huge set of things that maybe allow to infer
some bug if you spend a lot of time.

Greetings,

Andres Freund

#24Álvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Robert Haas (#21)
Re: optimize file transfer in pg_upgrade

On 2025-Mar-18, Robert Haas wrote:

The background here is that I'm kind of on the warpath against weird
configurations that we only test on certain buildfarm animals at the
moment, because the result of that is that CI is clean and then the
buildfarm turns red when you commit. That's an unenjoyable experience
for the committer and for everyone who looks at the buildfarm results.
The way to fix it is to stop relying on "rerun all the tests with this
weird mode flag" and rely more on tests that are designed to test that
specific flag and, ideally, that get run by in local testing or at
least by CI.

FWIW this is exactly the rationale that got me writing an email on the
Ashutosh's thread for a new pg_dump/restore test under
002_pg_upgrade.pl, whereby I was saying that we should not hide it
behind PG_TEST_EXTRA which almost nobody would remember to use. But I
discarded that draft, because that had actually been Ashutosh's idea at
some point in the thread and had been discarded because of the runtime
increase it'd cause. But, somehow, I still don't believe the theory
that it's such a bad idea to add a few seconds so that we have such a
comprehensive pg_dump test, with much less programmer overhead than
pg_dump's own weird enormous test script.

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Cada quien es cada cual y baja las escaleras como quiere" (JMSerrat)

#25Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#23)
1 attachment(s)
Re: optimize file transfer in pg_upgrade

On Tue, Mar 18, 2025 at 10:12:51AM -0400, Andres Freund wrote:

On 2025-03-18 10:04:41 -0400, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

I'm not quite sure what the best thing is to do is for the pg_upgrade
tests in particular, and it may well be best to do as you propose for
now and figure that out later. But I question whether just rerunning
all of those tests with several different mode flags is the right
thing to do. Why for example does 005_char_signedness.pl need to be
checked under both --link and --clone? I would guess that there are
one or maybe two tests in src/bin/pg_upgrade/t that needs to test
--link and --clone and they should grow internal loops to do that
(when supported by the local platform) and PG_UPGRADE_TEST_MODE should
go in the garbage.

+1

I'd be particularly allergic to running 002_pg_upgrade.pl multiple
times, as that's one of our most expensive tests, and I flat out
don't believe that expending that many cycles could be justified.
Surely we can test these modes sufficiently in some much cheaper and
more targeted way.

+1

Here is a first sketch at a test that cycles through all the transfer modes
and makes sure they succeed or fail with an error along the lines of "not
supported on this platform." Each test verifies that some very simple
objects make it to the new version, which we could of course expand on.
Would something like this suffice?

--
nathan

Attachments:

0001-Add-test-for-pg_upgrade-file-transfer-modes.patchtext/plain; charset=us-asciiDownload
From 89b27b68194c2be0e1aebdee871e556028cdd5b5 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 18 Mar 2025 12:21:03 -0500
Subject: [PATCH 1/1] Add test for pg_upgrade file transfer modes.

---
 src/bin/pg_upgrade/meson.build           |  1 +
 src/bin/pg_upgrade/t/006_modes.pl        | 63 ++++++++++++++++++++++++
 src/test/perl/PostgreSQL/Test/Cluster.pm | 19 +++++++
 src/test/perl/PostgreSQL/Test/Utils.pm   | 25 ++++++++++
 4 files changed, 108 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/006_modes.pl

diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index da84344966a..16cd9247e76 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -46,6 +46,7 @@ tests += {
       't/003_logical_slots.pl',
       't/004_subscription.pl',
       't/005_char_signedness.pl',
+      't/006_modes.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/t/006_modes.pl b/src/bin/pg_upgrade/t/006_modes.pl
new file mode 100644
index 00000000000..77ddf042ce0
--- /dev/null
+++ b/src/bin/pg_upgrade/t/006_modes.pl
@@ -0,0 +1,63 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Tests for file transfer modes
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub test_mode
+{
+	my ($mode) = @_;
+
+	my $old = PostgreSQL::Test::Cluster->new('old');
+	my $new = PostgreSQL::Test::Cluster->new('new');
+
+	$old->init();
+	$new->init();
+
+	$old->start;
+	$old->safe_psql('postgres', "CREATE TABLE test AS SELECT generate_series(1, 100)");
+	$old->safe_psql('postgres', "CREATE DATABASE test");
+	$old->safe_psql('test', "CREATE SEQUENCE test START 5432");
+	$old->stop;
+
+	my $result = command_ok_or_fails_like(
+		[
+			'pg_upgrade', '--no-sync',
+			'--old-datadir' => $old->data_dir,
+			'--new-datadir' => $new->data_dir,
+			'--old-bindir' => $old->config_data('--bindir'),
+			'--new-bindir' => $new->config_data('--bindir'),
+			'--socketdir' => $new->host,
+			'--old-port' => $old->port,
+			'--new-port' => $new->port,
+			$mode
+		],
+		qr/.* not supported on this platform/,
+		qr/^$/,
+		"pg_upgrade with transfer mode $mode");
+
+	if ($result)
+	{
+		$new->start;
+		$result = $new->safe_psql('postgres', "SELECT COUNT(*) FROM test");
+		is($result, '100', "table data after pg_upgrade $mode");
+		$result = $new->safe_psql('test', "SELECT nextval('test')");
+		is($result, '5432', "sequence data after pg_upgrade $mode");
+		$new->stop;
+	}
+
+	$old->clean_node();
+	$new->clean_node();
+}
+
+test_mode('--clone');
+test_mode('--copy');
+test_mode('--copy-file-range');
+test_mode('--link');
+
+done_testing();
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index bab3f3d2dbe..bda19bbcee2 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2801,6 +2801,25 @@ sub command_fails_like
 
 =pod
 
+=item $node->command_ok_or_fails_like(...)
+
+PostgreSQL::Test::Utils::command_ok_or_fails_like with our connection parameters. See command_ok(...)
+
+=cut
+
+sub command_ok_or_fails_like
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $self = shift;
+
+	local %ENV = $self->_get_env();
+
+	return PostgreSQL::Test::Utils::command_ok_or_fails_like(@_);
+}
+
+=pod
+
 =item $node->command_checks_all(...)
 
 PostgreSQL::Test::Utils::command_checks_all with our connection parameters. See
diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm b/src/test/perl/PostgreSQL/Test/Utils.pm
index d1ad131eadf..7d7ca83495f 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -89,6 +89,7 @@ our @EXPORT = qw(
   command_like
   command_like_safe
   command_fails_like
+  command_ok_or_fails_like
   command_checks_all
 
   $windows_os
@@ -1067,6 +1068,30 @@ sub command_fails_like
 
 =pod
 
+=item command_ok_or_fails_like(cmd, expected_stdout, expected_stderr, test_name)
+
+Check that the command either succeeds or fails with an error that matches the
+given regular expressions.
+
+=cut
+
+sub command_ok_or_fails_like
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+	my ($cmd, $expected_stdout, $expected_stderr, $test_name) = @_;
+	my ($stdout, $stderr);
+	print("# Running: " . join(" ", @{$cmd}) . "\n");
+	my $result = IPC::Run::run $cmd, '>' => \$stdout, '2>' => \$stderr;
+	if (!$result)
+	{
+		like($stdout, $expected_stdout, "$test_name: stdout matches");
+		like($stderr, $expected_stderr, "$test_name: stderr matches");
+	}
+	return $result;
+}
+
+=pod
+
 =item command_checks_all(cmd, ret, out, err, test_name)
 
 Run a command and check its status and outputs.
-- 
2.39.5 (Apple Git-154)

#26Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#25)
Re: optimize file transfer in pg_upgrade

Hi,

On 2025-03-18 12:29:02 -0500, Nathan Bossart wrote:

Here is a first sketch at a test that cycles through all the transfer modes
and makes sure they succeed or fail with an error along the lines of "not
supported on this platform." Each test verifies that some very simple
objects make it to the new version, which we could of course expand on.
Would something like this suffice?

I'd add a few more complications:

- Create and test a relation that was rewritten, to ensure we test the
relfilenode != oid case and one that isn't rewritten.

- Perhaps create a tablespace?

- Do we need a new old cluster for each of the modes? That seems like wasted
time? I guess it's required for --link...

Greetings,

Andres Freund

#27Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#26)
Re: optimize file transfer in pg_upgrade

On Tue, Mar 18, 2025 at 01:37:02PM -0400, Andres Freund wrote:

I'd add a few more complications:

- Create and test a relation that was rewritten, to ensure we test the
relfilenode != oid case and one that isn't rewritten.

+1

- Perhaps create a tablespace?

+1, I don't think we have much, if any, coverage of pg_upgrade with
non-default tablespaces.

- Do we need a new old cluster for each of the modes? That seems like wasted
time? I guess it's required for --link...

It'll also be needed for --swap. We could optionally save the old cluster
for a couple of modes if we really wanted to. *shrug*

I'll work on the first two...

--
nathan

#28Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#27)
Re: optimize file transfer in pg_upgrade

On 2025-03-18 12:47:01 -0500, Nathan Bossart wrote:

On Tue, Mar 18, 2025 at 01:37:02PM -0400, Andres Freund wrote:

- Do we need a new old cluster for each of the modes? That seems like wasted
time? I guess it's required for --link...

It'll also be needed for --swap. We could optionally save the old cluster
for a couple of modes if we really wanted to. *shrug*

Don't worry about it, I think the template initdb stuff should make it cheap enough...

#29Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#28)
1 attachment(s)
Re: optimize file transfer in pg_upgrade

On Tue, Mar 18, 2025 at 01:50:10PM -0400, Andres Freund wrote:

On 2025-03-18 12:47:01 -0500, Nathan Bossart wrote:

On Tue, Mar 18, 2025 at 01:37:02PM -0400, Andres Freund wrote:

- Do we need a new old cluster for each of the modes? That seems like wasted
time? I guess it's required for --link...

It'll also be needed for --swap. We could optionally save the old cluster
for a couple of modes if we really wanted to. *shrug*

Don't worry about it, I think the template initdb stuff should make it cheap enough...

Cool. I realize now why there's poor coverage for pg_upgrade with
tablespaces: you can't upgrade between the same version with tablespaces
(presumably due to the version-specific subdirectory conflict). I don't
know if the regression tests leave around any tablespaces for the
cross-version pg_upgrade tests, but that's probably the best we can do at
the moment.

For now, here's a new version of the test with a rewritten table. I also
tried to fix the expected error regex to handle some of the other error
messages for unsupported modes (as revealed by cfbot).

--
nathan

Attachments:

v2-0001-Add-test-for-pg_upgrade-file-transfer-modes.patchtext/plain; charset=us-asciiDownload
From c4b7816955cfb5d331851e14f9a93cbb182f4d1e Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 18 Mar 2025 12:21:03 -0500
Subject: [PATCH v2 1/1] Add test for pg_upgrade file transfer modes.

---
 src/bin/pg_upgrade/meson.build           |  1 +
 src/bin/pg_upgrade/t/006_modes.pl        | 67 ++++++++++++++++++++++++
 src/test/perl/PostgreSQL/Test/Cluster.pm | 19 +++++++
 src/test/perl/PostgreSQL/Test/Utils.pm   | 25 +++++++++
 4 files changed, 112 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/006_modes.pl

diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index da84344966a..16cd9247e76 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -46,6 +46,7 @@ tests += {
       't/003_logical_slots.pl',
       't/004_subscription.pl',
       't/005_char_signedness.pl',
+      't/006_modes.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/t/006_modes.pl b/src/bin/pg_upgrade/t/006_modes.pl
new file mode 100644
index 00000000000..468591fc486
--- /dev/null
+++ b/src/bin/pg_upgrade/t/006_modes.pl
@@ -0,0 +1,67 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Tests for file transfer modes
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub test_mode
+{
+	my ($mode) = @_;
+
+	my $old = PostgreSQL::Test::Cluster->new('old');
+	my $new = PostgreSQL::Test::Cluster->new('new');
+
+	$old->init();
+	$new->init();
+
+	$old->start;
+	$old->safe_psql('postgres', "CREATE TABLE test1 AS SELECT generate_series(1, 100)");
+	$old->safe_psql('postgres', "CREATE DATABASE testdb");
+	$old->safe_psql('testdb', "CREATE TABLE test2 AS SELECT generate_series(200, 300)");
+	$old->safe_psql('testdb', "VACUUM FULL test2");
+	$old->safe_psql('testdb', "CREATE SEQUENCE testseq START 5432");
+	$old->stop;
+
+	my $result = command_ok_or_fails_like(
+		[
+			'pg_upgrade', '--no-sync',
+			'--old-datadir' => $old->data_dir,
+			'--new-datadir' => $new->data_dir,
+			'--old-bindir' => $old->config_data('--bindir'),
+			'--new-bindir' => $new->config_data('--bindir'),
+			'--socketdir' => $new->host,
+			'--old-port' => $old->port,
+			'--new-port' => $new->port,
+			$mode
+		],
+		qr/.* not supported on this platform|could not .* between old and new data directories: .*/,
+		qr/^$/,
+		"pg_upgrade with transfer mode $mode");
+
+	if ($result)
+	{
+		$new->start;
+		$result = $new->safe_psql('postgres', "SELECT COUNT(*) FROM test1");
+		is($result, '100', "test1 data after pg_upgrade $mode");
+		$result = $new->safe_psql('testdb', "SELECT COUNT(*) FROM test2");
+		is($result, '101', "test2 data after pg_upgrade $mode");
+		$result = $new->safe_psql('testdb', "SELECT nextval('testseq')");
+		is($result, '5432', "sequence data after pg_upgrade $mode");
+		$new->stop;
+	}
+
+	$old->clean_node();
+	$new->clean_node();
+}
+
+test_mode('--clone');
+test_mode('--copy');
+test_mode('--copy-file-range');
+test_mode('--link');
+
+done_testing();
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index bab3f3d2dbe..bda19bbcee2 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2801,6 +2801,25 @@ sub command_fails_like
 
 =pod
 
+=item $node->command_ok_or_fails_like(...)
+
+PostgreSQL::Test::Utils::command_ok_or_fails_like with our connection parameters. See command_ok(...)
+
+=cut
+
+sub command_ok_or_fails_like
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $self = shift;
+
+	local %ENV = $self->_get_env();
+
+	return PostgreSQL::Test::Utils::command_ok_or_fails_like(@_);
+}
+
+=pod
+
 =item $node->command_checks_all(...)
 
 PostgreSQL::Test::Utils::command_checks_all with our connection parameters. See
diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm b/src/test/perl/PostgreSQL/Test/Utils.pm
index d1ad131eadf..7d7ca83495f 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -89,6 +89,7 @@ our @EXPORT = qw(
   command_like
   command_like_safe
   command_fails_like
+  command_ok_or_fails_like
   command_checks_all
 
   $windows_os
@@ -1067,6 +1068,30 @@ sub command_fails_like
 
 =pod
 
+=item command_ok_or_fails_like(cmd, expected_stdout, expected_stderr, test_name)
+
+Check that the command either succeeds or fails with an error that matches the
+given regular expressions.
+
+=cut
+
+sub command_ok_or_fails_like
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+	my ($cmd, $expected_stdout, $expected_stderr, $test_name) = @_;
+	my ($stdout, $stderr);
+	print("# Running: " . join(" ", @{$cmd}) . "\n");
+	my $result = IPC::Run::run $cmd, '>' => \$stdout, '2>' => \$stderr;
+	if (!$result)
+	{
+		like($stdout, $expected_stdout, "$test_name: stdout matches");
+		like($stderr, $expected_stderr, "$test_name: stderr matches");
+	}
+	return $result;
+}
+
+=pod
+
 =item command_checks_all(cmd, ret, out, err, test_name)
 
 Run a command and check its status and outputs.
-- 
2.39.5 (Apple Git-154)

#30Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#29)
4 attachment(s)
Re: optimize file transfer in pg_upgrade

On Tue, Mar 18, 2025 at 02:08:42PM -0500, Nathan Bossart wrote:

For now, here's a new version of the test with a rewritten table. I also
tried to fix the expected error regex to handle some of the other error
messages for unsupported modes (as revealed by cfbot).

And here is a new version of the full patch set.

--
nathan

Attachments:

v7-0001-Add-test-for-pg_upgrade-file-transfer-modes.patchtext/plain; charset=us-asciiDownload
From 8b6a5e0148c2f7a663f5003f12ae9461d2b06a5c Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 18 Mar 2025 20:58:07 -0500
Subject: [PATCH v7 1/4] Add test for pg_upgrade file transfer modes.

This new test checks all of pg_upgrade's file transfer modes.  For
each mode, we verify that pg_upgrade either succeeds (and some test
objects successfully reach the new version) or fails with an error
that indicates the mode is not supported on the current platform.

Suggested-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 src/bin/pg_upgrade/meson.build           |  1 +
 src/bin/pg_upgrade/t/006_modes.pl        | 67 ++++++++++++++++++++++++
 src/test/perl/PostgreSQL/Test/Cluster.pm | 19 +++++++
 src/test/perl/PostgreSQL/Test/Utils.pm   | 25 +++++++++
 4 files changed, 112 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/006_modes.pl

diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index da84344966a..16cd9247e76 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -46,6 +46,7 @@ tests += {
       't/003_logical_slots.pl',
       't/004_subscription.pl',
       't/005_char_signedness.pl',
+      't/006_modes.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/t/006_modes.pl b/src/bin/pg_upgrade/t/006_modes.pl
new file mode 100644
index 00000000000..468591fc486
--- /dev/null
+++ b/src/bin/pg_upgrade/t/006_modes.pl
@@ -0,0 +1,67 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Tests for file transfer modes
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub test_mode
+{
+	my ($mode) = @_;
+
+	my $old = PostgreSQL::Test::Cluster->new('old');
+	my $new = PostgreSQL::Test::Cluster->new('new');
+
+	$old->init();
+	$new->init();
+
+	$old->start;
+	$old->safe_psql('postgres', "CREATE TABLE test1 AS SELECT generate_series(1, 100)");
+	$old->safe_psql('postgres', "CREATE DATABASE testdb");
+	$old->safe_psql('testdb', "CREATE TABLE test2 AS SELECT generate_series(200, 300)");
+	$old->safe_psql('testdb', "VACUUM FULL test2");
+	$old->safe_psql('testdb', "CREATE SEQUENCE testseq START 5432");
+	$old->stop;
+
+	my $result = command_ok_or_fails_like(
+		[
+			'pg_upgrade', '--no-sync',
+			'--old-datadir' => $old->data_dir,
+			'--new-datadir' => $new->data_dir,
+			'--old-bindir' => $old->config_data('--bindir'),
+			'--new-bindir' => $new->config_data('--bindir'),
+			'--socketdir' => $new->host,
+			'--old-port' => $old->port,
+			'--new-port' => $new->port,
+			$mode
+		],
+		qr/.* not supported on this platform|could not .* between old and new data directories: .*/,
+		qr/^$/,
+		"pg_upgrade with transfer mode $mode");
+
+	if ($result)
+	{
+		$new->start;
+		$result = $new->safe_psql('postgres', "SELECT COUNT(*) FROM test1");
+		is($result, '100', "test1 data after pg_upgrade $mode");
+		$result = $new->safe_psql('testdb', "SELECT COUNT(*) FROM test2");
+		is($result, '101', "test2 data after pg_upgrade $mode");
+		$result = $new->safe_psql('testdb', "SELECT nextval('testseq')");
+		is($result, '5432', "sequence data after pg_upgrade $mode");
+		$new->stop;
+	}
+
+	$old->clean_node();
+	$new->clean_node();
+}
+
+test_mode('--clone');
+test_mode('--copy');
+test_mode('--copy-file-range');
+test_mode('--link');
+
+done_testing();
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 05bd94609d4..8759ed2cbba 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2801,6 +2801,25 @@ sub command_fails_like
 
 =pod
 
+=item $node->command_ok_or_fails_like(...)
+
+PostgreSQL::Test::Utils::command_ok_or_fails_like with our connection parameters. See command_ok(...)
+
+=cut
+
+sub command_ok_or_fails_like
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $self = shift;
+
+	local %ENV = $self->_get_env();
+
+	return PostgreSQL::Test::Utils::command_ok_or_fails_like(@_);
+}
+
+=pod
+
 =item $node->command_checks_all(...)
 
 PostgreSQL::Test::Utils::command_checks_all with our connection parameters. See
diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm b/src/test/perl/PostgreSQL/Test/Utils.pm
index d1ad131eadf..7d7ca83495f 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -89,6 +89,7 @@ our @EXPORT = qw(
   command_like
   command_like_safe
   command_fails_like
+  command_ok_or_fails_like
   command_checks_all
 
   $windows_os
@@ -1067,6 +1068,30 @@ sub command_fails_like
 
 =pod
 
+=item command_ok_or_fails_like(cmd, expected_stdout, expected_stderr, test_name)
+
+Check that the command either succeeds or fails with an error that matches the
+given regular expressions.
+
+=cut
+
+sub command_ok_or_fails_like
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+	my ($cmd, $expected_stdout, $expected_stderr, $test_name) = @_;
+	my ($stdout, $stderr);
+	print("# Running: " . join(" ", @{$cmd}) . "\n");
+	my $result = IPC::Run::run $cmd, '>' => \$stdout, '2>' => \$stderr;
+	if (!$result)
+	{
+		like($stdout, $expected_stdout, "$test_name: stdout matches");
+		like($stderr, $expected_stderr, "$test_name: stderr matches");
+	}
+	return $result;
+}
+
+=pod
+
 =item command_checks_all(cmd, ret, out, err, test_name)
 
 Run a command and check its status and outputs.
-- 
2.39.5 (Apple Git-154)

v7-0002-initdb-Add-no-sync-data-files.patchtext/plain; charset=us-asciiDownload
From 70770a018ef4d800ce5fccc302d164522ff5c278 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 19 Feb 2025 09:14:51 -0600
Subject: [PATCH v7 2/4] initdb: Add --no-sync-data-files.

This new option instructs initdb to skip synchronizing any files
in database directories and the database directories themselves,
i.e., everything in the base/ subdirectory and any other
tablespace directories.  Other files, such as those in pg_wal/ and
pg_xact/, will still be synchronized unless --no-sync is also
specified.  --no-sync-data-files is primarily intended for internal
use by tools that separately ensure the skipped files are
synchronized to disk.  A follow-up commit will use this to help
optimize pg_upgrade's file transfer step.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/initdb.sgml                | 27 +++++++
 src/bin/initdb/initdb.c                     | 10 ++-
 src/bin/initdb/t/001_initdb.pl              |  1 +
 src/bin/pg_basebackup/pg_basebackup.c       |  2 +-
 src/bin/pg_checksums/pg_checksums.c         |  2 +-
 src/bin/pg_combinebackup/pg_combinebackup.c |  2 +-
 src/bin/pg_rewind/file_ops.c                |  2 +-
 src/common/file_utils.c                     | 85 +++++++++++++--------
 src/include/common/file_utils.h             |  2 +-
 9 files changed, 96 insertions(+), 37 deletions(-)

diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 0026318485a..2f1f9a42f90 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -527,6 +527,33 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry id="app-initdb-option-no-sync-data-files">
+      <term><option>--no-sync-data-files</option></term>
+      <listitem>
+       <para>
+        By default, <command>initdb</command> safely writes all database files
+        to disk.  This option instructs <command>initdb</command> to skip
+        synchronizing all files in the individual database directories, the
+        database directories themselves, and the tablespace directories, i.e.,
+        everything in the <filename>base</filename> subdirectory and any other
+        tablespace directories.  Other files, such as those in
+        <literal>pg_wal</literal> and <literal>pg_xact</literal>, will still be
+        synchronized unless the <option>--no-sync</option> option is also
+        specified.
+       </para>
+       <para>
+        Note that if <option>--no-sync-data-files</option> is used in
+        conjuction with <option>--sync-method=syncfs</option>, some or all of
+        the aforementioned files and directories will be synchronized because
+        <literal>syncfs</literal> processes entire file systems.
+       </para>
+       <para>
+        This option is primarily intended for internal use by tools that
+        separately ensure the skipped files are synchronized to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="app-initdb-option-no-instructions">
       <term><option>--no-instructions</option></term>
       <listitem>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 21a0fe3ecd9..22b7d31b165 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -168,6 +168,7 @@ static bool data_checksums = true;
 static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+static bool sync_data_files = true;
 
 
 /* internal vars */
@@ -2566,6 +2567,7 @@ usage(const char *progname)
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
 	printf(_("  -n, --no-clean            do not clean up after errors\n"));
 	printf(_("  -N, --no-sync             do not wait for changes to be written safely to disk\n"));
+	printf(_("      --no-sync-data-files  do not sync files within database directories\n"));
 	printf(_("      --no-instructions     do not print instructions for next steps\n"));
 	printf(_("  -s, --show                show internal settings, then exit\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
@@ -3208,6 +3210,7 @@ main(int argc, char *argv[])
 		{"icu-rules", required_argument, NULL, 18},
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
+		{"no-sync-data-files", no_argument, NULL, 21},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3402,6 +3405,9 @@ main(int argc, char *argv[])
 			case 20:
 				data_checksums = false;
 				break;
+			case 21:
+				sync_data_files = false;
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -3453,7 +3459,7 @@ main(int argc, char *argv[])
 
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 		return 0;
 	}
@@ -3516,7 +3522,7 @@ main(int argc, char *argv[])
 	{
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 	}
 	else
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 01cc4a1602b..15dd10ce40a 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -76,6 +76,7 @@ command_like(
 	'checksums are enabled in control file');
 
 command_ok([ 'initdb', '--sync-only', $datadir ], 'sync only');
+command_ok([ 'initdb', '--sync-only', '--no-sync-data-files', $datadir ], '--no-sync-data-files');
 command_fails([ 'initdb', $datadir ], 'existing data directory');
 
 if ($supports_syncfs)
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index d4b4e334014..1da4bfc2351 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -2310,7 +2310,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
 		}
 		else
 		{
-			(void) sync_pgdata(basedir, serverVersion, sync_method);
+			(void) sync_pgdata(basedir, serverVersion, sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index 867aeddc601..f20be82862a 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -633,7 +633,7 @@ main(int argc, char *argv[])
 		if (do_sync)
 		{
 			pg_log_info("syncing data directory");
-			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method);
+			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method, true);
 		}
 
 		pg_log_info("updating control file");
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index d480dc74436..050260ee832 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -424,7 +424,7 @@ main(int argc, char *argv[])
 		else
 		{
 			pg_log_debug("recursively fsyncing \"%s\"", opt.output);
-			sync_pgdata(opt.output, version * 10000, opt.sync_method);
+			sync_pgdata(opt.output, version * 10000, opt.sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 467845419ed..55659ce201f 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -296,7 +296,7 @@ sync_target_dir(void)
 	if (!do_sync || dry_run)
 		return;
 
-	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method);
+	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method, true);
 }
 
 
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 0e3cfede935..78e272916f5 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -50,7 +50,8 @@ static int	pre_sync_fname(const char *fname, bool isdir);
 #endif
 static void walkdir(const char *path,
 					int (*action) (const char *fname, bool isdir),
-					bool process_symlinks);
+					bool process_symlinks,
+					const char *exclude_dir);
 
 #ifdef HAVE_SYNCFS
 
@@ -93,11 +94,15 @@ do_syncfs(const char *path)
  * syncing, and might not have privileges to write at all.
  *
  * serverVersion indicates the version of the server to be sync'd.
+ *
+ * If sync_data_files is false, this function skips syncing "base/" and any
+ * other tablespace directories.
  */
 void
 sync_pgdata(const char *pg_data,
 			int serverVersion,
-			DataDirSyncMethod sync_method)
+			DataDirSyncMethod sync_method,
+			bool sync_data_files)
 {
 	bool		xlog_is_symlink;
 	char		pg_wal[MAXPGPATH];
@@ -147,30 +152,33 @@ sync_pgdata(const char *pg_data,
 				do_syncfs(pg_data);
 
 				/* If any tablespaces are configured, sync each of those. */
-				dir = opendir(pg_tblspc);
-				if (dir == NULL)
-					pg_log_error("could not open directory \"%s\": %m",
-								 pg_tblspc);
-				else
+				if (sync_data_files)
 				{
-					while (errno = 0, (de = readdir(dir)) != NULL)
+					dir = opendir(pg_tblspc);
+					if (dir == NULL)
+						pg_log_error("could not open directory \"%s\": %m",
+									 pg_tblspc);
+					else
 					{
-						char		subpath[MAXPGPATH * 2];
+						while (errno = 0, (de = readdir(dir)) != NULL)
+						{
+							char		subpath[MAXPGPATH * 2];
 
-						if (strcmp(de->d_name, ".") == 0 ||
-							strcmp(de->d_name, "..") == 0)
-							continue;
+							if (strcmp(de->d_name, ".") == 0 ||
+								strcmp(de->d_name, "..") == 0)
+								continue;
 
-						snprintf(subpath, sizeof(subpath), "%s/%s",
-								 pg_tblspc, de->d_name);
-						do_syncfs(subpath);
-					}
+							snprintf(subpath, sizeof(subpath), "%s/%s",
+									 pg_tblspc, de->d_name);
+							do_syncfs(subpath);
+						}
 
-					if (errno)
-						pg_log_error("could not read directory \"%s\": %m",
-									 pg_tblspc);
+						if (errno)
+							pg_log_error("could not read directory \"%s\": %m",
+										 pg_tblspc);
 
-					(void) closedir(dir);
+						(void) closedir(dir);
+					}
 				}
 
 				/* If pg_wal is a symlink, process that too. */
@@ -182,15 +190,21 @@ sync_pgdata(const char *pg_data,
 
 		case DATA_DIR_SYNC_METHOD_FSYNC:
 			{
+				char	   *exclude_dir = NULL;
+
+				if (!sync_data_files)
+					exclude_dir = psprintf("%s/base", pg_data);
+
 				/*
 				 * If possible, hint to the kernel that we're soon going to
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(pg_data, pre_sync_fname, false);
+				walkdir(pg_data, pre_sync_fname, false, exclude_dir);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, pre_sync_fname, false);
-				walkdir(pg_tblspc, pre_sync_fname, true);
+					walkdir(pg_wal, pre_sync_fname, false, NULL);
+				if (sync_data_files)
+					walkdir(pg_tblspc, pre_sync_fname, true, NULL);
 #endif
 
 				/*
@@ -203,10 +217,14 @@ sync_pgdata(const char *pg_data,
 				 * get fsync'd twice. That's not an expected case so we don't
 				 * worry about optimizing it.
 				 */
-				walkdir(pg_data, fsync_fname, false);
+				walkdir(pg_data, fsync_fname, false, exclude_dir);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, fsync_fname, false);
-				walkdir(pg_tblspc, fsync_fname, true);
+					walkdir(pg_wal, fsync_fname, false, NULL);
+				if (sync_data_files)
+					walkdir(pg_tblspc, fsync_fname, true, NULL);
+
+				if (exclude_dir)
+					pfree(exclude_dir);
 			}
 			break;
 	}
@@ -245,10 +263,10 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(dir, pre_sync_fname, false);
+				walkdir(dir, pre_sync_fname, false, NULL);
 #endif
 
-				walkdir(dir, fsync_fname, false);
+				walkdir(dir, fsync_fname, false, NULL);
 			}
 			break;
 	}
@@ -264,6 +282,9 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
  * ignored in subdirectories, ie we intentionally don't pass down the
  * process_symlinks flag to recursive calls.
  *
+ * If exclude_dir is not NULL, it specifies a directory path to skip
+ * processing.
+ *
  * Errors are reported but not considered fatal.
  *
  * See also walkdir in fd.c, which is a backend version of this logic.
@@ -271,11 +292,15 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
 static void
 walkdir(const char *path,
 		int (*action) (const char *fname, bool isdir),
-		bool process_symlinks)
+		bool process_symlinks,
+		const char *exclude_dir)
 {
 	DIR		   *dir;
 	struct dirent *de;
 
+	if (exclude_dir && strcmp(exclude_dir, path) == 0)
+		return;
+
 	dir = opendir(path);
 	if (dir == NULL)
 	{
@@ -299,7 +324,7 @@ walkdir(const char *path,
 				(*action) (subpath, false);
 				break;
 			case PGFILETYPE_DIR:
-				walkdir(subpath, action, false);
+				walkdir(subpath, action, false, exclude_dir);
 				break;
 			default:
 
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index a832210adc1..8274bc877ab 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -35,7 +35,7 @@ struct iovec;					/* avoid including port/pg_iovec.h here */
 #ifdef FRONTEND
 extern int	fsync_fname(const char *fname, bool isdir);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
-						DataDirSyncMethod sync_method);
+						DataDirSyncMethod sync_method, bool sync_data_files);
 extern void sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method);
 extern int	durable_rename(const char *oldfile, const char *newfile);
 extern int	fsync_parent_path(const char *fname);
-- 
2.39.5 (Apple Git-154)

v7-0003-pg_dump-Add-sequence-data.patchtext/plain; charset=us-asciiDownload
From 887ca6a0cb221016a9a366d45f6d3b60c538fe3a Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 19 Feb 2025 11:25:28 -0600
Subject: [PATCH v7 3/4] pg_dump: Add --sequence-data.

This new option instructs pg_dump to dump sequence data when the
--no-data, --schema-only, or --statistics-only option is specified.
This was originally considered for commit a7e5457db8, but it was
left out at that time because there was no known use-case.  A
follow-up commit will use this to optimize pg_upgrade's file
transfer step.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/pg_dump.sgml               | 11 +++++++++++
 src/bin/pg_dump/pg_dump.c                   | 10 ++--------
 src/bin/pg_dump/t/002_pg_dump.pl            |  1 +
 src/bin/pg_upgrade/dump.c                   |  2 +-
 src/test/modules/test_pg_dump/t/001_base.pl |  2 +-
 5 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 0ae40f9be58..63cca18711a 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1298,6 +1298,17 @@ PostgreSQL documentation
        </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--sequence-data</option></term>
+      <listitem>
+       <para>
+        Include sequence data in the dump.  This is the default behavior except
+        when <option>--no-data</option>, <option>--schema-only</option>, or
+        <option>--statistics-only</option> is specified.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--serializable-deferrable</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 428ed2d60fc..e6253331e27 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -518,6 +518,7 @@ main(int argc, char **argv)
 		{"sync-method", required_argument, NULL, 15},
 		{"filter", required_argument, NULL, 16},
 		{"exclude-extension", required_argument, NULL, 17},
+		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -801,14 +802,6 @@ main(int argc, char **argv)
 	if (dopt.column_inserts && dopt.dump_inserts == 0)
 		dopt.dump_inserts = DUMP_DEFAULT_ROWS_PER_INSERT;
 
-	/*
-	 * Binary upgrade mode implies dumping sequence data even in schema-only
-	 * mode.  This is not exposed as a separate option, but kept separate
-	 * internally for clarity.
-	 */
-	if (dopt.binary_upgrade)
-		dopt.sequence_data = 1;
-
 	if (data_only && schema_only)
 		pg_fatal("options -s/--schema-only and -a/--data-only cannot be used together");
 	if (schema_only && statistics_only)
@@ -1275,6 +1268,7 @@ help(const char *progname)
 	printf(_("  --quote-all-identifiers      quote all identifiers, even if not key words\n"));
 	printf(_("  --rows-per-insert=NROWS      number of rows per INSERT; implies --inserts\n"));
 	printf(_("  --section=SECTION            dump named section (pre-data, data, or post-data)\n"));
+	printf(_("  --sequence-data              include sequence data in dump\n"));
 	printf(_("  --serializable-deferrable    wait until the dump can run without anomalies\n"));
 	printf(_("  --snapshot=SNAPSHOT          use given snapshot for the dump\n"));
 	printf(_("  --statistics-only            dump only the statistics, not schema or data\n"));
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index d281e27aa67..ed379033da7 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -66,6 +66,7 @@ my %pgdump_runs = (
 			'--file' => "$tempdir/binary_upgrade.dump",
 			'--no-password',
 			'--no-data',
+			'--sequence-data',
 			'--binary-upgrade',
 			'--dbname' => 'postgres',    # alternative way to specify database
 		],
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index 23fe7280a16..b8fd0d0acee 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -52,7 +52,7 @@ generate_old_dump(void)
 		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
 
 		parallel_exec_prog(log_file_name, NULL,
-						   "\"%s/pg_dump\" %s --no-data %s --quote-all-identifiers "
+						   "\"%s/pg_dump\" %s --no-data %s --sequence-data --quote-all-identifiers "
 						   "--binary-upgrade --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
 						   log_opts.verbose ? "--verbose" : "",
diff --git a/src/test/modules/test_pg_dump/t/001_base.pl b/src/test/modules/test_pg_dump/t/001_base.pl
index a9bcac4169d..adcaa419616 100644
--- a/src/test/modules/test_pg_dump/t/001_base.pl
+++ b/src/test/modules/test_pg_dump/t/001_base.pl
@@ -48,7 +48,7 @@ my %pgdump_runs = (
 		dump_cmd => [
 			'pg_dump', '--no-sync',
 			'--file' => "$tempdir/binary_upgrade.sql",
-			'--schema-only', '--binary-upgrade',
+			'--schema-only', '--sequence-data', '--binary-upgrade',
 			'--dbname' => 'postgres',
 		],
 	},
-- 
2.39.5 (Apple Git-154)

v7-0004-pg_upgrade-Add-swap-for-faster-file-transfer.patchtext/plain; charset=us-asciiDownload
From 73f9af4c0068c9003e229762ee91efe88949db3d Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 5 Mar 2025 17:36:54 -0600
Subject: [PATCH v7 4/4] pg_upgrade: Add --swap for faster file transfer.

This new option instructs pg_upgrade to move the data directories
from the old cluster to the new cluster and then to replace the
catalog files with those generated for the new cluster.  This mode
can outperform --link, --clone, --copy, and --copy-file-range,
especially on clusters with many relations.

However, this mode creates many garbage files in the old cluster,
which can prolong the file synchronization step.  To handle that,
we use "initdb --sync-only --no-sync-data-files" for file
synchronization, and we synchronize the catalog files as they are
transferred.  We assume that the database files transferred from
the old cluster were synchronized prior to upgrade.  This mode also
complicates reverting to the old cluster, so we recommend restoring
from backup upon failure during or after file transfer.

The new mode is limited to clusters located in the same file system
and to upgrades from version 10 and newer.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/pgupgrade.sgml    |  59 ++++-
 src/bin/pg_upgrade/TESTING         |   6 +-
 src/bin/pg_upgrade/check.c         |  29 ++-
 src/bin/pg_upgrade/controldata.c   |  21 +-
 src/bin/pg_upgrade/dump.c          |   4 +-
 src/bin/pg_upgrade/file.c          |  14 +-
 src/bin/pg_upgrade/info.c          |   4 +-
 src/bin/pg_upgrade/option.c        |   7 +
 src/bin/pg_upgrade/pg_upgrade.c    |  16 +-
 src/bin/pg_upgrade/pg_upgrade.h    |   5 +-
 src/bin/pg_upgrade/relfilenumber.c | 364 +++++++++++++++++++++++++++++
 src/bin/pg_upgrade/t/006_modes.pl  |   1 +
 src/common/file_utils.c            |  14 +-
 src/include/common/file_utils.h    |   1 +
 14 files changed, 511 insertions(+), 34 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 5db761d1ff1..da261619043 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -244,7 +244,8 @@ PostgreSQL documentation
       <listitem>
        <para>
         Copy files to the new cluster.  This is the default.  (See also
-        <option>--link</option> and <option>--clone</option>.)
+        <option>--link</option>, <option>--clone</option>,
+        <option>--copy-file-range</option>, and <option>--swap</option>.)
        </para>
       </listitem>
      </varlistentry>
@@ -262,6 +263,32 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--swap</option></term>
+      <listitem>
+       <para>
+        Move the data directories from the old cluster to the new cluster.
+        Then, replace the catalog files with those generated for the new
+        cluster.  This mode can outperform <option>--link</option>,
+        <option>--clone</option>, <option>--copy</option>, and
+        <option>--copy-file-range</option>, especially on clusters with many
+        relations.
+       </para>
+       <para>
+        However, this mode creates many garbage files in the old cluster, which
+        can prolong the file synchronization step if
+        <option>--sync-method=syncfs</option> is used.  Therefore, it is
+        recommended to use <option>--sync-method=fsync</option> with
+        <option>--swap</option>.
+       </para>
+       <para>
+        Additionally, once the file transfer step begins, the old cluster will
+        be destructively modified and therefore will no longer be safe to
+        start.  See <xref linkend="pgupgrade-step-revert"/> for details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--sync-method=</option><replaceable>method</replaceable></term>
       <listitem>
@@ -530,6 +557,10 @@ NET STOP postgresql-&majorversion;
      is started.  Clone mode also requires that the old and new data
      directories be in the same file system.  This mode is only available
      on certain operating systems and file systems.
+     Swap mode may be the fastest if there are many relations, but you will not
+     be able to access your old cluster once the file transfer step begins.
+     Swap mode also requires that the old and new cluster data directories be
+     in the same file system.
     </para>
 
     <para>
@@ -889,6 +920,32 @@ psql --username=postgres --file=script.sql postgres
 
         </itemizedlist></para>
       </listitem>
+
+      <listitem>
+       <para>
+        If the <option>--swap</option> option was used, the old cluster might
+        be destructively modified:
+
+        <itemizedlist>
+         <listitem>
+          <para>
+           If <command>pg_upgrade</command> aborts before reporting that the
+           old cluster is no longer safe to start, the old cluster was
+           unmodified; it can be restarted.
+          </para>
+         </listitem>
+
+         <listitem>
+          <para>
+           If <command>pg_upgrade</command> has reported that the old cluster
+           is no longer safe to start, the old cluster was destructively
+           modified.  The old cluster will need to be restored from backup in
+           this case.
+          </para>
+         </listitem>
+        </itemizedlist>
+       </para>
+      </listitem>
      </itemizedlist></para>
    </step>
   </procedure>
diff --git a/src/bin/pg_upgrade/TESTING b/src/bin/pg_upgrade/TESTING
index 00842ac6ec3..c3d463c9c29 100644
--- a/src/bin/pg_upgrade/TESTING
+++ b/src/bin/pg_upgrade/TESTING
@@ -20,13 +20,13 @@ export oldinstall=...otherversion/	(old version's install base path)
 See DETAILS below for more information about creation of the dump.
 
 You can also test the different transfer modes (--copy, --link,
---clone, --copy-file-range) by setting the environment variable
+--clone, --copy-file-range, --swap) by setting the environment variable
 PG_TEST_PG_UPGRADE_MODE to the respective command-line option, like
 
 	make check PG_TEST_PG_UPGRADE_MODE=--link
 
-The default is --copy.  Note that the other modes are not supported on
-all operating systems.
+The default is --copy.  Note that not all modes are supported on all
+operating systems.
 
 DETAILS
 -------
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 88daa808035..564a9116ca5 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -709,7 +709,34 @@ check_new_cluster(void)
 			check_copy_file_range();
 			break;
 		case TRANSFER_MODE_LINK:
-			check_hard_link();
+			check_hard_link(TRANSFER_MODE_LINK);
+			break;
+		case TRANSFER_MODE_SWAP:
+
+			/*
+			 * We do the hard link check for --swap, too, since it's an easy
+			 * way to verify the clusters are in the same file system.  This
+			 * allows us to take some shortcuts in the file synchronization
+			 * step.  With some more effort, we could probably support the
+			 * separate-file-system use case, but this mode is unlikely to
+			 * offer much benefit if we have to copy the files across file
+			 * system boundaries.
+			 */
+			check_hard_link(TRANSFER_MODE_SWAP);
+
+			/*
+			 * There are a few known issues with using --swap to upgrade from
+			 * versions older than 10.  For example, the sequence tuple format
+			 * changed in v10, and the visibility map format changed in 9.6.
+			 * While such problems are not insurmountable (and we may have to
+			 * deal with similar problems in the future, anyway), it doesn't
+			 * seem worth the effort to support swap mode for upgrades from
+			 * long-unsupported versions.
+			 */
+			if (GET_MAJOR_VERSION(old_cluster.major_version) < 1000)
+				pg_fatal("Swap mode can only upgrade clusters from PostgreSQL version %s and later.",
+						 "10");
+
 			break;
 	}
 
diff --git a/src/bin/pg_upgrade/controldata.c b/src/bin/pg_upgrade/controldata.c
index bd49ea867bf..47ee27ec835 100644
--- a/src/bin/pg_upgrade/controldata.c
+++ b/src/bin/pg_upgrade/controldata.c
@@ -751,7 +751,7 @@ check_control_data(ControlData *oldctrl,
 
 
 void
-disable_old_cluster(void)
+disable_old_cluster(transferMode transfer_mode)
 {
 	char		old_path[MAXPGPATH],
 				new_path[MAXPGPATH];
@@ -766,10 +766,17 @@ disable_old_cluster(void)
 				 old_path, new_path);
 	check_ok();
 
-	pg_log(PG_REPORT, "\n"
-		   "If you want to start the old cluster, you will need to remove\n"
-		   "the \".old\" suffix from %s/global/pg_control.old.\n"
-		   "Because \"link\" mode was used, the old cluster cannot be safely\n"
-		   "started once the new cluster has been started.",
-		   old_cluster.pgdata);
+	if (transfer_mode == TRANSFER_MODE_LINK)
+		pg_log(PG_REPORT, "\n"
+			   "If you want to start the old cluster, you will need to remove\n"
+			   "the \".old\" suffix from %s/global/pg_control.old.\n"
+			   "Because \"link\" mode was used, the old cluster cannot be safely\n"
+			   "started once the new cluster has been started.",
+			   old_cluster.pgdata);
+	else if (transfer_mode == TRANSFER_MODE_SWAP)
+		pg_log(PG_REPORT, "\n"
+			   "Because \"swap\" mode was used, the old cluster can no longer be\n"
+			   "safely started.");
+	else
+		pg_fatal("unrecognized transfer mode");
 }
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index b8fd0d0acee..23cb08e8347 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -52,9 +52,11 @@ generate_old_dump(void)
 		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
 
 		parallel_exec_prog(log_file_name, NULL,
-						   "\"%s/pg_dump\" %s --no-data %s --sequence-data --quote-all-identifiers "
+						   "\"%s/pg_dump\" %s --no-data %s %s --quote-all-identifiers "
 						   "--binary-upgrade --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
+						   (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+						   "" : "--sequence-data",
 						   log_opts.verbose ? "--verbose" : "",
 						   user_opts.do_statistics ? "" : "--no-statistics",
 						   log_opts.dumpdir,
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 7fd1991204a..91ed16acb08 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -434,7 +434,7 @@ check_copy_file_range(void)
 }
 
 void
-check_hard_link(void)
+check_hard_link(transferMode transfer_mode)
 {
 	char		existing_file[MAXPGPATH];
 	char		new_link_file[MAXPGPATH];
@@ -444,8 +444,16 @@ check_hard_link(void)
 	unlink(new_link_file);		/* might fail */
 
 	if (link(existing_file, new_link_file) < 0)
-		pg_fatal("could not create hard link between old and new data directories: %m\n"
-				 "In link mode the old and new data directories must be on the same file system.");
+	{
+		if (transfer_mode == TRANSFER_MODE_LINK)
+			pg_fatal("could not create hard link between old and new data directories: %m\n"
+					 "In link mode the old and new data directories must be on the same file system.");
+		else if (transfer_mode == TRANSFER_MODE_SWAP)
+			pg_fatal("could not create hard link between old and new data directories: %m\n"
+					 "In swap mode the old and new data directories must be on the same file system.");
+		else
+			pg_fatal("unrecognized transfer mode");
+	}
 
 	unlink(new_link_file);
 }
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index ad52de8b607..4b7a56f5b3b 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -490,7 +490,7 @@ get_rel_infos_query(void)
 					  "  FROM pg_catalog.pg_class c JOIN pg_catalog.pg_namespace n "
 					  "         ON c.relnamespace = n.oid "
 					  "  WHERE relkind IN (" CppAsString2(RELKIND_RELATION) ", "
-					  CppAsString2(RELKIND_MATVIEW) ") AND "
+					  CppAsString2(RELKIND_MATVIEW) "%s) AND "
 	/* exclude possible orphaned temp tables */
 					  "    ((n.nspname !~ '^pg_temp_' AND "
 					  "      n.nspname !~ '^pg_toast_temp_' AND "
@@ -499,6 +499,8 @@ get_rel_infos_query(void)
 					  "      c.oid >= %u::pg_catalog.oid) OR "
 					  "     (n.nspname = 'pg_catalog' AND "
 					  "      relname IN ('pg_largeobject') ))), ",
+					  (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+					  ", " CppAsString2(RELKIND_SEQUENCE) : "",
 					  FirstNormalObjectId);
 
 	/*
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 188dd8d8a8b..7fd7f1d33fc 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -62,6 +62,7 @@ parseCommandLine(int argc, char *argv[])
 		{"sync-method", required_argument, NULL, 4},
 		{"no-statistics", no_argument, NULL, 5},
 		{"set-char-signedness", required_argument, NULL, 6},
+		{"swap", no_argument, NULL, 7},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -228,6 +229,11 @@ parseCommandLine(int argc, char *argv[])
 				else
 					pg_fatal("invalid argument for option %s", "--set-char-signedness");
 				break;
+
+			case 7:
+				user_opts.transfer_mode = TRANSFER_MODE_SWAP;
+				break;
+
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
 						os_info.progname);
@@ -325,6 +331,7 @@ usage(void)
 	printf(_("  --no-statistics               do not import statistics from old cluster\n"));
 	printf(_("  --set-char-signedness=OPTION  set new cluster char signedness to \"signed\" or\n"
 			 "                                \"unsigned\"\n"));
+	printf(_("  --swap                        move data directories to new cluster\n"));
 	printf(_("  --sync-method=METHOD          set method for syncing files to disk\n"));
 	printf(_("  -?, --help                    show this help, then exit\n"));
 	printf(_("\n"
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 174cd920840..9295e46aed3 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -170,12 +170,14 @@ main(int argc, char **argv)
 
 	/*
 	 * Most failures happen in create_new_objects(), which has completed at
-	 * this point.  We do this here because it is just before linking, which
-	 * will link the old and new cluster data files, preventing the old
-	 * cluster from being safely started once the new cluster is started.
+	 * this point.  We do this here because it is just before file transfer,
+	 * which for --link will make it unsafe to start the old cluster once the
+	 * new cluster is started, and for --swap will make it unsafe to start the
+	 * old cluster at all.
 	 */
-	if (user_opts.transfer_mode == TRANSFER_MODE_LINK)
-		disable_old_cluster();
+	if (user_opts.transfer_mode == TRANSFER_MODE_LINK ||
+		user_opts.transfer_mode == TRANSFER_MODE_SWAP)
+		disable_old_cluster(user_opts.transfer_mode);
 
 	transfer_all_new_tablespaces(&old_cluster.dbarr, &new_cluster.dbarr,
 								 old_cluster.pgdata, new_cluster.pgdata);
@@ -212,8 +214,10 @@ main(int argc, char **argv)
 	{
 		prep_status("Sync data directory to disk");
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/initdb\" --sync-only \"%s\" --sync-method %s",
+				  "\"%s/initdb\" --sync-only %s \"%s\" --sync-method %s",
 				  new_cluster.bindir,
+				  (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+				  "--no-sync-data-files" : "",
 				  new_cluster.pgdata,
 				  user_opts.sync_method);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 4c9d0172149..69c965bb7d0 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -262,6 +262,7 @@ typedef enum
 	TRANSFER_MODE_COPY,
 	TRANSFER_MODE_COPY_FILE_RANGE,
 	TRANSFER_MODE_LINK,
+	TRANSFER_MODE_SWAP,
 } transferMode;
 
 /*
@@ -391,7 +392,7 @@ void		create_script_for_old_cluster_deletion(char **deletion_script_file_name);
 
 void		get_control_data(ClusterInfo *cluster);
 void		check_control_data(ControlData *oldctrl, ControlData *newctrl);
-void		disable_old_cluster(void);
+void		disable_old_cluster(transferMode transfer_mode);
 
 
 /* dump.c */
@@ -423,7 +424,7 @@ void		rewriteVisibilityMap(const char *fromfile, const char *tofile,
 								 const char *schemaName, const char *relName);
 void		check_file_clone(void);
 void		check_copy_file_range(void);
-void		check_hard_link(void);
+void		check_hard_link(transferMode transfer_mode);
 
 /* fopen_priv() is no longer different from fopen() */
 #define fopen_priv(path, mode)	fopen(path, mode)
diff --git a/src/bin/pg_upgrade/relfilenumber.c b/src/bin/pg_upgrade/relfilenumber.c
index 8c23c583172..a87e6156911 100644
--- a/src/bin/pg_upgrade/relfilenumber.c
+++ b/src/bin/pg_upgrade/relfilenumber.c
@@ -11,11 +11,92 @@
 
 #include <sys/stat.h>
 
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/int.h"
+#include "common/logging.h"
 #include "pg_upgrade.h"
 
 static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
 static void transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_frozenbit);
 
+/*
+ * The following set of sync_queue_* functions are used for --swap to reduce
+ * the amount of time spent synchronizing the swapped catalog files.  When a
+ * file is added to the queue, we also alert the file system that we'd like it
+ * to be persisted to disk in the near future (if that operation is supported
+ * by the current platform).  Once the queue is full, all of the files are
+ * synchronized to disk.  This strategy should generally be much faster than
+ * simply calling fsync() on the files right away.
+ *
+ * The general usage pattern should be something like:
+ *
+ *     for (int i = 0; i < num_files; i++)
+ *         sync_queue_push(files[i]);
+ *
+ *     // be sure to sync any remaining files in the queue
+ *     sync_queue_sync_all();
+ *     synq_queue_destroy();
+ */
+
+#define SYNC_QUEUE_MAX_LEN	(1024)
+
+static char *sync_queue[SYNC_QUEUE_MAX_LEN];
+static bool sync_queue_inited;
+static int	sync_queue_len;
+
+static inline void
+sync_queue_init(void)
+{
+	if (sync_queue_inited)
+		return;
+
+	sync_queue_inited = true;
+	for (int i = 0; i < SYNC_QUEUE_MAX_LEN; i++)
+		sync_queue[i] = palloc(MAXPGPATH);
+}
+
+static inline void
+sync_queue_sync_all(void)
+{
+	if (!sync_queue_inited)
+		return;
+
+	for (int i = 0; i < sync_queue_len; i++)
+	{
+		if (fsync_fname(sync_queue[i], false) != 0)
+			pg_fatal("could not synchronize file \"%s\": %m", sync_queue[i]);
+	}
+
+	sync_queue_len = 0;
+}
+
+static inline void
+sync_queue_push(const char *fname)
+{
+	sync_queue_init();
+
+	pre_sync_fname(fname, false);
+
+	strncpy(sync_queue[sync_queue_len++], fname, MAXPGPATH);
+	if (sync_queue_len >= SYNC_QUEUE_MAX_LEN)
+		sync_queue_sync_all();
+}
+
+static inline void
+sync_queue_destroy(void)
+{
+	if (!sync_queue_inited)
+		return;
+
+	sync_queue_inited = false;
+	sync_queue_len = 0;
+	for (int i = 0; i < SYNC_QUEUE_MAX_LEN; i++)
+	{
+		pfree(sync_queue[i]);
+		sync_queue[i] = NULL;
+	}
+}
 
 /*
  * transfer_all_new_tablespaces()
@@ -41,6 +122,9 @@ transfer_all_new_tablespaces(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		case TRANSFER_MODE_LINK:
 			prep_status_progress("Linking user relation files");
 			break;
+		case TRANSFER_MODE_SWAP:
+			prep_status_progress("Swapping data directories");
+			break;
 	}
 
 	/*
@@ -125,6 +209,267 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		/* We allocate something even for n_maps == 0 */
 		pg_free(mappings);
 	}
+
+	/*
+	 * Make sure anything pending synchronization in swap mode is fully
+	 * persisted to disk.  This is a no-op for other transfer modes.
+	 */
+	sync_queue_sync_all();
+	sync_queue_destroy();
+}
+
+/*
+ * prepare_for_swap()
+ *
+ * This function durably moves the database directory from the old cluster to
+ * the new cluster in preparation for moving the pg_restore-generated catalog
+ * files into place.  Returns false if the database with the given OID does not
+ * have a directory in the given tablespace, otherwise returns true.
+ *
+ * old_cat (the directory for the old catalog files), new_dat (the database
+ * directory in the new cluster), and moved_dat (the destination for the
+ * pg_restore-generated database directory) should be sized to MAXPGPATH bytes.
+ * This function will return the appropriate paths in those variables.
+ */
+static bool
+prepare_for_swap(const char *old_tablespace, Oid db_oid,
+				 char *old_cat, char *new_dat, char *moved_dat)
+{
+	const char *new_tablespace;
+	const char *old_tblspc_suffix;
+	const char *new_tblspc_suffix;
+	char		old_tblspc[MAXPGPATH];
+	char		new_tblspc[MAXPGPATH];
+	char		moved_tblspc[MAXPGPATH];
+	char		old_dat[MAXPGPATH];
+	struct stat st;
+
+	if (strcmp(old_tablespace, old_cluster.pgdata) == 0)
+	{
+		new_tablespace = new_cluster.pgdata;
+		new_tblspc_suffix = "/base";
+		old_tblspc_suffix = "/base";
+	}
+	else
+	{
+		new_tablespace = old_tablespace;
+		new_tblspc_suffix = new_cluster.tablespace_suffix;
+		old_tblspc_suffix = old_cluster.tablespace_suffix;
+	}
+
+	/* Old and new cluster paths. */
+	snprintf(old_tblspc, sizeof(old_tblspc), "%s%s", old_tablespace, old_tblspc_suffix);
+	snprintf(new_tblspc, sizeof(new_tblspc), "%s%s", new_tablespace, new_tblspc_suffix);
+	snprintf(old_dat, sizeof(old_dat), "%s/%u", old_tblspc, db_oid);
+	snprintf(new_dat, MAXPGPATH, "%s/%u", new_tblspc, db_oid);
+
+	/*
+	 * Paths for "moved aside" stuff.  We intentionally put these in the old
+	 * cluster so that the delete_old_cluster.{sh,bat} script handles them.
+	 */
+	snprintf(moved_tblspc, sizeof(moved_tblspc), "%s/moved_for_upgrade", old_tblspc);
+	snprintf(old_cat, MAXPGPATH, "%s/%u_old_catalogs", moved_tblspc, db_oid);
+	snprintf(moved_dat, MAXPGPATH, "%s/%u", moved_tblspc, db_oid);
+
+	/* Check that the database directory exists in the given tablespace. */
+	if (stat(old_dat, &st) != 0)
+	{
+		if (errno != ENOENT)
+			pg_fatal("could not stat file \"%s\": %m", old_dat);
+		return false;
+	}
+
+	/* Create directory for stuff that is moved aside. */
+	if (pg_mkdir_p(moved_tblspc, pg_dir_create_mode) != 0 && errno != EEXIST)
+		pg_fatal("could not create directory \"%s\"", moved_tblspc);
+
+	/* Create directory for old catalog files. */
+	if (pg_mkdir_p(old_cat, pg_dir_create_mode) != 0)
+		pg_fatal("could not create directory \"%s\"", old_cat);
+
+	/* Move the new cluster's database directory aside. */
+	if (rename(new_dat, moved_dat) != 0)
+		pg_fatal("could not rename \"%s\" to \"%s\"", new_dat, moved_dat);
+
+	/* Move the old cluster's database directory into place. */
+	if (rename(old_dat, new_dat) != 0)
+		pg_fatal("could not rename \"%s\" to \"%s\"", old_dat, new_dat);
+
+	return true;
+}
+
+/*
+ * FileNameMapCmp()
+ *
+ * qsort() comparator for FileNameMap that sorts by RelFileNumber.
+ */
+static int
+FileNameMapCmp(const void *a, const void *b)
+{
+	const FileNameMap *map1 = (const FileNameMap *) a;
+	const FileNameMap *map2 = (const FileNameMap *) b;
+
+	return pg_cmp_u32(map1->relfilenumber, map2->relfilenumber);
+}
+
+/*
+ * parse_relfilenumber()
+ *
+ * Attempt to parse the RelFileNumber of the given file name.  If we can't,
+ * return InvalidRelFileNumber.  Note that this code snippet is lifted from
+ * parse_filename_for_nontemp_relation().
+ */
+static RelFileNumber
+parse_relfilenumber(const char *filename)
+{
+	char	   *endp;
+	unsigned long n;
+
+	if (filename[0] < '1' || filename[0] > '9')
+		return InvalidRelFileNumber;
+
+	errno = 0;
+	n = strtoul(filename, &endp, 10);
+	if (errno || filename == endp || n <= 0 || n > PG_UINT32_MAX)
+		return InvalidRelFileNumber;
+
+	return (RelFileNumber) n;
+}
+
+/*
+ * swap_catalog_files()
+ *
+ * Moves the old catalog files aside, and moves the new catalog files into
+ * place.
+ */
+static void
+swap_catalog_files(FileNameMap *maps, int size, const char *old_cat,
+				   const char *new_dat, const char *moved_dat)
+{
+	DIR		   *dir;
+	struct dirent *de;
+	char		path[MAXPGPATH];
+	char		dest[MAXPGPATH];
+	RelFileNumber rfn;
+
+	/* Move the old catalog files aside. */
+	dir = opendir(new_dat);
+	if (dir == NULL)
+		pg_fatal("could not open directory \"%s\": %m", new_dat);
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		snprintf(path, sizeof(path), "%s/%s", new_dat, de->d_name);
+		if (get_dirent_type(path, de, false, PG_LOG_ERROR) != PGFILETYPE_REG)
+			continue;
+
+		rfn = parse_relfilenumber(de->d_name);
+		if (RelFileNumberIsValid(rfn))
+		{
+			FileNameMap key = {.relfilenumber = rfn};
+
+			if (bsearch(&key, maps, size, sizeof(FileNameMap), FileNameMapCmp))
+				continue;
+		}
+
+		snprintf(dest, sizeof(dest), "%s/%s", old_cat, de->d_name);
+		if (rename(path, dest) != 0)
+			pg_fatal("could not rename \"%s\" to \"%s\": %m", path, dest);
+	}
+	if (errno)
+		pg_fatal("could not read directory \"%s\": %m", new_dat);
+	(void) closedir(dir);
+
+	/* Move the new catalog files into place. */
+	dir = opendir(moved_dat);
+	if (dir == NULL)
+		pg_fatal("could not open directory \"%s\": %m", moved_dat);
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		snprintf(path, sizeof(path), "%s/%s", moved_dat, de->d_name);
+		if (get_dirent_type(path, de, false, PG_LOG_ERROR) != PGFILETYPE_REG)
+			continue;
+
+		rfn = parse_relfilenumber(de->d_name);
+		if (RelFileNumberIsValid(rfn))
+		{
+			FileNameMap key = {.relfilenumber = rfn};
+
+			if (bsearch(&key, maps, size, sizeof(FileNameMap), FileNameMapCmp))
+				continue;
+		}
+
+		snprintf(dest, sizeof(dest), "%s/%s", new_dat, de->d_name);
+		if (rename(path, dest) != 0)
+			pg_fatal("could not rename \"%s\" to \"%s\": %m", path, dest);
+
+		/*
+		 * We don't fsync() the database files in the file synchronization
+		 * stage of pg_upgrade in swap mode, so we need to synchronize them
+		 * ourselves.  We only do this for the catalog files because they were
+		 * created during pg_restore with fsync=off.  We assume that the user
+		 * data files files were properly persisted to disk when the user last
+		 * shut it down.
+		 */
+		if (user_opts.do_sync)
+			sync_queue_push(dest);
+	}
+	if (errno)
+		pg_fatal("could not read directory \"%s\": %m", moved_dat);
+	(void) closedir(dir);
+
+	/* Ensure the directory entries are persisted to disk. */
+	if (fsync_fname(new_dat, true) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", new_dat);
+	if (fsync_parent_path(new_dat) != 0)
+		pg_fatal("could not synchronize parent directory of \"%s\": %m", new_dat);
+}
+
+/*
+ * do_swap()
+ *
+ * Perform the required steps for --swap for a single database.  In short this
+ * moves the old cluster's database directory into the new cluster and then
+ * replaces any files for system catalogs with the ones that were generated
+ * during pg_restore.
+ */
+static void
+do_swap(FileNameMap *maps, int size, char *old_tablespace)
+{
+	char		old_cat[MAXPGPATH];
+	char		new_dat[MAXPGPATH];
+	char		moved_dat[MAXPGPATH];
+
+	/*
+	 * We perform many lookups on maps by relfilenumber in swap mode, so make
+	 * sure it's sorted by relfilenumber.  maps should already be sorted by
+	 * OID, so in general this shouldn't have much work to do.
+	 */
+	qsort(maps, size, sizeof(FileNameMap), FileNameMapCmp);
+
+	/*
+	 * If an old tablespace is given, we only need to process that one.  If no
+	 * old tablespace is specified, we need to process all the tablespaces on
+	 * the system.
+	 */
+	if (old_tablespace)
+	{
+		if (prepare_for_swap(old_tablespace, maps[0].db_oid,
+							 old_cat, new_dat, moved_dat))
+			swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+	}
+	else
+	{
+		if (prepare_for_swap(old_cluster.pgdata, maps[0].db_oid,
+							 old_cat, new_dat, moved_dat))
+			swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+
+		for (int tblnum = 0; tblnum < os_info.num_old_tablespaces; tblnum++)
+		{
+			if (prepare_for_swap(os_info.old_tablespaces[tblnum], maps[0].db_oid,
+								 old_cat, new_dat, moved_dat))
+				swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+		}
+	}
 }
 
 /*
@@ -145,6 +490,20 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
 		new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
 		vm_must_add_frozenbit = true;
 
+	/* --swap has its own subroutine */
+	if (user_opts.transfer_mode == TRANSFER_MODE_SWAP)
+	{
+		/*
+		 * We don't support --swap to upgrade from versions that require
+		 * rewriting the visibility map.  We should've failed already if
+		 * someone tries to do that.
+		 */
+		Assert(!vm_must_add_frozenbit);
+
+		do_swap(maps, size, old_tablespace);
+		return;
+	}
+
 	for (mapnum = 0; mapnum < size; mapnum++)
 	{
 		if (old_tablespace == NULL ||
@@ -259,6 +618,11 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 					pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"",
 						   old_file, new_file);
 					linkFile(old_file, new_file, map->nspname, map->relname);
+					break;
+				case TRANSFER_MODE_SWAP:
+					/* swap mode is handled in its own code path */
+					pg_fatal("should never happen");
+					break;
 			}
 	}
 }
diff --git a/src/bin/pg_upgrade/t/006_modes.pl b/src/bin/pg_upgrade/t/006_modes.pl
index 468591fc486..34f362c3fea 100644
--- a/src/bin/pg_upgrade/t/006_modes.pl
+++ b/src/bin/pg_upgrade/t/006_modes.pl
@@ -63,5 +63,6 @@ test_mode('--clone');
 test_mode('--copy');
 test_mode('--copy-file-range');
 test_mode('--link');
+test_mode('--swap');
 
 done_testing();
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 78e272916f5..4405ef8b425 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -45,9 +45,6 @@
  */
 #define MINIMUM_VERSION_FOR_PG_WAL	100000
 
-#ifdef PG_FLUSH_DATA_WORKS
-static int	pre_sync_fname(const char *fname, bool isdir);
-#endif
 static void walkdir(const char *path,
 					int (*action) (const char *fname, bool isdir),
 					bool process_symlinks,
@@ -352,16 +349,16 @@ walkdir(const char *path,
 }
 
 /*
- * Hint to the OS that it should get ready to fsync() this file.
+ * Hint to the OS that it should get ready to fsync() this file, if supported
+ * by the platform.
  *
  * Ignores errors trying to open unreadable files, and reports other errors
  * non-fatally.
  */
-#ifdef PG_FLUSH_DATA_WORKS
-
-static int
+int
 pre_sync_fname(const char *fname, bool isdir)
 {
+#ifdef PG_FLUSH_DATA_WORKS
 	int			fd;
 
 	fd = open(fname, O_RDONLY | PG_BINARY, 0);
@@ -388,11 +385,10 @@ pre_sync_fname(const char *fname, bool isdir)
 #endif
 
 	(void) close(fd);
+#endif							/* PG_FLUSH_DATA_WORKS */
 	return 0;
 }
 
-#endif							/* PG_FLUSH_DATA_WORKS */
-
 /*
  * fsync_fname -- Try to fsync a file or directory
  *
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 8274bc877ab..9fd88953e43 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -33,6 +33,7 @@ typedef enum DataDirSyncMethod
 struct iovec;					/* avoid including port/pg_iovec.h here */
 
 #ifdef FRONTEND
+extern int	pre_sync_fname(const char *fname, bool isdir);
 extern int	fsync_fname(const char *fname, bool isdir);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
 						DataDirSyncMethod sync_method, bool sync_data_files);
-- 
2.39.5 (Apple Git-154)

#31Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#30)
Re: optimize file transfer in pg_upgrade

On Tue, Mar 18, 2025 at 09:14:22PM -0500, Nathan Bossart wrote:

And here is a new version of the full patch set.

I'm currently planning to commit this sometime early-ish next week. One
notable loose end is the lack of a pg_upgrade test with a non-default
tablespace, but that is an existing problem that IMHO is best handled
separately (since we can only test it in cross-version upgrades).

--
nathan

#32Andres Freund
andres@anarazel.de
In reply to: Nathan Bossart (#30)
Re: optimize file transfer in pg_upgrade

Hi,

On 2025-03-18 21:14:22 -0500, Nathan Bossart wrote:

From 8b6a5e0148c2f7a663f5003f12ae9461d2b06a5c Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 18 Mar 2025 20:58:07 -0500
Subject: [PATCH v7 1/4] Add test for pg_upgrade file transfer modes.

This new test checks all of pg_upgrade's file transfer modes. For
each mode, we verify that pg_upgrade either succeeds (and some test
objects successfully reach the new version) or fails with an error
that indicates the mode is not supported on the current platform.

LGTM. I'm sure we could do more than the test does today, but I think it's a
good improvement.

Greetings,

Andres Freund

#33Tom Lane
tgl@sss.pgh.pa.us
In reply to: Nathan Bossart (#31)
Re: optimize file transfer in pg_upgrade

Nathan Bossart <nathandbossart@gmail.com> writes:

I'm currently planning to commit this sometime early-ish next week. One
notable loose end is the lack of a pg_upgrade test with a non-default
tablespace, but that is an existing problem that IMHO is best handled
separately (since we can only test it in cross-version upgrades).

Agreed that that shouldn't block this, but we need some kind of
plan for testing it better.

regards, tom lane

#34Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#33)
Re: optimize file transfer in pg_upgrade

Hi,

On 2025-03-19 12:28:33 -0400, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

I'm currently planning to commit this sometime early-ish next week. One
notable loose end is the lack of a pg_upgrade test with a non-default
tablespace, but that is an existing problem that IMHO is best handled
separately (since we can only test it in cross-version upgrades).

Agreed that that shouldn't block this, but we need some kind of
plan for testing it better.

Yea, this is really suboptimal.

Shouldn't allow_in_place_tablespaces be sufficient to deal with that scenario?
Or at least it should make it reasonably easy to cope if it doesn't already
suffice?

Greetings,

Andres Freund

#35Nathan Bossart
nathandbossart@gmail.com
In reply to: Andres Freund (#34)
Re: optimize file transfer in pg_upgrade

On Wed, Mar 19, 2025 at 12:44:38PM -0400, Andres Freund wrote:

On 2025-03-19 12:28:33 -0400, Tom Lane wrote:

Nathan Bossart <nathandbossart@gmail.com> writes:

I'm currently planning to commit this sometime early-ish next week. One
notable loose end is the lack of a pg_upgrade test with a non-default
tablespace, but that is an existing problem that IMHO is best handled
separately (since we can only test it in cross-version upgrades).

Agreed that that shouldn't block this, but we need some kind of
plan for testing it better.

Yea, this is really suboptimal.

Shouldn't allow_in_place_tablespaces be sufficient to deal with that scenario?
Or at least it should make it reasonably easy to cope if it doesn't already
suffice?

Unfortunately, pg_upgrade can't yet handle in-place tablespaces. One
reason is that pg_tablespace_location() returns a relative path for those
(e.g., "pg_tblspc/123456"). We'd also need to adjust init_tablespaces() to
not fail if all the tablespaces are in-place. There may be other reasons,
too. I'm confident we could get it working, but I'm not too excited about
trying to sneak this into v18.

In addition to testing with in-place tablespaces, we might also want to
teach the transfer modes test to do cross-version testing when possible.
In that case, we can test normal (non-in-place) tablespaces. However, that
would be limited to the buildfarm.

Does this seem like a reasonable plan for v19?

--
nathan

#36Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#35)
1 attachment(s)
Re: optimize file transfer in pg_upgrade

On Wed, Mar 19, 2025 at 02:32:01PM -0500, Nathan Bossart wrote:

In addition to testing with in-place tablespaces, we might also want to
teach the transfer modes test to do cross-version testing when possible.
In that case, we can test normal (non-in-place) tablespaces. However, that
would be limited to the buildfarm.

Actually, this one was pretty easy to do.

--
nathan

Attachments:

0001-Add-test-for-pg_upgrade-file-transfer-modes.patchtext/plain; charset=us-asciiDownload
From cc9bcd456f0d7d6ab19a89813755a3e76993cfb9 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Tue, 18 Mar 2025 20:58:07 -0500
Subject: [PATCH 1/1] Add test for pg_upgrade file transfer modes.

This new test checks all of pg_upgrade's file transfer modes.  For
each mode, we verify that pg_upgrade either succeeds (and some test
objects successfully reach the new version) or fails with an error
that indicates the mode is not supported on the current platform.

Suggested-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 src/bin/pg_upgrade/meson.build           |  1 +
 src/bin/pg_upgrade/t/006_modes.pl        | 89 ++++++++++++++++++++++++
 src/test/perl/PostgreSQL/Test/Cluster.pm | 19 +++++
 src/test/perl/PostgreSQL/Test/Utils.pm   | 25 +++++++
 4 files changed, 134 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/006_modes.pl

diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index da84344966a..16cd9247e76 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -46,6 +46,7 @@ tests += {
       't/003_logical_slots.pl',
       't/004_subscription.pl',
       't/005_char_signedness.pl',
+      't/006_modes.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/t/006_modes.pl b/src/bin/pg_upgrade/t/006_modes.pl
new file mode 100644
index 00000000000..85098e86245
--- /dev/null
+++ b/src/bin/pg_upgrade/t/006_modes.pl
@@ -0,0 +1,89 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Tests for file transfer modes
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub test_mode
+{
+	my ($mode) = @_;
+
+	my $old = PostgreSQL::Test::Cluster->new('old', install_path => $ENV{oldinstall});
+	my $new = PostgreSQL::Test::Cluster->new('new');
+
+	if (defined($ENV{oldinstall}))
+	{
+		$old->init(force_initdb => 1, extra => ['-k']);
+	}
+	else
+	{
+		$old->init();
+	}
+	$new->init();
+
+	$old->start;
+	$old->safe_psql('postgres', "CREATE TABLE test1 AS SELECT generate_series(1, 100)");
+	$old->safe_psql('postgres', "CREATE DATABASE testdb1");
+	$old->safe_psql('testdb1', "CREATE TABLE test2 AS SELECT generate_series(200, 300)");
+	$old->safe_psql('testdb1', "VACUUM FULL test2");
+	$old->safe_psql('testdb1', "CREATE SEQUENCE testseq START 5432");
+	if (defined($ENV{oldinstall}))
+	{
+		my $tblspc = PostgreSQL::Test::Utils::tempdir_short();
+		$old->safe_psql('postgres', "CREATE TABLESPACE test_tblspc LOCATION '$tblspc'");
+		$old->safe_psql('postgres', "CREATE DATABASE testdb2 TABLESPACE test_tblspc");
+		$old->safe_psql('postgres', "CREATE TABLE test3 TABLESPACE test_tblspc AS SELECT generate_series(300, 401)");
+		$old->safe_psql('testdb2', "CREATE TABLE test4 AS SELECT generate_series(400, 502)");
+	}
+	$old->stop;
+
+	my $result = command_ok_or_fails_like(
+		[
+			'pg_upgrade', '--no-sync',
+			'--old-datadir' => $old->data_dir,
+			'--new-datadir' => $new->data_dir,
+			'--old-bindir' => $old->config_data('--bindir'),
+			'--new-bindir' => $new->config_data('--bindir'),
+			'--socketdir' => $new->host,
+			'--old-port' => $old->port,
+			'--new-port' => $new->port,
+			$mode
+		],
+		qr/.* not supported on this platform|could not .* between old and new data directories: .*/,
+		qr/^$/,
+		"pg_upgrade with transfer mode $mode");
+
+	if ($result)
+	{
+		$new->start;
+		$result = $new->safe_psql('postgres', "SELECT COUNT(*) FROM test1");
+		is($result, '100', "test1 data after pg_upgrade $mode");
+		$result = $new->safe_psql('testdb1', "SELECT COUNT(*) FROM test2");
+		is($result, '101', "test2 data after pg_upgrade $mode");
+		$result = $new->safe_psql('testdb1', "SELECT nextval('testseq')");
+		is($result, '5432', "sequence data after pg_upgrade $mode");
+		if (defined($ENV{oldinstall}))
+		{
+			$result = $new->safe_psql('postgres', "SELECT COUNT(*) FROM test3");
+			is($result, '102', "test3 data after pg_upgrade $mode");
+			$result = $new->safe_psql('testdb2', "SELECT COUNT(*) FROM test4");
+			is($result, '103', "test4 data after pg_upgrade $mode");
+		}
+		$new->stop;
+	}
+
+	$old->clean_node();
+	$new->clean_node();
+}
+
+test_mode('--clone');
+test_mode('--copy');
+test_mode('--copy-file-range');
+test_mode('--link');
+
+done_testing();
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 05bd94609d4..8759ed2cbba 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2801,6 +2801,25 @@ sub command_fails_like
 
 =pod
 
+=item $node->command_ok_or_fails_like(...)
+
+PostgreSQL::Test::Utils::command_ok_or_fails_like with our connection parameters. See command_ok(...)
+
+=cut
+
+sub command_ok_or_fails_like
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $self = shift;
+
+	local %ENV = $self->_get_env();
+
+	return PostgreSQL::Test::Utils::command_ok_or_fails_like(@_);
+}
+
+=pod
+
 =item $node->command_checks_all(...)
 
 PostgreSQL::Test::Utils::command_checks_all with our connection parameters. See
diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm b/src/test/perl/PostgreSQL/Test/Utils.pm
index d1ad131eadf..7d7ca83495f 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -89,6 +89,7 @@ our @EXPORT = qw(
   command_like
   command_like_safe
   command_fails_like
+  command_ok_or_fails_like
   command_checks_all
 
   $windows_os
@@ -1067,6 +1068,30 @@ sub command_fails_like
 
 =pod
 
+=item command_ok_or_fails_like(cmd, expected_stdout, expected_stderr, test_name)
+
+Check that the command either succeeds or fails with an error that matches the
+given regular expressions.
+
+=cut
+
+sub command_ok_or_fails_like
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+	my ($cmd, $expected_stdout, $expected_stderr, $test_name) = @_;
+	my ($stdout, $stderr);
+	print("# Running: " . join(" ", @{$cmd}) . "\n");
+	my $result = IPC::Run::run $cmd, '>' => \$stdout, '2>' => \$stderr;
+	if (!$result)
+	{
+		like($stdout, $expected_stdout, "$test_name: stdout matches");
+		like($stderr, $expected_stderr, "$test_name: stderr matches");
+	}
+	return $result;
+}
+
+=pod
+
 =item command_checks_all(cmd, ret, out, err, test_name)
 
 Run a command and check its status and outputs.
-- 
2.39.5 (Apple Git-154)

#37Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#36)
4 attachment(s)
Re: optimize file transfer in pg_upgrade

On Wed, Mar 19, 2025 at 04:28:23PM -0500, Nathan Bossart wrote:

On Wed, Mar 19, 2025 at 02:32:01PM -0500, Nathan Bossart wrote:

In addition to testing with in-place tablespaces, we might also want to
teach the transfer modes test to do cross-version testing when possible.
In that case, we can test normal (non-in-place) tablespaces. However, that
would be limited to the buildfarm.

Actually, this one was pretty easy to do.

And here is yet another new version of the full patch set. I'm planning to
commit 0001 (the new pg_upgrade transfer mode test) tomorrow so that I can
deal with any buildfarm indigestion before committing swap mode. I did run
the test locally for upgrades from v9.6, v13, and v17, but who knows what
unique configurations I've failed to anticipate...

--
nathan

Attachments:

v8-0001-Add-test-for-pg_upgrade-file-transfer-modes.patchtext/plain; charset=us-asciiDownload
From 5b5fbd87faac7041ad5dd2defacd29cf1eaf6397 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 19 Mar 2025 20:24:41 -0500
Subject: [PATCH v8 1/4] Add test for pg_upgrade file transfer modes.

This new test checks all of pg_upgrade's file transfer modes.  For
each mode, we verify that pg_upgrade either succeeds (and some test
objects successfully reach the new version) or fails with an error
that indicates the mode is not supported on the current platform.
For cross-version tests, we also check that pg_upgrade transfers
non-default tablespaces correctly.  (Tablespaces can't be tested on
same version upgrades because of the version-specific subdirectory
conflict, but we might be able to enable such tests once we teach
pg_upgrade how to handle in-place tablespaces.)

Suggested-by: Robert Haas <robertmhaas@gmail.com>
Reviewed-by: Andres Freund <andres@anarazel.de>
Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 src/bin/pg_upgrade/meson.build           |   1 +
 src/bin/pg_upgrade/t/006_modes.pl        | 101 +++++++++++++++++++++++
 src/test/perl/PostgreSQL/Test/Cluster.pm |  19 +++++
 src/test/perl/PostgreSQL/Test/Utils.pm   |  25 ++++++
 4 files changed, 146 insertions(+)
 create mode 100644 src/bin/pg_upgrade/t/006_modes.pl

diff --git a/src/bin/pg_upgrade/meson.build b/src/bin/pg_upgrade/meson.build
index da84344966a..16cd9247e76 100644
--- a/src/bin/pg_upgrade/meson.build
+++ b/src/bin/pg_upgrade/meson.build
@@ -46,6 +46,7 @@ tests += {
       't/003_logical_slots.pl',
       't/004_subscription.pl',
       't/005_char_signedness.pl',
+      't/006_modes.pl',
     ],
     'test_kwargs': {'priority': 40}, # pg_upgrade tests are slow
   },
diff --git a/src/bin/pg_upgrade/t/006_modes.pl b/src/bin/pg_upgrade/t/006_modes.pl
new file mode 100644
index 00000000000..518e0994145
--- /dev/null
+++ b/src/bin/pg_upgrade/t/006_modes.pl
@@ -0,0 +1,101 @@
+# Copyright (c) 2025, PostgreSQL Global Development Group
+
+# Tests for file transfer modes
+
+use strict;
+use warnings FATAL => 'all';
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+sub test_mode
+{
+	my ($mode) = @_;
+
+	my $old = PostgreSQL::Test::Cluster->new('old', install_path => $ENV{oldinstall});
+	my $new = PostgreSQL::Test::Cluster->new('new');
+
+	if (defined($ENV{oldinstall}))
+	{
+		# Checksums are now enabled by default, but weren't before 18, so pass
+		# '-k' to initdb on older versions so that upgrades work.
+		$old->init(extra => ['-k']);
+	}
+	else
+	{
+		$old->init();
+	}
+	$new->init();
+
+	# Create a small variety of simple test objects on the old cluster.  We'll
+	# check that these reach the new version after upgrading.
+	$old->start;
+	$old->safe_psql('postgres', "CREATE TABLE test1 AS SELECT generate_series(1, 100)");
+	$old->safe_psql('postgres', "CREATE DATABASE testdb1");
+	$old->safe_psql('testdb1', "CREATE TABLE test2 AS SELECT generate_series(200, 300)");
+	$old->safe_psql('testdb1', "VACUUM FULL test2");
+	$old->safe_psql('testdb1', "CREATE SEQUENCE testseq START 5432");
+
+	# For cross-version tests, we can also check that pg_upgrade handles
+	# tablespaces.
+	if (defined($ENV{oldinstall}))
+	{
+		my $tblspc = PostgreSQL::Test::Utils::tempdir_short();
+		$old->safe_psql('postgres', "CREATE TABLESPACE test_tblspc LOCATION '$tblspc'");
+		$old->safe_psql('postgres', "CREATE DATABASE testdb2 TABLESPACE test_tblspc");
+		$old->safe_psql('postgres', "CREATE TABLE test3 TABLESPACE test_tblspc AS SELECT generate_series(300, 401)");
+		$old->safe_psql('testdb2', "CREATE TABLE test4 AS SELECT generate_series(400, 502)");
+	}
+	$old->stop;
+
+	my $result = command_ok_or_fails_like(
+		[
+			'pg_upgrade', '--no-sync',
+			'--old-datadir' => $old->data_dir,
+			'--new-datadir' => $new->data_dir,
+			'--old-bindir' => $old->config_data('--bindir'),
+			'--new-bindir' => $new->config_data('--bindir'),
+			'--socketdir' => $new->host,
+			'--old-port' => $old->port,
+			'--new-port' => $new->port,
+			$mode
+		],
+		qr/.* not supported on this platform|could not .* between old and new data directories: .*/,
+		qr/^$/,
+		"pg_upgrade with transfer mode $mode");
+
+	# If pg_upgrade was successful, check that all of our test objects reached
+	# the new version.
+	if ($result)
+	{
+		$new->start;
+		$result = $new->safe_psql('postgres', "SELECT COUNT(*) FROM test1");
+		is($result, '100', "test1 data after pg_upgrade $mode");
+		$result = $new->safe_psql('testdb1', "SELECT COUNT(*) FROM test2");
+		is($result, '101', "test2 data after pg_upgrade $mode");
+		$result = $new->safe_psql('testdb1', "SELECT nextval('testseq')");
+		is($result, '5432', "sequence data after pg_upgrade $mode");
+
+		# For cross-version tests, we should have some objects in a non-default
+		# tablespace.
+		if (defined($ENV{oldinstall}))
+		{
+			$result = $new->safe_psql('postgres', "SELECT COUNT(*) FROM test3");
+			is($result, '102', "test3 data after pg_upgrade $mode");
+			$result = $new->safe_psql('testdb2', "SELECT COUNT(*) FROM test4");
+			is($result, '103', "test4 data after pg_upgrade $mode");
+		}
+		$new->stop;
+	}
+
+	$old->clean_node();
+	$new->clean_node();
+}
+
+test_mode('--clone');
+test_mode('--copy');
+test_mode('--copy-file-range');
+test_mode('--link');
+
+done_testing();
diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 05bd94609d4..8759ed2cbba 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -2801,6 +2801,25 @@ sub command_fails_like
 
 =pod
 
+=item $node->command_ok_or_fails_like(...)
+
+PostgreSQL::Test::Utils::command_ok_or_fails_like with our connection parameters. See command_ok(...)
+
+=cut
+
+sub command_ok_or_fails_like
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+
+	my $self = shift;
+
+	local %ENV = $self->_get_env();
+
+	return PostgreSQL::Test::Utils::command_ok_or_fails_like(@_);
+}
+
+=pod
+
 =item $node->command_checks_all(...)
 
 PostgreSQL::Test::Utils::command_checks_all with our connection parameters. See
diff --git a/src/test/perl/PostgreSQL/Test/Utils.pm b/src/test/perl/PostgreSQL/Test/Utils.pm
index d1ad131eadf..7d7ca83495f 100644
--- a/src/test/perl/PostgreSQL/Test/Utils.pm
+++ b/src/test/perl/PostgreSQL/Test/Utils.pm
@@ -89,6 +89,7 @@ our @EXPORT = qw(
   command_like
   command_like_safe
   command_fails_like
+  command_ok_or_fails_like
   command_checks_all
 
   $windows_os
@@ -1067,6 +1068,30 @@ sub command_fails_like
 
 =pod
 
+=item command_ok_or_fails_like(cmd, expected_stdout, expected_stderr, test_name)
+
+Check that the command either succeeds or fails with an error that matches the
+given regular expressions.
+
+=cut
+
+sub command_ok_or_fails_like
+{
+	local $Test::Builder::Level = $Test::Builder::Level + 1;
+	my ($cmd, $expected_stdout, $expected_stderr, $test_name) = @_;
+	my ($stdout, $stderr);
+	print("# Running: " . join(" ", @{$cmd}) . "\n");
+	my $result = IPC::Run::run $cmd, '>' => \$stdout, '2>' => \$stderr;
+	if (!$result)
+	{
+		like($stdout, $expected_stdout, "$test_name: stdout matches");
+		like($stderr, $expected_stderr, "$test_name: stderr matches");
+	}
+	return $result;
+}
+
+=pod
+
 =item command_checks_all(cmd, ret, out, err, test_name)
 
 Run a command and check its status and outputs.
-- 
2.39.5 (Apple Git-154)

v8-0002-initdb-Add-no-sync-data-files.patchtext/plain; charset=us-asciiDownload
From 1afc1225ce3e49b1da3d97ada50fa01444bdafc4 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 19 Feb 2025 09:14:51 -0600
Subject: [PATCH v8 2/4] initdb: Add --no-sync-data-files.

This new option instructs initdb to skip synchronizing any files
in database directories and the database directories themselves,
i.e., everything in the base/ subdirectory and any other
tablespace directories.  Other files, such as those in pg_wal/ and
pg_xact/, will still be synchronized unless --no-sync is also
specified.  --no-sync-data-files is primarily intended for internal
use by tools that separately ensure the skipped files are
synchronized to disk.  A follow-up commit will use this to help
optimize pg_upgrade's file transfer step.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/initdb.sgml                | 27 +++++++
 src/bin/initdb/initdb.c                     | 10 ++-
 src/bin/initdb/t/001_initdb.pl              |  1 +
 src/bin/pg_basebackup/pg_basebackup.c       |  2 +-
 src/bin/pg_checksums/pg_checksums.c         |  2 +-
 src/bin/pg_combinebackup/pg_combinebackup.c |  2 +-
 src/bin/pg_rewind/file_ops.c                |  2 +-
 src/common/file_utils.c                     | 85 +++++++++++++--------
 src/include/common/file_utils.h             |  2 +-
 9 files changed, 96 insertions(+), 37 deletions(-)

diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 0026318485a..2f1f9a42f90 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -527,6 +527,33 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry id="app-initdb-option-no-sync-data-files">
+      <term><option>--no-sync-data-files</option></term>
+      <listitem>
+       <para>
+        By default, <command>initdb</command> safely writes all database files
+        to disk.  This option instructs <command>initdb</command> to skip
+        synchronizing all files in the individual database directories, the
+        database directories themselves, and the tablespace directories, i.e.,
+        everything in the <filename>base</filename> subdirectory and any other
+        tablespace directories.  Other files, such as those in
+        <literal>pg_wal</literal> and <literal>pg_xact</literal>, will still be
+        synchronized unless the <option>--no-sync</option> option is also
+        specified.
+       </para>
+       <para>
+        Note that if <option>--no-sync-data-files</option> is used in
+        conjuction with <option>--sync-method=syncfs</option>, some or all of
+        the aforementioned files and directories will be synchronized because
+        <literal>syncfs</literal> processes entire file systems.
+       </para>
+       <para>
+        This option is primarily intended for internal use by tools that
+        separately ensure the skipped files are synchronized to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="app-initdb-option-no-instructions">
       <term><option>--no-instructions</option></term>
       <listitem>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 21a0fe3ecd9..22b7d31b165 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -168,6 +168,7 @@ static bool data_checksums = true;
 static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+static bool sync_data_files = true;
 
 
 /* internal vars */
@@ -2566,6 +2567,7 @@ usage(const char *progname)
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
 	printf(_("  -n, --no-clean            do not clean up after errors\n"));
 	printf(_("  -N, --no-sync             do not wait for changes to be written safely to disk\n"));
+	printf(_("      --no-sync-data-files  do not sync files within database directories\n"));
 	printf(_("      --no-instructions     do not print instructions for next steps\n"));
 	printf(_("  -s, --show                show internal settings, then exit\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
@@ -3208,6 +3210,7 @@ main(int argc, char *argv[])
 		{"icu-rules", required_argument, NULL, 18},
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
+		{"no-sync-data-files", no_argument, NULL, 21},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3402,6 +3405,9 @@ main(int argc, char *argv[])
 			case 20:
 				data_checksums = false;
 				break;
+			case 21:
+				sync_data_files = false;
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -3453,7 +3459,7 @@ main(int argc, char *argv[])
 
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 		return 0;
 	}
@@ -3516,7 +3522,7 @@ main(int argc, char *argv[])
 	{
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 	}
 	else
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 01cc4a1602b..15dd10ce40a 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -76,6 +76,7 @@ command_like(
 	'checksums are enabled in control file');
 
 command_ok([ 'initdb', '--sync-only', $datadir ], 'sync only');
+command_ok([ 'initdb', '--sync-only', '--no-sync-data-files', $datadir ], '--no-sync-data-files');
 command_fails([ 'initdb', $datadir ], 'existing data directory');
 
 if ($supports_syncfs)
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index d4b4e334014..1da4bfc2351 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -2310,7 +2310,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
 		}
 		else
 		{
-			(void) sync_pgdata(basedir, serverVersion, sync_method);
+			(void) sync_pgdata(basedir, serverVersion, sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index 867aeddc601..f20be82862a 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -633,7 +633,7 @@ main(int argc, char *argv[])
 		if (do_sync)
 		{
 			pg_log_info("syncing data directory");
-			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method);
+			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method, true);
 		}
 
 		pg_log_info("updating control file");
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index d480dc74436..050260ee832 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -424,7 +424,7 @@ main(int argc, char *argv[])
 		else
 		{
 			pg_log_debug("recursively fsyncing \"%s\"", opt.output);
-			sync_pgdata(opt.output, version * 10000, opt.sync_method);
+			sync_pgdata(opt.output, version * 10000, opt.sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 467845419ed..55659ce201f 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -296,7 +296,7 @@ sync_target_dir(void)
 	if (!do_sync || dry_run)
 		return;
 
-	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method);
+	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method, true);
 }
 
 
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 0e3cfede935..78e272916f5 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -50,7 +50,8 @@ static int	pre_sync_fname(const char *fname, bool isdir);
 #endif
 static void walkdir(const char *path,
 					int (*action) (const char *fname, bool isdir),
-					bool process_symlinks);
+					bool process_symlinks,
+					const char *exclude_dir);
 
 #ifdef HAVE_SYNCFS
 
@@ -93,11 +94,15 @@ do_syncfs(const char *path)
  * syncing, and might not have privileges to write at all.
  *
  * serverVersion indicates the version of the server to be sync'd.
+ *
+ * If sync_data_files is false, this function skips syncing "base/" and any
+ * other tablespace directories.
  */
 void
 sync_pgdata(const char *pg_data,
 			int serverVersion,
-			DataDirSyncMethod sync_method)
+			DataDirSyncMethod sync_method,
+			bool sync_data_files)
 {
 	bool		xlog_is_symlink;
 	char		pg_wal[MAXPGPATH];
@@ -147,30 +152,33 @@ sync_pgdata(const char *pg_data,
 				do_syncfs(pg_data);
 
 				/* If any tablespaces are configured, sync each of those. */
-				dir = opendir(pg_tblspc);
-				if (dir == NULL)
-					pg_log_error("could not open directory \"%s\": %m",
-								 pg_tblspc);
-				else
+				if (sync_data_files)
 				{
-					while (errno = 0, (de = readdir(dir)) != NULL)
+					dir = opendir(pg_tblspc);
+					if (dir == NULL)
+						pg_log_error("could not open directory \"%s\": %m",
+									 pg_tblspc);
+					else
 					{
-						char		subpath[MAXPGPATH * 2];
+						while (errno = 0, (de = readdir(dir)) != NULL)
+						{
+							char		subpath[MAXPGPATH * 2];
 
-						if (strcmp(de->d_name, ".") == 0 ||
-							strcmp(de->d_name, "..") == 0)
-							continue;
+							if (strcmp(de->d_name, ".") == 0 ||
+								strcmp(de->d_name, "..") == 0)
+								continue;
 
-						snprintf(subpath, sizeof(subpath), "%s/%s",
-								 pg_tblspc, de->d_name);
-						do_syncfs(subpath);
-					}
+							snprintf(subpath, sizeof(subpath), "%s/%s",
+									 pg_tblspc, de->d_name);
+							do_syncfs(subpath);
+						}
 
-					if (errno)
-						pg_log_error("could not read directory \"%s\": %m",
-									 pg_tblspc);
+						if (errno)
+							pg_log_error("could not read directory \"%s\": %m",
+										 pg_tblspc);
 
-					(void) closedir(dir);
+						(void) closedir(dir);
+					}
 				}
 
 				/* If pg_wal is a symlink, process that too. */
@@ -182,15 +190,21 @@ sync_pgdata(const char *pg_data,
 
 		case DATA_DIR_SYNC_METHOD_FSYNC:
 			{
+				char	   *exclude_dir = NULL;
+
+				if (!sync_data_files)
+					exclude_dir = psprintf("%s/base", pg_data);
+
 				/*
 				 * If possible, hint to the kernel that we're soon going to
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(pg_data, pre_sync_fname, false);
+				walkdir(pg_data, pre_sync_fname, false, exclude_dir);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, pre_sync_fname, false);
-				walkdir(pg_tblspc, pre_sync_fname, true);
+					walkdir(pg_wal, pre_sync_fname, false, NULL);
+				if (sync_data_files)
+					walkdir(pg_tblspc, pre_sync_fname, true, NULL);
 #endif
 
 				/*
@@ -203,10 +217,14 @@ sync_pgdata(const char *pg_data,
 				 * get fsync'd twice. That's not an expected case so we don't
 				 * worry about optimizing it.
 				 */
-				walkdir(pg_data, fsync_fname, false);
+				walkdir(pg_data, fsync_fname, false, exclude_dir);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, fsync_fname, false);
-				walkdir(pg_tblspc, fsync_fname, true);
+					walkdir(pg_wal, fsync_fname, false, NULL);
+				if (sync_data_files)
+					walkdir(pg_tblspc, fsync_fname, true, NULL);
+
+				if (exclude_dir)
+					pfree(exclude_dir);
 			}
 			break;
 	}
@@ -245,10 +263,10 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(dir, pre_sync_fname, false);
+				walkdir(dir, pre_sync_fname, false, NULL);
 #endif
 
-				walkdir(dir, fsync_fname, false);
+				walkdir(dir, fsync_fname, false, NULL);
 			}
 			break;
 	}
@@ -264,6 +282,9 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
  * ignored in subdirectories, ie we intentionally don't pass down the
  * process_symlinks flag to recursive calls.
  *
+ * If exclude_dir is not NULL, it specifies a directory path to skip
+ * processing.
+ *
  * Errors are reported but not considered fatal.
  *
  * See also walkdir in fd.c, which is a backend version of this logic.
@@ -271,11 +292,15 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
 static void
 walkdir(const char *path,
 		int (*action) (const char *fname, bool isdir),
-		bool process_symlinks)
+		bool process_symlinks,
+		const char *exclude_dir)
 {
 	DIR		   *dir;
 	struct dirent *de;
 
+	if (exclude_dir && strcmp(exclude_dir, path) == 0)
+		return;
+
 	dir = opendir(path);
 	if (dir == NULL)
 	{
@@ -299,7 +324,7 @@ walkdir(const char *path,
 				(*action) (subpath, false);
 				break;
 			case PGFILETYPE_DIR:
-				walkdir(subpath, action, false);
+				walkdir(subpath, action, false, exclude_dir);
 				break;
 			default:
 
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index a832210adc1..8274bc877ab 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -35,7 +35,7 @@ struct iovec;					/* avoid including port/pg_iovec.h here */
 #ifdef FRONTEND
 extern int	fsync_fname(const char *fname, bool isdir);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
-						DataDirSyncMethod sync_method);
+						DataDirSyncMethod sync_method, bool sync_data_files);
 extern void sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method);
 extern int	durable_rename(const char *oldfile, const char *newfile);
 extern int	fsync_parent_path(const char *fname);
-- 
2.39.5 (Apple Git-154)

v8-0003-pg_dump-Add-sequence-data.patchtext/plain; charset=us-asciiDownload
From 4325f2786554c79480993284117bb583298127a3 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 19 Feb 2025 11:25:28 -0600
Subject: [PATCH v8 3/4] pg_dump: Add --sequence-data.

This new option instructs pg_dump to dump sequence data when the
--no-data, --schema-only, or --statistics-only option is specified.
This was originally considered for commit a7e5457db8, but it was
left out at that time because there was no known use-case.  A
follow-up commit will use this to optimize pg_upgrade's file
transfer step.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/pg_dump.sgml               | 11 +++++++++++
 src/bin/pg_dump/pg_dump.c                   | 10 ++--------
 src/bin/pg_dump/t/002_pg_dump.pl            |  1 +
 src/bin/pg_upgrade/dump.c                   |  2 +-
 src/test/modules/test_pg_dump/t/001_base.pl |  2 +-
 5 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 0ae40f9be58..63cca18711a 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1298,6 +1298,17 @@ PostgreSQL documentation
        </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--sequence-data</option></term>
+      <listitem>
+       <para>
+        Include sequence data in the dump.  This is the default behavior except
+        when <option>--no-data</option>, <option>--schema-only</option>, or
+        <option>--statistics-only</option> is specified.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--serializable-deferrable</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 428ed2d60fc..e6253331e27 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -518,6 +518,7 @@ main(int argc, char **argv)
 		{"sync-method", required_argument, NULL, 15},
 		{"filter", required_argument, NULL, 16},
 		{"exclude-extension", required_argument, NULL, 17},
+		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -801,14 +802,6 @@ main(int argc, char **argv)
 	if (dopt.column_inserts && dopt.dump_inserts == 0)
 		dopt.dump_inserts = DUMP_DEFAULT_ROWS_PER_INSERT;
 
-	/*
-	 * Binary upgrade mode implies dumping sequence data even in schema-only
-	 * mode.  This is not exposed as a separate option, but kept separate
-	 * internally for clarity.
-	 */
-	if (dopt.binary_upgrade)
-		dopt.sequence_data = 1;
-
 	if (data_only && schema_only)
 		pg_fatal("options -s/--schema-only and -a/--data-only cannot be used together");
 	if (schema_only && statistics_only)
@@ -1275,6 +1268,7 @@ help(const char *progname)
 	printf(_("  --quote-all-identifiers      quote all identifiers, even if not key words\n"));
 	printf(_("  --rows-per-insert=NROWS      number of rows per INSERT; implies --inserts\n"));
 	printf(_("  --section=SECTION            dump named section (pre-data, data, or post-data)\n"));
+	printf(_("  --sequence-data              include sequence data in dump\n"));
 	printf(_("  --serializable-deferrable    wait until the dump can run without anomalies\n"));
 	printf(_("  --snapshot=SNAPSHOT          use given snapshot for the dump\n"));
 	printf(_("  --statistics-only            dump only the statistics, not schema or data\n"));
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index d281e27aa67..ed379033da7 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -66,6 +66,7 @@ my %pgdump_runs = (
 			'--file' => "$tempdir/binary_upgrade.dump",
 			'--no-password',
 			'--no-data',
+			'--sequence-data',
 			'--binary-upgrade',
 			'--dbname' => 'postgres',    # alternative way to specify database
 		],
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index 23fe7280a16..b8fd0d0acee 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -52,7 +52,7 @@ generate_old_dump(void)
 		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
 
 		parallel_exec_prog(log_file_name, NULL,
-						   "\"%s/pg_dump\" %s --no-data %s --quote-all-identifiers "
+						   "\"%s/pg_dump\" %s --no-data %s --sequence-data --quote-all-identifiers "
 						   "--binary-upgrade --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
 						   log_opts.verbose ? "--verbose" : "",
diff --git a/src/test/modules/test_pg_dump/t/001_base.pl b/src/test/modules/test_pg_dump/t/001_base.pl
index a9bcac4169d..adcaa419616 100644
--- a/src/test/modules/test_pg_dump/t/001_base.pl
+++ b/src/test/modules/test_pg_dump/t/001_base.pl
@@ -48,7 +48,7 @@ my %pgdump_runs = (
 		dump_cmd => [
 			'pg_dump', '--no-sync',
 			'--file' => "$tempdir/binary_upgrade.sql",
-			'--schema-only', '--binary-upgrade',
+			'--schema-only', '--sequence-data', '--binary-upgrade',
 			'--dbname' => 'postgres',
 		],
 	},
-- 
2.39.5 (Apple Git-154)

v8-0004-pg_upgrade-Add-swap-for-faster-file-transfer.patchtext/plain; charset=us-asciiDownload
From 0bb275bea08d724a32d3f5154cd5d583b9c87ace Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 5 Mar 2025 17:36:54 -0600
Subject: [PATCH v8 4/4] pg_upgrade: Add --swap for faster file transfer.

This new option instructs pg_upgrade to move the data directories
from the old cluster to the new cluster and then to replace the
catalog files with those generated for the new cluster.  This mode
can outperform --link, --clone, --copy, and --copy-file-range,
especially on clusters with many relations.

However, this mode creates many garbage files in the old cluster,
which can prolong the file synchronization step.  To handle that,
we use "initdb --sync-only --no-sync-data-files" for file
synchronization, and we synchronize the catalog files as they are
transferred.  We assume that the database files transferred from
the old cluster were synchronized prior to upgrade.  This mode also
complicates reverting to the old cluster, so we recommend restoring
from backup upon failure during or after file transfer.

The new mode is limited to clusters located in the same file system
and to upgrades from version 10 and newer.

Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/pgupgrade.sgml    |  59 ++++-
 src/bin/pg_upgrade/TESTING         |   6 +-
 src/bin/pg_upgrade/check.c         |  29 ++-
 src/bin/pg_upgrade/controldata.c   |  21 +-
 src/bin/pg_upgrade/dump.c          |   4 +-
 src/bin/pg_upgrade/file.c          |  14 +-
 src/bin/pg_upgrade/info.c          |   4 +-
 src/bin/pg_upgrade/option.c        |   7 +
 src/bin/pg_upgrade/pg_upgrade.c    |  16 +-
 src/bin/pg_upgrade/pg_upgrade.h    |   5 +-
 src/bin/pg_upgrade/relfilenumber.c | 371 +++++++++++++++++++++++++++++
 src/bin/pg_upgrade/t/006_modes.pl  |  10 +
 src/common/file_utils.c            |  14 +-
 src/include/common/file_utils.h    |   1 +
 14 files changed, 527 insertions(+), 34 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 5db761d1ff1..da261619043 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -244,7 +244,8 @@ PostgreSQL documentation
       <listitem>
        <para>
         Copy files to the new cluster.  This is the default.  (See also
-        <option>--link</option> and <option>--clone</option>.)
+        <option>--link</option>, <option>--clone</option>,
+        <option>--copy-file-range</option>, and <option>--swap</option>.)
        </para>
       </listitem>
      </varlistentry>
@@ -262,6 +263,32 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--swap</option></term>
+      <listitem>
+       <para>
+        Move the data directories from the old cluster to the new cluster.
+        Then, replace the catalog files with those generated for the new
+        cluster.  This mode can outperform <option>--link</option>,
+        <option>--clone</option>, <option>--copy</option>, and
+        <option>--copy-file-range</option>, especially on clusters with many
+        relations.
+       </para>
+       <para>
+        However, this mode creates many garbage files in the old cluster, which
+        can prolong the file synchronization step if
+        <option>--sync-method=syncfs</option> is used.  Therefore, it is
+        recommended to use <option>--sync-method=fsync</option> with
+        <option>--swap</option>.
+       </para>
+       <para>
+        Additionally, once the file transfer step begins, the old cluster will
+        be destructively modified and therefore will no longer be safe to
+        start.  See <xref linkend="pgupgrade-step-revert"/> for details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--sync-method=</option><replaceable>method</replaceable></term>
       <listitem>
@@ -530,6 +557,10 @@ NET STOP postgresql-&majorversion;
      is started.  Clone mode also requires that the old and new data
      directories be in the same file system.  This mode is only available
      on certain operating systems and file systems.
+     Swap mode may be the fastest if there are many relations, but you will not
+     be able to access your old cluster once the file transfer step begins.
+     Swap mode also requires that the old and new cluster data directories be
+     in the same file system.
     </para>
 
     <para>
@@ -889,6 +920,32 @@ psql --username=postgres --file=script.sql postgres
 
         </itemizedlist></para>
       </listitem>
+
+      <listitem>
+       <para>
+        If the <option>--swap</option> option was used, the old cluster might
+        be destructively modified:
+
+        <itemizedlist>
+         <listitem>
+          <para>
+           If <command>pg_upgrade</command> aborts before reporting that the
+           old cluster is no longer safe to start, the old cluster was
+           unmodified; it can be restarted.
+          </para>
+         </listitem>
+
+         <listitem>
+          <para>
+           If <command>pg_upgrade</command> has reported that the old cluster
+           is no longer safe to start, the old cluster was destructively
+           modified.  The old cluster will need to be restored from backup in
+           this case.
+          </para>
+         </listitem>
+        </itemizedlist>
+       </para>
+      </listitem>
      </itemizedlist></para>
    </step>
   </procedure>
diff --git a/src/bin/pg_upgrade/TESTING b/src/bin/pg_upgrade/TESTING
index 00842ac6ec3..c3d463c9c29 100644
--- a/src/bin/pg_upgrade/TESTING
+++ b/src/bin/pg_upgrade/TESTING
@@ -20,13 +20,13 @@ export oldinstall=...otherversion/	(old version's install base path)
 See DETAILS below for more information about creation of the dump.
 
 You can also test the different transfer modes (--copy, --link,
---clone, --copy-file-range) by setting the environment variable
+--clone, --copy-file-range, --swap) by setting the environment variable
 PG_TEST_PG_UPGRADE_MODE to the respective command-line option, like
 
 	make check PG_TEST_PG_UPGRADE_MODE=--link
 
-The default is --copy.  Note that the other modes are not supported on
-all operating systems.
+The default is --copy.  Note that not all modes are supported on all
+operating systems.
 
 DETAILS
 -------
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 88daa808035..564a9116ca5 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -709,7 +709,34 @@ check_new_cluster(void)
 			check_copy_file_range();
 			break;
 		case TRANSFER_MODE_LINK:
-			check_hard_link();
+			check_hard_link(TRANSFER_MODE_LINK);
+			break;
+		case TRANSFER_MODE_SWAP:
+
+			/*
+			 * We do the hard link check for --swap, too, since it's an easy
+			 * way to verify the clusters are in the same file system.  This
+			 * allows us to take some shortcuts in the file synchronization
+			 * step.  With some more effort, we could probably support the
+			 * separate-file-system use case, but this mode is unlikely to
+			 * offer much benefit if we have to copy the files across file
+			 * system boundaries.
+			 */
+			check_hard_link(TRANSFER_MODE_SWAP);
+
+			/*
+			 * There are a few known issues with using --swap to upgrade from
+			 * versions older than 10.  For example, the sequence tuple format
+			 * changed in v10, and the visibility map format changed in 9.6.
+			 * While such problems are not insurmountable (and we may have to
+			 * deal with similar problems in the future, anyway), it doesn't
+			 * seem worth the effort to support swap mode for upgrades from
+			 * long-unsupported versions.
+			 */
+			if (GET_MAJOR_VERSION(old_cluster.major_version) < 1000)
+				pg_fatal("Swap mode can only upgrade clusters from PostgreSQL version %s and later.",
+						 "10");
+
 			break;
 	}
 
diff --git a/src/bin/pg_upgrade/controldata.c b/src/bin/pg_upgrade/controldata.c
index bd49ea867bf..47ee27ec835 100644
--- a/src/bin/pg_upgrade/controldata.c
+++ b/src/bin/pg_upgrade/controldata.c
@@ -751,7 +751,7 @@ check_control_data(ControlData *oldctrl,
 
 
 void
-disable_old_cluster(void)
+disable_old_cluster(transferMode transfer_mode)
 {
 	char		old_path[MAXPGPATH],
 				new_path[MAXPGPATH];
@@ -766,10 +766,17 @@ disable_old_cluster(void)
 				 old_path, new_path);
 	check_ok();
 
-	pg_log(PG_REPORT, "\n"
-		   "If you want to start the old cluster, you will need to remove\n"
-		   "the \".old\" suffix from %s/global/pg_control.old.\n"
-		   "Because \"link\" mode was used, the old cluster cannot be safely\n"
-		   "started once the new cluster has been started.",
-		   old_cluster.pgdata);
+	if (transfer_mode == TRANSFER_MODE_LINK)
+		pg_log(PG_REPORT, "\n"
+			   "If you want to start the old cluster, you will need to remove\n"
+			   "the \".old\" suffix from %s/global/pg_control.old.\n"
+			   "Because \"link\" mode was used, the old cluster cannot be safely\n"
+			   "started once the new cluster has been started.",
+			   old_cluster.pgdata);
+	else if (transfer_mode == TRANSFER_MODE_SWAP)
+		pg_log(PG_REPORT, "\n"
+			   "Because \"swap\" mode was used, the old cluster can no longer be\n"
+			   "safely started.");
+	else
+		pg_fatal("unrecognized transfer mode");
 }
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index b8fd0d0acee..23cb08e8347 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -52,9 +52,11 @@ generate_old_dump(void)
 		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
 
 		parallel_exec_prog(log_file_name, NULL,
-						   "\"%s/pg_dump\" %s --no-data %s --sequence-data --quote-all-identifiers "
+						   "\"%s/pg_dump\" %s --no-data %s %s --quote-all-identifiers "
 						   "--binary-upgrade --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
+						   (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+						   "" : "--sequence-data",
 						   log_opts.verbose ? "--verbose" : "",
 						   user_opts.do_statistics ? "" : "--no-statistics",
 						   log_opts.dumpdir,
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 7fd1991204a..91ed16acb08 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -434,7 +434,7 @@ check_copy_file_range(void)
 }
 
 void
-check_hard_link(void)
+check_hard_link(transferMode transfer_mode)
 {
 	char		existing_file[MAXPGPATH];
 	char		new_link_file[MAXPGPATH];
@@ -444,8 +444,16 @@ check_hard_link(void)
 	unlink(new_link_file);		/* might fail */
 
 	if (link(existing_file, new_link_file) < 0)
-		pg_fatal("could not create hard link between old and new data directories: %m\n"
-				 "In link mode the old and new data directories must be on the same file system.");
+	{
+		if (transfer_mode == TRANSFER_MODE_LINK)
+			pg_fatal("could not create hard link between old and new data directories: %m\n"
+					 "In link mode the old and new data directories must be on the same file system.");
+		else if (transfer_mode == TRANSFER_MODE_SWAP)
+			pg_fatal("could not create hard link between old and new data directories: %m\n"
+					 "In swap mode the old and new data directories must be on the same file system.");
+		else
+			pg_fatal("unrecognized transfer mode");
+	}
 
 	unlink(new_link_file);
 }
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index ad52de8b607..4b7a56f5b3b 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -490,7 +490,7 @@ get_rel_infos_query(void)
 					  "  FROM pg_catalog.pg_class c JOIN pg_catalog.pg_namespace n "
 					  "         ON c.relnamespace = n.oid "
 					  "  WHERE relkind IN (" CppAsString2(RELKIND_RELATION) ", "
-					  CppAsString2(RELKIND_MATVIEW) ") AND "
+					  CppAsString2(RELKIND_MATVIEW) "%s) AND "
 	/* exclude possible orphaned temp tables */
 					  "    ((n.nspname !~ '^pg_temp_' AND "
 					  "      n.nspname !~ '^pg_toast_temp_' AND "
@@ -499,6 +499,8 @@ get_rel_infos_query(void)
 					  "      c.oid >= %u::pg_catalog.oid) OR "
 					  "     (n.nspname = 'pg_catalog' AND "
 					  "      relname IN ('pg_largeobject') ))), ",
+					  (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+					  ", " CppAsString2(RELKIND_SEQUENCE) : "",
 					  FirstNormalObjectId);
 
 	/*
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 188dd8d8a8b..7fd7f1d33fc 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -62,6 +62,7 @@ parseCommandLine(int argc, char *argv[])
 		{"sync-method", required_argument, NULL, 4},
 		{"no-statistics", no_argument, NULL, 5},
 		{"set-char-signedness", required_argument, NULL, 6},
+		{"swap", no_argument, NULL, 7},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -228,6 +229,11 @@ parseCommandLine(int argc, char *argv[])
 				else
 					pg_fatal("invalid argument for option %s", "--set-char-signedness");
 				break;
+
+			case 7:
+				user_opts.transfer_mode = TRANSFER_MODE_SWAP;
+				break;
+
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
 						os_info.progname);
@@ -325,6 +331,7 @@ usage(void)
 	printf(_("  --no-statistics               do not import statistics from old cluster\n"));
 	printf(_("  --set-char-signedness=OPTION  set new cluster char signedness to \"signed\" or\n"
 			 "                                \"unsigned\"\n"));
+	printf(_("  --swap                        move data directories to new cluster\n"));
 	printf(_("  --sync-method=METHOD          set method for syncing files to disk\n"));
 	printf(_("  -?, --help                    show this help, then exit\n"));
 	printf(_("\n"
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 174cd920840..9295e46aed3 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -170,12 +170,14 @@ main(int argc, char **argv)
 
 	/*
 	 * Most failures happen in create_new_objects(), which has completed at
-	 * this point.  We do this here because it is just before linking, which
-	 * will link the old and new cluster data files, preventing the old
-	 * cluster from being safely started once the new cluster is started.
+	 * this point.  We do this here because it is just before file transfer,
+	 * which for --link will make it unsafe to start the old cluster once the
+	 * new cluster is started, and for --swap will make it unsafe to start the
+	 * old cluster at all.
 	 */
-	if (user_opts.transfer_mode == TRANSFER_MODE_LINK)
-		disable_old_cluster();
+	if (user_opts.transfer_mode == TRANSFER_MODE_LINK ||
+		user_opts.transfer_mode == TRANSFER_MODE_SWAP)
+		disable_old_cluster(user_opts.transfer_mode);
 
 	transfer_all_new_tablespaces(&old_cluster.dbarr, &new_cluster.dbarr,
 								 old_cluster.pgdata, new_cluster.pgdata);
@@ -212,8 +214,10 @@ main(int argc, char **argv)
 	{
 		prep_status("Sync data directory to disk");
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/initdb\" --sync-only \"%s\" --sync-method %s",
+				  "\"%s/initdb\" --sync-only %s \"%s\" --sync-method %s",
 				  new_cluster.bindir,
+				  (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+				  "--no-sync-data-files" : "",
 				  new_cluster.pgdata,
 				  user_opts.sync_method);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 4c9d0172149..69c965bb7d0 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -262,6 +262,7 @@ typedef enum
 	TRANSFER_MODE_COPY,
 	TRANSFER_MODE_COPY_FILE_RANGE,
 	TRANSFER_MODE_LINK,
+	TRANSFER_MODE_SWAP,
 } transferMode;
 
 /*
@@ -391,7 +392,7 @@ void		create_script_for_old_cluster_deletion(char **deletion_script_file_name);
 
 void		get_control_data(ClusterInfo *cluster);
 void		check_control_data(ControlData *oldctrl, ControlData *newctrl);
-void		disable_old_cluster(void);
+void		disable_old_cluster(transferMode transfer_mode);
 
 
 /* dump.c */
@@ -423,7 +424,7 @@ void		rewriteVisibilityMap(const char *fromfile, const char *tofile,
 								 const char *schemaName, const char *relName);
 void		check_file_clone(void);
 void		check_copy_file_range(void);
-void		check_hard_link(void);
+void		check_hard_link(transferMode transfer_mode);
 
 /* fopen_priv() is no longer different from fopen() */
 #define fopen_priv(path, mode)	fopen(path, mode)
diff --git a/src/bin/pg_upgrade/relfilenumber.c b/src/bin/pg_upgrade/relfilenumber.c
index 8c23c583172..b07f3330fee 100644
--- a/src/bin/pg_upgrade/relfilenumber.c
+++ b/src/bin/pg_upgrade/relfilenumber.c
@@ -11,11 +11,92 @@
 
 #include <sys/stat.h>
 
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/int.h"
+#include "common/logging.h"
 #include "pg_upgrade.h"
 
 static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
 static void transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_frozenbit);
 
+/*
+ * The following set of sync_queue_* functions are used for --swap to reduce
+ * the amount of time spent synchronizing the swapped catalog files.  When a
+ * file is added to the queue, we also alert the file system that we'd like it
+ * to be persisted to disk in the near future (if that operation is supported
+ * by the current platform).  Once the queue is full, all of the files are
+ * synchronized to disk.  This strategy should generally be much faster than
+ * simply calling fsync() on the files right away.
+ *
+ * The general usage pattern should be something like:
+ *
+ *     for (int i = 0; i < num_files; i++)
+ *         sync_queue_push(files[i]);
+ *
+ *     // be sure to sync any remaining files in the queue
+ *     sync_queue_sync_all();
+ *     synq_queue_destroy();
+ */
+
+#define SYNC_QUEUE_MAX_LEN	(1024)
+
+static char *sync_queue[SYNC_QUEUE_MAX_LEN];
+static bool sync_queue_inited;
+static int	sync_queue_len;
+
+static inline void
+sync_queue_init(void)
+{
+	if (sync_queue_inited)
+		return;
+
+	sync_queue_inited = true;
+	for (int i = 0; i < SYNC_QUEUE_MAX_LEN; i++)
+		sync_queue[i] = palloc(MAXPGPATH);
+}
+
+static inline void
+sync_queue_sync_all(void)
+{
+	if (!sync_queue_inited)
+		return;
+
+	for (int i = 0; i < sync_queue_len; i++)
+	{
+		if (fsync_fname(sync_queue[i], false) != 0)
+			pg_fatal("could not synchronize file \"%s\": %m", sync_queue[i]);
+	}
+
+	sync_queue_len = 0;
+}
+
+static inline void
+sync_queue_push(const char *fname)
+{
+	sync_queue_init();
+
+	pre_sync_fname(fname, false);
+
+	strncpy(sync_queue[sync_queue_len++], fname, MAXPGPATH);
+	if (sync_queue_len >= SYNC_QUEUE_MAX_LEN)
+		sync_queue_sync_all();
+}
+
+static inline void
+sync_queue_destroy(void)
+{
+	if (!sync_queue_inited)
+		return;
+
+	sync_queue_inited = false;
+	sync_queue_len = 0;
+	for (int i = 0; i < SYNC_QUEUE_MAX_LEN; i++)
+	{
+		pfree(sync_queue[i]);
+		sync_queue[i] = NULL;
+	}
+}
 
 /*
  * transfer_all_new_tablespaces()
@@ -41,6 +122,9 @@ transfer_all_new_tablespaces(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		case TRANSFER_MODE_LINK:
 			prep_status_progress("Linking user relation files");
 			break;
+		case TRANSFER_MODE_SWAP:
+			prep_status_progress("Swapping data directories");
+			break;
 	}
 
 	/*
@@ -125,6 +209,274 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		/* We allocate something even for n_maps == 0 */
 		pg_free(mappings);
 	}
+
+	/*
+	 * Make sure anything pending synchronization in swap mode is fully
+	 * persisted to disk.  This is a no-op for other transfer modes.
+	 */
+	sync_queue_sync_all();
+	sync_queue_destroy();
+}
+
+/*
+ * prepare_for_swap()
+ *
+ * This function moves the database directory from the old cluster to the new
+ * cluster in preparation for moving the pg_restore-generated catalog files
+ * into place.  Returns false if the database with the given OID does not have
+ * a directory in the given tablespace, otherwise returns true.
+ *
+ * old_cat (the directory for the old catalog files), new_dat (the database
+ * directory in the new cluster), and moved_dat (the destination for the
+ * pg_restore-generated database directory) should be sized to MAXPGPATH bytes.
+ * This function will return the appropriate paths in those variables.
+ */
+static bool
+prepare_for_swap(const char *old_tablespace, Oid db_oid,
+				 char *old_cat, char *new_dat, char *moved_dat)
+{
+	const char *new_tablespace;
+	const char *old_tblspc_suffix;
+	const char *new_tblspc_suffix;
+	char		old_tblspc[MAXPGPATH];
+	char		new_tblspc[MAXPGPATH];
+	char		moved_tblspc[MAXPGPATH];
+	char		old_dat[MAXPGPATH];
+	struct stat st;
+
+	if (strcmp(old_tablespace, old_cluster.pgdata) == 0)
+	{
+		new_tablespace = new_cluster.pgdata;
+		new_tblspc_suffix = "/base";
+		old_tblspc_suffix = "/base";
+	}
+	else
+	{
+		/*
+		 * XXX: The below line is a hack to deal with the fact that we
+		 * presently don't have an easy way to find the corresponding new
+		 * tablespace's path.  This will need to be fixed if/when we add
+		 * pg_upgrade support for in-place tablespaces.
+		 */
+		new_tablespace = old_tablespace;
+
+		new_tblspc_suffix = new_cluster.tablespace_suffix;
+		old_tblspc_suffix = old_cluster.tablespace_suffix;
+	}
+
+	/* Old and new cluster paths. */
+	snprintf(old_tblspc, sizeof(old_tblspc), "%s%s", old_tablespace, old_tblspc_suffix);
+	snprintf(new_tblspc, sizeof(new_tblspc), "%s%s", new_tablespace, new_tblspc_suffix);
+	snprintf(old_dat, sizeof(old_dat), "%s/%u", old_tblspc, db_oid);
+	snprintf(new_dat, MAXPGPATH, "%s/%u", new_tblspc, db_oid);
+
+	/*
+	 * Paths for "moved aside" stuff.  We intentionally put these in the old
+	 * cluster so that the delete_old_cluster.{sh,bat} script handles them.
+	 */
+	snprintf(moved_tblspc, sizeof(moved_tblspc), "%s/moved_for_upgrade", old_tblspc);
+	snprintf(old_cat, MAXPGPATH, "%s/%u_old_catalogs", moved_tblspc, db_oid);
+	snprintf(moved_dat, MAXPGPATH, "%s/%u", moved_tblspc, db_oid);
+
+	/* Check that the database directory exists in the given tablespace. */
+	if (stat(old_dat, &st) != 0)
+	{
+		if (errno != ENOENT)
+			pg_fatal("could not stat file \"%s\": %m", old_dat);
+		return false;
+	}
+
+	/* Create directory for stuff that is moved aside. */
+	if (pg_mkdir_p(moved_tblspc, pg_dir_create_mode) != 0 && errno != EEXIST)
+		pg_fatal("could not create directory \"%s\"", moved_tblspc);
+
+	/* Create directory for old catalog files. */
+	if (pg_mkdir_p(old_cat, pg_dir_create_mode) != 0)
+		pg_fatal("could not create directory \"%s\"", old_cat);
+
+	/* Move the new cluster's database directory aside. */
+	if (rename(new_dat, moved_dat) != 0)
+		pg_fatal("could not rename \"%s\" to \"%s\"", new_dat, moved_dat);
+
+	/* Move the old cluster's database directory into place. */
+	if (rename(old_dat, new_dat) != 0)
+		pg_fatal("could not rename \"%s\" to \"%s\"", old_dat, new_dat);
+
+	return true;
+}
+
+/*
+ * FileNameMapCmp()
+ *
+ * qsort() comparator for FileNameMap that sorts by RelFileNumber.
+ */
+static int
+FileNameMapCmp(const void *a, const void *b)
+{
+	const FileNameMap *map1 = (const FileNameMap *) a;
+	const FileNameMap *map2 = (const FileNameMap *) b;
+
+	return pg_cmp_u32(map1->relfilenumber, map2->relfilenumber);
+}
+
+/*
+ * parse_relfilenumber()
+ *
+ * Attempt to parse the RelFileNumber of the given file name.  If we can't,
+ * return InvalidRelFileNumber.  Note that this code snippet is lifted from
+ * parse_filename_for_nontemp_relation().
+ */
+static RelFileNumber
+parse_relfilenumber(const char *filename)
+{
+	char	   *endp;
+	unsigned long n;
+
+	if (filename[0] < '1' || filename[0] > '9')
+		return InvalidRelFileNumber;
+
+	errno = 0;
+	n = strtoul(filename, &endp, 10);
+	if (errno || filename == endp || n <= 0 || n > PG_UINT32_MAX)
+		return InvalidRelFileNumber;
+
+	return (RelFileNumber) n;
+}
+
+/*
+ * swap_catalog_files()
+ *
+ * Moves the old catalog files aside, and moves the new catalog files into
+ * place.
+ */
+static void
+swap_catalog_files(FileNameMap *maps, int size, const char *old_cat,
+				   const char *new_dat, const char *moved_dat)
+{
+	DIR		   *dir;
+	struct dirent *de;
+	char		path[MAXPGPATH];
+	char		dest[MAXPGPATH];
+	RelFileNumber rfn;
+
+	/* Move the old catalog files aside. */
+	dir = opendir(new_dat);
+	if (dir == NULL)
+		pg_fatal("could not open directory \"%s\": %m", new_dat);
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		snprintf(path, sizeof(path), "%s/%s", new_dat, de->d_name);
+		if (get_dirent_type(path, de, false, PG_LOG_ERROR) != PGFILETYPE_REG)
+			continue;
+
+		rfn = parse_relfilenumber(de->d_name);
+		if (RelFileNumberIsValid(rfn))
+		{
+			FileNameMap key = {.relfilenumber = rfn};
+
+			if (bsearch(&key, maps, size, sizeof(FileNameMap), FileNameMapCmp))
+				continue;
+		}
+
+		snprintf(dest, sizeof(dest), "%s/%s", old_cat, de->d_name);
+		if (rename(path, dest) != 0)
+			pg_fatal("could not rename \"%s\" to \"%s\": %m", path, dest);
+	}
+	if (errno)
+		pg_fatal("could not read directory \"%s\": %m", new_dat);
+	(void) closedir(dir);
+
+	/* Move the new catalog files into place. */
+	dir = opendir(moved_dat);
+	if (dir == NULL)
+		pg_fatal("could not open directory \"%s\": %m", moved_dat);
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		snprintf(path, sizeof(path), "%s/%s", moved_dat, de->d_name);
+		if (get_dirent_type(path, de, false, PG_LOG_ERROR) != PGFILETYPE_REG)
+			continue;
+
+		rfn = parse_relfilenumber(de->d_name);
+		if (RelFileNumberIsValid(rfn))
+		{
+			FileNameMap key = {.relfilenumber = rfn};
+
+			if (bsearch(&key, maps, size, sizeof(FileNameMap), FileNameMapCmp))
+				continue;
+		}
+
+		snprintf(dest, sizeof(dest), "%s/%s", new_dat, de->d_name);
+		if (rename(path, dest) != 0)
+			pg_fatal("could not rename \"%s\" to \"%s\": %m", path, dest);
+
+		/*
+		 * We don't fsync() the database files in the file synchronization
+		 * stage of pg_upgrade in swap mode, so we need to synchronize them
+		 * ourselves.  We only do this for the catalog files because they were
+		 * created during pg_restore with fsync=off.  We assume that the user
+		 * data files files were properly persisted to disk when the user last
+		 * shut it down.
+		 */
+		if (user_opts.do_sync)
+			sync_queue_push(dest);
+	}
+	if (errno)
+		pg_fatal("could not read directory \"%s\": %m", moved_dat);
+	(void) closedir(dir);
+
+	/* Ensure the directory entries are persisted to disk. */
+	if (fsync_fname(new_dat, true) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", new_dat);
+	if (fsync_parent_path(new_dat) != 0)
+		pg_fatal("could not synchronize parent directory of \"%s\": %m", new_dat);
+}
+
+/*
+ * do_swap()
+ *
+ * Perform the required steps for --swap for a single database.  In short this
+ * moves the old cluster's database directory into the new cluster and then
+ * replaces any files for system catalogs with the ones that were generated
+ * during pg_restore.
+ */
+static void
+do_swap(FileNameMap *maps, int size, char *old_tablespace)
+{
+	char		old_cat[MAXPGPATH];
+	char		new_dat[MAXPGPATH];
+	char		moved_dat[MAXPGPATH];
+
+	/*
+	 * We perform many lookups on maps by relfilenumber in swap mode, so make
+	 * sure it's sorted by relfilenumber.  maps should already be sorted by
+	 * OID, so in general this shouldn't have much work to do.
+	 */
+	qsort(maps, size, sizeof(FileNameMap), FileNameMapCmp);
+
+	/*
+	 * If an old tablespace is given, we only need to process that one.  If no
+	 * old tablespace is specified, we need to process all the tablespaces on
+	 * the system.
+	 */
+	if (old_tablespace)
+	{
+		if (prepare_for_swap(old_tablespace, maps[0].db_oid,
+							 old_cat, new_dat, moved_dat))
+			swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+	}
+	else
+	{
+		if (prepare_for_swap(old_cluster.pgdata, maps[0].db_oid,
+							 old_cat, new_dat, moved_dat))
+			swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+
+		for (int tblnum = 0; tblnum < os_info.num_old_tablespaces; tblnum++)
+		{
+			if (prepare_for_swap(os_info.old_tablespaces[tblnum], maps[0].db_oid,
+								 old_cat, new_dat, moved_dat))
+				swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+		}
+	}
 }
 
 /*
@@ -145,6 +497,20 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
 		new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
 		vm_must_add_frozenbit = true;
 
+	/* --swap has its own subroutine */
+	if (user_opts.transfer_mode == TRANSFER_MODE_SWAP)
+	{
+		/*
+		 * We don't support --swap to upgrade from versions that require
+		 * rewriting the visibility map.  We should've failed already if
+		 * someone tries to do that.
+		 */
+		Assert(!vm_must_add_frozenbit);
+
+		do_swap(maps, size, old_tablespace);
+		return;
+	}
+
 	for (mapnum = 0; mapnum < size; mapnum++)
 	{
 		if (old_tablespace == NULL ||
@@ -259,6 +625,11 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 					pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"",
 						   old_file, new_file);
 					linkFile(old_file, new_file, map->nspname, map->relname);
+					break;
+				case TRANSFER_MODE_SWAP:
+					/* swap mode is handled in its own code path */
+					pg_fatal("should never happen");
+					break;
 			}
 	}
 }
diff --git a/src/bin/pg_upgrade/t/006_modes.pl b/src/bin/pg_upgrade/t/006_modes.pl
index 518e0994145..34fddbcdab5 100644
--- a/src/bin/pg_upgrade/t/006_modes.pl
+++ b/src/bin/pg_upgrade/t/006_modes.pl
@@ -16,6 +16,15 @@ sub test_mode
 	my $old = PostgreSQL::Test::Cluster->new('old', install_path => $ENV{oldinstall});
 	my $new = PostgreSQL::Test::Cluster->new('new');
 
+	# --swap can't be used to upgrade from versions older than 10, so just skip
+	# the test if the old cluster version is too old.
+	if ($old->pg_version < 10 && $mode eq "--swap")
+	{
+		$old->clean_node();
+		$new->clean_node();
+		return;
+	}
+
 	if (defined($ENV{oldinstall}))
 	{
 		# Checksums are now enabled by default, but weren't before 18, so pass
@@ -97,5 +106,6 @@ test_mode('--clone');
 test_mode('--copy');
 test_mode('--copy-file-range');
 test_mode('--link');
+test_mode('--swap');
 
 done_testing();
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 78e272916f5..4405ef8b425 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -45,9 +45,6 @@
  */
 #define MINIMUM_VERSION_FOR_PG_WAL	100000
 
-#ifdef PG_FLUSH_DATA_WORKS
-static int	pre_sync_fname(const char *fname, bool isdir);
-#endif
 static void walkdir(const char *path,
 					int (*action) (const char *fname, bool isdir),
 					bool process_symlinks,
@@ -352,16 +349,16 @@ walkdir(const char *path,
 }
 
 /*
- * Hint to the OS that it should get ready to fsync() this file.
+ * Hint to the OS that it should get ready to fsync() this file, if supported
+ * by the platform.
  *
  * Ignores errors trying to open unreadable files, and reports other errors
  * non-fatally.
  */
-#ifdef PG_FLUSH_DATA_WORKS
-
-static int
+int
 pre_sync_fname(const char *fname, bool isdir)
 {
+#ifdef PG_FLUSH_DATA_WORKS
 	int			fd;
 
 	fd = open(fname, O_RDONLY | PG_BINARY, 0);
@@ -388,11 +385,10 @@ pre_sync_fname(const char *fname, bool isdir)
 #endif
 
 	(void) close(fd);
+#endif							/* PG_FLUSH_DATA_WORKS */
 	return 0;
 }
 
-#endif							/* PG_FLUSH_DATA_WORKS */
-
 /*
  * fsync_fname -- Try to fsync a file or directory
  *
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 8274bc877ab..9fd88953e43 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -33,6 +33,7 @@ typedef enum DataDirSyncMethod
 struct iovec;					/* avoid including port/pg_iovec.h here */
 
 #ifdef FRONTEND
+extern int	pre_sync_fname(const char *fname, bool isdir);
 extern int	fsync_fname(const char *fname, bool isdir);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
 						DataDirSyncMethod sync_method, bool sync_data_files);
-- 
2.39.5 (Apple Git-154)

#38Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#37)
Re: optimize file transfer in pg_upgrade

On Wed, Mar 19, 2025 at 09:02:42PM -0500, Nathan Bossart wrote:

On Wed, Mar 19, 2025 at 04:28:23PM -0500, Nathan Bossart wrote:

On Wed, Mar 19, 2025 at 02:32:01PM -0500, Nathan Bossart wrote:

In addition to testing with in-place tablespaces, we might also want to
teach the transfer modes test to do cross-version testing when possible.
In that case, we can test normal (non-in-place) tablespaces. However, that
would be limited to the buildfarm.

Actually, this one was pretty easy to do.

And here is yet another new version of the full patch set. I'm planning to
commit 0001 (the new pg_upgrade transfer mode test) tomorrow so that I can
deal with any buildfarm indigestion before committing swap mode. I did run
the test locally for upgrades from v9.6, v13, and v17, but who knows what
unique configurations I've failed to anticipate...

As promised, I've committed just 0001 for now. I'll watch closely for any
issues in the buildfarm.

--
nathan

#39Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#38)
3 attachment(s)
Re: optimize file transfer in pg_upgrade

On Thu, Mar 20, 2025 at 11:11:46AM -0500, Nathan Bossart wrote:

As promised, I've committed just 0001 for now. I'll watch closely for any
issues in the buildfarm.

Seeing none, here's is a rebased patch set without 0001. The only changes
are some fleshed-out comments and commit messages. I'm still aiming to
commit this sometime early next week.

--
nathan

Attachments:

v9-0002-pg_dump-Add-sequence-data.patchtext/plain; charset=us-asciiDownload
From b085d8e10e8ce5fd7213a27d294bf7556e4d7430 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 19 Feb 2025 11:25:28 -0600
Subject: [PATCH v9 2/3] pg_dump: Add --sequence-data.

This new option instructs pg_dump to dump sequence data when the
--no-data, --schema-only, or --statistics-only option is specified.
This was originally considered for commit a7e5457db8, but it was
left out at that time because there was no known use-case.  A
follow-up commit will use this to optimize pg_upgrade's file
transfer step.

Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com>
Reviewed-by: Bruce Momjian <bruce@momjian.us>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/pg_dump.sgml               | 11 +++++++++++
 src/bin/pg_dump/pg_dump.c                   | 10 ++--------
 src/bin/pg_dump/t/002_pg_dump.pl            |  1 +
 src/bin/pg_upgrade/dump.c                   |  2 +-
 src/test/modules/test_pg_dump/t/001_base.pl |  2 +-
 5 files changed, 16 insertions(+), 10 deletions(-)

diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
index 0ae40f9be58..63cca18711a 100644
--- a/doc/src/sgml/ref/pg_dump.sgml
+++ b/doc/src/sgml/ref/pg_dump.sgml
@@ -1298,6 +1298,17 @@ PostgreSQL documentation
        </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--sequence-data</option></term>
+      <listitem>
+       <para>
+        Include sequence data in the dump.  This is the default behavior except
+        when <option>--no-data</option>, <option>--schema-only</option>, or
+        <option>--statistics-only</option> is specified.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--serializable-deferrable</option></term>
       <listitem>
diff --git a/src/bin/pg_dump/pg_dump.c b/src/bin/pg_dump/pg_dump.c
index 428ed2d60fc..e6253331e27 100644
--- a/src/bin/pg_dump/pg_dump.c
+++ b/src/bin/pg_dump/pg_dump.c
@@ -518,6 +518,7 @@ main(int argc, char **argv)
 		{"sync-method", required_argument, NULL, 15},
 		{"filter", required_argument, NULL, 16},
 		{"exclude-extension", required_argument, NULL, 17},
+		{"sequence-data", no_argument, &dopt.sequence_data, 1},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -801,14 +802,6 @@ main(int argc, char **argv)
 	if (dopt.column_inserts && dopt.dump_inserts == 0)
 		dopt.dump_inserts = DUMP_DEFAULT_ROWS_PER_INSERT;
 
-	/*
-	 * Binary upgrade mode implies dumping sequence data even in schema-only
-	 * mode.  This is not exposed as a separate option, but kept separate
-	 * internally for clarity.
-	 */
-	if (dopt.binary_upgrade)
-		dopt.sequence_data = 1;
-
 	if (data_only && schema_only)
 		pg_fatal("options -s/--schema-only and -a/--data-only cannot be used together");
 	if (schema_only && statistics_only)
@@ -1275,6 +1268,7 @@ help(const char *progname)
 	printf(_("  --quote-all-identifiers      quote all identifiers, even if not key words\n"));
 	printf(_("  --rows-per-insert=NROWS      number of rows per INSERT; implies --inserts\n"));
 	printf(_("  --section=SECTION            dump named section (pre-data, data, or post-data)\n"));
+	printf(_("  --sequence-data              include sequence data in dump\n"));
 	printf(_("  --serializable-deferrable    wait until the dump can run without anomalies\n"));
 	printf(_("  --snapshot=SNAPSHOT          use given snapshot for the dump\n"));
 	printf(_("  --statistics-only            dump only the statistics, not schema or data\n"));
diff --git a/src/bin/pg_dump/t/002_pg_dump.pl b/src/bin/pg_dump/t/002_pg_dump.pl
index d281e27aa67..ed379033da7 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -66,6 +66,7 @@ my %pgdump_runs = (
 			'--file' => "$tempdir/binary_upgrade.dump",
 			'--no-password',
 			'--no-data',
+			'--sequence-data',
 			'--binary-upgrade',
 			'--dbname' => 'postgres',    # alternative way to specify database
 		],
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index 23fe7280a16..b8fd0d0acee 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -52,7 +52,7 @@ generate_old_dump(void)
 		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
 
 		parallel_exec_prog(log_file_name, NULL,
-						   "\"%s/pg_dump\" %s --no-data %s --quote-all-identifiers "
+						   "\"%s/pg_dump\" %s --no-data %s --sequence-data --quote-all-identifiers "
 						   "--binary-upgrade --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
 						   log_opts.verbose ? "--verbose" : "",
diff --git a/src/test/modules/test_pg_dump/t/001_base.pl b/src/test/modules/test_pg_dump/t/001_base.pl
index a9bcac4169d..adcaa419616 100644
--- a/src/test/modules/test_pg_dump/t/001_base.pl
+++ b/src/test/modules/test_pg_dump/t/001_base.pl
@@ -48,7 +48,7 @@ my %pgdump_runs = (
 		dump_cmd => [
 			'pg_dump', '--no-sync',
 			'--file' => "$tempdir/binary_upgrade.sql",
-			'--schema-only', '--binary-upgrade',
+			'--schema-only', '--sequence-data', '--binary-upgrade',
 			'--dbname' => 'postgres',
 		],
 	},
-- 
2.39.5 (Apple Git-154)

v9-0003-pg_upgrade-Add-swap-for-faster-file-transfer.patchtext/plain; charset=us-asciiDownload
From 209a6fc03ddc94ca879809ae0794fd1ace841224 Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Thu, 20 Mar 2025 14:22:27 -0500
Subject: [PATCH v9 3/3] pg_upgrade: Add --swap for faster file transfer.

This new option instructs pg_upgrade to move the data directories
from the old cluster to the new cluster and then to replace the
catalog files with those generated for the new cluster.  This mode
can outperform --link, --clone, --copy, and --copy-file-range,
especially on clusters with many relations.

However, this mode creates many garbage files in the old cluster,
which can prolong the file synchronization step.  To handle that,
we use "initdb --sync-only --no-sync-data-files" for file
synchronization, and we synchronize the catalog files as they are
transferred.  We assume that the database files transferred from
the old cluster were synchronized prior to upgrade.

This mode also complicates reverting to the old cluster, so we
recommend restoring from backup upon failure during or after file
transfer.  We did consider teaching pg_upgrade how to generate a
revert script for such failures, but we decided against it due to
the rarity of failing during file transfer, the complexity of
generating the script, and the potential for misusing the script.

The new mode is limited to clusters located in the same file
system.  With some effort, we could probably support upgrades
between different file systems, but this mode is unlikely to offer
much benefit if we have to copy the files across file system
boundaries.

It is also limited to upgrades from version 10 or newer.  There are
a few known obstacles for using swap mode to upgrade from older
versions.  For example, the visibility map format changed in v9.6,
and the sequence tuple format changed in v10.  In fact, swap mode
omits the --sequence-data option in its uses of pg_dump and instead
reuses the old cluster's sequence data files.  While teaching swap
mode to deal with these kinds of changes is surely possible (and we
may have to deal with similar problems in the future, anyway), it
doesn't seem worth the effort to support upgrades from
long-unsupported versions.

Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com>
Reviewed-by: Bruce Momjian <bruce@momjian.us>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/pgupgrade.sgml            |  59 +++-
 src/bin/pg_upgrade/TESTING                 |   6 +-
 src/bin/pg_upgrade/check.c                 |  29 +-
 src/bin/pg_upgrade/controldata.c           |  21 +-
 src/bin/pg_upgrade/dump.c                  |   4 +-
 src/bin/pg_upgrade/file.c                  |  14 +-
 src/bin/pg_upgrade/info.c                  |   4 +-
 src/bin/pg_upgrade/option.c                |   7 +
 src/bin/pg_upgrade/pg_upgrade.c            |  16 +-
 src/bin/pg_upgrade/pg_upgrade.h            |   5 +-
 src/bin/pg_upgrade/relfilenumber.c         | 375 +++++++++++++++++++++
 src/bin/pg_upgrade/t/006_transfer_modes.pl |  10 +
 src/common/file_utils.c                    |  14 +-
 src/include/common/file_utils.h            |   1 +
 14 files changed, 531 insertions(+), 34 deletions(-)

diff --git a/doc/src/sgml/ref/pgupgrade.sgml b/doc/src/sgml/ref/pgupgrade.sgml
index 5db761d1ff1..da261619043 100644
--- a/doc/src/sgml/ref/pgupgrade.sgml
+++ b/doc/src/sgml/ref/pgupgrade.sgml
@@ -244,7 +244,8 @@ PostgreSQL documentation
       <listitem>
        <para>
         Copy files to the new cluster.  This is the default.  (See also
-        <option>--link</option> and <option>--clone</option>.)
+        <option>--link</option>, <option>--clone</option>,
+        <option>--copy-file-range</option>, and <option>--swap</option>.)
        </para>
       </listitem>
      </varlistentry>
@@ -262,6 +263,32 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry>
+      <term><option>--swap</option></term>
+      <listitem>
+       <para>
+        Move the data directories from the old cluster to the new cluster.
+        Then, replace the catalog files with those generated for the new
+        cluster.  This mode can outperform <option>--link</option>,
+        <option>--clone</option>, <option>--copy</option>, and
+        <option>--copy-file-range</option>, especially on clusters with many
+        relations.
+       </para>
+       <para>
+        However, this mode creates many garbage files in the old cluster, which
+        can prolong the file synchronization step if
+        <option>--sync-method=syncfs</option> is used.  Therefore, it is
+        recommended to use <option>--sync-method=fsync</option> with
+        <option>--swap</option>.
+       </para>
+       <para>
+        Additionally, once the file transfer step begins, the old cluster will
+        be destructively modified and therefore will no longer be safe to
+        start.  See <xref linkend="pgupgrade-step-revert"/> for details.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry>
       <term><option>--sync-method=</option><replaceable>method</replaceable></term>
       <listitem>
@@ -530,6 +557,10 @@ NET STOP postgresql-&majorversion;
      is started.  Clone mode also requires that the old and new data
      directories be in the same file system.  This mode is only available
      on certain operating systems and file systems.
+     Swap mode may be the fastest if there are many relations, but you will not
+     be able to access your old cluster once the file transfer step begins.
+     Swap mode also requires that the old and new cluster data directories be
+     in the same file system.
     </para>
 
     <para>
@@ -889,6 +920,32 @@ psql --username=postgres --file=script.sql postgres
 
         </itemizedlist></para>
       </listitem>
+
+      <listitem>
+       <para>
+        If the <option>--swap</option> option was used, the old cluster might
+        be destructively modified:
+
+        <itemizedlist>
+         <listitem>
+          <para>
+           If <command>pg_upgrade</command> aborts before reporting that the
+           old cluster is no longer safe to start, the old cluster was
+           unmodified; it can be restarted.
+          </para>
+         </listitem>
+
+         <listitem>
+          <para>
+           If <command>pg_upgrade</command> has reported that the old cluster
+           is no longer safe to start, the old cluster was destructively
+           modified.  The old cluster will need to be restored from backup in
+           this case.
+          </para>
+         </listitem>
+        </itemizedlist>
+       </para>
+      </listitem>
      </itemizedlist></para>
    </step>
   </procedure>
diff --git a/src/bin/pg_upgrade/TESTING b/src/bin/pg_upgrade/TESTING
index 00842ac6ec3..c3d463c9c29 100644
--- a/src/bin/pg_upgrade/TESTING
+++ b/src/bin/pg_upgrade/TESTING
@@ -20,13 +20,13 @@ export oldinstall=...otherversion/	(old version's install base path)
 See DETAILS below for more information about creation of the dump.
 
 You can also test the different transfer modes (--copy, --link,
---clone, --copy-file-range) by setting the environment variable
+--clone, --copy-file-range, --swap) by setting the environment variable
 PG_TEST_PG_UPGRADE_MODE to the respective command-line option, like
 
 	make check PG_TEST_PG_UPGRADE_MODE=--link
 
-The default is --copy.  Note that the other modes are not supported on
-all operating systems.
+The default is --copy.  Note that not all modes are supported on all
+operating systems.
 
 DETAILS
 -------
diff --git a/src/bin/pg_upgrade/check.c b/src/bin/pg_upgrade/check.c
index 88daa808035..564a9116ca5 100644
--- a/src/bin/pg_upgrade/check.c
+++ b/src/bin/pg_upgrade/check.c
@@ -709,7 +709,34 @@ check_new_cluster(void)
 			check_copy_file_range();
 			break;
 		case TRANSFER_MODE_LINK:
-			check_hard_link();
+			check_hard_link(TRANSFER_MODE_LINK);
+			break;
+		case TRANSFER_MODE_SWAP:
+
+			/*
+			 * We do the hard link check for --swap, too, since it's an easy
+			 * way to verify the clusters are in the same file system.  This
+			 * allows us to take some shortcuts in the file synchronization
+			 * step.  With some more effort, we could probably support the
+			 * separate-file-system use case, but this mode is unlikely to
+			 * offer much benefit if we have to copy the files across file
+			 * system boundaries.
+			 */
+			check_hard_link(TRANSFER_MODE_SWAP);
+
+			/*
+			 * There are a few known issues with using --swap to upgrade from
+			 * versions older than 10.  For example, the sequence tuple format
+			 * changed in v10, and the visibility map format changed in 9.6.
+			 * While such problems are not insurmountable (and we may have to
+			 * deal with similar problems in the future, anyway), it doesn't
+			 * seem worth the effort to support swap mode for upgrades from
+			 * long-unsupported versions.
+			 */
+			if (GET_MAJOR_VERSION(old_cluster.major_version) < 1000)
+				pg_fatal("Swap mode can only upgrade clusters from PostgreSQL version %s and later.",
+						 "10");
+
 			break;
 	}
 
diff --git a/src/bin/pg_upgrade/controldata.c b/src/bin/pg_upgrade/controldata.c
index bd49ea867bf..47ee27ec835 100644
--- a/src/bin/pg_upgrade/controldata.c
+++ b/src/bin/pg_upgrade/controldata.c
@@ -751,7 +751,7 @@ check_control_data(ControlData *oldctrl,
 
 
 void
-disable_old_cluster(void)
+disable_old_cluster(transferMode transfer_mode)
 {
 	char		old_path[MAXPGPATH],
 				new_path[MAXPGPATH];
@@ -766,10 +766,17 @@ disable_old_cluster(void)
 				 old_path, new_path);
 	check_ok();
 
-	pg_log(PG_REPORT, "\n"
-		   "If you want to start the old cluster, you will need to remove\n"
-		   "the \".old\" suffix from %s/global/pg_control.old.\n"
-		   "Because \"link\" mode was used, the old cluster cannot be safely\n"
-		   "started once the new cluster has been started.",
-		   old_cluster.pgdata);
+	if (transfer_mode == TRANSFER_MODE_LINK)
+		pg_log(PG_REPORT, "\n"
+			   "If you want to start the old cluster, you will need to remove\n"
+			   "the \".old\" suffix from %s/global/pg_control.old.\n"
+			   "Because \"link\" mode was used, the old cluster cannot be safely\n"
+			   "started once the new cluster has been started.",
+			   old_cluster.pgdata);
+	else if (transfer_mode == TRANSFER_MODE_SWAP)
+		pg_log(PG_REPORT, "\n"
+			   "Because \"swap\" mode was used, the old cluster can no longer be\n"
+			   "safely started.");
+	else
+		pg_fatal("unrecognized transfer mode");
 }
diff --git a/src/bin/pg_upgrade/dump.c b/src/bin/pg_upgrade/dump.c
index b8fd0d0acee..23cb08e8347 100644
--- a/src/bin/pg_upgrade/dump.c
+++ b/src/bin/pg_upgrade/dump.c
@@ -52,9 +52,11 @@ generate_old_dump(void)
 		snprintf(log_file_name, sizeof(log_file_name), DB_DUMP_LOG_FILE_MASK, old_db->db_oid);
 
 		parallel_exec_prog(log_file_name, NULL,
-						   "\"%s/pg_dump\" %s --no-data %s --sequence-data --quote-all-identifiers "
+						   "\"%s/pg_dump\" %s --no-data %s %s --quote-all-identifiers "
 						   "--binary-upgrade --format=custom %s --no-sync --file=\"%s/%s\" %s",
 						   new_cluster.bindir, cluster_conn_opts(&old_cluster),
+						   (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+						   "" : "--sequence-data",
 						   log_opts.verbose ? "--verbose" : "",
 						   user_opts.do_statistics ? "" : "--no-statistics",
 						   log_opts.dumpdir,
diff --git a/src/bin/pg_upgrade/file.c b/src/bin/pg_upgrade/file.c
index 7fd1991204a..91ed16acb08 100644
--- a/src/bin/pg_upgrade/file.c
+++ b/src/bin/pg_upgrade/file.c
@@ -434,7 +434,7 @@ check_copy_file_range(void)
 }
 
 void
-check_hard_link(void)
+check_hard_link(transferMode transfer_mode)
 {
 	char		existing_file[MAXPGPATH];
 	char		new_link_file[MAXPGPATH];
@@ -444,8 +444,16 @@ check_hard_link(void)
 	unlink(new_link_file);		/* might fail */
 
 	if (link(existing_file, new_link_file) < 0)
-		pg_fatal("could not create hard link between old and new data directories: %m\n"
-				 "In link mode the old and new data directories must be on the same file system.");
+	{
+		if (transfer_mode == TRANSFER_MODE_LINK)
+			pg_fatal("could not create hard link between old and new data directories: %m\n"
+					 "In link mode the old and new data directories must be on the same file system.");
+		else if (transfer_mode == TRANSFER_MODE_SWAP)
+			pg_fatal("could not create hard link between old and new data directories: %m\n"
+					 "In swap mode the old and new data directories must be on the same file system.");
+		else
+			pg_fatal("unrecognized transfer mode");
+	}
 
 	unlink(new_link_file);
 }
diff --git a/src/bin/pg_upgrade/info.c b/src/bin/pg_upgrade/info.c
index ad52de8b607..4b7a56f5b3b 100644
--- a/src/bin/pg_upgrade/info.c
+++ b/src/bin/pg_upgrade/info.c
@@ -490,7 +490,7 @@ get_rel_infos_query(void)
 					  "  FROM pg_catalog.pg_class c JOIN pg_catalog.pg_namespace n "
 					  "         ON c.relnamespace = n.oid "
 					  "  WHERE relkind IN (" CppAsString2(RELKIND_RELATION) ", "
-					  CppAsString2(RELKIND_MATVIEW) ") AND "
+					  CppAsString2(RELKIND_MATVIEW) "%s) AND "
 	/* exclude possible orphaned temp tables */
 					  "    ((n.nspname !~ '^pg_temp_' AND "
 					  "      n.nspname !~ '^pg_toast_temp_' AND "
@@ -499,6 +499,8 @@ get_rel_infos_query(void)
 					  "      c.oid >= %u::pg_catalog.oid) OR "
 					  "     (n.nspname = 'pg_catalog' AND "
 					  "      relname IN ('pg_largeobject') ))), ",
+					  (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+					  ", " CppAsString2(RELKIND_SEQUENCE) : "",
 					  FirstNormalObjectId);
 
 	/*
diff --git a/src/bin/pg_upgrade/option.c b/src/bin/pg_upgrade/option.c
index 188dd8d8a8b..7fd7f1d33fc 100644
--- a/src/bin/pg_upgrade/option.c
+++ b/src/bin/pg_upgrade/option.c
@@ -62,6 +62,7 @@ parseCommandLine(int argc, char *argv[])
 		{"sync-method", required_argument, NULL, 4},
 		{"no-statistics", no_argument, NULL, 5},
 		{"set-char-signedness", required_argument, NULL, 6},
+		{"swap", no_argument, NULL, 7},
 
 		{NULL, 0, NULL, 0}
 	};
@@ -228,6 +229,11 @@ parseCommandLine(int argc, char *argv[])
 				else
 					pg_fatal("invalid argument for option %s", "--set-char-signedness");
 				break;
+
+			case 7:
+				user_opts.transfer_mode = TRANSFER_MODE_SWAP;
+				break;
+
 			default:
 				fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
 						os_info.progname);
@@ -325,6 +331,7 @@ usage(void)
 	printf(_("  --no-statistics               do not import statistics from old cluster\n"));
 	printf(_("  --set-char-signedness=OPTION  set new cluster char signedness to \"signed\" or\n"
 			 "                                \"unsigned\"\n"));
+	printf(_("  --swap                        move data directories to new cluster\n"));
 	printf(_("  --sync-method=METHOD          set method for syncing files to disk\n"));
 	printf(_("  -?, --help                    show this help, then exit\n"));
 	printf(_("\n"
diff --git a/src/bin/pg_upgrade/pg_upgrade.c b/src/bin/pg_upgrade/pg_upgrade.c
index 174cd920840..9295e46aed3 100644
--- a/src/bin/pg_upgrade/pg_upgrade.c
+++ b/src/bin/pg_upgrade/pg_upgrade.c
@@ -170,12 +170,14 @@ main(int argc, char **argv)
 
 	/*
 	 * Most failures happen in create_new_objects(), which has completed at
-	 * this point.  We do this here because it is just before linking, which
-	 * will link the old and new cluster data files, preventing the old
-	 * cluster from being safely started once the new cluster is started.
+	 * this point.  We do this here because it is just before file transfer,
+	 * which for --link will make it unsafe to start the old cluster once the
+	 * new cluster is started, and for --swap will make it unsafe to start the
+	 * old cluster at all.
 	 */
-	if (user_opts.transfer_mode == TRANSFER_MODE_LINK)
-		disable_old_cluster();
+	if (user_opts.transfer_mode == TRANSFER_MODE_LINK ||
+		user_opts.transfer_mode == TRANSFER_MODE_SWAP)
+		disable_old_cluster(user_opts.transfer_mode);
 
 	transfer_all_new_tablespaces(&old_cluster.dbarr, &new_cluster.dbarr,
 								 old_cluster.pgdata, new_cluster.pgdata);
@@ -212,8 +214,10 @@ main(int argc, char **argv)
 	{
 		prep_status("Sync data directory to disk");
 		exec_prog(UTILITY_LOG_FILE, NULL, true, true,
-				  "\"%s/initdb\" --sync-only \"%s\" --sync-method %s",
+				  "\"%s/initdb\" --sync-only %s \"%s\" --sync-method %s",
 				  new_cluster.bindir,
+				  (user_opts.transfer_mode == TRANSFER_MODE_SWAP) ?
+				  "--no-sync-data-files" : "",
 				  new_cluster.pgdata,
 				  user_opts.sync_method);
 		check_ok();
diff --git a/src/bin/pg_upgrade/pg_upgrade.h b/src/bin/pg_upgrade/pg_upgrade.h
index 4c9d0172149..69c965bb7d0 100644
--- a/src/bin/pg_upgrade/pg_upgrade.h
+++ b/src/bin/pg_upgrade/pg_upgrade.h
@@ -262,6 +262,7 @@ typedef enum
 	TRANSFER_MODE_COPY,
 	TRANSFER_MODE_COPY_FILE_RANGE,
 	TRANSFER_MODE_LINK,
+	TRANSFER_MODE_SWAP,
 } transferMode;
 
 /*
@@ -391,7 +392,7 @@ void		create_script_for_old_cluster_deletion(char **deletion_script_file_name);
 
 void		get_control_data(ClusterInfo *cluster);
 void		check_control_data(ControlData *oldctrl, ControlData *newctrl);
-void		disable_old_cluster(void);
+void		disable_old_cluster(transferMode transfer_mode);
 
 
 /* dump.c */
@@ -423,7 +424,7 @@ void		rewriteVisibilityMap(const char *fromfile, const char *tofile,
 								 const char *schemaName, const char *relName);
 void		check_file_clone(void);
 void		check_copy_file_range(void);
-void		check_hard_link(void);
+void		check_hard_link(transferMode transfer_mode);
 
 /* fopen_priv() is no longer different from fopen() */
 #define fopen_priv(path, mode)	fopen(path, mode)
diff --git a/src/bin/pg_upgrade/relfilenumber.c b/src/bin/pg_upgrade/relfilenumber.c
index 8c23c583172..c0affa5565c 100644
--- a/src/bin/pg_upgrade/relfilenumber.c
+++ b/src/bin/pg_upgrade/relfilenumber.c
@@ -11,11 +11,92 @@
 
 #include <sys/stat.h>
 
+#include "common/file_perm.h"
+#include "common/file_utils.h"
+#include "common/int.h"
+#include "common/logging.h"
 #include "pg_upgrade.h"
 
 static void transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace);
 static void transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_frozenbit);
 
+/*
+ * The following set of sync_queue_* functions are used for --swap to reduce
+ * the amount of time spent synchronizing the swapped catalog files.  When a
+ * file is added to the queue, we also alert the file system that we'd like it
+ * to be persisted to disk in the near future (if that operation is supported
+ * by the current platform).  Once the queue is full, all of the files are
+ * synchronized to disk.  This strategy should generally be much faster than
+ * simply calling fsync() on the files right away.
+ *
+ * The general usage pattern should be something like:
+ *
+ *     for (int i = 0; i < num_files; i++)
+ *         sync_queue_push(files[i]);
+ *
+ *     // be sure to sync any remaining files in the queue
+ *     sync_queue_sync_all();
+ *     synq_queue_destroy();
+ */
+
+#define SYNC_QUEUE_MAX_LEN	(1024)
+
+static char *sync_queue[SYNC_QUEUE_MAX_LEN];
+static bool sync_queue_inited;
+static int	sync_queue_len;
+
+static inline void
+sync_queue_init(void)
+{
+	if (sync_queue_inited)
+		return;
+
+	sync_queue_inited = true;
+	for (int i = 0; i < SYNC_QUEUE_MAX_LEN; i++)
+		sync_queue[i] = palloc(MAXPGPATH);
+}
+
+static inline void
+sync_queue_sync_all(void)
+{
+	if (!sync_queue_inited)
+		return;
+
+	for (int i = 0; i < sync_queue_len; i++)
+	{
+		if (fsync_fname(sync_queue[i], false) != 0)
+			pg_fatal("could not synchronize file \"%s\": %m", sync_queue[i]);
+	}
+
+	sync_queue_len = 0;
+}
+
+static inline void
+sync_queue_push(const char *fname)
+{
+	sync_queue_init();
+
+	pre_sync_fname(fname, false);
+
+	strncpy(sync_queue[sync_queue_len++], fname, MAXPGPATH);
+	if (sync_queue_len >= SYNC_QUEUE_MAX_LEN)
+		sync_queue_sync_all();
+}
+
+static inline void
+sync_queue_destroy(void)
+{
+	if (!sync_queue_inited)
+		return;
+
+	sync_queue_inited = false;
+	sync_queue_len = 0;
+	for (int i = 0; i < SYNC_QUEUE_MAX_LEN; i++)
+	{
+		pfree(sync_queue[i]);
+		sync_queue[i] = NULL;
+	}
+}
 
 /*
  * transfer_all_new_tablespaces()
@@ -41,6 +122,9 @@ transfer_all_new_tablespaces(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		case TRANSFER_MODE_LINK:
 			prep_status_progress("Linking user relation files");
 			break;
+		case TRANSFER_MODE_SWAP:
+			prep_status_progress("Swapping data directories");
+			break;
 	}
 
 	/*
@@ -125,6 +209,278 @@ transfer_all_new_dbs(DbInfoArr *old_db_arr, DbInfoArr *new_db_arr,
 		/* We allocate something even for n_maps == 0 */
 		pg_free(mappings);
 	}
+
+	/*
+	 * Make sure anything pending synchronization in swap mode is fully
+	 * persisted to disk.  This is a no-op for other transfer modes.
+	 */
+	sync_queue_sync_all();
+	sync_queue_destroy();
+}
+
+/*
+ * prepare_for_swap()
+ *
+ * This function moves the database directory from the old cluster to the new
+ * cluster in preparation for moving the pg_restore-generated catalog files
+ * into place.  Returns false if the database with the given OID does not have
+ * a directory in the given tablespace, otherwise returns true.
+ *
+ * old_cat (the directory for the old catalog files), new_dat (the database
+ * directory in the new cluster), and moved_dat (the destination for the
+ * pg_restore-generated database directory) should be sized to MAXPGPATH bytes.
+ * This function will return the appropriate paths in those variables.
+ */
+static bool
+prepare_for_swap(const char *old_tablespace, Oid db_oid,
+				 char *old_cat, char *new_dat, char *moved_dat)
+{
+	const char *new_tablespace;
+	const char *old_tblspc_suffix;
+	const char *new_tblspc_suffix;
+	char		old_tblspc[MAXPGPATH];
+	char		new_tblspc[MAXPGPATH];
+	char		moved_tblspc[MAXPGPATH];
+	char		old_dat[MAXPGPATH];
+	struct stat st;
+
+	if (strcmp(old_tablespace, old_cluster.pgdata) == 0)
+	{
+		new_tablespace = new_cluster.pgdata;
+		new_tblspc_suffix = "/base";
+		old_tblspc_suffix = "/base";
+	}
+	else
+	{
+		/*
+		 * XXX: The below line is a hack to deal with the fact that we
+		 * presently don't have an easy way to find the corresponding new
+		 * tablespace's path.  This will need to be fixed if/when we add
+		 * pg_upgrade support for in-place tablespaces.
+		 */
+		new_tablespace = old_tablespace;
+
+		new_tblspc_suffix = new_cluster.tablespace_suffix;
+		old_tblspc_suffix = old_cluster.tablespace_suffix;
+	}
+
+	/* Old and new cluster paths. */
+	snprintf(old_tblspc, sizeof(old_tblspc), "%s%s", old_tablespace, old_tblspc_suffix);
+	snprintf(new_tblspc, sizeof(new_tblspc), "%s%s", new_tablespace, new_tblspc_suffix);
+	snprintf(old_dat, sizeof(old_dat), "%s/%u", old_tblspc, db_oid);
+	snprintf(new_dat, MAXPGPATH, "%s/%u", new_tblspc, db_oid);
+
+	/*
+	 * Paths for "moved aside" stuff.  We intentionally put these in the old
+	 * cluster so that the delete_old_cluster.{sh,bat} script handles them.
+	 */
+	snprintf(moved_tblspc, sizeof(moved_tblspc), "%s/moved_for_upgrade", old_tblspc);
+	snprintf(old_cat, MAXPGPATH, "%s/%u_old_catalogs", moved_tblspc, db_oid);
+	snprintf(moved_dat, MAXPGPATH, "%s/%u", moved_tblspc, db_oid);
+
+	/* Check that the database directory exists in the given tablespace. */
+	if (stat(old_dat, &st) != 0)
+	{
+		if (errno != ENOENT)
+			pg_fatal("could not stat file \"%s\": %m", old_dat);
+		return false;
+	}
+
+	/* Create directory for stuff that is moved aside. */
+	if (pg_mkdir_p(moved_tblspc, pg_dir_create_mode) != 0 && errno != EEXIST)
+		pg_fatal("could not create directory \"%s\"", moved_tblspc);
+
+	/* Create directory for old catalog files. */
+	if (pg_mkdir_p(old_cat, pg_dir_create_mode) != 0)
+		pg_fatal("could not create directory \"%s\"", old_cat);
+
+	/* Move the new cluster's database directory aside. */
+	if (rename(new_dat, moved_dat) != 0)
+		pg_fatal("could not rename \"%s\" to \"%s\"", new_dat, moved_dat);
+
+	/* Move the old cluster's database directory into place. */
+	if (rename(old_dat, new_dat) != 0)
+		pg_fatal("could not rename \"%s\" to \"%s\"", old_dat, new_dat);
+
+	return true;
+}
+
+/*
+ * FileNameMapCmp()
+ *
+ * qsort() comparator for FileNameMap that sorts by RelFileNumber.
+ */
+static int
+FileNameMapCmp(const void *a, const void *b)
+{
+	const FileNameMap *map1 = (const FileNameMap *) a;
+	const FileNameMap *map2 = (const FileNameMap *) b;
+
+	return pg_cmp_u32(map1->relfilenumber, map2->relfilenumber);
+}
+
+/*
+ * parse_relfilenumber()
+ *
+ * Attempt to parse the RelFileNumber of the given file name.  If we can't,
+ * return InvalidRelFileNumber.  Note that this code snippet is lifted from
+ * parse_filename_for_nontemp_relation().
+ */
+static RelFileNumber
+parse_relfilenumber(const char *filename)
+{
+	char	   *endp;
+	unsigned long n;
+
+	if (filename[0] < '1' || filename[0] > '9')
+		return InvalidRelFileNumber;
+
+	errno = 0;
+	n = strtoul(filename, &endp, 10);
+	if (errno || filename == endp || n <= 0 || n > PG_UINT32_MAX)
+		return InvalidRelFileNumber;
+
+	return (RelFileNumber) n;
+}
+
+/*
+ * swap_catalog_files()
+ *
+ * Moves the old catalog files aside, and moves the new catalog files into
+ * place.  prepare_for_swap() should have already been called (and returned
+ * true) for the tablespace being transferred.  old_cat (the directory for the
+ * old catalog files), new_dat (the database directory in the new cluster), and
+ * moved_dat (the location of the moved-aside pg_restore-generated database
+ * directory) should be the variables returned by prepare_for_swap().
+ */
+static void
+swap_catalog_files(FileNameMap *maps, int size, const char *old_cat,
+				   const char *new_dat, const char *moved_dat)
+{
+	DIR		   *dir;
+	struct dirent *de;
+	char		path[MAXPGPATH];
+	char		dest[MAXPGPATH];
+	RelFileNumber rfn;
+
+	/* Move the old catalog files aside. */
+	dir = opendir(new_dat);
+	if (dir == NULL)
+		pg_fatal("could not open directory \"%s\": %m", new_dat);
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		snprintf(path, sizeof(path), "%s/%s", new_dat, de->d_name);
+		if (get_dirent_type(path, de, false, PG_LOG_ERROR) != PGFILETYPE_REG)
+			continue;
+
+		rfn = parse_relfilenumber(de->d_name);
+		if (RelFileNumberIsValid(rfn))
+		{
+			FileNameMap key = {.relfilenumber = rfn};
+
+			if (bsearch(&key, maps, size, sizeof(FileNameMap), FileNameMapCmp))
+				continue;
+		}
+
+		snprintf(dest, sizeof(dest), "%s/%s", old_cat, de->d_name);
+		if (rename(path, dest) != 0)
+			pg_fatal("could not rename \"%s\" to \"%s\": %m", path, dest);
+	}
+	if (errno)
+		pg_fatal("could not read directory \"%s\": %m", new_dat);
+	(void) closedir(dir);
+
+	/* Move the new catalog files into place. */
+	dir = opendir(moved_dat);
+	if (dir == NULL)
+		pg_fatal("could not open directory \"%s\": %m", moved_dat);
+	while (errno = 0, (de = readdir(dir)) != NULL)
+	{
+		snprintf(path, sizeof(path), "%s/%s", moved_dat, de->d_name);
+		if (get_dirent_type(path, de, false, PG_LOG_ERROR) != PGFILETYPE_REG)
+			continue;
+
+		rfn = parse_relfilenumber(de->d_name);
+		if (RelFileNumberIsValid(rfn))
+		{
+			FileNameMap key = {.relfilenumber = rfn};
+
+			if (bsearch(&key, maps, size, sizeof(FileNameMap), FileNameMapCmp))
+				continue;
+		}
+
+		snprintf(dest, sizeof(dest), "%s/%s", new_dat, de->d_name);
+		if (rename(path, dest) != 0)
+			pg_fatal("could not rename \"%s\" to \"%s\": %m", path, dest);
+
+		/*
+		 * We don't fsync() the database files in the file synchronization
+		 * stage of pg_upgrade in swap mode, so we need to synchronize them
+		 * ourselves.  We only do this for the catalog files because they were
+		 * created during pg_restore with fsync=off.  We assume that the user
+		 * data files files were properly persisted to disk when the user last
+		 * shut it down.
+		 */
+		if (user_opts.do_sync)
+			sync_queue_push(dest);
+	}
+	if (errno)
+		pg_fatal("could not read directory \"%s\": %m", moved_dat);
+	(void) closedir(dir);
+
+	/* Ensure the directory entries are persisted to disk. */
+	if (fsync_fname(new_dat, true) != 0)
+		pg_fatal("could not synchronize directory \"%s\": %m", new_dat);
+	if (fsync_parent_path(new_dat) != 0)
+		pg_fatal("could not synchronize parent directory of \"%s\": %m", new_dat);
+}
+
+/*
+ * do_swap()
+ *
+ * Perform the required steps for --swap for a single database.  In short this
+ * moves the old cluster's database directory into the new cluster and then
+ * replaces any files for system catalogs with the ones that were generated
+ * during pg_restore.
+ */
+static void
+do_swap(FileNameMap *maps, int size, char *old_tablespace)
+{
+	char		old_cat[MAXPGPATH];
+	char		new_dat[MAXPGPATH];
+	char		moved_dat[MAXPGPATH];
+
+	/*
+	 * We perform many lookups on maps by relfilenumber in swap mode, so make
+	 * sure it's sorted by relfilenumber.  maps should already be sorted by
+	 * OID, so in general this shouldn't have much work to do.
+	 */
+	qsort(maps, size, sizeof(FileNameMap), FileNameMapCmp);
+
+	/*
+	 * If an old tablespace is given, we only need to process that one.  If no
+	 * old tablespace is specified, we need to process all the tablespaces on
+	 * the system.
+	 */
+	if (old_tablespace)
+	{
+		if (prepare_for_swap(old_tablespace, maps[0].db_oid,
+							 old_cat, new_dat, moved_dat))
+			swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+	}
+	else
+	{
+		if (prepare_for_swap(old_cluster.pgdata, maps[0].db_oid,
+							 old_cat, new_dat, moved_dat))
+			swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+
+		for (int tblnum = 0; tblnum < os_info.num_old_tablespaces; tblnum++)
+		{
+			if (prepare_for_swap(os_info.old_tablespaces[tblnum], maps[0].db_oid,
+								 old_cat, new_dat, moved_dat))
+				swap_catalog_files(maps, size, old_cat, new_dat, moved_dat);
+		}
+	}
 }
 
 /*
@@ -145,6 +501,20 @@ transfer_single_new_db(FileNameMap *maps, int size, char *old_tablespace)
 		new_cluster.controldata.cat_ver >= VISIBILITY_MAP_FROZEN_BIT_CAT_VER)
 		vm_must_add_frozenbit = true;
 
+	/* --swap has its own subroutine */
+	if (user_opts.transfer_mode == TRANSFER_MODE_SWAP)
+	{
+		/*
+		 * We don't support --swap to upgrade from versions that require
+		 * rewriting the visibility map.  We should've failed already if
+		 * someone tries to do that.
+		 */
+		Assert(!vm_must_add_frozenbit);
+
+		do_swap(maps, size, old_tablespace);
+		return;
+	}
+
 	for (mapnum = 0; mapnum < size; mapnum++)
 	{
 		if (old_tablespace == NULL ||
@@ -259,6 +629,11 @@ transfer_relfile(FileNameMap *map, const char *type_suffix, bool vm_must_add_fro
 					pg_log(PG_VERBOSE, "linking \"%s\" to \"%s\"",
 						   old_file, new_file);
 					linkFile(old_file, new_file, map->nspname, map->relname);
+					break;
+				case TRANSFER_MODE_SWAP:
+					/* swap mode is handled in its own code path */
+					pg_fatal("should never happen");
+					break;
 			}
 	}
 }
diff --git a/src/bin/pg_upgrade/t/006_transfer_modes.pl b/src/bin/pg_upgrade/t/006_transfer_modes.pl
index 518e0994145..34fddbcdab5 100644
--- a/src/bin/pg_upgrade/t/006_transfer_modes.pl
+++ b/src/bin/pg_upgrade/t/006_transfer_modes.pl
@@ -16,6 +16,15 @@ sub test_mode
 	my $old = PostgreSQL::Test::Cluster->new('old', install_path => $ENV{oldinstall});
 	my $new = PostgreSQL::Test::Cluster->new('new');
 
+	# --swap can't be used to upgrade from versions older than 10, so just skip
+	# the test if the old cluster version is too old.
+	if ($old->pg_version < 10 && $mode eq "--swap")
+	{
+		$old->clean_node();
+		$new->clean_node();
+		return;
+	}
+
 	if (defined($ENV{oldinstall}))
 	{
 		# Checksums are now enabled by default, but weren't before 18, so pass
@@ -97,5 +106,6 @@ test_mode('--clone');
 test_mode('--copy');
 test_mode('--copy-file-range');
 test_mode('--link');
+test_mode('--swap');
 
 done_testing();
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index 1e6250cc190..7b62687a2aa 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -45,9 +45,6 @@
  */
 #define MINIMUM_VERSION_FOR_PG_WAL	100000
 
-#ifdef PG_FLUSH_DATA_WORKS
-static int	pre_sync_fname(const char *fname, bool isdir);
-#endif
 static void walkdir(const char *path,
 					int (*action) (const char *fname, bool isdir),
 					bool process_symlinks,
@@ -352,16 +349,16 @@ walkdir(const char *path,
 }
 
 /*
- * Hint to the OS that it should get ready to fsync() this file.
+ * Hint to the OS that it should get ready to fsync() this file, if supported
+ * by the platform.
  *
  * Ignores errors trying to open unreadable files, and reports other errors
  * non-fatally.
  */
-#ifdef PG_FLUSH_DATA_WORKS
-
-static int
+int
 pre_sync_fname(const char *fname, bool isdir)
 {
+#ifdef PG_FLUSH_DATA_WORKS
 	int			fd;
 
 	fd = open(fname, O_RDONLY | PG_BINARY, 0);
@@ -388,11 +385,10 @@ pre_sync_fname(const char *fname, bool isdir)
 #endif
 
 	(void) close(fd);
+#endif							/* PG_FLUSH_DATA_WORKS */
 	return 0;
 }
 
-#endif							/* PG_FLUSH_DATA_WORKS */
-
 /*
  * fsync_fname -- Try to fsync a file or directory
  *
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index 8274bc877ab..9fd88953e43 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -33,6 +33,7 @@ typedef enum DataDirSyncMethod
 struct iovec;					/* avoid including port/pg_iovec.h here */
 
 #ifdef FRONTEND
+extern int	pre_sync_fname(const char *fname, bool isdir);
 extern int	fsync_fname(const char *fname, bool isdir);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
 						DataDirSyncMethod sync_method, bool sync_data_files);
-- 
2.39.5 (Apple Git-154)

v9-0001-initdb-Add-no-sync-data-files.patchtext/plain; charset=us-asciiDownload
From bace00bcbf3d96bf5cc9f8865e9d1650f2d48ace Mon Sep 17 00:00:00 2001
From: Nathan Bossart <nathan@postgresql.org>
Date: Wed, 19 Feb 2025 09:14:51 -0600
Subject: [PATCH v9 1/3] initdb: Add --no-sync-data-files.

This new option instructs initdb to skip synchronizing any files
in database directories and the database directories themselves,
i.e., everything in the base/ subdirectory and any other
tablespace directories.  Other files, such as those in pg_wal/ and
pg_xact/, will still be synchronized unless --no-sync is also
specified.  --no-sync-data-files is primarily intended for internal
use by tools that separately ensure the skipped files are
synchronized to disk.  A follow-up commit will use this to help
optimize pg_upgrade's file transfer step.

Reviewed-by: Greg Sabino Mullane <htamfids@gmail.com>
Reviewed-by: Bruce Momjian <bruce@momjian.us>
Reviewed-by: Robert Haas <robertmhaas@gmail.com>
Discussion: https://postgr.es/m/Zyvop-LxLXBLrZil%40nathan
---
 doc/src/sgml/ref/initdb.sgml                | 27 +++++++
 src/bin/initdb/initdb.c                     | 10 ++-
 src/bin/initdb/t/001_initdb.pl              |  1 +
 src/bin/pg_basebackup/pg_basebackup.c       |  2 +-
 src/bin/pg_checksums/pg_checksums.c         |  2 +-
 src/bin/pg_combinebackup/pg_combinebackup.c |  2 +-
 src/bin/pg_rewind/file_ops.c                |  2 +-
 src/common/file_utils.c                     | 85 +++++++++++++--------
 src/include/common/file_utils.h             |  2 +-
 9 files changed, 96 insertions(+), 37 deletions(-)

diff --git a/doc/src/sgml/ref/initdb.sgml b/doc/src/sgml/ref/initdb.sgml
index 0026318485a..2f1f9a42f90 100644
--- a/doc/src/sgml/ref/initdb.sgml
+++ b/doc/src/sgml/ref/initdb.sgml
@@ -527,6 +527,33 @@ PostgreSQL documentation
       </listitem>
      </varlistentry>
 
+     <varlistentry id="app-initdb-option-no-sync-data-files">
+      <term><option>--no-sync-data-files</option></term>
+      <listitem>
+       <para>
+        By default, <command>initdb</command> safely writes all database files
+        to disk.  This option instructs <command>initdb</command> to skip
+        synchronizing all files in the individual database directories, the
+        database directories themselves, and the tablespace directories, i.e.,
+        everything in the <filename>base</filename> subdirectory and any other
+        tablespace directories.  Other files, such as those in
+        <literal>pg_wal</literal> and <literal>pg_xact</literal>, will still be
+        synchronized unless the <option>--no-sync</option> option is also
+        specified.
+       </para>
+       <para>
+        Note that if <option>--no-sync-data-files</option> is used in
+        conjuction with <option>--sync-method=syncfs</option>, some or all of
+        the aforementioned files and directories will be synchronized because
+        <literal>syncfs</literal> processes entire file systems.
+       </para>
+       <para>
+        This option is primarily intended for internal use by tools that
+        separately ensure the skipped files are synchronized to disk.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="app-initdb-option-no-instructions">
       <term><option>--no-instructions</option></term>
       <listitem>
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 21a0fe3ecd9..22b7d31b165 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -168,6 +168,7 @@ static bool data_checksums = true;
 static char *xlog_dir = NULL;
 static int	wal_segment_size_mb = (DEFAULT_XLOG_SEG_SIZE) / (1024 * 1024);
 static DataDirSyncMethod sync_method = DATA_DIR_SYNC_METHOD_FSYNC;
+static bool sync_data_files = true;
 
 
 /* internal vars */
@@ -2566,6 +2567,7 @@ usage(const char *progname)
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
 	printf(_("  -n, --no-clean            do not clean up after errors\n"));
 	printf(_("  -N, --no-sync             do not wait for changes to be written safely to disk\n"));
+	printf(_("      --no-sync-data-files  do not sync files within database directories\n"));
 	printf(_("      --no-instructions     do not print instructions for next steps\n"));
 	printf(_("  -s, --show                show internal settings, then exit\n"));
 	printf(_("      --sync-method=METHOD  set method for syncing files to disk\n"));
@@ -3208,6 +3210,7 @@ main(int argc, char *argv[])
 		{"icu-rules", required_argument, NULL, 18},
 		{"sync-method", required_argument, NULL, 19},
 		{"no-data-checksums", no_argument, NULL, 20},
+		{"no-sync-data-files", no_argument, NULL, 21},
 		{NULL, 0, NULL, 0}
 	};
 
@@ -3402,6 +3405,9 @@ main(int argc, char *argv[])
 			case 20:
 				data_checksums = false;
 				break;
+			case 21:
+				sync_data_files = false;
+				break;
 			default:
 				/* getopt_long already emitted a complaint */
 				pg_log_error_hint("Try \"%s --help\" for more information.", progname);
@@ -3453,7 +3459,7 @@ main(int argc, char *argv[])
 
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 		return 0;
 	}
@@ -3516,7 +3522,7 @@ main(int argc, char *argv[])
 	{
 		fputs(_("syncing data to disk ... "), stdout);
 		fflush(stdout);
-		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method);
+		sync_pgdata(pg_data, PG_VERSION_NUM, sync_method, sync_data_files);
 		check_ok();
 	}
 	else
diff --git a/src/bin/initdb/t/001_initdb.pl b/src/bin/initdb/t/001_initdb.pl
index 01cc4a1602b..15dd10ce40a 100644
--- a/src/bin/initdb/t/001_initdb.pl
+++ b/src/bin/initdb/t/001_initdb.pl
@@ -76,6 +76,7 @@ command_like(
 	'checksums are enabled in control file');
 
 command_ok([ 'initdb', '--sync-only', $datadir ], 'sync only');
+command_ok([ 'initdb', '--sync-only', '--no-sync-data-files', $datadir ], '--no-sync-data-files');
 command_fails([ 'initdb', $datadir ], 'existing data directory');
 
 if ($supports_syncfs)
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index d4b4e334014..1da4bfc2351 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -2310,7 +2310,7 @@ BaseBackup(char *compression_algorithm, char *compression_detail,
 		}
 		else
 		{
-			(void) sync_pgdata(basedir, serverVersion, sync_method);
+			(void) sync_pgdata(basedir, serverVersion, sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index 867aeddc601..f20be82862a 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -633,7 +633,7 @@ main(int argc, char *argv[])
 		if (do_sync)
 		{
 			pg_log_info("syncing data directory");
-			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method);
+			sync_pgdata(DataDir, PG_VERSION_NUM, sync_method, true);
 		}
 
 		pg_log_info("updating control file");
diff --git a/src/bin/pg_combinebackup/pg_combinebackup.c b/src/bin/pg_combinebackup/pg_combinebackup.c
index d480dc74436..050260ee832 100644
--- a/src/bin/pg_combinebackup/pg_combinebackup.c
+++ b/src/bin/pg_combinebackup/pg_combinebackup.c
@@ -424,7 +424,7 @@ main(int argc, char *argv[])
 		else
 		{
 			pg_log_debug("recursively fsyncing \"%s\"", opt.output);
-			sync_pgdata(opt.output, version * 10000, opt.sync_method);
+			sync_pgdata(opt.output, version * 10000, opt.sync_method, true);
 		}
 	}
 
diff --git a/src/bin/pg_rewind/file_ops.c b/src/bin/pg_rewind/file_ops.c
index 467845419ed..55659ce201f 100644
--- a/src/bin/pg_rewind/file_ops.c
+++ b/src/bin/pg_rewind/file_ops.c
@@ -296,7 +296,7 @@ sync_target_dir(void)
 	if (!do_sync || dry_run)
 		return;
 
-	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method);
+	sync_pgdata(datadir_target, PG_VERSION_NUM, sync_method, true);
 }
 
 
diff --git a/src/common/file_utils.c b/src/common/file_utils.c
index eaa2e76f43f..1e6250cc190 100644
--- a/src/common/file_utils.c
+++ b/src/common/file_utils.c
@@ -50,7 +50,8 @@ static int	pre_sync_fname(const char *fname, bool isdir);
 #endif
 static void walkdir(const char *path,
 					int (*action) (const char *fname, bool isdir),
-					bool process_symlinks);
+					bool process_symlinks,
+					const char *exclude_dir);
 
 #ifdef HAVE_SYNCFS
 
@@ -93,11 +94,15 @@ do_syncfs(const char *path)
  * syncing, and might not have privileges to write at all.
  *
  * serverVersion indicates the version of the server to be sync'd.
+ *
+ * If sync_data_files is false, this function skips syncing "base/" and any
+ * other tablespace directories.
  */
 void
 sync_pgdata(const char *pg_data,
 			int serverVersion,
-			DataDirSyncMethod sync_method)
+			DataDirSyncMethod sync_method,
+			bool sync_data_files)
 {
 	bool		xlog_is_symlink;
 	char		pg_wal[MAXPGPATH];
@@ -147,30 +152,33 @@ sync_pgdata(const char *pg_data,
 				do_syncfs(pg_data);
 
 				/* If any tablespaces are configured, sync each of those. */
-				dir = opendir(pg_tblspc);
-				if (dir == NULL)
-					pg_log_error("could not open directory \"%s\": %m",
-								 pg_tblspc);
-				else
+				if (sync_data_files)
 				{
-					while (errno = 0, (de = readdir(dir)) != NULL)
+					dir = opendir(pg_tblspc);
+					if (dir == NULL)
+						pg_log_error("could not open directory \"%s\": %m",
+									 pg_tblspc);
+					else
 					{
-						char		subpath[MAXPGPATH * 2];
+						while (errno = 0, (de = readdir(dir)) != NULL)
+						{
+							char		subpath[MAXPGPATH * 2];
 
-						if (strcmp(de->d_name, ".") == 0 ||
-							strcmp(de->d_name, "..") == 0)
-							continue;
+							if (strcmp(de->d_name, ".") == 0 ||
+								strcmp(de->d_name, "..") == 0)
+								continue;
 
-						snprintf(subpath, sizeof(subpath), "%s/%s",
-								 pg_tblspc, de->d_name);
-						do_syncfs(subpath);
-					}
+							snprintf(subpath, sizeof(subpath), "%s/%s",
+									 pg_tblspc, de->d_name);
+							do_syncfs(subpath);
+						}
 
-					if (errno)
-						pg_log_error("could not read directory \"%s\": %m",
-									 pg_tblspc);
+						if (errno)
+							pg_log_error("could not read directory \"%s\": %m",
+										 pg_tblspc);
 
-					(void) closedir(dir);
+						(void) closedir(dir);
+					}
 				}
 
 				/* If pg_wal is a symlink, process that too. */
@@ -182,15 +190,21 @@ sync_pgdata(const char *pg_data,
 
 		case DATA_DIR_SYNC_METHOD_FSYNC:
 			{
+				char	   *exclude_dir = NULL;
+
+				if (!sync_data_files)
+					exclude_dir = psprintf("%s/base", pg_data);
+
 				/*
 				 * If possible, hint to the kernel that we're soon going to
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(pg_data, pre_sync_fname, false);
+				walkdir(pg_data, pre_sync_fname, false, exclude_dir);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, pre_sync_fname, false);
-				walkdir(pg_tblspc, pre_sync_fname, true);
+					walkdir(pg_wal, pre_sync_fname, false, NULL);
+				if (sync_data_files)
+					walkdir(pg_tblspc, pre_sync_fname, true, NULL);
 #endif
 
 				/*
@@ -203,10 +217,14 @@ sync_pgdata(const char *pg_data,
 				 * get fsync'd twice. That's not an expected case so we don't
 				 * worry about optimizing it.
 				 */
-				walkdir(pg_data, fsync_fname, false);
+				walkdir(pg_data, fsync_fname, false, exclude_dir);
 				if (xlog_is_symlink)
-					walkdir(pg_wal, fsync_fname, false);
-				walkdir(pg_tblspc, fsync_fname, true);
+					walkdir(pg_wal, fsync_fname, false, NULL);
+				if (sync_data_files)
+					walkdir(pg_tblspc, fsync_fname, true, NULL);
+
+				if (exclude_dir)
+					pfree(exclude_dir);
 			}
 			break;
 	}
@@ -245,10 +263,10 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
 				 * fsync the data directory and its contents.
 				 */
 #ifdef PG_FLUSH_DATA_WORKS
-				walkdir(dir, pre_sync_fname, false);
+				walkdir(dir, pre_sync_fname, false, NULL);
 #endif
 
-				walkdir(dir, fsync_fname, false);
+				walkdir(dir, fsync_fname, false, NULL);
 			}
 			break;
 	}
@@ -264,6 +282,9 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
  * ignored in subdirectories, ie we intentionally don't pass down the
  * process_symlinks flag to recursive calls.
  *
+ * If exclude_dir is not NULL, it specifies a directory path to skip
+ * processing.
+ *
  * Errors are reported but not considered fatal.
  *
  * See also walkdir in fd.c, which is a backend version of this logic.
@@ -271,11 +292,15 @@ sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method)
 static void
 walkdir(const char *path,
 		int (*action) (const char *fname, bool isdir),
-		bool process_symlinks)
+		bool process_symlinks,
+		const char *exclude_dir)
 {
 	DIR		   *dir;
 	struct dirent *de;
 
+	if (exclude_dir && strcmp(exclude_dir, path) == 0)
+		return;
+
 	dir = opendir(path);
 	if (dir == NULL)
 	{
@@ -299,7 +324,7 @@ walkdir(const char *path,
 				(*action) (subpath, false);
 				break;
 			case PGFILETYPE_DIR:
-				walkdir(subpath, action, false);
+				walkdir(subpath, action, false, exclude_dir);
 				break;
 			default:
 
diff --git a/src/include/common/file_utils.h b/src/include/common/file_utils.h
index a832210adc1..8274bc877ab 100644
--- a/src/include/common/file_utils.h
+++ b/src/include/common/file_utils.h
@@ -35,7 +35,7 @@ struct iovec;					/* avoid including port/pg_iovec.h here */
 #ifdef FRONTEND
 extern int	fsync_fname(const char *fname, bool isdir);
 extern void sync_pgdata(const char *pg_data, int serverVersion,
-						DataDirSyncMethod sync_method);
+						DataDirSyncMethod sync_method, bool sync_data_files);
 extern void sync_dir_recurse(const char *dir, DataDirSyncMethod sync_method);
 extern int	durable_rename(const char *oldfile, const char *newfile);
 extern int	fsync_parent_path(const char *fname);
-- 
2.39.5 (Apple Git-154)

#40Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#39)
Re: optimize file transfer in pg_upgrade

On Thu, Mar 20, 2025 at 03:23:13PM -0500, Nathan Bossart wrote:

I'm still aiming to commit this sometime early next week.

Committed. Thanks to everyone who chimed in on this thread.

While writing the attributions, I noticed that nobody seems to have
commented specifically on 0001. The closest thing to a review I see is
Greg's note upthread [0]/messages/by-id/CAKAnmm+i3Q1pZ05N_b8=S3B=rztQDn--HoW8BRKVtCg53r8NiQ@mail.gmail.com. This patch is a little bigger than what I'd
ordinarily feel comfortable with committing unilaterally, but it's been
posted in its current form since February 28th, this thread has gotten a
decent amount of traffic since then, and it's not a huge change ("9 files
changed, 96 insertions(+), 37 deletions(-)"). I'm happy to address any
post-commit feedback that folks have. As noted earlier [1]/messages/by-id/Z9h5Spp76EBygyEL@nathan, I'm not wild
about how it's implemented, but this is the nicest approach I've thought of
thus far.

I also wanted to draw attention to this note in 0003:

/*
* XXX: The below line is a hack to deal with the fact that we
* presently don't have an easy way to find the corresponding new
* tablespace's path. This will need to be fixed if/when we add
* pg_upgrade support for in-place tablespaces.
*/
new_tablespace = old_tablespace;

I intend to address this in v19, primarily to enable same-version
pg_upgrade testing with non-default tablespaces. My current thinking is
that we should have pg_upgrade also gather the new cluster tablespace
information and map them to the corresponding tablespaces on the old
cluster. This might require some refactoring in pg_upgrade. In any case,
I didn't feel this should block the feature for v18.

[0]: /messages/by-id/CAKAnmm+i3Q1pZ05N_b8=S3B=rztQDn--HoW8BRKVtCg53r8NiQ@mail.gmail.com
[1]: /messages/by-id/Z9h5Spp76EBygyEL@nathan

--
nathan

#41Alexander Lakhin
exclusion@gmail.com
In reply to: Nathan Bossart (#37)
Re: optimize file transfer in pg_upgrade

Hello Nathan,

20.03.2025 04:02, Nathan Bossart wrote:

On Wed, Mar 19, 2025 at 04:28:23PM -0500, Nathan Bossart wrote:
And here is yet another new version of the full patch set. I'm planning to
commit 0001 (the new pg_upgrade transfer mode test) tomorrow so that I can
deal with any buildfarm indigestion before committing swap mode. I did run
the test locally for upgrades from v9.6, v13, and v17, but who knows what
unique configurations I've failed to anticipate...

I found a couple of the 006_transfer_modes failures during the past month:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2025-04-08%2004%3A18%3A15
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2025-04-21%2008%3A03%3A06

Both happened on Windows, but what's worse is that the failure logs
contain no information on the exact reason. We can see:
#   Failed test 'pg_upgrade with transfer mode --swap: stdout matches'
#   at C:/tools/xmsys64/home/pgrunner/bf/root/HEAD/pgsql/src/bin/pg_upgrade/t/006_transfer_modes.pl line 61.
...
# Restoring database schemas in the new cluster
# *failure*
#
# Consult the last few lines of
"C:/tools/xmsys64/home/pgrunner/bf/root/HEAD/pgsql.build/testrun/pg_upgrade/006_transfer_modes/data/t_006_transfer_modes_new_data/pgdata/pg_upgrade_output.d/20250421T081115.575/log/pg_upgrade_dump_1.log"
for
# the probable cause of the failure.
# Failure, exiting
# '
#     doesn't match '(?^:.* not supported on this platform|could not .* between old and new data directories: .*)'

there is a reference to pg_upgrade_dump_x.log, but no such files saved.

I tried to reproduce this failure locally, but failed. Still I discovered
that when the test fails, the target directory containing pgdata/ gets
removed, because of this coding:
    my $result = command_ok_or_fails_like(
...
    # If pg_upgrade was successful, check that all of our test objects reached
    # the new version.
    if ($result)
    {
...
    }

    $old->clean_node();
    $new->clean_node();

Moreover, even when pg_upgrade succeeds, IPC::Run::run inside
command_ok_or_fails_like() returns false, as we can see from a
successful test run:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=fairywren&amp;dt=2025-04-27%2001%3A03%3A06&amp;stg=misc-check

pgsql.build/testrun/pg_upgrade/006_transfer_modes/log/regress_log_006_transfer_modes
[01:18:38.210](21.036s) ok 1 - pg_upgrade with transfer mode --clone: stdout matches
[01:18:38.211](0.001s) ok 2 - pg_upgrade with transfer mode --clone: stderr matches

The corresponding code is:
    print("# Running: " . join(" ", @{$cmd}) . "\n");
    my $result = IPC::Run::run $cmd, '>' => \$stdout, '2>' => \$stderr;
    if (!$result)
    {
        like($stdout, $expected_stdout, "$test_name: stdout matches");
        like($stderr, $expected_stderr, "$test_name: stderr matches");
    }

So maybe it's worth to adjust the test somehow to have interesting logs
left after a failure?

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

#42Nathan Bossart
nathandbossart@gmail.com
In reply to: Alexander Lakhin (#41)
Re: optimize file transfer in pg_upgrade

On Sun, Apr 27, 2025 at 05:00:01PM +0300, Alexander Lakhin wrote:

Both happened on Windows, but what's worse is that the failure logs
contain no information on the exact reason. We can see:
#�� Failed test 'pg_upgrade with transfer mode --swap: stdout matches'
#�� at C:/tools/xmsys64/home/pgrunner/bf/root/HEAD/pgsql/src/bin/pg_upgrade/t/006_transfer_modes.pl line 61.
...
# Restoring database schemas in the new cluster
# *failure*

I see a couple of other pg_upgrade failures on drongo and fairywren that
look similar, although these are for different tests:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2025-03-10%2019%3A26%3A35
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2025-03-30%2013%3A03%3A05

Moreover, even when pg_upgrade succeeds, IPC::Run::run inside
command_ok_or_fails_like() returns false, as we can see from a
successful test run:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=fairywren&amp;dt=2025-04-27%2001%3A03%3A06&amp;stg=misc-check

pgsql.build/testrun/pg_upgrade/006_transfer_modes/log/regress_log_006_transfer_modes
[01:18:38.210](21.036s) ok 1 - pg_upgrade with transfer mode --clone: stdout matches
[01:18:38.211](0.001s) ok 2 - pg_upgrade with transfer mode --clone: stderr matches

That's expected for platforms that don't support all of the modes. We
verify the output matches a known error message in that case.

So maybe it's worth to adjust the test somehow to have interesting logs
left after a failure?

I see some other discussion about failures with similar symptoms [0]/messages/by-id/TYAPR01MB5866AB7FD922CE30A2565B8BF5A8A@TYAPR01MB5866.jpnprd01.prod.outlook.com [1]/messages/by-id/CALDaNm3tjY44HoSwY84=XGEbTg0ruVfD4hAMTm=TgBqVysH4Qw@mail.gmail.com.
Commit 6f97ef0 seems to have helped with one of the tests, and there is a
proposed patch in the latest thread [2]/messages/by-id/CALDaNm2y+nf-V9tjKwvbPprobZs1t_UrcCpJ0qYD5-KkOUFAyg@mail.gmail.com that AFAICT aims to fix the
underlying issue.

[0]: /messages/by-id/TYAPR01MB5866AB7FD922CE30A2565B8BF5A8A@TYAPR01MB5866.jpnprd01.prod.outlook.com
[1]: /messages/by-id/CALDaNm3tjY44HoSwY84=XGEbTg0ruVfD4hAMTm=TgBqVysH4Qw@mail.gmail.com
[2]: /messages/by-id/CALDaNm2y+nf-V9tjKwvbPprobZs1t_UrcCpJ0qYD5-KkOUFAyg@mail.gmail.com

--
nathan

#43Alexander Lakhin
exclusion@gmail.com
In reply to: Nathan Bossart (#42)
Re: optimize file transfer in pg_upgrade

Hello Nathan,

28.04.2025 18:15, Nathan Bossart wrote:

I see a couple of other pg_upgrade failures on drongo and fairywren that
look similar, although these are for different tests:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2025-03-10%2019%3A26%3A35
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2025-03-30%2013%3A03%3A05

Yeah, I've categorized the first one as [1]https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#Upgrade_tests_fail_on_Windows_due_to_pg_upgrade_output.d.2F_not_removed, but now I see that it's
something different, because "pg_upgrade_output.d/ removed after
successful pg_upgrade" is not the only (and not the first) failure there.
Will fix. As to the second one, yes, it's similar in the sense that the
failed test log doesn't contain information needed to understand the
cause.

Moreover, even when pg_upgrade succeeds, IPC::Run::run inside
command_ok_or_fails_like() returns false, as we can see from a
successful test run:
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=fairywren&amp;dt=2025-04-27%2001%3A03%3A06&amp;stg=misc-check

pgsql.build/testrun/pg_upgrade/006_transfer_modes/log/regress_log_006_transfer_modes
[01:18:38.210](21.036s) ok 1 - pg_upgrade with transfer mode --clone: stdout matches
[01:18:38.211](0.001s) ok 2 - pg_upgrade with transfer mode --clone: stderr matches

That's expected for platforms that don't support all of the modes. We
verify the output matches a known error message in that case.

Yes, I meant that in that case we can't determine whether to preserve logs
of the failed pg_upgrade outside of command_ok_or_fails_like().

So maybe it's worth to adjust the test somehow to have interesting logs
left after a failure?

I see some other discussion about failures with similar symptoms [0] [1].
Commit 6f97ef0 seems to have helped with one of the tests, and there is a
proposed patch in the latest thread [2] that AFAICT aims to fix the
underlying issue.

Thank you for the references! Unfortunately I still can't see where the
lack of upgrade log files is discussed.

In other words, if we had logs like in the case [2]https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&amp;dt=2025-02-20%2017%3A01%3A23, it could be helpful.

[1]: https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#Upgrade_tests_fail_on_Windows_due_to_pg_upgrade_output.d.2F_not_removed
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#Upgrade_tests_fail_on_Windows_due_to_pg_upgrade_output.d.2F_not_removed
[2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=copperhead&amp;dt=2025-02-20%2017%3A01%3A23

Best regards,
Alexander Lakhin
Neon (https://neon.tech)

#44Nathan Bossart
nathandbossart@gmail.com
In reply to: Alexander Lakhin (#43)
Re: optimize file transfer in pg_upgrade

On Mon, Apr 28, 2025 at 09:00:01PM +0300, Alexander Lakhin wrote:

Thank you for the references! Unfortunately I still can't see where the
lack of upgrade log files is discussed.

That was briefly discussed here:

/messages/by-id/644cf995-e3a5-4f69-9398-7db500e2673d@dunslane.net

One other potential problem with this test is that we reuse the directory
names for each transfer mode. That seems easy enough to fix.

--
nathan

#45Alexander Lakhin
exclusion@gmail.com
In reply to: Nathan Bossart (#44)
Re: optimize file transfer in pg_upgrade

Hello Nathan,

28.04.2025 21:26, Nathan Bossart wrote:

One other potential problem with this test is that we reuse the directory
names for each transfer mode. That seems easy enough to fix.

FWIW, I've counted seven 006_transfer_modes failures happened during this
year. Five are from Windows animals:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2025-04-08%2004%3A18%3A15
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2025-04-21%2008%3A03%3A06
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2025-07-21%2012%3A35%3A58
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&amp;dt=2025-08-22%2000%3A04%3A05
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&amp;dt=2025-12-28%2003%3A43%3A24

And two from culicidae, which tests EXEC_BACKEND and thus suffers from [1]https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#culicidae_failed_to_restart_server_due_to_incorrect_checksum_in_control_file
([2]/messages/by-id/7ff9de7f-7203-cad9-76d9-45497cbedac7@gmail.com):
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&amp;dt=2025-11-22%2012%3A31%3A23
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae&amp;dt=2025-12-14%2018%3A24%3A48

Probably, this number could justify improving the test so that we can
identify the failure reason for sure looking at the upgrade logs. As of
now, we can only guess it, based on the animals' specifics...

[1]: https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#culicidae_failed_to_restart_server_due_to_incorrect_checksum_in_control_file
https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#culicidae_failed_to_restart_server_due_to_incorrect_checksum_in_control_file
[2]: /messages/by-id/7ff9de7f-7203-cad9-76d9-45497cbedac7@gmail.com

Best regards,
Alexander